Witnessing the CSI Effect first hand

Recently someone internally proposed that windows automatically filter the output of the speakers from the audio samples being captured.

I was chatting with Valorie at dinner last night (we had dinner at The Herbfarm for a combined 20th anniversary and 45th birthday celebration – major yum) and their proposal came up.  All of a sudden I realized that I was seeing a perfect example of the CSI Effect.

The CSI effect is SO cool.  The CSI techs get a 911 recording of a phone call, and they ask Archie the digital magic guy to “clean it up”.  Archie starts playing it back, and the cool waveform shows up on the screen.  Catherine Willows says “Get rid of that road noise”, Archie clicks his mouse a couple of times, and Voila! all of a sudden you have a perfectly clear representation of the voice of the kidnappers accomplice.

This shows up in other ways, but most often, it’s a variant of asking the lab rat to “remove that guys voice”, and magically he (or she) does it.

For CSI shows, it’s easy.  Why?  Because the CSI special effects people created the tape in the first place by taking all the original elements as separate tracks (the kidnappers voice, their accomplice, the road noise) and merging them into a single tape.  Then when it comes time for them to extract the specific sound that is necessary to advance the plot, what do you know, there it is!


However, in the real world, extracting sounds like this is a bit like separating out the salt and pepper from the salt & pepper mixture they use at Subway restaurants (or unscrambling an egg if you’d rather).  The problem is that when you enter the digital domain, all you have are discrete samples, it’s not clear where those samples came from.

Complicating things is that many inputs are usually mono, and the source of the audio almost always comes from multiple locations (which means that all the samples from all the different sources get squished together).  As a result, picking out the inputs that come from the various discrete sources can be quite complicated (especially indoors where sounds reflect off walls and stuff).

Now in the case of the “automatically filter” proposal, the filterers have an advantage that the CSI guys don’t.   They have the pristine audio samples before they’re handed to the audio card.  That means that they have a representation of the samples they’re trying to remove, and that helps a LOT.  The problem is that between the pristine audio samples being rendered and the captured audio samples, there’s this nasty ugly place called “the real world”.

And in the real world, things get messed up.  And there are TONS of ways that they can get messed up.  They get messed up by low quality PC speakers, by crappy microphones, by ambient room noise, etc. 


Now it turns out that there’s a huge body of work involved in just this problem because fixing it is really important to a large set of people (mainly telephone companies, which have to deal with this issue all the time).  In the industry the techniques used to remove the source are known as “Acoustic Echo Cancellation“.  The thing about AEC is that it’s not perfect.  It’s good, and you can apparently remove about 25dB of the input source, but it’s not perfect (The “volume” of a signal halves at every 6dB reduction, so the microphone will actually pick up the output of the speakers, but at around 1/8th the original volume) .  The Wikipedia entry I linked to above has a list of some of the problems facing AEC algorithms.

One other problem with AEC is that the process degrades the quality of the captured signal.  Maybe not much, but some.  When the ultimate receiver is a human ear, the reduction in quality isn’t that big a deal (because the human ear is incredibly forgiving), which is why the telephone companies and most voice IM applications use it.  On the other hand, that degradation would be disastrous if you were to record a concert.


But right now, the idea of simply stripping out the output audio samples and generating a pristine version of the captured signal isn’t feasible, at least not without having the benefit of Jerry Bruckheimer’s crack special effects people working for you.


OT: While researching this post, I found a cool blog called “The CSI Effect“, written by Andrea Campbell, an author who specializes in writing books about forensic science and criminal justice.  It looks like fun (yeah, I’m a total geek, I know that).

Comments (19)

  1. Chris Hynes says:

    It seems to me that if you’re trying to remove the possibility of executing voice commands via a malicious output file, not trying to create a concert quality recording, it would be more feasable to filter the output samples from the input. This might even make the input more recognizable as there would be less background noise. Of course, you have to address delay issues, but that should be fairly constant, possibly even trainable. I wonder if this would make the recognition more or less precise…


  2. Did I say anything about voice recognition?  This article has nothing to do with voice recognition, just that it’s not easy to unmingle samples once they’ve been mingled.

    Either way, "filter the output samples from the input" is EXACTLY what I’m talking about when I refer to AEC.  The best you can do is reduce the captured volume to 1/8th the source volume at a cost of a significant degredation of the input signal.  It’s not clear if the degredation in quality of the input signal would be enough to hinder the ability for a speech recognition algorithm to recognise valid speech (if you were to do such a thing).

    In general, the ultimate destination for audio processed by AEC algorithms is the human ear, and the human ear has a remarkable ability to determine meaning from a degraded signal.

  3. LJ says:

    "separating out the salt and pepper from the salt & pepper mixture they use at Subway restaurants "

    1) put S&P mixture into hot water – stir until salt is dissolved.

    2) Pour salt water off of pepper.

    3) Rinse pepper and save run-off water

    4) Repeat #3 waay too many times.

    5) Dry pepper

    6) Boil off water in low temperature vacuum boiler (we just want the water to go to vapor, not to cook the salt).

    Unscrambling an egg I leave to another reader.

  4. LJ, yeah, I mentioned this to Valorie and she came up with something similar (I didn’t know that pepper was magnetic).

  5. Joku says:

    I’m surprised you didn’t mention the work with the idea of having multiple mics in laptops so the recorded voice could be made out of the background noise easier. Wasn’t there some Microsoft hardware or research team working on this?

  6. KSS says:

    Another approach would be to process both the audio you are playing via the speakers for voice commands and the audio coming in via the microphone for voice commands and if you get the same command from each at the same time then it much be originating from the speakers.

    In other words, you don’t need to untangle the audio, you just need to filter out the commands that originate from the speakers.  Downside is two audio signals to process for verbal commands; but, that may be cheaper than some complicated signal processing approach to removing the speaker signal from the mic anyway.

  7. Joku, this post’s about the CSI effect – people who see stuff on CSI and assume that it works the same way in real life.

    MicArrays are different beasts (although very, very cool).

  8. Dean Earley says:

    We get exactly the same as we are in the digital CCTV market.

    "How can i get nice clear crisp image from this low quality recorded image?"

    "You can’t, digitial is lossy. Once its saved at that quality, its gone."

    "But they can do it on TV…"

    I think this was from the same person who asked if we could detect people wearing motorbike helmets by inverting the face detection system… :o)

  9. Andrew says:

    > The "volume" of a signal halves at every 6dB reduction

    I think you meant 3dB not 6dB, right?

  10. Clinton Pierce says:

    Another salt and pepper separation technique?  Static electricity.  Get a balloon, plastic ruler, etc… all staticy and hold it over the S&P mixture.  The pepper will jump up onto the balloon and leave the salt behind.

    The egg is left as an exercise for the reader.

  11. Dustin Long says:


    If the results mentioned in this article are really what they claim to be, we might soon have the CSI effect become reality. Very interesting stuff.

  12. Jeremy says:

    The egg? That’s easy! Just put the scrambled mess in the transporter, and Scotty will sort out the molecules into the original egg in the transport stream…

  13. James Kilner says:

    I believe that the volume reduction from a 6dB reduction in signal is well over half.  I recall that a 3dB reduction is where the volume is halved (but I may be wrong).  This applies not only to sound, but to any electrical signal.

  14. James, 6dB is roughly half volume (technically it’s 1/2 the loudness as measured via SPL).   Volume is roughly the same as loudness (volume is perceptual, loudness is physical).

    "This means that a doubling in sound pressure output from a speaker relates to a 6 dBSPL increase.", from: http://en.wikipedia.org/wiki/Decibel

  15. RyanBemrose says:

    A doubling of power (W) is 3dB.  A doubling of amplitude (Pa or dBSPL) is 6dB.  You can start knowing that power is proportional to amplitude squared, and do the math from there.

    Loudness corresponds to amplitude, not power.  So Larry is correct.  Plus 6dB is twice as loud.

    (Hearing damage is still proportional to power, though, so watch that amplitude 🙂 )

    -Ryan "I should write a post on this" Bemrose

  16. Maurits says:

    Darn, so all those "freeze and enhance" movie scenes are fake?

    Mathematically, the 6 dB figure comes from 20 * log10(2) dB, which actually works out to 6.02-ish dB.  6 dB is close enough for most purposes.

    For power, it’s 10 * log10(2) dB, or 3.01 dB.

  17. Igor says:

    There is a filter called Sonic Foundry Noise Reduction (and perhaps some others that work the same way). One you pinpoint the noise part of the signal, noise gets removed completely.

    Anyway, the idea of filtering commands on both input and output at the same is very good as long as you don’t have zero-latency sound hardware.

  18. anonymous says:

    Igor, it’s probably working on characteristic noise / white noise / or hiss in recordings.  This is easier to remove since it’s easier to model than 1 person’s voice amongst many people talking in a room.

    With the cocktail party solution – they’re using ANNs (neural networks), so how do they determine that they’ve found _the_ solution (it’s a mathematical proof they claim)?

  19. Igor says:

    It is capable of removing all sorts of noise because it takes "fingerprint" of the noise and then "blanks" it (in frequency domain) using FFT and inverse FFT.