Witnessing the CSI Effect first hand

Recently someone internally proposed that windows automatically filter the output of the speakers from the audio samples being captured.

I was chatting with Valorie at dinner last night (we had dinner at The Herbfarm for a combined 20th anniversary and 45th birthday celebration - major yum) and their proposal came up.  All of a sudden I realized that I was seeing a perfect example of the CSI Effect.

The CSI effect is SO cool.  The CSI techs get a 911 recording of a phone call, and they ask Archie the digital magic guy to "clean it up".  Archie starts playing it back, and the cool waveform shows up on the screen.  Catherine Willows says "Get rid of that road noise", Archie clicks his mouse a couple of times, and Voila! all of a sudden you have a perfectly clear representation of the voice of the kidnappers accomplice.

This shows up in other ways, but most often, it's a variant of asking the lab rat to "remove that guys voice", and magically he (or she) does it.

For CSI shows, it's easy.  Why?  Because the CSI special effects people created the tape in the first place by taking all the original elements as separate tracks (the kidnappers voice, their accomplice, the road noise) and merging them into a single tape.  Then when it comes time for them to extract the specific sound that is necessary to advance the plot, what do you know, there it is!

 

However, in the real world, extracting sounds like this is a bit like separating out the salt and pepper from the salt & pepper mixture they use at Subway restaurants (or unscrambling an egg if you'd rather).  The problem is that when you enter the digital domain, all you have are discrete samples, it's not clear where those samples came from.

Complicating things is that many inputs are usually mono, and the source of the audio almost always comes from multiple locations (which means that all the samples from all the different sources get squished together).  As a result, picking out the inputs that come from the various discrete sources can be quite complicated (especially indoors where sounds reflect off walls and stuff).

Now in the case of the "automatically filter" proposal, the filterers have an advantage that the CSI guys don't.   They have the pristine audio samples before they're handed to the audio card.  That means that they have a representation of the samples they're trying to remove, and that helps a LOT.  The problem is that between the pristine audio samples being rendered and the captured audio samples, there's this nasty ugly place called "the real world".

And in the real world, things get messed up.  And there are TONS of ways that they can get messed up.  They get messed up by low quality PC speakers, by crappy microphones, by ambient room noise, etc. 

 

Now it turns out that there's a huge body of work involved in just this problem because fixing it is really important to a large set of people (mainly telephone companies, which have to deal with this issue all the time).  In the industry the techniques used to remove the source are known as "Acoustic Echo Cancellation".  The thing about AEC is that it's not perfect.  It's good, and you can apparently remove about 25dB of the input source, but it's not perfect (The "volume" of a signal halves at every 6dB reduction, so the microphone will actually pick up the output of the speakers, but at around 1/8th the original volume) .  The Wikipedia entry I linked to above has a list of some of the problems facing AEC algorithms.

One other problem with AEC is that the process degrades the quality of the captured signal.  Maybe not much, but some.  When the ultimate receiver is a human ear, the reduction in quality isn't that big a deal (because the human ear is incredibly forgiving), which is why the telephone companies and most voice IM applications use it.  On the other hand, that degradation would be disastrous if you were to record a concert.

 

But right now, the idea of simply stripping out the output audio samples and generating a pristine version of the captured signal isn't feasible, at least not without having the benefit of Jerry Bruckheimer's crack special effects people working for you.

 

OT: While researching this post, I found a cool blog called "The CSI Effect", written by Andrea Campbell, an author who specializes in writing books about forensic science and criminal justice.  It looks like fun (yeah, I'm a total geek, I know that).