What IS audio on a PC anyway?

This may be well known, but maybe not (I didn’t understand it until I joined the Windows Audio team).

Just what is digital audio, anyway?  Well, at its core, all of digital audio is a “pop” sound made on the speaker.  When you get right down to it, that’s all it is.  A “sound” in digital audio is a voltage spike applied to a speaker jack, with a specific amplitude.  The amplitude determines how much the speaker diaphragm moves when the signal is received by the speaker.

That’s it, that’s all that digital audio is – it’s a “pop” noise.  The trick that makes it sound like Sondheim is that you make a LOT of pops every second – thousands and thousands of pops per second.  When you make the pops quickly enough, your ear puts the pops together to turn them into a discrete sound.  You can hear a simple example of this effect when you walk near a high voltage power transformer.  AC power in the US runs at 60 cycles per second, and as the transformer works, it emits a noise on each cycle.  The brain smears that 60 Hz sound together and turns it into the “hum” that you hear near power equipment.

Another way of thinking about this (thanks Frank) is to consider the speaker on your home stereo.  As you’re listening to music, if you pull the cover off the speaker, you can see the cone move in and out with the music.  Well, if you were to take a ruler and measure the displacement of the cone from 0, the distance that it moves from the origin is the volume of the pop.  Now start measuring really fast – thousands of times a second.  Your collected measurements make up an encoded representation of the sound you just heard.

To play back the audio, take your measurements, and move the cone the same amount, and it will reproduce the original sound.

Since a picture is worth a thousand words, Simon Cooke was gracious enough to draw the following...

Take an audio signal, say a sine wave:

Then, you sample the sine wave (in this case, 16 samples per frequency):

Each of the bars under the sine wave is the sample.  When you play back the samples, the speaker will reproduce the original sound.  One thing to keep in mind (as Simon commented) is that the output waveform doesn't look quite like the stepped function that the samples would generate.  Instead, after the Digital-to-Audio-Converter (DAC) in the sound card, there's a low pass filter that smooths the output of the signal.

When you take an analog audio signal, and encode it in this format, it’s also known as “Pulse Coded Modulation”, or “PCM”.  Ultimately, all PC audio comes out in PCM, that’s typically what’s sent to the sound card when you’re playing back audio.

When an analog signal is captured (in a recording studio, for example), the volume of the signal is sampled at some frequency (typically 44.1 kHz for CD audio).  Each of the samples is captured with a particular range of amplitudes (or quantization).  For CD audio, the quantization is 16 bits, in two samples.  Obviously, this means that each sample has one of at most 65,536 values, which is typically enough for most audio applications.  Since the CD audio is stereo, there are two 16 bit values for each sample. 

Other devices, like telephones, on the other hand, typically uses 8 bit samples, and acquires their samples at 8kHz – that’s why the sound quality on telephone communications is so poor (btw, telephones don’t actually use direct 8 bit samples, instead their data stream is compressed using a format called mu-law (or a-law in Europe), or G.711).  On the other hand, the bandwidth used by typical telephone communication is significantly lower than CD audio – CD audio’s bandwidth is 44,100*16*2=1.35Mb/second, or 176KB/second.  The bandwidth of a telephone conversation is 64Kb/second, or 8KB/second (reduced to from 3.2Kb/s to 11Kb/s with compression), an order of magnitude lower.  When you’re dealing with low bandwidth networks like the analog phone network or wireless networks, this reduction in bandwidth is critical.

It’s also possible to sample at higher frequencies and higher sample sizes.  Some common sample sizes are 20bits/sample and 24bits/sample.  I’ve also seen 96.2 kHz sample frequencies and sometimes even higher.

When you’re ripping your CDs, on the other hand, it’s pointless to rip them at anything other than 44.1 kHz, 16 bit stereo, there’s nothing you can do to improve the resolution.  There ARE other forms of audio that have a higher bit rate, for example, DVD-Audio allows samples at 44.1, 48, 88.2, 96, 176.4 or 192 kHz, and sample sizes of 16, 20, or 24 bits/sample, with up to 6 96 kHz audio channels or 2 192 kHz samples.

One thing to realize about PCM audio is that it’s extraordinarily sparse – there is a huge amount of compression that can be done to the data to reduce the size of the audio data.  But in most cases, when the data finally hits your sound card, it’s represented as PCM data (this isn’t always the case, for example, if you’re using the SPDIF connector on your sound card, then the data sent to the card isn’t PCM).

Edit: Corrected math slightly.

Edit: Added a couple of pictures (Thanks Simon!)

Edit3: Not high pass, low pass filter, thanks Stefan.