What IS audio on a PC anyway?

This may be well known, but maybe not (I didn’t understand it until I joined the Windows Audio team).

Just what is digital audio, anyway?  Well, at its core, all of digital audio is a “pop” sound made on the speaker.  When you get right down to it, that’s all it is.  A “sound” in digital audio is a voltage spike applied to a speaker jack, with a specific amplitude.  The amplitude determines how much the speaker diaphragm moves when the signal is received by the speaker.

That’s it, that’s all that digital audio is – it’s a “pop” noise.  The trick that makes it sound like Sondheim is that you make a LOT of pops every second – thousands and thousands of pops per second.  When you make the pops quickly enough, your ear puts the pops together to turn them into a discrete sound.  You can hear a simple example of this effect when you walk near a high voltage power transformer.  AC power in the US runs at 60 cycles per second, and as the transformer works, it emits a noise on each cycle.  The brain smears that 60 Hz sound together and turns it into the “hum” that you hear near power equipment.

Another way of thinking about this (thanks Frank) is to consider the speaker on your home stereo.  As you’re listening to music, if you pull the cover off the speaker, you can see the cone move in and out with the music.  Well, if you were to take a ruler and measure the displacement of the cone from 0, the distance that it moves from the origin is the volume of the pop.  Now start measuring really fast – thousands of times a second.  Your collected measurements make up an encoded representation of the sound you just heard.

To play back the audio, take your measurements, and move the cone the same amount, and it will reproduce the original sound.

Since a picture is worth a thousand words, Simon Cooke was gracious enough to draw the following…

Take an audio signal, say a sine wave:

Then, you sample the sine wave (in this case, 16 samples per frequency):

Each of the bars under the sine wave is the sample.  When you play back the samples, the speaker will reproduce the original sound.  One thing to keep in mind (as Simon commented) is that the output waveform doesn’t look quite like the stepped function that the samples would generate.  Instead, after the Digital-to-Audio-Converter (DAC) in the sound card, there’s a low pass filter that smooths the output of the signal.

When you take an analog audio signal, and encode it in this format, it’s also known as “Pulse Coded Modulation”, or “PCM”.  Ultimately, all PC audio comes out in PCM, that’s typically what’s sent to the sound card when you’re playing back audio.

When an analog signal is captured (in a recording studio, for example), the volume of the signal is sampled at some frequency (typically 44.1 kHz for CD audio).  Each of the samples is captured with a particular range of amplitudes (or quantization).  For CD audio, the quantization is 16 bits, in two samples.  Obviously, this means that each sample has one of at most 65,536 values, which is typically enough for most audio applications.  Since the CD audio is stereo, there are two 16 bit values for each sample. 

Other devices, like telephones, on the other hand, typically uses 8 bit samples, and acquires their samples at 8kHz – that’s why the sound quality on telephone communications is so poor (btw, telephones don’t actually use direct 8 bit samples, instead their data stream is compressed using a format called mu-law (or a-law in Europe), or G.711).  On the other hand, the bandwidth used by typical telephone communication is significantly lower than CD audio – CD audio’s bandwidth is 44,100*16*2=1.35Mb/second, or 176KB/second.  The bandwidth of a telephone conversation is 64Kb/second, or 8KB/second (reduced to from 3.2Kb/s to 11Kb/s with compression), an order of magnitude lower.  When you’re dealing with low bandwidth networks like the analog phone network or wireless networks, this reduction in bandwidth is critical.

It’s also possible to sample at higher frequencies and higher sample sizes.  Some common sample sizes are 20bits/sample and 24bits/sample.  I’ve also seen 96.2 kHz sample frequencies and sometimes even higher.

When you’re ripping your CDs, on the other hand, it’s pointless to rip them at anything other than 44.1 kHz, 16 bit stereo, there’s nothing you can do to improve the resolution.  There ARE other forms of audio that have a higher bit rate, for example, DVD-Audio allows samples at 44.1, 48, 88.2, 96, 176.4 or 192 kHz, and sample sizes of 16, 20, or 24 bits/sample, with up to 6 96 kHz audio channels or 2 192 kHz samples.

One thing to realize about PCM audio is that it’s extraordinarily sparse – there is a huge amount of compression that can be done to the data to reduce the size of the audio data.  But in most cases, when the data finally hits your sound card, it’s represented as PCM data (this isn’t always the case, for example, if you’re using the SPDIF connector on your sound card, then the data sent to the card isn’t PCM).

Edit: Corrected math slightly.

Edit: Added a couple of pictures (Thanks Simon!)

Edit3: Not high pass, low pass filter, thanks Stefan.

Comments (38)

  1. A couple of quick notes…

    I was under the impression that phones used 8khz 8 bit data, which would be closer to 7.8kb/s, not 11kb /s.

    Also, the "pops" and" brain smearing" arguments are not comet at all; there is hardware filtering after the DAC which bandwidth limits the signal and forces it to be a smooth continuous waveform. The speaker care itself also has a mass, and the inertia and momentum of the speaker itself also has a very similar effect on the signal, causing it to roll off the higher frequencies.

    So basically, no pops. Sure, you can create pop ‘sounds’ but only at the limiting frequency cf the output of the DAC. The sound of the transformer really is a 60 hz sound – the transformer coils resonate at the AC line frequency + distortion – no brain-driven signal integration required.


  2. comet = correct. Damn TabletPC!

  3. Simon, you’re right, the analog filtering after the DAC smears them out, but from a conceptual standpoint, that was the easiest way of explaining the idea that these are discrete samples.

    And 60Hz is above the human threshold of hearing, which is why you can hear the transformer, I chose it because it was the most common example of a low frequency sound I could come up with.

  4. Oh, and you’re right, phone is 8kHz, not 11kHz. That’s why I indicated that it was for devices <i> like </i> telephones.

  5. Todd says:

    Your units are wrong on your math. Your "44,1000" also has one too many 0s. The correct formula is:

    44,100 (1/s) * 16 (b) * 2 = 1411200 b/s = 1378.125 Kb/s = 1.35 Mb/s.

    That’s in bits. In bytes, you get:

    (44,100 (1/s) * 16 (b) * 2) * 1/8 (B/b) = 176400 B/s = 172.27 KB/s = 0.17 MB/s

    Thus, your numbers are correct but your units (bytes vs. bits) are wrong assuming the standard notataion of ‘b’ = bits and ‘B’ = bytes. I also used 1024 B/KB (b/Kb, KB/MB, Kb/Mb) rather than 1000.

  6. Strangely, given the topic, this is one case where a picture really is worth a thousand words.

  7. oh, and re: the telephone stiff, I was referring to this line, Larry.

    "The bandwidth of a telephone conversation is 88kB/second, or 11kb/second, an order of magnitude lower."

  8. I’d like to do a picture, but I’m not good enough in Mathematica to do justice, and I’m not going to steal someone elses work.

  9. Jeff Parker says:

    Now another question along the same lines, in like Media Player, the visualizations when listening to music, like the graphical equalizer and such. Do those pretty much tap into the same signal to draw on the screen? I would assume so, but not sure, one of those things I just sat back and enjoyed was curious but never dug into it.

    But I am assuming those pulses for the speaker are the same pulses that you see on the screen.

  10. Yup, visualizations operate on the samples being sent to the sound card. Internally they’re implemented as dshow filters that render their samples to the screen instead of performing some kind of transform on them.

  11. Audio dude says:

    Might want to clarify that you’re talking about cellphones, not landline phones. The yunguns might be confused.

    You might also clarify that you can transfer PCM over S/PDIF, but that isn’t the only data format available. Further, in the PCM case, the PCM data might be encoded for SPDIF on the card itself, not on the host PC. πŸ™‚

  12. Larry – send me an email, and let me know what you need. I should be able to knock something out pretty quickly.

  13. Eric Lippert says:

    > One thing to realize about PCM audio is that it’s extraordinarily sparse – there is a huge amount of compression that can be done

    Well, sure — but it is important to note that in many compression schemes, it’s lossy compression.

    Compression works well when there are large ranges with small variations and patterns in the data, both of which are true of SOME audio. Compression works great on "bassy" music because a 220 Hz note has a profile that looks like a sine wave with two hundred samples per wavelength.

    There are therefore two ways you could compress this. You could take advantage of the small change between any two samples and just store the differences, which take up fewer bits than the values. Or, you could take advantage of the fact that a 220 Hz sine wave is computable if you know its duration, initial value, and frequency — just store them!

    Lossy compression schemes do exactly that — they take the Fourier transform of the signal to determine what combination of sine waves make up a particular sample, and then just store information about those sine waves.

    The question then is "what about the high-frequency stuff?" The human threshhold for hearing is ~20KHz. Any signal sampled at 40KHz that contains loud sounds in the KHz range is going to have large variations between samples, and the details of the overall wave shape are likely to change rapidly.

    That’s what makes lossy compression schemes lossy. They just throw away the high frequency information because its too hard to compress!

    That’s why when you listen to overly compressed audio, things like applause and symbol crashes sound awful. Applause and symbol crashes tend to have lots of random, hard-to-compress, high amplitude, high frequency signal in them.

  14. Ziv Caspi says:

    Actually, modern audio compression algorithms can do much more than simply cut off the higher frequencies — encoders like MP3 have a psycho-acoustic model of the way we hear things to improve compression even further. For example, a strong tone will often "mask" a weak tone which is close to it in frequency, so the second one can often be thrown out as part of the lossy compression.

  15. Chris says:

    I’ve read that a sample rate higher than 2x the highest sampled frequency is unnecessary according to Nyquist’s theorem. If this is so, and human hearing tops out at around 20KHz (or let’s say 30KHz for the super humans), then sample rates higher than 40-60KHz are just taking up more space on our storage media for no real gain.

    Some say that there are still effects from those inaudible frequencies on those in the audible range. If that is so then recording microphones (which also top out at around 20KHz) would still record the effects of those frequencies. Any other improvement in perceived sound quality is attributed to the quality of the equipment in the recording and/or playback chain (or is psychological).

    Search for Dan Lavry on usenet or elsewhere for a more in depth discussion/argument of this.

  16. Chris – there is a subtlety here.

    The Nyquist limit details how many samples one must take to accurately reproduce a signal of a given frequency without aliasing.

    If a sound was generated by a single staionary point source in an infinitely absorbing room (ie. no echo), then Nyquist will tell you everything you need to know to reproduce that sound.

    However, when you start positioning sounds in space, higher frequencies become important. While the human ear can only hear frequencies up to about 22.5kHz (on average – some people can hear more, some less), it can discriminate between the arrival times of sounds at much higher resolution – on the order of what would be a frequency of 100,000Hz. That is, if the same sound wave arrives 10 microseconds apart, at one ear first and then the other, the listener can tell the difference, and interprets this as spatial separation of the sounds.

    A lot of positioning information is encoded in the higher frequency domain. So while Nyquist is strictly correct for a given signal, it’s a very much idealised form when you’re dealing with stereo positional audio.

  17. Norman Diamond says:

    10/26/2004 2:31 PM Eric Lippert

    > That’s why when you listen to overly

    > compressed audio, things like applause

    > and symbol crashes sound awful.

    Ann if ewe overtly comprise dictionaries four spilling chequers, sings like cymbal clashes look awe full? @ leased they sound like symbol crashes.

  18. Paul Winwood says:

    The other factor is the dynamic range (loudness) of human hearing. We can hear over a range of 120dB although sustained levels of over 90dB can damage our hearing. CD audio can achieve about 96dB. AC-3 and DVD-Audio formats achieve more than this.

  19. Petr Kadlec says:

    Simon: "about 22.5kHz (on average – some people can hear more, some less)"

    Are you sure about that? I’ve always thought that it is significantly lower (somewhere around 16 kHz). In fact, if this would be true, it would mean that some people would be able to actually hear that there is missing some high frequency signal on a CD audio.

  20. Stefan Kuhr says:

    >>"there’s a high pass filter that smooths the output of the signal."

    Just one comment: It’s not a high pass, it is a *low* pass filter that is used to suppress the frequency portions that are "mirrored" into the signal from the next band (overlaid by a sinx/x curve), assuming it is not an ideally bandlimited signal that you are sampling. The low pass should let pass all frequency portions from DC to the highest frequency (i.e. 20kHz) and should then have a very steep curve to suppress everything at or above half the sample frequency (i.e. 22.1kHz). This is also where the quality of the analog circuitry comes into play: You can create very steeply curved low pass filters with only few RC elements (e.g. Chebychev filters) but those have a non constant group delay (some people can actually hear this), which means that e.g. the lower frequencies arrive later at the listener’s ear than the higher frequencies or the other way round. If you want steep filters but constant group delay you need more RC elements in the analog filter and thus more complex and expensive analog filters (e.g. Bessel filters).

  21. When I was taking engineering at university one of my profs mentioned that transformers hum at twice the input voltage. So transformers in North America actually hum at 120Hz, not 60Hz.

    Here’s a link that explains why:


  22. Roland Boon says:

    There is a different method of raw audio encoding than PCM which is used by Super Audio CD; a 1 bit digital stream known as DSD.

    "The DSD technology uses a sampling frequency of 2.8224 MHz, which is 64 times higher than that of CD. This enables a frequency response up to 100 kHz and a dynamic range of 120 dB across the entire audible range."

    From http://www.licensing.philips.com/information/sacd/protec/documents1089.html

  23. Petr – sorry, I meant 22.05kHz πŸ™‚ I missed a decimal.

    Nyquist states that you sample at twice the frequency of the signal you want to reproduce (with the caveats I previously mentioned re: spatial positioning).

    CD audio was chosen to record at 44.1kHz because most people top-out their hearing at the high end at 22.05kHz (which, by Nyquist, is sampled at 44.1kHz).

    DAT tapes record at 48kHz because they didn’t want them to be compatible with CD audio on a direct binary level – the idea being to reduce piracy. In actuality, all that really happened because of it was tape went the way of the dodo.

  24. John Topley says:

    DAT is still used in the music industry though.

  25. Times like this make me wish I had your brain. I don’t really want to go work for the Windows Audio team, but it’s almost as if I don’t really have to if I could just get exclusive access to just SOME of your brain. If there was just some way of harnessing it safely and cheaply.

    I guess I’ll just have to settle for more posts like this. I used to think some of this stuff was over my head, but it really isn’t if it’s given in such a clear explaination. One can get lost in the techno-babble of the audio world quite easily, but somehow I understood everything that was said.

    Thanks again.

  26. Stewart says:

    I remember being told that one of the other problems with the sampling rates is that before encoding the source signel must be low pass filtered to prevent frequencies above the nyquist limit sneaking through and causing aliasing.

    Because this filtering cannot be ideal (because of the group delays mentioned above) you lose some of the higher frequencies below the nyquist limit. If you expand the sample rate, you can make the filter higher and you save more of the perceptible audio

  27. ATZ Man says:

    Interesting article and comments.

    48KHz was already established as a standard for digital audio when CD Audio was being defined. Steward seems to state the motive for 48KHz. The 44.1KHz standard supposedly comes from making the physical CD fit in a Japanese-size car stereo slot and still hold a certain amount of music.

    I remember in the late 80’s PC games were attempting to use the original PC speaker to output digitized sound. It was worse than phone audio, and compute intensive, and 24-bit audio cards in a retail box cost $30 so it’s not much use.

  28. Norman Diamond says:

    10/27/2004 5:25 AM Petr Kadlec

    >> Simon: "about 22.5kHz (on average – some

    >> people can hear more, some less)"

    [or 22.05 kHz, it doesn’t matter much to this]

    > I’ve always thought that it is significantly

    > lower (somewhere around 16 kHz).

    The average varies with age and gender. For adult men it might wee be 16 kHz. But the overall average is still the overall average.

    > In fact, if this would be true, it would

    > mean that some people would be able to

    > actually hear that there is missing some

    > high frequency signal on a CD audio.

    That’s exactly true. No matter what the average is, that means some people are below it and some people are above it. There always have been people who are put off by the missing high frequencies on CD audio — and by inconsistent phase shifts, and on white noise or pink noise added by ordinary electronic amplifiers.

    10/28/2004 10:59 AM ATZ Man

    > The 44.1KHz standard supposedly comes from

    > making the physical CD fit in a Japanese-

    > size car stereo slot and still hold a

    > certain amount of music.

    Huh? Japanese car stereo slots are the same size as import car stereo slots. Of course there aren’t a lot of imports, and most of those are from Germany not from the NAFTA zone, but they all take the same kinds of stereos.

    There are other kinds of optical disks such as MDs, but those didn’t exist when CDs were first developed, and they sure didn’t figure in setting frequencies. I’ve read that MDs use a lossy compression algorithm. They were popular for a while because of their portability (e.g. can be worn while riding a train) but that market has moved to flash memory.

  29. It might annoy some at Microsoft to call out the comparison but when I read Scoble this morning I couldn’t help but think of Big Blue. One of factors about IBM i always find impressive is seniority. Folks stay at…

  30. Before I can talk about reading audio CDs using DAE (Digital Audio Extraction), I need to talk a bit…

  31. I’ve been talking about audio controls – volumes and mutes and the like, but one of the more confusing…

  32. Before I can talk about reading audio CDs using DAE (Digital Audio Extraction), I need to talk a bit