Intro to Audio Programming, Part 2: Demystifying the WAV Format

The WAV format is arguably the most basic of sound formats out there. It was developed by Microsoft and IBM, and it is rather loosely defined. As a result, there are a lot of WAV files out there that theoretically should not work, but somehow do.

Even if you do not plan to work with WAV data directly, future entries in this series will use the DirectSound library which has direct analogs to elements of the wave file, so it’s important to understand what all of this means.

WAV files are usually stored uncompressed, which means that they can get quite large, but they cannot exceed 4 gigabytes due to the fact that the file size header field is a 32-bit unsigned integer (32 bit file length means a maximum of 4 gigs).

WAV files are stored in a binary format. They are comprised of chunks, where each chunk tells you something about the data in the file.

Here are a couple of quick links that describe the WAV file format in detail. If you are doing work with the WAV format, you should bookmark these:

Let’s dig in a little deeper to how WAV works.

Chunks

A chunk is used to represent certain metadata (or actual data) regarding the file. Each chunk serves a specific purpose and is structured very specifically (order matters).

Note: There are a lot of available chunks that you can use to accomplish different things, but not all of them are required to be in a wave file. To limit the amount of stuff you have to absorb, we’ll only be looking at the header and the two chunks you need to create a functioning WAV file.

Also, every chunk (including the file header) starts with four characters (byte) that define what’s coming called sGroupID. We’ll see more about this… now.

Header

While not really a chunk according to the WAV spec, the header is what comes first in every WAV file. Here is what the header looks like:

Field Name Size (bytes) C# Data Type Value Description
sGroupID 4 char[4] “RIFF” For WAV files, this value is always RIFF. RIFF stands for Resource Interchange File Format, and is not limited to WAV audio – RIFFs can also hold AVI video.
dwFileLength 32 uint varies The total file size, in bytes, minus 8 (to ignore the text RIFF and WAVE in this header).
sRiffType 4 char[4] “WAVE” For WAV files, this value is always WAVE.

 

Format Chunk

The Format chunk is the metadata chunk. It specifies many of the things we talked about in part 1 of this series, such as the sample rate, bit depth, number of channels, and more.

Before we look at the format chunk structure, let’s run through some definitions in gory detail.

  • Sample – A single, scalar value representing the amplitude of the sound wave in one channel of audio data.
  • Channel – An independent waveform in the audio data. The number of channels is important: one channel is “Mono,” two channels is “Stereo” – there are different waves for the left and right speakers. 5.1 surround sound has 5 channels, one of which is for the lowest sounds and is usually sent to a subwoofer. Again, each channel holds audio data that is independent of all the other channels, although all channels will be the same overall length.
  • Frame – A frame is like a sample, but in multichannel format – it is a snapshot of all the channels at a specific data point.
  • Sampling Rate / Sample Rate – The number of samples (or frames)that exist for each second of data. This field is represented in Hz, or “per second.” For example, CD-quality audio has 44,100 samples per second. A higher sampling rate means higher fidelity audio.
  • Bit Depth / Bits per Sample – The number of bits available for one sample. Common bit depths are 8-bit, 16-bit and 32-bit. A sample is almost always represented by a native data type, such as byte, short, or int. A higher bit depth means each sample can be more precise, resulting in higher fidelity audio.
  • Block Align – This is the number of bytes in a frame. This is calculated by multiplying the number of channels by the number of bytes (not bits) in a sample. To get the number of bytes per sample, we divide the bit depth by 8 (assuming a byte is 8 bits). The resulting formula to calculate block align looks like blockAlign = nChannels * (bitsPerSample / 8) . For 16-bit stereo format, this gives you 2 channels * 2 bytes = 4 bytes.
  • Average Bytes per Second – Used mainly to allocate memory, this measurement is equal to sampling rate * block align.

Now that we know what all these things mean (or at least, you can scroll up and read them when you need to), let’s dive into the format chunk’s structure.

Field Name Size (bytes) C# Data Type Value Description
sGroupID 4 char[4] “fmt “ Indicates the format chunk is defined below. Note the single space at the end to fill out the 4 bytes required here.
dwChunkSize 32 uint varies The length of the rest of this chunk, in bytes (not including sGroupID or dwChunkSize).
wFormatTag 16 ushort 1 For WAV files, this value is always 1 and indicates PCM format.
wChannels 16 ushort varies Indicates the number of channels in the audio. 1 for mono, 2 for stereo, etc.
dwSamplesPerSec 32 uint varies The sampling rate for the audio (e.g. 44100, 8000, 96000, depending on what you want).
dwAvgBytesPerSec 32 uint sampleRate * blockAlign The number of multichannel audio frames per second. Used to estimate how much memory is needed to play the file.
wBlockAlign 16 ushort wChannels * (dwBitsPerSample / 8) The number of bytes in a multichannel audio frame.
dwBitsPerSample 32 uint varies The bit depth (bits per sample) of the audio. Usually 8, 16, or 32.

As I mentioned, this chunk gives you pretty much everything you need to specify the wave format. When we look at DirectSound later, we will see the same fields to describe the wave format.

Now, let’s look at the data chunk, my favorite chunk of them all!

Data Chunk

The data chunk is really simple. It’s got the sGroupID, the length of the data and the data itself. Depending on your chosen bit depth, the data type of the array will vary.

Field Name Size C# Data Type Value Description
sGroupID 4 bytes char[4] “data” Indicates the data chunk is coming next.
dwChunkSize 32 bytes uint varies The length of the array below.
sampleData Number of elements in the sample data: dwSamplesPerSec * wChannels * duration of audio in seconds byte[] (8-bit audio) short[] (16-bit audio) float[] (32-bit audio) sample data All sample data is stored here.

For the sample data, it’s important to note that each element of the data array is a signed value – it can be negative, positive or zero. The range of each of these elements is important too.

Since each sample represents amplitude, and we are working with signed values, we have to consider what is our minimum and maximum amplitude given the data type we’ve chosen. Due to some crazy business involving Endianness and 2’s complement, 16-bit samples range from -32760 to 32760 instead of -32768 to 32768 (2^16 / 2). We don’t worry about this with 8-bit data, because 8 bits is only one byte and Endianness is not an issue, nor with 32-bit data because it is represented as a proper float from -1.0f to 1.0f.

Here are the value ranges for each data type:

  • 8-bit audio: - 128 to 127 (because, again, of 2’s complement)
  • 16-bit audio: - 32760 to 32760
  • 32-bit audio: - 1.0f to 1.0f

Making a Wave File

Writing the wave file is as easy as constructing the Header, Format Chunk and Data Chunk and writing them in binary fashion to a file in that order.

There are a couple of caveats. Firstly, determining the dwChunkSize for the format and data chunks can be weird because you have to sum up the byte count for each field and then use that result as your chunk size. You can do this by using the .NET Framework’s BitConverter class to convert each of the fields to a byte array and then retrieve its length, summing up the total byte count for all the counted elements in the chunk. If you get the chunk size wrong, the wave file won’t work because whatever is trying to play it has an incorrect number of bytes to read for that chunk.

The other thing you have to calculate AFTER generating the chunks is the dwFileLength field in the header. This value is equal to the total size in bytes of the entire file minus 8 (including those fields that are ignored when calculating chunk size, such as sGroupID and dwChunkSize). The reason we subtract 8 is because the format dictates that we don’t count the 8 bytes held by the RIFF and WAVE markers in the header. There are some other ways to do this, which we’ll explore in the next article.

The best way to accomplish all these things effectively is to implement a structure or class for the header and the two chunks, which have all the fields defined properly with the data types shown above. Put methods on the chunk class that will calculate that chunk’s dwChunkSize as well as the total size of the chunk including sGroupID and dwChunkSize. Then, add the total chunk size of the data & format chunks, then add 4 (the size of the header chunk, since we only count dwFileLength). Assign that value to dwFileLength and you are golden.

That’s It, For Now…

What, no example code? I know, I know. That’s what the next blog post is for, so stay tuned. In the next post, you’ll learn how to use all this knowledge to write a C# program that synthesizes audio, generates the samples, and writes the results to a working WAV file that you can listen to!

Currently Playing: Silversun Pickups – Swoon – The Royal We