Using the Hashing Transforms (or How Do I Compute a Hash Block by Block)

Occasionally I get asked how to use the hashing algorithms that ship with .NET to get the hash of some data when there is only access to pieces of the input at a time. This comes up for various reasons, sometimes the input data is too big to fit entirely into the available memory, sometimes the data isn't all available at once, and sometimes it is already being operated on a block at a time, and it would be inefficient to read it all again through a stream.

Instead of just showing how to manually hash data a block at a time, I'll show all three methods for computing the hash, starting with the most automated and ending up with the manual method. The data I'll use in the samples is just a simple string. I divided it up into blocks, and also have a copy kept as a whole.

// setup some data to hash -- either in whole form, or in multiple blocks
byte[] data = Encoding.UTF8.GetBytes("Test input data for hashing.");
byte[] subBlock1 = Encoding.UTF8.GetBytes("Test input d");
byte[] subBlock2 = Encoding.UTF8.GetBytes("ata for");
byte[] subBlock3 = Encoding.UTF8.GetBytes(" hashing.");

The Easy Way

The easiest way to compute a hash is to simply call the ComputeHash() method of the hashing algorithm. This method has several overloads, but the two basic variations are passing a byte array or passing a stream. Either way, the return value is a byte array representing the hash code.

// create the hash the easy way
byte[] easyHash = new SHA256Managed().ComputeHash(data);
Console.WriteLine(Convert.ToBase64String(easyHash));

Using Streams

The easiest way to hash a stream is to use the ComputeHash() method, and pass it in the stream you want hashed. However, I'll show you here how to use hash a stream yourself, in a manner very similar to how the ComputeHash() method would use to accomplish its task. Understanding how this works will be key for understanding how to compute the hash of several blocks of data by hand. One reason you might want to do this in a real world application is if you were computing the hash of several files. If you called ComputeHash() with many different file streams, you'd get a seperate hash for each file. However, using this method, you can keep a single hash value.

The first thing I do is create a memory stream that holds the data I want to hash. This could just as easily be a file stream, or an isolated storage stream, if you were hashing data from one of these sources. Next, I create the hashing algorithm, and attach a crypto stream to the algorithm and memory stream. The crypto stream should be in read mode, since it is going to be reading data out of the underlying memory stream.

The next step is to actually pull the data from the underlying stream through the crypto stream and into a temporary buffer. I setup a scratch buffer that is the same size as the data I'm going to hash, but if this is going to be a large value, and you want to save on some memory, you could use a smaller buffer and perform the read in several steps. The call to Read() on the crypto stream pulls the data out of the underlying memory stream, through the hashing transform, and out into my scratch buffer. The value in the scratch buffer is not the hash value however, since that value changes with each byte read. Instead the scratch buffer is filled with the data being hashed.

Once the data has been all read out of the memory stream, the hash value is stored in the hashing algorithm's Hash property. It's ok if you read this property after you close the crypto stream, however if you read it after you dispose of the crypto stream (either by calling Dispose() yourself or by exiting a using block), the hash value will be destroyed.

// create the hash by reading a stream
using(MemoryStream ms = new MemoryStream(data))
{
    SHA256Managed shaStream = new SHA256Managed();
    using(CryptoStream cs = new CryptoStream(ms, shaStream, CryptoStreamMode.Read))
    {
      // read the contents of the underlying stream into the scratch buffer
      byte[] scratch = new byte[data.Length];
      cs.Read(scratch, 0, scratch.Length);
      cs.Close();

      // make sure to save the hash value before the crypto stream is disposed
      byte[] streamHash = shaStream.Hash;
      Console.WriteLine(Convert.ToBase64String(streamHash));
    }
    ms.Close();
}

Creating the Hash One Block at a Time

If pulling the data to be hashed through a crypto stream doesn't provide enough control for you, it's possible to create the hash by transforming one block of data at a time. This is actually what the crypto stream is doing beneath the covers. This process can be sort of counter-intuitive, since its not expected that most people will get hash values this way, and the interface needed was designed primarily for the crypto stram to use.

In order to hash data one block at a time, you need to use the hash algorithm's ICrytpoTransform interface. This interface provides two methods of note:

  • TransformBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, byte[] outputOffset)
  • TransformFinalBlock(byte[] inputBuffer, int intputOffset, int intputCount)

For all but the last block of your data, you call TransformBlock(). The last block gets passed to TransformFinalBlock(), however getting the parameters correct can be tricky. Obviously TransformFinalBlock() takes an array which contains the bytes to hash, the offset of the first byte to hash, and the number of bytes that should be hashed. However, TransformBlock()'s parameters are a little more subtle.

The first three parameters, are the same as for TransformFinalBlock(), but the output buffer is not actually an output buffer. Instead, you must pass the same buffer for outputBuffer as you did for inputBuffer, or the hashing algorithm will throw an exception (generally an IOExecption). The value copied to the output buffer will be a copy of the input buffer, so its usually a good idea to give the same starting offset as well.

The reason for this is that every bit read into a hashing algorithm changes every byte of the hash, preventing the output from being actually written to an output buffer as the hash is computed. However, the parameter is necessary, since TransformBlock is a member of the ICryptoTransform interface, and this interface is what the CryptoStream works with. Since standard symmetric or asymmetric encryption algorithms can compute their output value as they go, these algorithms actually do make use of the outputBuffer. CryptoStream internally will read a block from the input stream, pass it to the ICryptoTransform, and have the transform write a block back to the output stream.

You should also note that in my samples the blocks that I will transform are all of different lengths. This is just fine with the hashing algorithm, which unlike some encryption algorithms will work with blocks of any length. Enough talk about the parameters to these methods, lets look at some sample code:

// create the hash by manually transforming blocks
SHA256Managed shaManual = new SHA256Managed();
shaManual.TransformBlock(subBlock1, 0, subBlock1.Length, subBlock1, 0);
shaManual.TransformBlock(subBlock2, 0, subBlock2.Length, subBlock2, 0);
shaManual.TransformFinalBlock(subBlock3, 0, subBlock3.Length);
Console.WriteLine(Convert.ToBase64String(shaManual.Hash));

If you put all the sample code from this post together, you'll notice that all three methods of hashing will print out the same hash value: P0KAzj5A473O8iRKbg++AiZHN6elbfSfb+GmwuG2F14= (in base64). Hopefully the samples in this post will help to clear up confusion on computing hash values in more advanced scenarios.