Byte of a struct and onto endian concerns

[Updated: added performance data. 7/19]

I haven't written for a while, but it hasn't been for lack of things to write about.  Just been a little busier than normal lately.  I'll let you imagine why. ;)

Something that I've been working on is parsing graphics files for testing (in C#).  It's useful to know all of the metadata in a JPG, PNG, GIF, BMP, etc. for testing, not just the simpler facts like height & width or color depth.  The specs for graphics files can be quite vast (take a look at TIFF) and testing against them requires understanding the specs in depth if your application does any parsing of it's own.

The specifications usually give struct definitions to document their internal format.  You can create C# structs that match the layout, but then how do you get data from a file into them?  You can just read ints and bytes and such into the struct fields, but that's a bit tedious to do--I hate tedious.  If something is tedious, there is always a better way. Ok then, but how can you yank a "legacy" struct out of a data file?  Let's take a look at a simple struct example and talk further:

 [StructLayout(LayoutKind.Sequential)]
public struct BITMAPCOREHEADER
{
   public UInt32 bcSize;
   public UInt16 bcWidth;
   public UInt16 bcHeight;
   public UInt16 bcPlanes;
   public UInt16 bcBitCount;
}

This is a real example from a BMP/DIB.  (Note that LayoutKind.Sequential isn't really needed, it's the default.  I like using it as a "flag-to-self" that the struct is for interop.)

It ends up that the intrinsic types and structs that are made of intrinsic types are stored exactly the same way in managed vs. unmanaged memory.  Knowing that and using a few methods from the Marshal class you can copy structs to and from pointers.  Here's an example of a generic method to do just that. Note that in this case I have a BinaryReader in the same class (StreamParser) that this.Length (length of the file), this.Position (current position in the file), and this.ReadBytes (reads a byte array) map to.  (More on why later.)

 /// <summary>
/// Reads the specified struct from an unmanaged stream.
/// <summary>
/// <typeparam name="T">The struct type you're trying to read.</typeparam>
/// <returns>The struct read from the stream.</returns>
public T ReadStruct<T>() where T: struct
{
   int objectSize = Marshal.SizeOf(typeof(T));
   if (this.Length - this.Position < objectSize)
   {
      // Throw appropriately here
   }
   // Allocate some unmanaged memory.
   IntPtr buffer = Marshal.AllocHGlobal(objectSize);
   // Copy the read byte array (byte[]) into the unmanaged memory block.
   Marshal.Copy(this.ReadBytes(objectSize), 0, buffer, objectSize);
   // Push the memory into a new struct of type (T).
   T returnStruct = (T)Marshal.PtrToStructure(buffer, typeof(T));
   // Free the unmanaged memory block.
   Marshal.FreeHGlobal(buffer);
   return returnStruct;
}

You can simply call this method as follows to get a BITMAPCOREHEADER (assuming you've already moved to the right position in the file):

 // streamParser is an instance of the class that contains ReadStruct
BITMAPCOREHEADER = streamParser.ReadStruct<BITMAPCOREHEADER>();

Not bad, but careful readers may have noticed that we're copying out to unmanaged memory and back and yet I said earlier that structs are stored the same way in both places.  Well, using the unsafe keyword we can get around this.

 {
   int objectSize = Marshal.SizeOf(typeof(T));
   if (this.Length - this.Position < objectSize)
   {
      // Throw appropriately here
   }
   // Doesn't allocate unmanaged memory.
   T returnStruct;   unsafe   {      byte[] buffer = this.ReadBytes(objectSize);      fixed (byte* bufferPointer = &buffer[0])         returnStruct = (T)Marshal.PtrToStructure((IntPtr)bufferPointer, typeof(T));   }
   return returnStruct;
}

Nice, but how does this affect performance? To check, I set up a 128byte buffer and struct and copied the data using the above two methods a million times. On my machine it took about a second for the "unsafe" method. For the "safe" method it was a second and a half. Noticeably faster, but unless you're doing some really heavy duty data handling it probably won't be too noticeable.

I also looked at using Marshal.Copy() with unsafe. Ends up that you can't do a T*, which made that difficult, if not impossible. (See the first comment for details on why this is.)

The function is useful and using generics the syntax is pretty readable.  This won't work if the file is big endian ordered, however.  I'll talk more about endian swapping in my next post.

One last bit... there is some cool stuff in the Marshal class, but it is limited.  Marshal.AllocHGlobal(), for example, actually calls the Win32 LocalAlloc() API with no flags.  This means the block of memory is not initialized and can have garbage in it.  Beware.  There is a flag to zero memory with LocalAlloc(), but there is no way to get at it via Marshal.  The only way to erase it is to create a managed array of bytes and Marshal.Copy() them over.  (That I've found to this point.  You could, of course, P/Invoke into the memory API's yourself.)