Thoughts on writing an IL Disassembler

I’m a big fan of writing your own version of tools, even if there’s already an existing tool. Why? Because writing your own tool is absolutely the best way to understand the concepts. You’ll run into all sorts of little gotchas and special case exceptions that force you to really understand what you’re working with.

 

My canonical example is showing the contents of a Win32 (PE) executable. Sure, there are plenty of tools such as DUMPBIN and Russ Osterlund’s fabulous PEBrowse Professional that break a PE file apart for your viewing pleasure. But if you can write an equivalent tool, you’ll be forced to understand all nuances of the file format. That’s why I keep my own PE dumping program up to date.

 

A similar story exists for assembler instructions, be they x86, IA64, or the CLR’s IL. It’s one thing to be able to read them, it’s entirely another to know how to crack apart any instruction. It takes your knowledge to a whole new level.

 

Over ten years ago, I wrote my first disassembler, for the x86 platform. At the time, the 80486 was the top dog chip, and my goal was to handle every valid instruction, even the wacky floating point instructions. That code (which I still use today) is filled with all sorts of code for weird special cases, and isn’t particularly pleasant to read. Thus, you might imagine my trepidation when it came to writing an IL disassembler. I kept pushing it off until I had a real need for it.

 

That time came recently. Sure, I could have looked at the code for existing disassemblers, but I wouldn’t have learned as much. And to my delight, it turned out to be quite a simple task compared to my prior x86 experience.

 

At a high level, writing an IL disassembler has two main tasks:

  • Cracking the headers and exception handling clauses
  • Decoding the instruction stream

In the end, I spent far more time understanding the headers and clauses then I did writing the IL decoding portion. However, decoding the instruction stream is what I’m going to talk about here.

 

I’m a big advocate of data driven programming. My goal is to write as little code as possible, and embed in data as much of the behavior as I can. Decoding IL instructions are a perfect example of where data driven programming makes things easier.

 

There are quite a few IL instructions that branch to another address. The destination is embedded as a relative offset. The relative offset value immediately follows the instruction’s opcode. In a data driven programming approach, I have a structure that represents all the various attributes of each opcode. One of the attributes is whether the instruction branches or not. In the first stab of my code, my output simply had the offset value. Now, let’s say I wanted to get fancy and add additional annotations to the output, for instance indicating if the target is before or after the current instruction. Or I might wish to indicate that the target is the RET instruction at the end of the function. I can add these features in one place in my code and I know that they’ll show up for every branch instruction.  In a non-data drive approach, I’d have to find the code for handling each specific branch instruction and modify it.

 

I included this little digression on data driven programming to help show my delight when I discovered the OPCODE.DEF file in the .NET Framework SDK. If you’re at all curious, go look at it. You’ll see a bunch of lines that begin with “OPDEF”, such as:

 

OPDEF(CEE_LDARG_1, "ldarg.1", Pop0, Push1, InlineNone, // Rest of line omitted

 

It doesn’t take too long to see the similarity to an initialized C++ array of structures. For instance, consider this very much simplified example:

 

struct S

{

  int m_int;

    char * m_str;

};

 

S MyArray[] =

{

    {1, "hello"},

    {2, "goodbye"},

    {3, "Ciao!"},

};

 

In the OPCODE.DEF file, all the OPDEF lines sure look like they’re initializing a structure in an array (albeit with a structure with many more fields than my simplified example above.)

 

What you won’t find in OPCODE.DEF is any sort of structure definition or meaning for OPDEF. However, crafty programmers that we are, it’s easy enough to examine the OPDEF lines and figure out how to declare our own structure (say, “struct ILOpcode”). Next, it’s not hard to write an OPDEF macro that initializes one ILOpcode instance.

 

Given a couple of hours, and a few minor hacks, I had the relevant portions of OPCODE.DEF compiling in C++ as an array of ILOpcode’s.  Once I had the array, the rest of the decoding process was a piece of cake. I’d write a bit of code, and try running it against a .NET executable. When I encountered an instruction that I didn’t handle properly, I’d go back to figure out what needed to be implemented next. By iteratively doing this develop, test, analyze cycle, I had the opcode decoding portion done in a few hours.

 

In the end, I was pleasantly surprised at how much simpler IL is than the x86 instruction set. The biggest simplification is that each IL instruction can have at most one argument (the one pseudo-exception being the SWITCH instruction.)  Similarly, the possible argument sizes are much fewer than in x86 land. Typically, arguments are one or four bytes long. If you’ve ever had to deal with the x86’s R/M and SIB encodings, you know exactly how much more complicated the x86 instructions are.

 

One final note on code design: It’s important to remember that something like a disassembler may be used in multiple ways.  In one use, you may just display string output to the console. But it would be bad form to simply printf the opcode names and argument values in the opcode decoding code. What happens if you want to use the disassembler in a GUI program? Suddenly your printf statements don’t work.  In a case like this, I’d make the disassembler return a class instance that has all the relevant attributes of the instruction it decoded. This way, the client of the disassembler can decide how the output should be formatted and displayed. It can also implement other useful features like highlighting branch instructions or decoding IL tokens into meaningful names.