Thoughts on writing an IL Disassembler


I’m a big fan of writing your own version of tools, even if there’s already an existing tool. Why? Because writing your own tool is absolutely the best way to understand the concepts. You’ll run into all sorts of little gotchas and special case exceptions that force you to really understand what you’re working with.


 


My canonical example is showing the contents of a Win32 (PE) executable. Sure, there are plenty of tools such as DUMPBIN and Russ Osterlund’s fabulous PEBrowse Professional that break a PE file apart for your viewing pleasure. But if you can write an equivalent tool, you’ll be forced to understand all nuances of the file format. That’s why I keep my own PE dumping program up to date.


 


A similar story exists for assembler instructions, be they x86, IA64, or the CLR’s IL. It’s one thing to be able to read them, it’s entirely another to know how to crack apart any instruction. It takes your knowledge to a whole new level.


 


Over ten years ago, I wrote my first disassembler, for the x86 platform. At the time, the 80486 was the top dog chip, and my goal was to handle every valid instruction, even the wacky floating point instructions. That code (which I still use today) is filled with all sorts of code for weird special cases, and isn’t particularly pleasant to read. Thus, you might imagine my trepidation when it came to writing an IL disassembler. I kept pushing it off until I had a real need for it.


 


That time came recently. Sure, I could have looked at the code for existing disassemblers, but I wouldn’t have learned as much. And to my delight, it turned out to be quite a simple task compared to my prior x86 experience.


 


At a high level, writing an IL disassembler has two main tasks:



  • Cracking the headers and exception handling clauses

  • Decoding the instruction stream

In the end, I spent far more time understanding the headers and clauses then I did writing the IL decoding portion. However, decoding the instruction stream is what I’m going to talk about here.


 


I’m a big advocate of data driven programming. My goal is to write as little code as possible, and embed in data as much of the behavior as I can. Decoding IL instructions are a perfect example of where data driven programming makes things easier.


 


There are quite a few IL instructions that branch to another address. The destination is embedded as a relative offset. The relative offset value immediately follows the instruction’s opcode. In a data driven programming approach, I have a structure that represents all the various attributes of each opcode. One of the attributes is whether the instruction branches or not. In the first stab of my code, my output simply had the offset value. Now, let’s say I wanted to get fancy and add additional annotations to the output, for instance indicating if the target is before or after the current instruction. Or I might wish to indicate that the target is the RET instruction at the end of the function. I can add these features in one place in my code and I know that they’ll show up for every branch instruction.  In a non-data drive approach, I’d have to find the code for handling each specific branch instruction and modify it.


 


I included this little digression on data driven programming to help show my delight when I discovered the OPCODE.DEF file in the .NET Framework SDK. If you’re at all curious, go look at it. You’ll see a bunch of lines that begin with “OPDEF”, such as:


 


OPDEF(CEE_LDARG_1, “ldarg.1”, Pop0, Push1, InlineNone, // Rest of line omitted


 


It doesn’t take too long to see the similarity to an initialized C++ array of structures. For instance, consider this very much simplified example:


 


struct S


{


    int    m_int;


    char * m_str;


};


 


S MyArray[] =


{


    {1, “hello”},


    {2, “goodbye”},


    {3, “Ciao!”},


};


 


In the OPCODE.DEF file, all the OPDEF lines sure look like they’re initializing a structure in an array (albeit with a structure with many more fields than my simplified example above.)


 


What you won’t find in OPCODE.DEF is any sort of structure definition or meaning for OPDEF. However, crafty programmers that we are, it’s easy enough to examine the OPDEF lines and figure out how to declare our own structure (say, “struct ILOpcode”). Next, it’s not hard to write an OPDEF macro that initializes one ILOpcode instance.


 


Given a couple of hours, and a few minor hacks, I had the relevant portions of OPCODE.DEF compiling in C++ as an array of ILOpcode’s.  Once I had the array, the rest of the decoding process was a piece of cake. I’d write a bit of code, and try running it against a .NET executable. When I encountered an instruction that I didn’t handle properly, I’d go back to figure out what needed to be implemented next. By iteratively doing this develop, test, analyze cycle, I had the opcode decoding portion done in a few hours.


 


In the end, I was pleasantly surprised at how much simpler IL is than the x86 instruction set. The biggest simplification is that each IL instruction can have at most one argument (the one pseudo-exception being the SWITCH instruction.)  Similarly, the possible argument sizes are much fewer than in x86 land. Typically, arguments are one or four bytes long. If you’ve ever had to deal with the x86’s R/M and SIB encodings, you know exactly how much more complicated the x86 instructions are.


 


One final note on code design: It’s important to remember that something like a disassembler may be used in multiple ways.  In one use, you may just display string output to the console. But it would be bad form to simply printf the opcode names and argument values in the opcode decoding code. What happens if you want to use the disassembler in a GUI program? Suddenly your printf statements don’t work.  In a case like this, I’d make the disassembler return a class instance that has all the relevant attributes of the instruction it decoded. This way, the client of the disassembler can decide how the output should be formatted and displayed. It can also implement other useful features like highlighting branch instructions or decoding IL tokens into meaningful names.


Comments (9)

  1. Nick Parker says:

    Will you be posting this your version of the IL Disassembler?

  2. Matt Pietrek says:

    Alas, I can’t post it. It’s for a project I’m working on internally.

  3. which brings up a good point (your previous comment)…now that you work for MS, how much of the cool internal "hush hush" stuff (that you used to write about so freely) can you actually write about now?

  4. Matt Pietrek says:

    Jayson,

    If you look at a lot of my writings, the majority of it covered documented topics, albeit things that weren’t *well* documented. Quite often, my writing merely cast a brighter light into some of the darker corners of Windows. Three examples that come to mind are:

    1) The PE format

    2) Vectored Exception Handling

    3) DbgHelp features

    Obviously I can’t talk about anything that’s still Microsoft NDA material. And not many people know this, but almost all of my writing (outside of my books) was for MSJ/MSDN magazine, which were editorially controlled by Microsoft. All in all, I don’t feel any more constrained than I did before joining Microsoft.

  5. Roshan James says:

    I am nowhere in comparison, but I still wanted to point you at my own little PE viewer tool.

    The difference from yours being mainly that it can display some .net structures and (more importantly) the metadata tables (in table form).

    http://www27.brinkster.com/sparksite/work/trashbin/trashbin.htm

  6. Tom Guinther says:

    I’ve been a Matt Pietrek fan since he was a newbie member of Borland’s CompuServe Forum support team, so I am really glad to see that he is finally blogging. Anyway, Matt and I discussed an IL Disassembler I had written over margaritas (at Margaritas in Nashua.) He was suprised that I claimed to have written it in about an hour. He wont remember the conversation because he was, shall we say, "distracted" by the pleasant scenery and ambiance.

    Anyway, the IL Disassembler I wrote is sitting on a backup DVD somewhere in my code collection. Maybe since Matt cant release his version he should host mine so that people can get an example of what he is talking about. I have no interest in the code, it was part of personal project I wont get around to finishing.

    It uses basically the same techniques and ideas Matt describes with smart formatting for exception/fault/finally handlers. It is driven by a metadata engine that I thought I did a pretty good job designing (at least at the time I liked it!) I would have to find the code and add a public domain license, then Matt could look at it, tweak it, and publish it.

    Of course, Matt has better things to do with his time so you’ll probably have to bug him a bit.

    Tom

  7. someone says:

    sure i for one would like to play around with your tool

  8. Re: Thoughts on writing an IL Disassembler