The world is a better place if you generate verifiable IL

If you are writing a compiler that targets IL or just emitting IL, you may find this an interesting read:

 

The JIT compiler will always try to generate code, even if the IL is bad. From the JIT’s point of view, IL code falls in 3 categories:

 

1) Verifiable IL. Most of the code can be verifiable. You can look for bugs in it using offline tools (PEVerify)

2) Non verifiable but correct IL

3) Non verifiable and incorrect IL

 

It is difficult for us to tell the difference between 2) and 3). Whenever we see something that looks really wrong (Bad EH regions, IL stack imbalances, etc..) we bail out throwing an InvalidProgramException. The JIT could probably do a better job here by being stricter on things that are obviously wrong, but even if we did, bad IL will get through (we can’t prove it’s incorrect for all cases). Historically, it’s difficult to be in the 2) camp and be in shape, I’ve had to debug many more bugs coming from people that generate unverifiable code than from let’s say , the C# compiler, which generates verifiable code (unless you use the unsafe keyword).

 

The problem is that It is very easy to go from 2) to 3). What can happen when you are in 3)?

 

- If you get very lucky, you will immediately crash, with some clear indication of what method had a problem, you will inspect the IL and find the problem.

- If you are less lucky, you will break some CLR invariant and crash in a way you will have a hard time to figure out what went wrong.

- If you are not very lucky, you will have a bug that won’t reproduce deterministically, that will behave different ways in different machines (specially in your customer’s machines), and that will need debugging by somebody with expertise in CLR internals. Trust me, you don’t want to be here.

 

What’s my recommendation? It’s very likely that 100% of your code could be in camp 1). If not, try to get close to 100%. Once you’re there separate verifiable from non verifiable code into different assemblies and make sure the code you think is verifiable passes verification. You can do this by saving your assembly to disk and running PEVerify on it, or refusing the SkipVerification permission, in which case the JIT compiler will verify your methods as you go and will throw a nice an easy to debug exception when it encounters code that isn’t verifiable.

 

Some examples on how easy and how non obvious it is to get into really hard to debug problems. The CLR Garbage Collector performs all its work on a huge dynamic graph of objects. To put it in a simple form, the GC follows pointers from one object to the other in order to obtain the set of objects that are currently alive. For example, if your have a live object of the following class

 

Class Foo

{

            Int i

            Foo next

}

 

To see what other live objects this one is keeping alive, the GC has to walk the object looking for pointers. To do that, the GC obtains the layout of the object and follows the managed pointers, in this case, ‘next’.

 

Now, what happens if next is a bad pointer? Most likely you will crash. Why? To figure out what objects is ‘next’ keeping alive, we need to know what type ‘next’ is. The GC figures this out by looking at ‘next’s method table, which is, let’s say, in *( (MethodTable**) next). Now, if next is a bad pointer, you are basically reading from a random location, if you get lucky you will crash right there, if not more dangerous things can happen: heap corruption, type safety violation, etc… This class of bug is what we call here a ‘GC Hole’

 

How can you get into this mess by emitting IL? Let’s write a silly example (at the right of each instruction I show what’s in the IL stack after the instruction executes) that will show you that not even inspecting the x86 assembly we generate from your IL will be enough proof that you are correct (obviously, for simplicity, this is not real code. Real code can be more subtle and more innocent looking). Let’s hope it can convince you to only generate verifiable code ;)

 

 

newobj instance void Test::.ctor() ( Test*)

castclass Test ( Test*)

ret ()

 

This would generate the following code:

 

IN0001: mov ECX, 0x2cb0dd4 (GC regs: - )

IN0002: call CORINFO_HELP_NEWSFAST (GC regs: EAX)

IN0003: mov EDX, EAX (GC regs: EDX)

IN0004: mov ECX, 0x2cb0dd4 (GC regs: EDX)

IN0005: call CORINFO_HELP_CHKCASTCLASS (GC regs: EAX)

 

This code is fine. Note the ‘GC regs’ on each line The JIT emits GC information, which tells the GC what registers hold GC pointers for every instruction (again, we don’t do this always, but let’s say we do for this discussion), so basically, after instructions 2,3 and 4 EDX is holding a GC reference (the Test object we just created) and after instruction 5 the result of the case is in EAX

 

Now, let’s introduce a small change (that will make our code unverifiable)

 

newobj instance void Test::.ctor()( Test*)

conv.i ( I )

castclass Test ( Test*)

ret ()

 

 

This yields the following code:

 

IN0001: mov ECX, 0x2cb0dd4 (GC regs: - )

IN0002: call CORINFO_HELP_NEWSFAST (GC regs: EAX)

IN0003: mov EDX, EAX (GC regs: None!)

IN0004: mov ECX, 0x2cb0dd4 (GC regs: None!)

IN0005: call CORINFO_HELP_CHKCASTCLASS (GC regs: EAX)

 

 

What is the difference? The generated code is exactly the same, but the GC info isn’t. In our second version, after the 3rd instruction EAX is no longer a GC pointer, because conv.i told us to treat it as a pointer size integer. What does this mean to us? It means that if a GC happens while in instructions 3 and 4, the GC won’t think there is an object in EDX, which is the only reference we have to our newly create object, so it will reclaim it’s memory back and possibly reuse it. So now the GC returns control to the code and we call into CORINFO_HELP_CHKCASTCLASS with a bogus pointer, oops! And what’s worse, you will only see the bug if a GC happens in that 2 instruction window. Combine the above with multithread code (let’s say you are running in ASP.NET) and you have the perfect GC hole, which will cause unexplained crashes that you’ll just see in production machines or while demoing your product.

 

[Edit: Fixed fonts and a typo]