MC++ / IJW codegen for Native structures in IL

Article
09/07/2005

When MC++ compiles an "unmanaged" class into IL (from IJW), it actually compiles the class into an opaque blob and then uses pointer arithmetic to access the fields. The code is not verifiable, but it's still IL. This is exactly how it would codegen if the class were compiled to native code.

Consider this snippet that declares a "native" C++ class, which we will then compile to IL.

 // A "native" class. Not allocated on GC heap.
class Point
{
public:
    Point(int x, int y)
    {
        m_x = x;
        m_y = y;
    }

    int m_x;
    int m_y;
    int m_z;
};

That compiles to this IL code:

 .class private sequential ansi sealed beforefieldinit Test.Point
       extends [mscorlib]System.ValueType
{
  .pack 0
   .size 12
  .custom instance void [Microsoft.VisualC]Microsoft.VisualC.MiscellaneousBitsAttribute::.ctor(int32) = ( 01 00 40 00 00 00 00 00 )                         // ..@.....
  .custom instance void [Microsoft.VisualC]Microsoft.VisualC.DebugInfoInPDBAttribute::.ctor() = ( 01 00 00 00 ) 
} // end of class Test.Point

You'll notice the source-level fields m_x, m_y, m_z are not actually declared here. Instead, there's a ".size 12" directive which will tell the CLR that this managed type is an opaque blob of 12 bytes.

Now look at what happens when you actually access a field:

// Allocate the native class
Point * p1 = new Point(3, 4);
Console::WriteLine(p1->m_y);

This becomes the following IL:

 //000128:      // Allocate the native class
//000129:     Point * p1 = new Point(3, 4); 
  IL_0006:  ldc.i4.s   12
  IL_0008:  call       void* modopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl) new(uint32)
  IL_000d:  stloc.3
  .try
  {
    IL_000e:  ldloc.3
    IL_000f:  brfalse.s  IL_001b
    IL_0011:  ldloc.3
    IL_0012:  ldc.i4.3
    IL_0013:  ldc.i4.4
    IL_0014:  call       valuetype Test.Point* modopt([mscorlib]System.Runtime.CompilerServices.CallConvThiscall) 'Test.Point.{ctor} '(valuetype Test.Point* modopt([mscorlib]System.Runtime.CompilerServices.IsConst) modopt([mscorlib]System.Runtime.CompilerServices.IsConst),
                                                                                                                                      int32,
                                                                                                                                      int32)
    IL_0019:  br.s       IL_001c
    IL_001b:  ldc.i4.0
    IL_001c:  stloc.s    V_12
    IL_001e:  leave.s    IL_0027
  }  // end .try
  fault
  {
    IL_0020:  ldloc.3
    IL_0021:  call       void modopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl) delete(void*)
    IL_0026:  endfinally
  }  // end handler
  IL_0027:  ldloc.s    V_12
  IL_0029:  stloc.s    p1
//000130:     Console::WriteLine(p1->m_y); 
  IL_002b:  ldloc.s    p1
  IL_002d:  ldc.i4.4   // 4 is the offset of field m_y 
  IL_002e:  add
  IL_002f:  ldind.i4   // equivalent to: *(p1 + offsetof(Point, m_y))
  IL_0030:  call       void [mscorlib]System.Console::WriteLine(int32)

This has two parts. The first source line to allocate the new Point() object on the unmanaged heap generates a lot of IL. You'll notice this first calls new() to allocate the unmanaged memory (IL_0008) , and then invokes the ctor (IL_0014). Conceptually, this is exactly what we'd expect from native code generation. But what's all the stuff in red? It's just an exception backstop to ensure we delete the raw memory for the ctor if a managed exception is thrown.
The second source line to access the field (p1->m_y) is exactly what we expect. It finds the field by 'this ptr' plus offset-to-field, and then dereferences. That's exactly what the native code would do.

What about managed classes?

Now consider if we declare it as a "managed" class (I'm using the Whidbey C++ syntax), which is done with the "ref" keyword:

 // A "managed" class, must be allocated on GC heap.
ref class ManagedPoint
{
public:
    ManagedPoint(int x, int y)
    {
        m_x = x;
        m_y = y;
    }

    int m_x;
    int m_y;
    int m_z;
};

That compiles to this IL code:

 .class /*02000002*/ private auto ansi beforefieldinit Test.ManagedPoint
       extends [mscorlib/*23000001*/]System.Object/*01000011*/
{
   .field public int32 m_x<br>  .field public int32 m_y<br>  .field public int32 m_z
  .method /*06000060*/ public hidebysig specialname rtspecialname 
          instance void  .ctor(int32 x,
                               int32 y) cil managed
  {
    // Code size       21 (0x15)
    // ... ctor code omitted ...
  } // end of method ManagedPoint::.ctor

} // end of class Test.ManagedPoint

You can see the explicit field declarations. Now that looks more like a managed class! And again, given the code to access the field:

     // Allocate a similar class on the GC heap.
    ManagedPoint ^ p2 = gcnew ManagedPoint(5,6);
    Console::WriteLine(p2->m_y);

Here's the IL:

 //000132:     // Allocate a similar class on the GC heap.
//000133:     ManagedPoint ^ p2 = gcnew ManagedPoint(5,6);
  IL_0035:  ldc.i4.5
  IL_0036:  ldc.i4.6
  IL_0037:  newobj     instance void Test.ManagedPoint::.ctor(int32, int32)
  IL_003c:  stloc.s    p2
//000134:     Console::WriteLine(p2->m_y);
  IL_003e:  ldloc.s    p2
  IL_0040:  ldfld      int32 Test.ManagedPoint::m_y
  IL_0045:  call       void [mscorlib]System.Console::WriteLine(int32)

This looks just like the IL from C# code. The object is allocated with the "newobj" instruction onto the GC heap. Unlike to IL for the "native" class, no backstop is needed because the GC is handling all the memory. The field is accessed specifically through the "ldfld" (load field) instruction instead of through pointer math. The compiler does not know the offset of the field or the size of the ManagedPoint class.

Some conclusions
It's interesting to print both of these classes from a managed-only debugger like Mdbg.

 [p#:0, t#:0] mdbg> print p1     <-- the native class, an opaque blob
p1=Test.Point <Test.Point>
[p#:0, t#:0] mdbg> print p2     <-- the managed class, a real class
p2=Test.ManagedPoint
        m_x=5
        m_y=6
        m_z=0

You'll notice that MDbg can print the managed class as you'd expect. However, since MDbg uses metadata to find the fields, and the native class is just an opaque blob, it can't actually print the fields. (MDbg should be able to print the size of the native blob - that's just a missing feature). VS is smarter and can actually print the fields because it uses more than just the metadata. ( I wanted to drill into the pdb more with my pdb2xml reader , but that doesn't seem to work on MC++ dlls. Another investigation for me).

One takeaway here is that language compilers can have a loose mapping between types in a language and types at the IL level.
On the down side, this decreases interopability and they're also not verifiably safe. While unmanaged C++ code access such fields, C# can't import the native classes since they're just opaque blobs. Or more specifically, C# can pass the opaque blob around, it just can't do anything useful with it.
On the plus side, this gives the compiler full control. It also paves the way for the compiler to do things outside the CLR's common type system, such as multiple-inheritance.

MC++ / IJW codegen for Native structures in IL

Additional resources