C# and Fastcall – How to make them work together without C++/CLI (Shellcode!)

I recently ran into a situation where code had to meet the following requirements:

  1. C# language only
  2. No dependence on VC++ runtimes
  3. Support COM Interop with no ole32
  4. Support delegates for Fastcall calling convention callback functions


The first three were no big deal in and of themselves.  The fourth, however, introduced a big problem.  The CLR natively supports several different types of calling conventions for functions and delegates for native interop.  It defaults to stdcall which is what the Windows API typically uses.  If I needed to make a delegate use cdecl instead, I could simply place a tag above the delegate like this:


public delegate void ExampleDelegate(); 

The CallingConvention enumeration has five members: Winapi, Cdecl, StdCall, ThisCall, and FastCall.  I'm not going to go over the differences between them since odds are if you're writing code in C# you don't care other than knowing which one you should be using.  If you enter the FastCall value in Visual Studio, you'll probably notice the comment that goes along with it: "This calling convention is not supported."  For the other four, you could simply use Marshal.GetFunctionPointerForDelegate and do interop with no issues.  But this won't work for Fastcall.  If you try to use it anyway, it will compile, it will run, and it will probably cause bad results.  What typically happens is that your delegate is called but none of the parameters are what you expect them to be; if you used IntPtr instead of referenced Structure types, that won't cause a crash in and of itself.  However, since the function calling your delegate expects the stack to be in a certain state, it is highly likely to create a access violation which will result in the .NET exception "Attempted to read or write protected memory." 

Normally, the easy way around this is to use C++/CLI, create a native function to use a delegate, and use it to call a managed function in a convention that the Common Runtime Language can understand.  Here's a simple example of doing that for the ExampleDelegate shown above:


#include "stdafx.h"

using namespace System;

#pragma unmanaged

typedef void(__stdcall *PExampleDelegateManaged)();

PExampleDelegateManaged managedDelegate;

void __fastcall ExampleDelegateNativeFunction()




#pragma managed


delegate void ExampleDelegateManaged();

void ExampleDelegateManagedFunction()

Console::WriteLine("Yay this works just fine");


int main(array<System::String ^> ^args)

//You'd want the delegate below to be a member of a class or otherwise pinned to something so that the GC wouldn't clean it up when it could still be called

//and obviously this probably wouldn't be in your main function:-)

ExampleDelegateManaged^ managedDelegatePinnedHereToAvoidGarbageCollection = gcnewExampleDelegateManaged(ExampleDelegateManagedFunction);

IntPtr funcPointer = System::Runtime::InteropServices::Marshal::GetFunctionPointerForDelegate(managedDelegatePinnedHereToAvoidGarbageCollection);

managedDelegate = (PExampleDelegateManaged)(void*)funcPointer;

//other code here 

return 0;


While that's not overly complicated to do, it wouldn't work with the requirements that had to be met, so another way to do it needed to be developed.  The solution that I came up with was to use "shellcode" which is the term I'm going to use since it's the best way that I know of to describe native code that executes at an arbitrary point in memory without needing any modules or linked libraries to execute successfully.  Normally writing shellcode isn't something that a programmer has to do very often unless they're doing "security research", but it can occasionally be useful in legitimate programming as well.  Since we're going to be using such code on purpose in our process, not having to worry about removing null bytes makes it not too difficult to do.  So here's what needs to be done in visual studio to make this happen in C without writing assembly(though the same thing can be done with other compilers and operating systems too!):

  1. Create a New Project with the VC++ language.
  2. Choose Win32 Console Application, name it, hit ok, and click Finish.
  3. Right-click the project and go to its properties page.
  4. In the C/C++ -> Optimization section, set Optimization to Disabled.
  5. In the C/C++ -> Code Generation section, set Runtime Library to Multi-Threaded DLL and set  Security Check to Disabled.
  6. In the C/C++ -> Output Files section, set Assembler Output to Assembly, Machine Code, and Source.

Now that that is done, we have a framework to make the code we need.  In my case, one of the callbacks that I had to wrap took four arguments: a pointer, an integer, a pointer, and a pointer.  In order to make code to wrap it, I simply coded the following in the file that the Visual Studio created for me with a main method already in it:

#include "stdafx.h"

extern "C" {

#pragma runtime_checks( "", off ) //Security and cookie checks create dependencies that we can't have in order to make this code truly independent

typedef void(__stdcall *PStdCallPrototype)(void* Argument1, int Argument2, void* Argument3, void* Argument4);

//You don't want it to be inlined if you call it in main, so tell the compiler not to inline it

__declspec(noinline) void__fastcall FastCallPrototypeCalledFixedAddress(void* Argument1, intArgument2, void* Argument3, void* Argument4)

//We'll use fixed addresses for where this gets called as it will make it easier to replace with a real address later in C#

PStdCallPrototype protoFunction;

#ifdef _WIN64
protoFunction = (PStdCallPrototype)0xF0F0F0F0F0F0F0F0;


protoFunction = (PStdCallPrototype)0xF0F0F0F0;


protoFunction(Argument1, Argument2, Argument3, Argument4);


#pragma runtime_checks( "", restore )


int _tmain(intargc, _TCHAR* argv[])

//Use four very easy to identify parameters

void* Argument1 = (void*)1;

int Argument2 = 2;

void* Argument3 = (void*)3;

void* Argument4 = (void*)4; 

//Call the wrapper so you can step through it if you want to see it in action

FastCallPrototypeCalledFixedAddress(Argument1, Argument2, Argument3, Argument4); 

return 0;


Once you build that, it'll create a file that has the raw machine bytes in it.  I needed my code to work in both x86 and x64 scenarios since my C# code compiled to Any CPU, so I had to make a second file for x64.  To do that, go to Build->Configuration Manager, click on the Platform dropdown for the project, select New, then choose x64 (you can leave the create solutions platform checkbox checked or uncheck it - it doesn't matter for this small use case) and click ok.  Close the configuration manager, go back to the project properties, and repeat the above configuration options if necessary.  Then build the project again.  Now there are both x86 and x64 raw machine bytes available.  Now that we have raw machine code, we need a way to execute it.  I put the following simple class together that does the following:

  1. Takes a delegate, the raw machine bytes, and the original address using in the raw machine bytes (0xF0F0F0F0 / 0xF0F0F0F0F0F0F0F0 in the above)
  2. Uses VirtualAlloc to create PAGE_EXECUTE_READWRITE memory in the process.  That allows it to copy code to that area and then execute it.
  3. Pins the supplied delegate and creates an IntPtr to that pinned delegate using Marshal.GetFunctionPointerForDelegate
  4. Replaces the original address with the address represented by the IntPtr
  5. Exposes an IntPtr property pointing to the wrapped delegate that can be supplied as the delegate procedure for a fastcall callback

It looks like this: 

class UnmanagedCodeSupplier : IDisposable


private Delegate delegateInstance;

private IntPtr delegatePointer;

private IntPtr newFunctionAddress;

public IntPtr WrappedDelegateFunction { get { return newFunctionAddress; } }

public UnmanagedCodeSupplier(Delegate actualDelegate, byte[] codeBytes, UIntPtr addressToReplace)


newFunctionAddress = VirtualAlloc(IntPtr.Zero, newIntPtr(codeBytes.Length), MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);

if (newFunctionAddress == IntPtr.Zero)




List<byte> newcode = newList<byte>();

List<byte> currentcodequeued = newList<byte>();

delegateInstance = actualDelegate;

delegatePointer = Marshal.GetFunctionPointerForDelegate(delegateInstance);       

//Get the original address to replace as bytes

byte[] cBytes = IntPtr.Size == sizeof(int) ? BitConverter.GetBytes(addressToReplace.ToUInt32()) : BitConverter.GetBytes(addressToReplace.ToUInt64()); //Account for the size difference

//Get the new bytes to use

byte[] nBytes = IntPtr.Size == sizeof(int) ? BitConverter.GetBytes(delegatePointer.ToInt32()) : BitConverter.GetBytes(delegatePointer.ToInt64()); //Account for the size difference

int currentMatchNumber = 0; bool matched = false;

//Loop through the code, find the matching address, replace it with the address from the delegate

for (int i = 0; i < codeBytes.Length; i++)


if (matched) {



elseif (codeBytes[i] == cBytes[currentMatchNumber]) {


if (currentMatchNumber == cBytes.Length) {//Add the real address instead of the fake



matched = true;


else {



else {

if (currentcodequeued.Count > 0) {







if (!matched)



//cleanup - this happens to be implemented in such a way that just calling dispose can be used

throw new ArgumentException("Invalid addressToReplace specified for the specified codeBytes");


//Now just copy that executable code over to where it should be

Marshal.Copy(newcode.ToArray(), 0, newFunctionAddress, codeBytes.Length);


#region Native Interop

const uint MEM_COMMIT = 0x1000;

const uint MEM_RESERVE = 0x2000;

const uint MEM_RELEASE = 0x8000;

const uint PAGE_EXECUTE_READWRITE = 0x40;

[DllImport("kernel32", SetLastError = true)]

static extern IntPtr VirtualAlloc(IntPtr startAddress, IntPtr size, uint allocationType, uint protectionType);

[DllImport("kernel32", SetLastError = true)]

static extern IntPtr VirtualFree(IntPtr address, IntPtr size, uint freeType);






public void Dispose()





private void Dispose(bool calledDispose)


if (calledDispose)


//Do managed cleanup

delegateInstance =null;


//Do native cleanup

if (newFunctionAddress != IntPtr.Zero)


VirtualFree(newFunctionAddress, IntPtr.Zero, MEM_RELEASE);

newFunctionAddress = IntPtr.Zero;




I then opened up my raw machine code files and put them in my code as static byte arrays like this:


static byte[] x86CodeForFastcallWrapperForExecutionDelegate = new byte[] {
    0x55,   //push  ebp
    0x8b, 0xec,  // mov  ebp, esp
    0x83, 0xec, 0x4c, // sub  esp, 76   ; 0000004cH
    0x53,   //push  ebx
    0x56,   //push  esi
    0x57,   //push  edi
    0x89, 0x55, 0xf8,  //mov  DWORD PTR _Argument2$[ebp], edx
    0x89, 0x4d, 0xfc,  //mov  DWORD PTR _Argument1$[ebp], ecx
    0xc7, 0x45, 0xf4, 0xf0, 0xf0, 0xf0, 0xf0,   //mov  DWORD PTR _protoFunction$[ebp], -252645136 ; f0f0f0f0H
    0x8b, 0x45, 0x0c,  //mov  eax, DWORD PTR _Argument4$[ebp]
    0x50,  // push  eax
    0x8b, 0x4d, 0x08, // mov  ecx, DWORD PTR _Argument3$[ebp]
    0x51,  // push  ecx
    0x8b, 0x55, 0xf8, // mov  edx, DWORD PTR _Argument2$[ebp]
    0x52,  // push  edx
    0x8b, 0x45, 0xfc, // mov  eax, DWORD PTR _Argument1$[ebp]
    0x50,  // push  eax
    0xff, 0x55, 0xf4, // call  DWORD PTR _protoFunction$[ebp]
    0x5f,  // pop  edi
    0x5e,  // pop  esi
    0x5b,  // pop  ebx
    0x8b, 0xe5,  // mov  esp, ebp
    0x5d,  // pop  ebp
    0xc2, 0x08, 0x00 // ret  8
static byte[] x64CodeForFastcallWrapperForExecutionDelegate = new byte[] {
    0x4c, 0x89, 0x4c, 0x24, 0x20, // mov  QWORD PTR [rsp+32], r9
    0x4c, 0x89, 0x44, 0x24, 0x18, // mov  QWORD PTR [rsp+24], r8
    0x89, 0x54, 0x24, 0x10, // mov  DWORD PTR [rsp+16], edx
    0x48, 0x89, 0x4c, 0x24, 0x08, // mov  QWORD PTR [rsp+8], rcx
    0x48, 0x83, 0xec, 0x38, // sub  rsp, 56   ; 00000038H
    0x48, 0xb8, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,//  mov  rax, -1085102592571150096 ; f0f0f0f0f0f0f0f0H
    0x48, 0x89, 0x44, 0x24, 0x20, // mov  QWORD PTR protoFunction$[rsp], rax
    0x4c, 0x8b, 0x4c, 0x24, 0x58, // mov  r9, QWORD PTR Argument4$[rsp]
    0x4c, 0x8b, 0x44, 0x24, 0x50, // mov  r8, QWORD PTR Argument3$[rsp]
    0x8b, 0x54, 0x24, 0x48,  //mov  edx, DWORD PTR Argument2$[rsp]
    0x48, 0x8b, 0x4c, 0x24, 0x40,//  mov  rcx, QWORD PTR Argument1$[rsp]
    0xff, 0x54, 0x24, 0x20,  //call  QWORD PTR protoFunction$[rsp]
    0x48, 0x83, 0xc4, 0x38,  //add  rsp, 56   ; 00000038H
    0xc3  // ret  0

And finally used it all in my code like this:


private delegate void PStdCallDelegateVersion(IntPtr context, TTCallbackType type, IntPtr arg1, IntPtr arg2);

private UnmanagedCodeSupplier callbackProvider;

private MyClass()


callbackProvider = new UnmanagedCodeSupplier((PStdCallDelegateVersion)ttExecutionCallback, IntPtr.Size == sizeof(int) ? x86CodeForFastcallWrapperForExecutionDelegate : x64CodeForFastcallWrapperForExecutionDelegate, IntPtr.Size == sizeof(int) ? new UIntPtr(0xF0F0F0F0) : new UIntPtr(0xF0F0F0F0F0F0F0F0));

private void ttExecutionCallback(IntPtr cntxt, TTCallbackType type, IntPtr arg1, IntPtr arg2) { ... }


So while it's far from the simplest thing in the world to do this, it's not that bad, and it certainly made for an interesting afternoon of problem solving.  Once you have a template like this together, it's easy and relatively quick to do for additional function prototypes.

Follow us on Twitter, www.twitter.com/WindowsSDK.

Comments (0)

Skip to main content