Minimize the size of your program - low level

NOTICE: The following is not intended for real-world application. It is just an intellectual exercise to minimize the size of the program. The generated PE file may or may not be valid as it depends on behavior of specific architecture, OS and toolset.

The previous post: Minimize the size of your program - high level

Do you know the smallest x86 PE file in the world?
Here is the answer: https://www.phreedom.org/solar/code/tinype/

  • Smallest possible PE file: 97 bytes
  • Smallest possible PE file on Windows 2000: 133 bytes
  • Smallest PE file that downloads a file over WebDAV and executes it: 133 bytes

BTW: You can use IDA Pro, WinDBG or OllyDbg to open and debug the tiny PE :P

The main idea of tiny PE is simple: Striping unnecessary bytes and collapsing data structure by overlapping. So the overhead of PE format will be minimized (however, striping will lead to invalid PE header). Here, we are more concerned about how to decrease the size of our code.

To investigate the code generated by the compiler, you can use the /FAsc flag.

First, we can apply the technique used in tiny PE to collapse the PE Header. There're lots of fields that are not used by the current OS loader. It is unsafe to assume that they are not used in real-world application, but as far as our main purpose here is to minimize the size, we will move our data into these fields to save space. Another benefit is that, the address difference of these data will be within 0x80, which is suitable for our further optimization of the code.

Now let's begin our optimization!

1. Algorithm

a. Local variable initialization

Normally, the initialization involves lots of bytes of code.

    unsigned long keyT[4]={0xCBDCEDFE,0x8798A9BA,0x43546576,0x00102132};

The above code will be translated to assembly like:

    mov    DWORD PTR [ebp+offset8], imm32

It is 7 bytes. Then 7*4 = 28 bytes to initialize the whole array.

Instead, we can change the initialization into for-loop.

    for (size_t i=0;i<16;++i) ((char *)keyT)[i]=0xFE-i*0x11;
    ((char *)keyT)[15]++;

The above code can be implemented as:

    lea    edx, [ebp+keyT-1]
    xor    ecx, ecx
    mov    cl, 16
    mov    al, 0FEh
loc_4:
    inc    edx
    mov    [edx], al
    sub    al, 11h
    loop    loc_4
    inc    byte [edx]

28 -> 18 bytes!

b. memcpy

By default, the compiler will call the API memcpy in msvcrt.dll. But the cost is high. The alternative is to use intrinsic version of memcpy which takes advantage of "rep movs" instruction instead.
If we take a closer look at the algorithm, it turns out that the code simply rotates keyT array t bytes left.

    memcpy(Buf1,(char *)keyT+t,0x10-t);
    memcpy((char *)Buf1+0x10-t,keyT,t);
    memcpy(keyT,Buf1,16);

That means most of the parameter for memcpy can be reused, which will save lots of store/load.

    ;ebx == t, edi == end of keyT
    xor    ecx, ecx
    mov    cl, 10h
    sub    cl, bl
    mov    esi, edi
    sub    esi, ecx
    lea    edi, [ebp+Buf1]
    push    edi
    rep    movsb

    mov    cl, bl
    lea    esi, [ebp+keyT]
    rep    movsb

    pop    esi
    mov    cl, 10h
    rep    movsb

Thanks to the value reuse, the last two memcpy only use 12 bytes, and 35 bytes in total! In comparison, 54 bytes before optimization, not including more than 10 bytes for PE format overhead related to reference memcpy API

2. 8-bit vs 32-bit

In 32-bit environment, all the data and address are 32-bit. Then the 32-bit immediate numbers and long jump instructions can contribute to large portion of our code.
We can try to make these value small enough, so they can be held in 8-bit data type.
For example, when we call:

    printf("%08X",keyT[k]);

The compiler will generate the code like this:

    push    hexstr+ImageBase    ; "%08X"
    call    [printf+ImageBase]

Because we arrange the positions of hexstr and printf within 80 bytes, we can use the following code instead:

    mov    eax, hexstr+ImageBase
    push    eax
    add    al, printf-hexstr
    call    [eax]

Because printf uses cdecl calling convention, the caller is responsible to balance the stack. When we pop up the arguments from stack, we can reuse the pushed "eax" for "printf("\n");"

    add    al, newline-hexstr
    push    eax
    add    al, printf-newline
    call    [eax]

22 -> 17 bytes!

We can also split the long jmp into two short jmps (5 -> 4 bytes)

3. Misc

  • Prevent using 16-bit instruction.
  • Use "loop" instruction to do the loop.
  • Use enter/leave the setup stack.
  • Use "lods/stos" to load and store from the memory.
  • Use "pusha/popa" for multiple push/pop.
  • Sometimes, the instruction for "eax" will be a little shorter than for other registers.
  • Sometimes, 8-bit instruction will be a little shorter than the 32-bit version.
  • Return value is not needed here, and OS doesn't require us to save the registers. So we can omit both the "xor eax, eax" at the end of the program and the "push/pop" instructions to save the register (this is unsafe for real-world application)
  • Register allocation is also very important. If chosen properly, we can share their values and prevent many load/store.
  • Take advantage of al, ah and alike. Otherwise the limited number of registers on X86 will be a big issue.

For more detailed information on how to optimize your code for size, please see the excellent optimization manual by Agner Fog.

BTW: Don't forget to set the "Debug Directory Size" in the PE Header to 0, otherwise the program will crash on WinXP. Because that data member overlaps with our code, that means we have to insert instructions like this into our code:

    jmp    short skip_debug
    times 4 db 0 ;ensure debug directory size is 0
skip_debug:

Here is the source (you can compile it using "nasmw -f bin") and the result:

  PE Header Code Data Total size
Before optimization 480 272 144 896
After optimization 136 180 7 323