Minimize the size of your program - high level

NOTICE: The following is not intended for real-world application. It is just an intellectual exercise to minimize the size of the program. The generated PE file may or may not be valid as it depends on behavior of specific architecture, OS and toolset.

Nowadays, the size of the program is no longer a big issue. But it is still very interesting to write program with extremely small size.

Here, I'll limit the discussion on X86 + WinXP + PE format to simplify the analysis. And I choose VC9 as our C++ compiler. The target program reads the number of queries first, then calculates the digest for each query and outputs the result.

If you use the default setting to compile the code, you will get an executable of size 56320 bytes.

cl /c test.cpp
link test.obj

That is small, right? But most of the code in the executable is not yours. By default, the compiler will statically link to the CRT library, so they will reside in your program. We'd better get rid of these stuff. Let's add /MD flag to cl.exe. WOW! test.exe is only 5632 bytes now! But that is not the end of the story, we can decrease the size even further.

If you check the file generated, you can still find lots of stuff not belong to you. For example, the entry point procedure "start" is added "magically" by the compiler to initialize the CRT (you can check the source code of this function in Microsoft Visual Studio 9.0\VC\crt\src\crt0.c). In our program, these work can be safely ignored. We can pass /entry:"main" flag to tell the linker to use our main function as the entry point directly. Unfortunately, you will get the following linker errors if you do that:

MSVCRT.lib(gs_report.obj) : error LNK2019: unresolved external symbol __imp__TerminateProcess@8 referenced in function ___report_gsfailure
MSVCRT.lib(gs_report.obj) : error LNK2019: unresolved external symbol __imp__GetCurrentProcess@0 referenced in function ___report_gsfailure
MSVCRT.lib(gs_report.obj) : error LNK2019: unresolved external symbol __imp__UnhandledExceptionFilter@4 referenced in function ___report_gsfailure
MSVCRT.lib(gs_report.obj) : error LNK2019: unresolved external symbol __imp__SetUnhandledExceptionFilter@4 referenced in function ___report_gsfailure
MSVCRT.lib(gs_report.obj) : error LNK2019: unresolved external symbol __imp__IsDebuggerPresent@0 referenced in function ___report_gsfailure

Don't worry. That is what /GS flag does, and we will discuss it later. Here we simply pass kernel32.lib to link.exe to resolve the failure. The size of test.exe is now 3072 bytes.

/GS is used to detect buffer overrun, and is on by default. This flag is very useful for real-world application, but our program will not take the potential buffer overrun into consideration here. So we turn it off by passing /GS- to cl.exe and test.exe decrease to 2560 bytes!

There are no redundant code now, and we can focus on other stuff in our program. In PE format, different sections are used to store different kinds of data. For example, VC store code into ".text" and data into ".data". Each section will align to the 512-byte boundary in the file. That is why all the previous size of test.exe are multiply of 512. We can merge them to save the space (this is not safe for real-world application and may cause problems). These flags for link.exe are what we need: /MERGE:.rdata=.text /MERGE:.data=.text. We will then have test.exe of 1536 bytes!

With /O1 for cl.exe (minimize size), we can decrease the size to 1024 bytes. After loosing the section alignment requirement to 4 by linker flag /ALIGN:4 (again, this is not safe for real-world application and may cause problems), we achieve the size of 896.

All of these tunings are high level, and finally we decrease the size of the original exe by more than 98%!

cl /O1 /MD /GS- /c test.cpp
link test.obj /entry:"main" /MERGE:.rdata=.text /MERGE:.data=.text /ALIGN:4

BTW: If the main is completely empty, the above flag setting can generate a program of only 468 bytes. That is the lower bound of high level size optimization. To go even further, we have to resort to low level approach. That will be described in the next post.

The program:

#include <cstdio>
#include <cstring>

static unsigned char Tbl[9]={1,1,2,3,7,5,6,4,8};

int main()
{
    unsigned char BufT[0x401];
    int n;
    scanf("%d",&n);
    for (;n>0;--n) {
        scanf("%s",BufT);

        unsigned long keyT[4]={0xCBDCEDFE,0x8798A9BA,0x43546576,0x00102132};
        unsigned char *Buf=BufT;
        int tt;
        while (tt=*Buf++) {
            {
                int t=Tbl[tt%9];
                for (int k=0;k<4;++k) keyT[k]=keyT[k]*t+0xFBFC;
            }
            unsigned long Buf1[4];
            int t=tt&0xF;
            memcpy(Buf1,(char *)keyT+t,0x10-t);
            memcpy((char *)Buf1+0x10-t,keyT,t);
            memcpy(keyT,Buf1,16);
        }
       
        for (size_t k=0;k<4;++k) printf("%08X",keyT[k]);
        printf("\n");
    }
    return 0;
}