Optimizing BitBlt by generating code on the fly


The initial implementation of the Bit­Blt function in 16-bit Windows didn't have any special tricks. It was static code that supported the sixteen raster operations that involve a source and destination.

The second version of Bit­Blt generated code on the fly. Specifically, the Bit­Blt function generated code onto the stack which performed the block transfer with the appropriate operation, and then called the code as a subroutine.

This was clearly the days before Data Execution Prevention (DEP).

Internally, the function that did this was called CBlt, short for "compiled block transfer." The generated code followed a template, hard-coding the array bounds and bitmap stride into the generated code, thereby transforming variables into constants, which avoided having to consume a register (or worse, accessing memory) to check the value.

The code generator emitted code to loop over the source and destination bitmaps, taking care to check for overlap between the source and destination and looping in the correct direction accordingly. The code generator also had to generate code to handle the so-called "phase mismatch" in the case where the source and destination do not start at the same bit offset within the starting byte. And of course to handle the case where the starting or ending pixels are not on a byte boundary. And then there was code to handle the case of interlaced displays, where the way to move from one scan line to the next depends on whether you're starting from an odd-numbered or even-numbered scan line.¹ Basically, all the stuff that you need to worry about when doing Bit­Blt, but instead of doing it, you are generating code that does it.

Inside the loop body, the code generator inserted a code fragment to perform the block transfer operation. For example, if the operation was SRCERASE, then the generated code would do something like

    ; By this point, the source is in AL
    ; and the destination is in ES:[DI].

    not al
    and al, es:[di]

    ; On exit from the fragment, the result is in AL.

The fragment is where the donuts are made. All the rest of the generated code is just scaffolding so we can get to this point. And as you can see, the fragment is usually rather anticlimactic.

The code generator had a table of sixteen fragments, so it knew what instructions to place inside the loop body.

The third version of Bit­Blt, known as SuperBlt at the time, extended its support to three-parameter raster operations (source, pattern, and destination). There are 256 possible operations, but to avoid exploding the number of fragments, there was some consolidation. For example, if two raster operations are the same except that one is the bitwise inverse of the other, then the same fragment was used for both, and the compiler appended a not al to one of them.

The fragment table also noted which inputs were required by the operation. For example, the DSTINVERT operation doesn't use the source at all, so the code generator can avoid generating the code to loop through the source bytes and load them into the al register. No point calculating values you're never going to use.

The result of all this compilation was around 120 instructions of machine code to perform a block transfer operation. Each of these custom-generated subroutines handled a particular bitmap size, overlap scenario, phase match, block transfer operation, and interlace state.

The fourth version of Bit­Blt added support for blitting between color bitmaps and monochrome bitmaps. So now you had color conversion as another input to the code generator.

In Windows 3.0, the fifth version added support for bitmaps larger than 64KB. The code generator took advantage of 32-bit registers so that it could index into the entire memory block at once, instead of having to break it up into 64KB pieces.

In Windows 95, the code generator got crazy and used the esp register as a general-purpose register. The 80386 has only eight 32-bit registers, so gaining an extra register was a big help. The code doesn't actually use the stack, so the fact that the esp register doesn't point to the stack isn't a problem. (Note that normal Win32 code can't get away with this trick because the stack pointer must remain valid for stack unwinding purposes. But this was special code running under special conditions, and it was in cahoots with the kernel so the kernel didn't freak out when it saw this wacko stack.)

Uh oh, but this means that you can't use the esp register to access your local variables. No problem! We'll run the code on a custom stack, too, so that our local variables are at fixed offsets relative to the stack selector register.

Nearly all of GDI was written in assembly language in versions of Windows up to and including the Windows 95 family. In that era, being able to not only read but also write assembly language was a core developer skill.

Bonus reading: The idea of generating block transfer code on the fly has been around for a while. (If impatient, skip to the bottom of page 43.)

¹ The way the code managed this was rather clever. It calculated the stride to go from an odd-numbered scan line to an even-numbered scan line, and then it calculated the stride to go from an even-numbered scan line to an odd-numbered scan line. It then xor'd the two values together to create a toggle value. After each scan line was complete, the current stride was applied, and then the stride was xor'd with the toggle value. This causes the stride to flip back and forth between the two desired values.

Comments (24)
  1. Pietro Gagliardi (andlabs) says:

    Ah yes, the Blit. If you were to ask me 10 years ago (holy cow it’s been 10 years now), I would’ve been super extra double obsessed with all the things coming out of Bell Labs at that time.

    Before reading the SIGGRAPH paper today, I didn’t know that the Blit had a 68000 as its CPU, and WOW the AT&T dialect of 68000 assembly language looks quite different from the Motorola variant that I am familiar with! Not as extreme as AT&T vs Intel x86 assembly, mind (the biggest difference is that where AT&T says %a0@+, Motorola says (a0)+), but still odd somehow…

    Bonus watching: Penn and Teller using what looks like a Blit but probably isn’t to prank Rob Pike’s and Dennis Ritchie’s then-boss, co-discoverer of the cosmic microwave background Arno Penzias: https://www.youtube.com/watch?v=fxMKuv0A6z4

  2. Andre says:

    Very interesting to see to what lengths the Windows devs went to make things fast. Any ballpark numbers by how much this improved performance over the first approach?

    And, since it’s probably costly to maintain, are there still many of these performance tricks inside Windows? As nowadays it seems common that the hardware improvements of the last 20 years are being spent by userland applications adding about a dozen levels of abstraction, until a blinking cursor uses up the CPU, so all these neat improvements will get eaten up anyway.

  3. John Elliott says:

    The Windows 1.x drivers I’ve looked at all generate blit code on the fly. I think they’d have been for Windows 1.03 or 1.04.

    The drivers for GEM on the Apricot F-series also generate code, but they’re the only ones I’ve seen that do. Presumably Digital Research decided the performance increase wasn’t worth the extra burden of maintenance.

  4. Darran Rowe says:

    I would imagine that BitBlt still evolved to include even more crazy stuff. Windows NT 4 made the GDI kernel mode and the latest versions also include hardware acceleration when available.
    Thinking about that, and going off on a tangent. These parts of Windows were originally made kernel mode to improve performance. Raymond, do you know if this is still holding true, especially with Meltdown and Spectre, or has nobody at Microsoft tested this?

    1. Klimax says:

      It moved into user-mode in Vista and IIRC stayed there. Cost of switching was amortized by doing bulk of work in user-mode code and only sending finished result to kernel-mode for displaying. One of major reasons for introduction of WDDM.

      So none of fixes/workarounds for those two vulnerabilities should have any measurable impact on graphical performance.

      See:
      https://docs.microsoft.com/en-us/windows-hardware/drivers/display/windows-vista-and-later-display-driver-model-architecture
      https://docs.microsoft.com/en-us/windows-hardware/drivers/display/windows-vista-and-later-display-driver-model-operation-flow

      1. Darran Rowe says:

        I knew that happened with the Direct3D runtime. But are you sure that it also happens with the GDI?
        The first link you provided shows that even in the WDDM model, the Win32 GDI goes straight to Win32K. From that, my understanding was that even in Vista and newer versions, Win32K, or kernel mode, was still where the GDI was implemented.

        1. GDI was pulled out of kernel along with everything else in Vista. Where did you get the idea that it wasn’t? All GDI now runs through cdd.dll, which manages the actual interfacing with WDDM and the driver from there.

          1. Darran Rowe says:

            General lack of major interest and no real pressing need to research into it. The little bit of prodding around that I did do made it seem that the flow went gdi32.dll -> win32u.dll or maybe gdi32full.dll -> win32u.dll, and win32u.dll is the entry into win32k. There also wasn’t any obvious links between cdd.dll and anything else, since it is obviously a kernel mode library and it depends on win32k and it doesn’t have any exports.
            But well, if people keep saying that part of GDI is user mode then who am I to argue.
            The question was originally just one of those random passing thoughts anyway.

          2. Mark S says:

            (I can’t reply to Darran’s message, presumably because the thread is too deeploy nested)

            Note that as a consequence, in Vista, GDI could no longer be hardware accelerated. But partial support for GDI hardware acceleration was restored in Windows 7.

        2. Klimax says:

          Ok. Looks like answer is far less clear then I remember it. As far as I can tell based upon following links, GDI is handled by DWM. Apparently it is not known how much is done in user-mode.

          Engineering Windows 7 Graphics Performance:
          https://blogs.msdn.microsoft.com/e7/2009/04/25/engineering-windows-7-graphics-performance/

          Also this page might give you some idea about answer:
          https://docs.microsoft.com/en-us/windows-hardware/drivers/display/rendering-on-a-discrete-gpu-using-cross-adapter-resources

          1. camhusmj38 says:

            Articles like this strongly suggest GDI is still in the Kernel. https://blog.quarkslab.com/reverse-engineering-the-win32k-type-isolation-mitigation.html

            Apparently, there have been many exploits that relied on that fact.

  5. laonianren says:

    In the interlaced case is it hopping forward and back between two separate bitmaps for the odd and even fields? On 80s systems I programmed the video controllers handed interlacing transparently, which made things easier.

  6. JT says:

    In about 1984 my friend John Butler told me about this fancy Windows BitBlt and credited Rao Remala with the design and implementation. I still remember John’s excitement when he uttered the strange sentence “And then it executes the stack!”. What a cool idea.

  7. IanBoyd says:

    I remember an old Michael Abrash article in Dr. Dobbs talking about WinG. (http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/DDJ/1995/9514/9514f/9514f.htm)

    WinG gave developers a flat 32-bit bitmap model, with a fast, double-buffered, copy to the screen. He also mentioned that all the functionatilty was built into Windows NT – and so on NT it was a simple wrapper.

    > You see, Windows is a DOS extender, it provides driver support in a generally device-independent way, and it attracts far more drivers than any other platform. The only thing missing for real-time graphics is fast, direct, double-buffered graphics– or, rather, I should say “was,” because that gap is now filled by WinG (pronounced win-GEE), the software that made WinDOOM possible. WinG does nothing more or less profound than support fast, double-buffered drawing on Win 3.1, Win32s, and Win32, and that’s precisely the missing ingredient in making Windows an excellent real-time animation platform.
    >
    > I know what you’re thinking: Is that all? Is that what you’re getting so excited about? Well, yes. Think of it this way. Right now, if you want to do real-time, 256-color graphics above 320×200 under DOS, you have to deal with the complexities of Mode X. If you want to go past 360×480, you have to deal with supporting dozens of different SuperVGAs, and you have to handle banked video memory. You also have to do all the drawing yourself, including text. Furthermore, you have to deal with protected mode somehow, plus input handling, and all the rest.
    >
    > With WinG, the details of protected mode, devices, and input handling are handled by Windows, and 32-bit programming is a snap with Windows NT or Win32s or, soon, Windows 95, with tools galore available. Better yet, you can do all your drawing into a single, linear pixel buffer, with GDI’s help, if you’d like, and then WinG will copy or stretch that buffer to the screen at memory bandwidth. No banking, no Mode X, no mode-set complexities, just the big, linear pixel buffer with 32-bit addressing that you’ve always dreamed of–and Windows is everywhere nowadays, so we’re talking about a huge and rapidly growing installed base.

    1. Beldantazar says:

      Ah, the days before directx and gpu’s doing all the work for you far faster than the cpu ever could.

      1. Anon says:

        Microsoft’s strategy with gaming was actually pretty clever. The first stage was to get all the Dos games running on WinG with the “Windows as a Dos extender that lets you render into a linear framebuffer and then blast it to the screen” model Abrash was advocating. The second was to develop DirectX as an abstraction layer between games and the GPU so things like drawing a textured polygon could be hardware accelerated.

        Pretty quickly Windows was a better platform for gaming than Dos was.

  8. Bram Stolk says:

    Heh, takes me back.
    I used to do this too.
    Convert sprite-data into blit-code (even with alpha-transparency added.)
    I would convert it into x86 code, and load it as a shared object.

    Later, I even ported it from x86 asm to a dec-alpha assembly system, because I had a toy dec-alpha running linux at the time.

    1. Mark S says:

      You might enjoy this bit of craziness: https://www.youtube.com/watch?v=vAHXK2wut_I
      Using the sprite positions *as* code in this intricate Super Mario World exploit

  9. i336_ says:

    It would be so awesome for the Windows 3.1 source code to become available someday. Windows 95 would be… I’d love to live to see that.

    I’d love to see the full JIT routines in question – since that’s what this is! I’m very happy to learn that Windows 1.x (see other comment here) through Windows 95 and probably even NT used JIT techniques to manipulate the contents of the screen. Wow!

    (A tangent on the subject of graphics: I want to do a feasibility study on making a (barebones) Win3.x driver for the Bochs GPU (which is very simple). I’ve timidly poked through the 3.1 DDK (which appears to have some reference code), but I can’t figure out what I’m looking at. I guess my first question is if I’d be able to shim everything and write the bulk of the code in C. End thread hijack – but I suspect that ONT could be the best place to ask this…)

    1. John Elliott says:

      You can find source for the Windows 2.x version in the Windows 2 DDK — see, for example, vga-ega/cblt.asm.

  10. Neil says:

    Huh, and here was me thinking that the 256 ROP codes were handled by reversing the computation in the table at https://msdn.microsoft.com/en-us/library/aa932106.aspx i.e. D = ROP >> 16 >> P * 4 >> S * 2 >> D & 1 (except I was assuming it used some sort of optimised bit twiddling magic instead of computing it bit by bit).

  11. Stu says:

    I seem to remember Executor, the Mac emulator did something similar for blitting.

  12. Joshua says:

    Now I don’t feel bad for my own trampolines on the stack.

  13. R P (MSFT) says:

    Can you shed some light on the low-order word of the predefined ROP codes? I have a vague memory that the SDK documentation used to say that it was opaque input to the code generator, but that might have been Petzold, Pietrek, Schulman, or someone else who said that.

Comments are closed.

Skip to main content