The initial implementation of the
in 16-bit Windows
didn't have any special tricks.
It was static code that supported the
sixteen raster operations that involve a source and destination.
The second version of
generated code on the fly.
BitBlt function generated code onto the stack
which performed the block transfer with the appropriate operation,
and then called the code as a subroutine.
This was clearly the days before Data Execution Prevention (DEP).
Internally, the function that did this was called
short for "compiled block transfer."
The generated code followed a template,
hard-coding the array bounds and bitmap stride
into the generated code,
thereby transforming variables into constants,
which avoided having to consume a register (or worse,
accessing memory) to check the value.
The code generator emitted code to loop over the source
and destination bitmaps,
taking care to check for overlap between the source and destination
and looping in the correct direction accordingly.
The code generator also had to generate code to handle the so-called
"phase mismatch" in the case where the source and destination
do not start at the same bit offset within the starting byte.
And of course to handle the case where the starting or ending
pixels are not on a byte boundary.
And then there was code to handle the case of interlaced displays,
where the way to move from one scan line to the next
depends on whether you're starting from an odd-numbered
or even-numbered scan line.¹
Basically, all the stuff that you need to worry about when doing
but instead of doing it,
you are generating code that does it.
Inside the loop body,
the code generator
inserted a code fragment to perform the block transfer operation.
For example, if the operation was
then the generated code would do something like
; By this point, the source is in AL ; and the destination is in ES:[DI]. not al and al, es:[di] ; On exit from the fragment, the result is in AL.
The fragment is where the donuts are made. All the rest of the generated code is just scaffolding so we can get to this point. And as you can see, the fragment is usually rather anticlimactic.
The code generator had a table of sixteen fragments, so it knew what instructions to place inside the loop body.
The third version of
known as SuperBlt at the time,
extended its support to three-parameter raster operations
(source, pattern, and destination).
There are 256 possible operations,
but to avoid exploding the number of fragments,
there was some consolidation.
if two raster operations are the same except that one is the
bitwise inverse of the other, then the same fragment was used for
both, and the compiler appended a
not al to one of them.
The fragment table also noted which inputs were required
by the operation.
For example, the
DSTINVERT operation doesn't
use the source at all,
so the code generator can avoid generating the code to
loop through the source bytes and load them into the
No point calculating values you're never going to use.
The result of all this compilation was around 120 instructions of machine code to perform a block transfer operation. Each of these custom-generated subroutines handled a particular bitmap size, overlap scenario, phase match, block transfer operation, and interlace state.
The fourth version of
BitBlt added support
for blitting between color bitmaps and monochrome bitmaps.
So now you had color conversion as another input to the code
In Windows 3.0, the fifth version added support for bitmaps larger than 64KB. The code generator took advantage of 32-bit registers so that it could index into the entire memory block at once, instead of having to break it up into 64KB pieces.
In Windows 95,
the code generator got crazy
and used the
esp register as a general-purpose register.
The 80386 has only eight 32-bit registers,
so gaining an extra register was a big help.
The code doesn't actually use the stack, so the fact that the
esp register doesn't point to the stack isn't a problem.
(Note that normal Win32 code can't get away with this trick because the stack
pointer must remain valid for stack unwinding purposes. But this was
special code running under special conditions, and it was in cahoots with
the kernel so the kernel didn't freak out when it saw this wacko stack.)
but this means that you can't use the
to access your local variables.
We'll run the code on a custom stack, too,
so that our local variables are at fixed offsets relative
to the stack selector register.
Nearly all of GDI was written in assembly language in versions of Windows up to and including the Windows 95 family. In that era, being able to not only read but also write assembly language was a core developer skill.
Bonus reading: The idea of generating block transfer code on the fly has been around for a while. (If impatient, skip to the bottom of page 43.)
¹ The way the code managed this was rather clever. It calculated the stride to go from an odd-numbered scan line to an even-numbered scan line, and then it calculated the stride to go from an even-numbered scan line to an odd-numbered scan line. It then xor'd the two values together to create a toggle value. After each scan line was complete, the current stride was applied, and then the stride was xor'd with the toggle value. This causes the stride to flip back and forth between the two desired values.