The Beginning of the Endian


So the rumors have been true. Apple is moving to Intel processors, which, in some ways, makes sense. At least it does from Apple’s point of view. The Kool-Aid has never really affected their ability to figure out what’s right for Apple regardless of the direction they’ve taken.

For users, the transition won’t be all that painful either. Indeed, the prospect of being able to run VirtualPC on a native Intel processor has me, as a user, licking my chops. Buy a Mac and you’ll get the best of all three worlds: Mac, Unix and Windows.

When you’re a developer, though, the Kool-Aid can be rather heady. For those of us who’ve been around for a while, developing software for Apple’s computers has often been rather like being a farm animal with a ring through its nose. In fact, the primary difference between Apple and Microsoft as platform vendors is that Apple generates excitement in the user base while Microsoft tends toward generating excitement in the independent software developer base.

So, when Steve Jobs starts talking about certain software developers having little trouble making the transition from the PowerPC to x86 processors, you can pretty well bet that the Kool-Aid is laced with all sorts of compounds that are not there for your benefit.

No doubt, some, perhaps even many, applications will have little difficulty making the transition. There is, however, a class of applications that will require a significant amount of work. The feature they all have in common: they produce documents–files with data in binary form. To understand why, we need a little lesson in the history of microprocessor design.

Microprocessors have data registers. These are analogous to little scratch-pads on your desktop. When the computer needs to do some manipulation of data, it pulls that data out of memory and puts it into these little scratch pads. When the computer is done with that data, yet wants to save it for later use, it writes the data on the scratch pad out to memory.

Early microprocessors had 8-bit, or byte-sized, data registers. This was generally convenient, because memory is considered to be just a very huge array of bytes. However, when the first 16-bit microprocessors were designed, the size of the scratch pads, or data registers, was twice the size of a single, addressable piece of memory.

This introduced a design problem. When you read a 16-bit value from a byte-addressable data store, like memory, which byte do you load first? You have two choices. You can load the most-significant byte first (known as big-endian), or you can load the least-significant byte first (known as little-endian). The difference between big-endian and little-endian systems is often referred to as “byte-sex”. It’s a term I’ve used before, and, for those who’ve wondered what it means, now you know.

As luck would have it, the two most active designers of microprocessors at the time, Intel and Motorola, answered that question in opposite ways. I’m not well-versed in VLSI design, but I have little doubt that each group of chip designers had legitimate reasons for the choices they made. None of that is important here. What is important is that, way back in those days, Apple chose Motorola processors while IBM chose Intel. Thus began the great endian divide.

Independent of each other, this isn’t really a significant issue. Big-endian is slightly more difficult to work with than little-endian, but the parameter type provided by modern compilers renders the problem negligible. The real problem starts when binary data gets saved to disk and shared.

To see how this works, imagine a program that keeps track of people’s salaries. Your boss decides to give you a raise, so this program reads your salary from memory into one of these data registers, adds your raise to it, and then writes it back out to memory. At the end of the day, the program stores all employee records to disk as just an array of bytes as they reside in memory. Let’s also pretend that the accounts use computers with Intel processors, so the data got written in little-endian format.

Now suppose you want to look to make sure you got the correct raise. You fire up a program on your Mac that reads your employee record, but when it reads your salary, it’s going to read the bytes of that number in reverse order from the way it was written on an Intel processor. If your new salary is supposed to be $128 a week, unless the data was byte-swapped when it was read from the disk, your Mac will lead you to believe that your salary is really $2,147,483,648 per week. Now that’s a raise!

If you think this is messy, then know that I’ve oversimplified the problem significantly. Rarely do programs maintain various numbers in isolation from each other. Employee records often have other numbers associated with them, like hours, hire dates, number of children, exemption status, etc. These are kept in compound data structures generally known as records. Many of them are quite huge.

When transferring data between x86 and PowerPC based computers, each numeric value in each of these records needs to be byte-swapped. Code that doesn’t properly byte-swap data written on the other processor type has a bug. I call these “unsafe byte-sex” bugs. If you’re going to move documents with binary data between PowerPC and Intel processors, you have to write code that practices safe byte-sex.

For some developers, this problem will be huge. Word, for example, has a total of 176 distinct data structures in its binary file format, each of which needs to be properly byte-swapped whenever one of them is read from the disk. Word’s example is probably a bit extreme, but even 20 or 30 data structures requiring proper byte-swapping represents a good chunk of work for both developers and testers.

That said, life for Word will actually get quite a bit easier. Why? Because Word’s files are always written in Intel byte-sex. Given the number of unsafe byte-sex bugs I’ve had to fix over the course of my career working on Word, I’m actually looking forward to the day when I can forget about them almost entirely.

 

Rick

Currently playing in iTunes: Saint Augustine In Hell by Sting

Comments (21)

  1. Alicia says:

    Its all good for progs like Word, but what about Windows?

  2. mschaef says:

    So how does VirtualPC virtualize x86 on the G5, given that the G5 no longer has little endian support?

  3. Travis Owens says:

    I also blogged on this subject but I much a more negative attitude about the whole thing, even though I would love to run Mac OS10.4 within a VMWare scenario.

  4. Joe developer says:

    I have ported lots of C code between systems with different endian conventions.

    It’s generally easy to do. One reason is that the mistakes tend to be easy to spot, like your example.

    Another problem is missing Altivec support. But of course Word probably doesn’t have much of that. 🙂

  5. Rajesh says:

    When we had to write our Word importer/exporter we first wrote it for Mac. So while compiling it for Windows we just had to disable the macro for byte swapping and it worked without any hitch.

    Well switching to intel will have its own headaches due to architectural differences. I’m not a VLSI guy, but I would like to see some kind of emulation inside CPU which can do both kind.

  6. Mike Dimmick says:

    I guess you’ll be making that transition from Metrowerks CodeWarrior to XCode and from CFM to Mach-O now, then. The ‘Universal Binary Programming Guidelines’ (at http://developer.apple.com/documentation/MacOSX/Conceptual/universal_binary/universal_binary.pdf) states that "Carbon applications based on the Code Fragment Manager Preferred Executable Format (PEF) must be changed to Mach-O."

  7. I’ve seen quite a few people comment in blogs that the endian issue I mentioned yesterday won’t turn…

  8. Travis Owens says:

    Just when Apple finally has an OS thats being embraced fully by it’s users Apple goes and flips everthing upside down yet again.

    Time to update and recompile all your code. And this time I’m not so quick to see any included emulation for older apps, but only time will tell.

  9. Err… This misses out on really rather a lot of computing history.

    The big/little endian byte ordering issue arose long before microprocessors came on the scene. It dates back to the days of mainframes and mini-computers. IBM used big-endian. DEC used little-endian.

    Intel and Motorola were very late to the party, and were just carrying on two traditions that were already long-established by that time.

  10. Carlos says:

    Some older processors, like the PDP-11, confuse things even more when storing 32-bit values: the bytes are stored in the order 2,3,0,1 (where 3 is most-significant) so they are neither big-endian nor little-endian.

  11. steamer25 says:

    …until the next time Apple decides they haven’t been thinking differently enough ;).

  12. Could you please clarify why big-endian is more difficult for you to deal with?

    When you’re designing a processor, the endianness to use endiit’s not a big deal, it justs boils down whether your microcode ascends or descends on the byte order, and whether you figure smaller addresses as higher or lower in your diagram… In any case, they’re usually easier to debug, as you don’t have to "unwrap" memory dumps…

  13. Rick Schaut says:

    Juan,

    Big-endian is slightly more difficult, because, on little-endian systems, you can pass the address of a short to a routine that expects a pointer to a long (and vice-versa). If you do that on a big-endian system, you’ll either overwrite the wrong word in memory or you’ll get a value back that’s way too large.

    Rick