ia64 – misdeclaring near and far data

As I mentioned yesterday, the ia64 is a very demanding architecture. Today I'll discuss another way that lying to the compiler will come back and bite you.

The ia64 does not have an absolute addressing mode. Instead, you access your global variables through the r1 register, nicknamed "gp" (global pointer). This register always points to your global variables. For example, if you had three global variables, one of them might be kept at [gp+0], the second at [gp+8] and the third at [gp+16].

(I believe the Win32 MIPS calling convention also used this technique.)

On the ia64, there is a limitation in the "addl" instruction: You can only add constants up to 22 bits, which comes out to 4MB. So you can have only 4MB of global variables.

Well, it turns out that some people want more than 4MB of global variables. Fortunately, these people don't have one million DWORD variables. Rather, they have a few really big global arrays.

The ia64 compiler solves this problem by splitting global variables into two categories, "small" and "large". (The boundary between "small" and "large" can be set by a compiler flag. I believe the default is to treat anything larger than 8 bytes as "large".)

The code to access a "small" variable goes like this:

        addl    r30 = -205584, gp;; // r30 -> global variable
        ld4     r30 = [r30]         // load a DWORD from the global variable

(The gp register actually points into the middle of your global variables, so that both positive and negative offsets can be used. In this case, the variable happened to live at a negative offset from gp.)

By comparison, "large" global variables are accessed through a two-step process. First, the variable itself is allocated in a separate section of the file. Second, a pointer to the variable is placed into the "small" globals variables section of the module. As a result, accessing a "large" global variable requires an added level of indirection.

        addl    r30 = -205584, gp;; // r30 -> global variable forwarder
        ld8     r30 = [r30];;       // r30 -> global variable
        ld4     r30 = [r30]         // load a DWORD from the global variable

If you leave the size of an object unspecified, like

extern BYTE b[];

then the compiler plays it safe and assumes the variable is large. If it turns out that the variable is small, the forwarder pointer will still be there, and the code will do the double-indirection to fetch something that could have been accessed with a single indirection. The code is slightly less efficient, but at least it still works.

On the other hand, if you misdeclare the object as being small when it is actually large, then you end up in trouble. For example, if you write

    extern BYTE b;

in one file, and

    extern BYTE b[256];

in another, then files that include the first header will think the object is small and generate "small" code to access it, but files that include the second header will think it is large. And if the object turns out to be large after all, the code that used the first header file will fail pretty spectacularly.

So don't do that. When you declare a variable, make sure to declare it accurately. Otherwise the ia64 will catch you in a lie, and you will pay.

Comments (22)
  1. Anonymous says:

    Re: So don’t do that. When you declare a variable, make sure to declare it accurately.

    I could be way off here, but it seems like the compiler might be able to stub out ambiguous declarations and then LTGC (link time code generation) could be used to fixup the stub using the correct code?


  2. Anonymous says:

    The linker should take care of this. It should notice that "b" is being declared in the small data section in one object and the large data section in another. Two same-named symbols going into two different sections should be causing a link error every time. We have this on a PPC architecture and it’s exactly what happens.

  3. Anonymous says:

    So what happens if I *do* declare a million DWORDs? (OK, slightly more than a million – enough to go over 4MB.)

    That’s obviously unlikely to come up in any real hand-written code, but it’s not completely out of the question for generated code, even if it is somewhat outlandish. Does the compiler simply tell me to stop being so stupid? (Or will I crash and burn due to other resource constraints long before hitting this problem?)

  4. Anonymous says:

    Anyone care to recommend a book on the ia64 architecture? Or are these only available from Intel?

    I haven’t seen a lot of new books on assembly language in the past few years…

  5. Anonymous says:

    Hey Raymond,

    I todays examples I noticed that you used double semicolons on some lines, singles on another and some lines even had none, was this on purpose or is this an error?

    Just curious…

  6. Anonymous says:

    Do Intel employees get a bonus when they introduce segmentation into their architectures? This whole "small" / "large" thing is right out of the 8086 segmented architecture playbook. That did <nothing> but make life difficult for everyone who had to use it.

    What possible reason do they have for this latest design? Not enough transistors? Compatibility with an imaginary previous architecture? A desire to make Z80 assembly code run un-changed? Seriously, it’s like they just aren’t happy until they have some useless level of indirection built permanently into their chips, thus causing programmers and compiler writers no end of extra work and grief.

  7. Anonymous says:

    The architectural limit is likely due to the tradeoffs of making processor instructions fit into 64 bits. You have to have enough bits to encode the opcode, the registers affected, conditional execution flags, and the immediate value. This kind of limit isn’t unique to the IA-64. Almost any RISC architecture has it — both ARM dnd Thumb code have similar limits, and there was a similar issue back on the 68000, as you could only access +-32K from an address register without generating a lot of extra code.

  8. Anonymous says:

    Jack: You’re right, if you mismatch the linker does yell at you. Serves me right for trying to blog from memory.

    Ian: I haven’t actually tried declaring four million global variables. I suspect you’ll run into a linker or compiler limitation before you get that far.

    Steve: The double-semicolons have special meaning in ia64 assembly. (I don’t see any single-semicolons in my assembly code; if there are then they are typos.) If people are interested in learning more about ia64 I can write more about it, but it’s pretty deep geek stuff that is probably not of general interest.

    Peter: The ia64 was actually developed originally by HP. Intel joined in later and wound up with star billing. I suspect the things you’re complaining about were developed in the HP era.

    Ben: Actually the architectural limits aren’t really limits; they’re just a different way of doing things. The ia64 was designed around the principles of EPIC: Explicitly Parallel Instruction Computing. The theory behind EPIC is "Do a lot of work at compile time (happens only once) instead of doing it at run time (happens every time the instruction is executed)". The theory is that by doing the hard work offline, the runtime performance is much better.

  9. Anonymous says:

    "If people are interested in learning more about ia64 I can write more about it, but it’s pretty deep geek stuff that is probably not of general interest."

    IMO it’s very interesting to see how IA64 works more practical, since I’ve only experienced it in theory. From the theory I’ve read it seems great :). So, yeah – please write more about IA64!

    But if you’d like to share some insight into the Windows shell please do that too :)

  10. Anonymous says:

    IIRC the double semicolon indicates an architectural stop. An instruction bundle on IA64 consists of a 128-bit template divided into three 41-bit blocks. The template indicates which arithmetic unit each portion of the instruction bundle runs on, and whether there are architectural stops between each position in the bundle. The 5 spare bits indicate which of the templates this is.

    The processor is allowed to execute any operations between architectural stops in parallel. All operations in one group must be completed before the next group is begun.

    Amost all instructions on IA64 can be predicated, using a predicate register to decide whether to execute an operation. There are nominally 64 1-bit predicate registers (which can be saved to an integer register as required). If the register is 1 at the point of instruction retire, the instruction’s final state is kept, otherwise it is discarded. 6 of the 41 bits of every instruction slot are used to encode the qualifying predicate (pr0 is hardwired to 1 and means ‘always’, and is normally omitted from assembly listings).

    Short sequences of predicated instructions can typically perform better than branchy code, due to branch penalties.

    The earliest predicated architecture I’m aware of is ARM, where instructions are predicated on the state of the flags. IA64 improves on this by only setting predicates for the conditions we’re actually interested in, and in being able to hold onto them for longer.

    The 22 bit limit simply comes from the fact that the immediate field of the ‘add immediate to register’ opcode is 22 bits in size. Unlike x86, IA64 instructions generally allow source and destination to be specified separately: of the form rx = y + z rather than just rx += y. In this case, you’ve got 7 bits for specifying the source and 7 for the destination.

    There are ‘very long’ load-immediate operations, but these are obviously unsuitable for position-independent code without requiring a relocation fixup for every load. Microsoft have clearly decided to go for position independent code (as far as practical) this time around. Relocation requires the loader to patch every relocated address, which requires the OS to allocate fresh pages to the relocated code (pages can be shared between processes if the DLLs are loaded at their expected base address).

    You can get IA64 manuals from http://developer.intel.com/design/itanium/manuals/iiasdmanual.htm. That’s where I learned all this from – I haven’t yet touched an actual IA64 box (and am probably unlikely to in my current job).

  11. Anonymous says:

    I don’t know of a 64-bit architecture that allows 64-bit literals embedded in the code. e.g. on Alpha, to load a 64-bit constant into the register you have to load half the bits, shift them into place, and then load the other bits.

    x86 is really an oddball as far as allowing opcodes and literals of varying sizes. (from 1 to over 10 bytes I think)

  12. Anonymous says:

    This reminds me of the near/far models of 16-bit DOS/Windwos programming. You could mix near/far code/data with enough combinations to shoot yourself in both feet and hands. Programmers were really happy when it was replaced by flat 32-bit model. Why are they bringing this back ? When will we get flat 64-bit pointers ?

  13. Anonymous says:

    Um, the pointers in Win64 mode are already flat. Once a (flat) pointer to the global variable has been computed, you can use it like any other pointer.

    The only wrinkle is the way the address of the global variable is computed in the first place. But that’s all invisible to the programmer, assuming you declared your variables correctly.

  14. Anonymous says:

    Manual trackback: I wrote more about global pointers (which I hope is correct!) at my blog.

    The link above should be correct, or follow http://mikedimmick.blogspot.com/2004_01_01_mikedimmick_archive.html#107490611481306996.

  15. Anonymous says:

    > If people are interested in learning more about ia64 I can write

    Yup! More IA-64. Its an interesting architecture and geeks love new CPU architectures. If only I could come across an IA-64 box to actually play with. They seem to be rather rare.

  16. Anonymous says:

    I’m also in favour of more 64 bit articles. I’d also be interested in AMD64 stuff if you ever come across it, not only IA64.

  17. Anonymous says:

    IA64, an architecture with perhaps even more quirks than the IA32. What’s next, windows on the TMS320? ;)

    What about an article on debug info formats? Is there anything similar to DWARF2 in the Windows world? If there isn’t, how can the debugger handle single-stepping and local variable display in the face of compiler optimizations?

  18. Anonymous says:

    I’m going to pass on debugger issues, since I know very little about how they work. (I only do object-level debugging; I don’t use a source-level debugger.)

  19. Anonymous says:

    So you haven’t yet encountered a buggy program that relied on some peculiarity of its debug info to work? ;)

    I’m pretty sure the different debugging APIs various Windows versions have had would make for a good article, though :)

    I rarely use debuggers myself but I’m fascinated by how they work.

  20. Anonymous says:

    The Itanium has two stacks, so don’t assume that there’s only one.

Comments are closed.