Why have a stack?


Last time I discussed why it is that we have all the .NET compilers target an “intermediate language”, or “IL”, and then have jitters that translate IL to machine code: because doing so ultimately reduces the costs of building a multi-language, multi-hardware platform. Today I want to talk a bit about why IL is the way it is; specifically, why is it a “stack machine”?

To begin with, what is a “stack machine”? Let’s consider how you might design a machine language that could describe the operation of adding together two integers to make a third. You could do it like this:

add [address of first addend], [address of second addend], [address of sum]


When the machine encounters this instruction it looks up the values of the addends stored in the two addresses, somehow adds them together — how it does so is its business — and stores the result in the third address.

You might instead say that there is a special region of memory called the “accumulator” which knows how to add a given value to itself:

clear_accumulator
increase_accumulator [address of first addend]
increase_accumulator [address of second addend]
save_accumulator [address of sum]


Or, you could say that there is a special region of memory called the “stack” which can grow and shrink; you get access to the items on the top:

push_value_at [address of first addend]
push_value_at [address of second addend]
add
pop_value_into [address of sum]


The “add” instruction takes the two values off the top of the stack, somehow adds them, and then puts the result back on the stack; the net result is that the stack shrinks by one.

A virtual machine where most of the instructions are of this third form is called a “stack machine”, for obvious reasons.

IL specifies a stack machine, just like many other virtual machines. But most hardware instruction sets actually more closely resemble the second form: registers are just fancy accumulators. Why then are so many virtual machines specifying stack machines?

There are several reasons, but again, it primarily comes down to lowering costs. Stack machines are very easy to understand, they are very easy to write a compiler front-end for, they are easy to build interpreters for, they are easy to build jitters for, and they provide a concise abstraction for the common case. The common case is that the result of any particular operation is only going to be of interest for a brief period.

Imagine, for example, if we chose the first strategy for IL, and then had to compile an expression like x = A() + B() + C(); What would we have to do in the first case? Something like this:

create_temporary_storage // for result of A()
call A(), [address of temporary storage]
create_temporary_storage // for result of B()
call B(), [address of temporary storage]
create_temporary_storage // for result of first addition
add [address of first temporary storage], [address of second temporary storage], [address of third temporary storage]



You see how this goes? The IL is getting huge, and all so that we can keep track of precisely which memory locations are used to store values that we are about to never care about again. A stack abstraction lets the stack implementation deal with the temporary storages; in a stack machine, the same code looks something like:

push [address of x]
call A() // implicitly creates a temporary storage by pushing the result on the stack
call B()
add
call C()
add
store // store result on top of stack in address just below it.


The code is much smaller and much easier to understand. A stack machine is a very simple way to describe a complex computation; by being able to write code for such a simple machine, it lowers the cost of making a compiler. And not only is it easier to write compilers and jitters that target simple stack machines, it is easier to write other code analysis tools as well. The IL verifier, for example, can quickly determine when there is a code path through a method that, say, misaligns the stack, or passes the wrong types of arguments to a method.

Comments (25)

  1. Rick Brewster says:

    It's simple, it's well understood, and it works. Three wonderful, enviable traits for any system to have.

  2. DaRage says:

    This articles demonstrates the pros of using a stack which is simplicity but what about the cons? what are we giving away by using a stack machine?

    Good question; just because the "pros" are enormous does not mean that there are no "cons". The CIL system avoids many of the "cons" because (1) the stack machine is just an intermediate form on the way to being realized as machine code, and (2) the CIL system is not a "pure" stack machine; you can get access to "local" items that are not at the top of the stack easily enough. These are effectively arbitrarily many registers, if you want to, say, perform common subexpression elimination. For a good discussion of some of the problems that stack machines have, see http://en.wikipedia.org/wiki/Stack_machine for details. — Eric

  3. barrkel says:

    I think you're being a tiny bit disingenuous.

    I think I am not being even a tiny bit disingenuous and frankly I resent the statement that I'm being deliberately misleading in any way. If you'd like to have a discussion of the technical merits of various solutions I am happy to do so; you can leave the ad hominem attacks out. — Eric

    Three-address code, your "first strategy", is typically not as verbose as your example of its use; this "create_temporary_storage()" bit is a poor design choice. Rather, it might look like:

       %1 = A()
       %2 = B()
       %3 = %1 + %2
       %4 = C()
       %5 = %3 + %4
       x = %5

    I'm deliberately using syntax reminiscent of LLVM's SSA-based approach. On a line count basis, it's actually shorter than the stack machine; and with an SSA encoding, I think it's easier to apply many analyses than a stack machine encoding. SSA is also not particularly harder to generate code for than a stack machine (particularly an impure one like CLR or JVM, where locals and parameters have register-oriented storage rather than living on the stack, and you have access to something like LLVM's alloca and mem2reg pass); and nor is SSA any harder to interpret. I'll grant that the stack machine format, when encoded in bytecode, is probably easier to make more compact. But if size is that important, you'll still want to apply compression, and I wouldn't like to guess which straightforward encodings for SSA vs stack machine naturally give rise to lower entropy per instruction count for common compression algorithms.

  4. barrkel says:

    Actually, the three-address code could look a line shorter again:

       %1 = A()

       %2 = B()

       %3 = %1 + %2

       %4 = C()

       x = %3 + %4

  5. Joren says:

    But still, the stack-based code does not have to specify all the data sources and destinations (be it memory addresses or registers).

  6. Lalala says:

    @Joren: no, but in more complex scenarios it sometimes has to do some bit of copying around or shuffling the stack ordering because the data sources and destinations are fixed and implicit… You always gain some, loose some.

    That said I understand stack-machine are very easy to work with, but I would also tend to think that SSA are not that bad either and lend themselves pretty well to various analysis and optimizations. Anyway we're stuck with the IL the way it is and it's surely not gonna change soon.

  7. MikeCaron says:

    @Barry.Kelly: What, exactly, would %1, etc be representative of? Registers? Memory locations? If we assume the former, then we run into an issue with re-entrancy. Let's pretend this is a (non-tail) recursive function. If we're always writing into (say) EAX, then each iteration of the function will be overwriting that register, which may or may not be a problem.

    Let's assume, though, that %1 doesn't mean a specific register, but perhaps it's a place holder for "the next register" (or maybe the registers shuffle around so that EAX doesn't mean the same thing all the time. Like how (I believe, without bothering to go look it up) Itanium works). Now, what happens if you run out of registers? Even in a non recursive function, there are a limited number of those to pass around. In the interests of efficiency, we'd want to save the contents of any register we care about, so that other pieces of code don't have to worry about stepping on your toes. Now we're back to initial problem of where to store these temporary values.

    If %1, etc, instead represent memory directly, then who owns this memory? Do they live in the DATA section of the executable? If so, what happens if we recurse? If memory is allocated for each stack frame (whoops, there's another problem with avoiding the stack), then where do we keep the address? In a register? But, there's only so many of those to go around…

    I'm sure these problems are not insurmountable (after all, I'm fairly certain that there exist stackless architectures, even if I'm not nearly smart enough to conceive of one), but the stack certainly lends itself to solving these problems in a concise manner.

    @DaRage: I imagine that one particularly nasty con of a stack-based system is that it's possible to run out of stack space. Technically, the hypothetical ideal stack has no limit on its size, but as we all know, our programs and machines cannot be infinite in size, so limits exist, and running out of stack space is an expressly fatal problem.

    Another con might be that now you need to worry about balancing the stack and ensuring that your code doesn't push more than it pops, or vice versa.

  8. @Mike: says:

    I'd imagine they represent local execution frame slots, that are enregistered and/or spilled automatically by the JITter, after all, knowing about the hardware is the JITter's job, not the compilers. So for @Barry's example, you could trivially generate something like (assuming callee cleanup for simplicity):

    ; %1 = C()

    call C()

    mov [ebp+4], eax

    ; %2 = D()

    call D()

    mov [ebp+8], eax

    ; %3 = %1 + %2

    mov eax, [ebp+4]

    mov ecx, [ebp+8]

    add eax, ecx

    mov [ebp+12], eax

    ; %4 = E()

    call E()

    mov [ebp+16], eax

    ; x = %3 + %4

    mov eax, [ebp+12]

    mov ecx, [ebp+16]

    add eax, ecx

    mov x, eax

    Which you could simplify of course:

    call C()

    mov [ebp+4], eax

    call D()

    mov ecx, [ebp+4]

    add ecx, eax

    mov [ebp+4],  ecx

    call E()

    mov ecx, [ebp+4]

    add ecx, eax

    mov x, eax

  9. The nice thing about stack VMs is how easy it is to walk the expression tree and generate bytecode for it – you don't need any information shared about the nodes, every one of them can generate its own code independently without any assumptions about what the stack looks like before that, so long as it ends with the newly computed value on top.

  10. Gabe says:

    Mike Caron: The SSA "registers" (%1, %2, %3 in the example) are all virtual — they don't "live" anywhere. The JIT compiler's job is to figure out where (if anywhere) each register is stored at any point in a function. The values might be stored in a CPU register, on the stack, or not stored at all. For example, %3 might be stored in a register when it's first computed, then moved to the stack to use it in a function call, and then discarded completely once it's unneeded. Once the lifetime of a register has expired or been moved the location that stored its value may be reused.

    Also, I should add that stack-based machines are not uncommon. The x87 is a stack architecture that probably everybody reading this post has an implementation of in some way or another. The Burroughs B5000, like the CLR, has a typed, stack architecture.

  11. barrkel says:

    Woah, Eric, I didn't mean to attack or insult you. I'm sorry it came out that way. I just think you picked a bad example.

    But I also explained why I think your example was bad. I notice you didn't address my technical argument.

  12. barrkel says:

    @Mike.Caron: read up on LLVM. %1 etc. are pseudo-registers (most register VMs provide an infinite number of logical registers for convenience). But they can be viewed in another way: as names for nodes in an expression DAG. It's from this that simpler analyses flow.

  13. barrkel says:

    Eric, FWIW, I actually believe the CLR chose a stack machine because the JVM chose a stack machine, and the JVM chose a stack machine because it was a reasonably compact representation when kept uncompressed in memory, and that in turn was important because it was designed to be interpreted on small devices (settop boxes and the like). But the CLR was from the get-go designed to compile its code, rather than interpret it in any circumstances. I think the CLR would actually have been slightly better off with an SSA design. But there's little to choose between them; it's not difficult to turn one into the other. From what I understand, most non-trivial JIT compilers convert the stack code into SSA anyway.

  14. James Noble says:

    The JVM (and thus the CLR) stack machines look very much like the Smalltalk bytecode stack machines with types strapped on. The Smalltalk machine, was, I think, originally microcoded on the Alto.

    At around the same time the AT&T folks including, I think, Rob Pike, worked in the Inferno/Limbo system that had a 3-address IL.  They certainly argued it was quicker and easier to write a (simple, not-very-optimising) translator from 3-address code to machine code than it would have been with a stack machine. a

  15. rbirkby@hotmail.com says:

    To add balance, Microsoft also ship register-based virtual machines. So even within mscorp there is no consensus on the best approach. Some history, perhaps:

    Microsoft acquired a company in 1996 whose product was a register-based virtual machine. There was lots of speculation at the time that this was the basis for MSIL, but that speculation was wrong.

    Fast forward 15 years and the person who created that virtual machine also designed Microsoft's newest register-based virtual machine – namely the Chakra JavaScript engine:

    > "IE9 includes a new interpreter which uses a register-based layout, efficient opcode, and use of type optimizations."

    That company was Colusa and the person is Steve Lucco

    http://www.microsoft.com/…/default.mspx

  16. tobi says:

    I have always asked myself exactly this question. Why have a stack?

    I would like to know why building an (optimizing) jitter is easier for a stack machine although the jitter target platform uses assignments?

  17. Mike says:

    @tobi: Actually there's probably no such thing as a jitter that performs optimizations on the stack representation, the stack based representation is usually used as a persistent, on disk, representation.

    IMO, it's rather irrelevant if this persistent representation is stack based or register bases. Sure, the SSA form may look shorter in terms of number of instructions but those instructions usually require more explicit operands that in a stack based representation.

    The main advantage of the SSA representation is that it requires less work on the jit compiler's part and since the jit compiler runs at runtime…

    The stack based representation could have some advantage too, the stack based temporaries tend to have a short, explicit lifetime. Once and instruction like "add" is executed you know that the 2 stack temporaries used are dead. But with a register based repreasentation you need to do data flow analysis to figure this out.

  18. Paul says:

    Your "stack machine" example reminds me of Reverse Polish Notation calculators:

    7 * (3 + 5)  =  3 5 + 7 *

  19. Tanveer Badar says:

    @Paul. Well that is exactly how you encode a calculation to be executed by a stack machine. I would say, rather sleepily, that they are equivalent.

  20. Phil Wilson says:

    Paul's comment about reverse Polish is spot on. People keep thinking that computers were invented around 1986, but mainframe computers with stack architecture based on reverse Polish were around in the 1960s. I worked on one for many years, and the instruction set and opportunuties for optimization make them a good choice.

    en.wikipedia.org/…/Stack_machine

  21. Jsmith says:

    @Paul

    Whoever thinks computers were invented in 1986 must not have been paying any attention what-so-ever. Even if we assume they were oblivious to the existence and development of "full size" computers, they should still know better. One of the earlier microcomputers, the Altair 8800 came out in 1975.

  22. barrkel says:

    @Mike

    A very long time ago (at the time, I was still in secondary school), I wrote a simple expression compiler that targeted a stack machine; and to make the RPN "program" faster, I implemented a peephole optimizer. For patterns that looked like 'push a; push b; add' where a and b are constants, I would replace it with 'push c', where c = a+b. I did the same for sub, mul, div etc. And I ran this trivial optimizer iteratively over the instruction stream until it no longer made any replacements.

    It turns out that if you do this, it amounts to constant subexpression evaluation for expression subtrees that are wholly constant (i.e. it can resolve 1 – 2 – x to -1 – x, but can't do the same for 1 – x – 2). So there are some meaningful optimizations that you can do with stack representations, even fairly cheaply.

  23. Tom.Wesy says:

    @Jsmith

    I think "not paying attention what-so-ever" is a bit harsh.  From what I remember of my youth, it was awfully easy to dismiss anything that was invented before I was born as not terribly relevant.  I suspect the same is true for many young people today :-).

  24. Rob Kent says:

    Re "a tiny bit disingenuous". Do people realise that 'disingenuous' means: "The opposite of ingenuous; lacking in candour or frankness, insincere, morally fraudulent. (Said of persons and their actions.)" [OED]?

  25. Mike says:

    I'm a tad disappointed in humanity that this somehow turned into a religious battle. Thanks for the post Eric.