The Intel 80386, part 6: Data transfer instructions


The most common data transfer operation is the simple copy of a value from one place to another.

    MOV     r/m, r/m/i  ; d = s, does not set flags

The MOV instruction does not support an 8-bit immediate with sign extension. The immediate must be the same size as the other operand. This is annoying if you want to move a small constant into a 32-bit register, because you have to burn four bytes to encode the constant.

Not often seen, but included here for completeness is the exchange of two values:

    XCHG    r/m, r/m    ; exchange d and s, does not set flags

The XCHG instruction exchanges two values, and it has the bonus property of being atomic if one of the operands is a memory location. It is the only read-modify-write memory instruction on the 80386 that is automatically atomic. We'll learn more about atomic operations later.

The 80386 has an architectural stack register, which was the style at the time. (Nowadays, RISC-style processors establish the stack register by convention.)

    PUSH    r/m/i32     ; push d onto stack
                        ; esp = esp - 4
                        ; *esp = d
    POP     r/m32       ; pop top of stack into d
                        ; d = *esp
                        ; esp = esp + 4

The stack grows downward (toward lower addresses), and the stack pointer points to the top of stack. Therefore, pushing a value onto the stack means decrementing the stack pointer by 4, and then storing the value at the new top-of-stack address. Popping a value consists of loading from the top-of-stack into the destination, and then incrementing the stack pointer by 4.

Note that you are allowed to push or pop a memory operand. This makes PUSH and POP unusual in that they access two memory locations in a single instruction. One memory location is the operand, and the other is the top of the stack.

There are other types of push and pop operations, but you are not going to see them in compiler-generated code, so I won't bother covering them here.

We saw above that moving small constants into registers requires a large instruction encoding. We saw earlier how the XOR instruction could be used to generate the value zero. To generate small signed integers, you might see this:

    PUSH     1          ; push sign-extended 8-bit constant to stack
    POP      EAX        ; pop the value from the stack to the EAX register

The next category of data transfer operations are those that change size.

    CBW                 ; sign-extend al to ah:al
    CWD                 ; sign-extend ax to dx:ax
    CDQ                 ; sign-extend eax to edx:eax
    CWDE                ; sign-extend ax to eax

The Convert (Byte|Word|Dword) to (Word|Dword|Qword) instructions perform sign extension by copying the sign bit of the implied source to all bits of the implied destination.

Whereas the above "convert" instructions have implied source and destination, the "move and extend" instructions let you specify which registers you want as the source and destination.

    MOVSX rm, r/mn      ; sign-extend s to d (m > n)
    MOVZX rm, r/mn      ; zero-extend s to d (m > n)

These instructions are unusual in that the source and destinations are different sizes. The destination of a "move with sign extension" or "move with zero extension" must be a register larger than the source. (Because if it were the equal or smaller in size, there would be no extension going on.) Note that the source cannot be an immediate, so you can say good-bye to your dreams of using MOVZX eax, 2 to load a small constant into eax.

There's another instruction which is in the same family as the data transfer operations, except that it doesn't actually transfer any data.

    LEA     r, m        ; d = address of s, does not set flags

The "load effective address" instruction goes through the motions of loading a value from memory, except that instead of loading the value from memory, it loads the address of the memory that would have been loaded. In other words, it lets you take advantate of the arithmetic built into the memory address circuitry.

For example, there is no three-register ADD instruction, but you can simulate it with the LEA instruction:

    LEA     eax, DWORD PTR [ecx + ebx] ; eax = ecx + ebx, does not set flags

Even though there is no memory access, you still have to specify a memory size in order to keep the memory address circuitry happy.

As we saw in our exploration of addressing modes, the memory address circuitry can perform computations of the form reg + reg * scale + constant, so you can use the LEA instruction to perform those calculations, provided you don't care about flags.

    LEA     eax, DWORD PTR [ecx + ecx * 4] ; eax = ecx + ecx * 4 = ecx * 5

If you happen to know the value of a particular register, you can use LEA to calculate a small constant.

    XOR     ebx, ebx        ; ebx = 0
    LEA     eax, [ebx+1]    ; eax = ebx + 1 = 1

This gives you another way to load a small constant without burning a large MOV eax, 1.

Okay, so those are the data transfer instructions. Next time, we'll look at conditional instructions and control transfer.

Comments (21)
  1. Darran Rowe says:

    So far I have been enjoying this series.
    While I know a decent amount of assembly myself, these posts have been getting me to think about a simpler version of the architecture. For example, when you mentioned move, I initially thought “what no CMOVcc?” then remembered that was added around the Pentium era.
    Sometimes going back to an earlier iteration of an architecture is fun and interesting.

  2. Erik F says:

    Do compilers use the string-transfer instructions (STOS, LODS) anymore? I thought that they would, but disassembly of compiled code in godbolt.org seems to indicate otherwise.

    1. Darran Rowe says:

      I had to jump into the disassembly for an application today and the COM support library which comes with Visual Studio 2017 update 9 has a rep stos in the debug version of the library.

    2. kantos says:

      Depends on the operation and if you specify an architecture optimization target when you compile. Some architectures do better with certain instruction patterns. I know that memmove tends to use those instructions to prevent some architectural race conditions that would happen otherwise with SSE/AVX versions.

    3. REP STOS / REP MOVS yes (sort of), non-REP variants and all the other string operations (REP CMPS*, SCAS*, OUTS*, INS*) no.

      “Sort of” because while many compilers don’t generate REP STOS/REP MOVS either, low-level code like C runtime libraries often uses them from ASM, so even when a compiler doesn’t directly generate say “REP STOSB”, it might end up say transforming a hand-written zero-clearing loop into “call memset”, which then uses one of the REP STOS forms internally.

      From a CPU point of view, REP STOS and REP MOVS are considered special because a) the operations they represent are quite common (memset and memcpy/memmove, respectively), b) there are certain internal optimizations the CPUs can perform if they know they’re going to be copying (or writing) data for quite a while. Specifically the internal implementation of the “fast string” operations can choose to bypass parts of the cache hierarchy and coherency protocol on large block transfers, avoiding thrashing and reducing memory bandwidth requirements. With a write-back cache, storing to a cache line typically requires loading its previous contents; the block transfers know when they’re going to be overwriting every single byte, and can use a special type of coherency message “give me ownership of that cache line but don’t bother sending me the current contents, I don’t need them”.

      The flipside is that those extra smarts (Intel has them in Ivy Bridge architecture chips and later) take some extra time to set up. REP MOVS and REP STOS usually hit higher peak rates than what normal user-mode code can achieve for large transfers (and come with the extra benefits mentioned above), but they do have something on the order of 20 cycles setup time, before they start doing anything, and their rate never gets any faster than what you could do with regular code if the source and destination are already in the L1 cache. So short copies or clears (meaning less than 2-4 kilobytes or so) are usually done “by hand” using vector instructions, and then for the larger ones you use REP STOS/MOVS.

      1. smf- says:

        I just threw an example together where MSVC generated a REP STOSD

        https://godbolt.org/z/xWsPD5

        GCC uses a call to memset, so it depends on the compiler.

  3. FrostCat says:

    “RISC-style processors establish the stack register by convention”

    What does this mean?

    1. FrostCat says:

      (I wish there were an edit feature. At first, I didn’t think that searching for “the stack register is established by convention” would get useful results).

      Oh, I get it. On, for example, ARM, R13 is implicitly the stack pointer, although at first glance it looks like just another one of the Rx registers, and only certain instructions can use that register.

      1. ARM isn’t a good example, or not anymore at least. But look at say MIPS, PowerPC or Alpha (all architectures Raymond has posted about already!), where the stack pointer really is just a normal general-purpose register that the OS and compilers happen to agree on.

        32-bit ARM used to be like that originally, but then with Thumb they added PUSH and POP instructions wiring the function of R13 as the stack pointer into the hardware. Thumb also made R13 special in other contexts. So ever since, they kind of do have an architectural stack pointer – which happens to be called R13, but R13 behaves differently from the other registers. Even more magic on 32-bit ARM is R15 aka PC – yeah, the program counter in 32-bit ARM is exposed as a numbered general-purpose register! But one that is really weird in many many ways.

        64-bit ARM then went for an architectural stack pointer from the beginning.

  4. PiotrSiodmak says:

    Is what you’re describing in this series just about one old processor or is it the general x86 as “defined” by WinDbg’s output?

    1. One specific old processor, although the instructions still work on newer processors. But newer processors have newer instructions that I won’t cover here (except at the very end in an appendix).

  5. Vilx- says:

    Nice! I’ve always wanted to get to know more about x86 assembly and this is a very easy format to digest it. Thank you!

  6. For some strange reason, this article on my RSS feed shows an image. It looks like a screenshot from a cartoon and shows some goofy-looking yellow-skinned anthropomorphic characters in dark blue police uniforms.

  7. Paradice says:

    I’m really interested in the follow up about atomic operations. I’m trying to think why they would be important on a 386 which is only ever single-core. I must be missing something!

    1. Have you considered DMA by peripherals? Even as far back as the 386 the CPU was not the only source of writes to RAM

      1. smf- says:

        The 68000 cpu has some atomic instructions, they aren’t supported on the Amiga or Atari ST because their dma controllers can’t support them. That is what you get in a unified memory system when the display hardware needs priority over the cpu.

        It’s pretty clear that processor designers back then were just winging it and throwing in features they thought might be useful.

        https://www.youtube.com/watch?v=UaHtGf4aRLs

    2. Erik F says:

      There were multi-processor systems for the 386: the Compaq SystemPro is an example of one. As the system was insanely expensive, it’s unlikely you ever saw one outside of a server room, but it existed.

  8. Anonymous says:
    (The content was deleted per user request)
  9. Erik F says:

    Did ENTER, LEAVE, PUSHA or POPA ever get widely used? They were introduced with the 80186 but I can’t remember ever seeing compilers, let alone assembly programmers, use them.

    1. RP (MSFT) says:

      I saw ENTER and LEAVE used on the ‘286 and ‘386 by an OS that shall not be named, back in the late eighties. That OS also used Call Gates and such, though, so it was completely alien to Windows devs in lots of ways.

      I used PUSHA/POPA myself on a (not related to or endorsed by my employer) project last year.

      1. smf- says:

        All my x86 assembly language had to run on 8088 and up, so I never got a chance to use pusha/popa/enter/leave.

        I ended up porting my interrupt driven serial port code to C when I needed to support 32 bit. Of course it ended up being faster and I think I fixed some bugs in the process, so I should probably never have used assembler in the first place.

Comments are closed.

Skip to main content