Understanding ARM Assembly Part 1

My name is Marion Cole, and I am a Sr. EE in Microsoft Platforms Serviceability group.  You may be wondering why Microsoft support would need to know ARM assembly.  Doesn’t Windows only run on x86 and x64 machines?  No.  Windows has ran on a variety of processors in the past.  Those include i860, Alpha, MIPS, Fairchild Clipper, PowerPC, Itanium, SPARC, 286, 386, IA-32, x86, x64, and the newest one is ARM.  Most of these processors are antiquated now.  The common ones now are IA-32, x86, x64.  However Windows has started supporting ARM processors in order to jump into the portable devices arena.  You will find them in the Microsoft Surface RT, Windows Phones, and other things in the future I am sure.  So you may be saying that these devices are locked, and cannot be debugged.  That is true from a live debug perspective, but you can get memory dumps and application dumps from them and those can be debugged.




There are limitations on ARM processors that Windows supports.  There are 3 System on Chip (SOC) vendors that are supported.  nVidia, Texas-Instruments, and Qualcomm. Windows only supports the ARMv7 (Cortex, Scorpion) architecture in ARMv7-A in (Application Profile) mode.  This implements a traditional ARM architecture with multiple modes and supporting a Virtual Memory System Architecture (VMSA) based on an MMU.  It supports the ARM and Thumb-2 instruction sets which allows for a mixture of 16 (Thumb) and 32 (ARM) bit opcodes.  So it will look strange in the assembly.  Luckily the debuggers know this and handle it for you.  This also helps to shrink the size of the assembly code in memory.  The processor also has to have the Optional ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).


In order to understand the assembly that you will see you need to understand the processor internals.


ARM is a Reduced Instruction Set Computer (RISC) much like some of the previous processors that Windows ran on.  It is a 32 bit load/store style processor.  It has a “Weakly-ordered” memory model: similar to Alpha and IA64, and it requires specific memory barriers to enforce ordering.  In ARM devices these as ISB, DSB, and DMB instructions.




The processor has 16 available registers r0 – r15.

0: kd> r

r0=00000001  r1=00000000  r2=00000000  r3=00000000  r4=e1820044  r5=e17d0580

r6=00000001  r7=e17f89b9  r8=00000002  r9=00000000 r10=1afc38ec r11=e1263b78

r12=e127813c  sp=e1263b20  lr=e16c12c3  pc=e178b6d0 psr=00000173 —– Thumb


r0, r1, r2, r3, and r12 are volatile registers.  Volatile registers are scratch registers presumed by the caller to be destroyed across a call.  Nonvolatile registers are required to retain their values across a function call and must be saved by the callee if used. 


On Windows four of these registers have a designated purpose.  Those are:

  • PC (r15) – Program Counter (EIP on x86)
  • LR (r14) – Link Register.  Used as a return address to the caller.
  • SP (r13) – Stack Pointer (ESP on x86).
  • R11 – Frame Pointer (EBP on x86).
  • CPSR – Current Program Status Register (Flags on x86).


In Windbg all but r11 will be labeled appropriately for you.  So you may be asking why r11 is not labeled “fp” in the debugger.  That is because r11 is only used as a frame pointer when you are calling a non-leaf subroutine.  The way it works is this: when a call to a non-leaf subroutine is made, the called subroutine pushes the value of the previous frame pointer (in r11) to the stack (right after the lr) and then r11 is set to point to this location in the stack, so eventually we end up with a linked list of frame pointers in the stack that easily enables the construction of the call stack. The frame pointer is not pushed to the stack in leaf functions.  Will discuss leaf functions later.


CPSR (Current Program Status Register)


Now we need to understand some about the CPSR register.  Here is the bit breakdown:


















































  • Bits [31:28] – Condition Code Flags
    • N – bit 31 – If this bit is set, the result was negative.  If bit is cleared the result was positive or zero.
    • Z – bit 30 – If set this bit indicates the result was zero or values compared were equal.  If it is cleared, the value is non-zero or the compared values are not equal.
    • C – bit 29 – If this bit is set the instruction resulted in a carry condition.  E.g. Adding two unsigned values resulted in a value too large to be strored.
    • V – bit 28 – If this bit is set then the instruction resulted in an overflow condition.  E.g. An overflow of adding two signed values.
  • Instructions variants ending with ‘s’ set the condition codes (mov/movs)
  • E – bit 9 – Endianness (big = 1/Little = 0)
  • T – bit 5 – Set if executing Thumb instructions
  • M – bits [4:0] – CPU Mode (User 10000/Supervisor 10011)


So why do I need to know about the CPSR (Current Program Status Register)?  You will need to know where some of these bits are due to how some of the assembly instruction affect these flags.  Example of this is:


ADD will add two registers together, or add an immediate value to a register.  However it will not affect the flags.


ADDS will do the same as ADD, but it does affect the flags.


MOV will allow you to move a value into a register, and a value between registers.  This is not like the x86/x64.  MOV will not let you read or write to memory.  This does not affect the flags.


MOVS does the same thing as MOV, but it does affect the flags.


I hope you are seeing a trend here.  There are instructions that will look the same.  However if they end in “S” then you need to know that this will affect the flags.  I am not going to list all of those assembly instructions here.  Those are already listed in the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition at http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html.


So now we have an idea of what can set the flags.  Now we need to understand what the flags are used for.  They are mainly used for branching instructions.  Here is an example:

003a11d2 429a     cmp         r2,r3

003a11d4 d104     bne         |MyApp!FirstFunc+0x28 (003a11e0)|


The first instruction in this code (cmp) compares the value stored in register r2 to the value stored in register r3. This comparison instruction sets or resets the Z flag in the CPSR register. The second instruction is a branch instruction (b) with the condition code ne which means that if the result of the previous comparison was that the values are not equal (the CPSR flag Z is zero) then branch to the address MyApp!FirstFunc+0x28 (003a11e0). Otherwise the execution continues.


There are a few compare instructions.  “cmp” subtracts two register values, sets the flags, and discards the result.  “cmn” adds two register values, sets the flags, and discards the results.  “tst” does a bit wise AND of two register values, sets the flags, and discards the results.  There is even an If Then (it) instruction.  I am not going to discuss that one here as I have never seen it in any of the Windows code.


So is “bne” the only branch instruction?  No.  There is a lot of them.  Here is a table of things that can be seen beside “b”, and what they check the CPSR register:



Meaning (Integer)

Condition Flags (in CPSR)





Not Equal



Negative (Minus)



Positive or Zero (Plus)



Unsigned higher

C==1 and Z==0


Unsigned lower or same

C==0 or Z==1


Signed greater than or equal



Signed less than



Signed greater than

Z==0 and N==V


Signed less than or equal

Z==1 or N!=V





No overflow



Carry set



Carry clear


None (AL)

Execute always



Floating Point Registers


As mentioned earlier the processor also has to have the ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).  Here is what they are.

Floating Point


As you can see this is 16 – 64bit regiters (d0-d15) that is overlaid with 32 – 32bit registers (s0-s31).  There are varieties of the ARM processor that has 32 – 64bit registers and 64 – 32bit registers.  Windows 8 will support both 16 and 32 register variants.  You have to be careful when using these, because if you access unaligned floats you may cause an exception.




As you can see here the SIMD (NEON) extension adds 16 – 128 bit registers (q0-q15) onto the floating point registers.  So if you reference Q0 it is the same as referencing D0-D1 or S0-S1-S2-S3.


In part 2 we will discuss how Windows utilizes this processor.

Comments (11)

  1. FX says:

    This is fascinating, please keep it coming.

  2. SS says:

    Excellent intro! Eagerly looking forward to the next parts.

  3. bajjibala@gmail.com says:

    Is it possible to install Windows 64-bit in a device with ARM 64-bit processor? And if SPARC processors were supported, why wasn't the support existed till now, as SPARC and ARM comes under RISC family?

    [The 64 bit builds of Windows run on x64 and ia64 platforms.  The source code needs to be built using a compiler that is specific to each platform.  Also, there are small parts of the Windows kernel and hal that are written in native assembler.]

  4. Henry Brown says:

    Look like 6502 instruction set. Wonder if Don Lancaster should write a new book? I used to code this on Apple II.

  5. tonberry277@gmail.com says:

    Very interesting and helpful info about ARM. Thanks for sharing!

  6. Marc Sherman says:

    Good stuff, looking forward to the next installment.

  7. Hawk777 says:

    Odd that the IT instruction isn’t used in Windows code; GCC certainly generates it when appropriate (at least for V7M, which is all I’ve coded for).

  8. MazeGen says:

    Is there an ARM assembler provided by Microsoft?

    [Yes, armasm.]

  9. jas71@hotmail.com says:

    Henry: There are historical reasons for some 6502 similarities: the ARM was originally developed by Acorn Computers Ltd of England to power the successor to their 6502-based micros, so the design drew on aspects of that.

    One of the strengths is orthogonality. You can apply condition codes to virtually any instruction, not just branch: for example, "if (a<b) { b-=a; } else { a-=b; }" could translate to:

    cmp r0,r1

    sublt r1,r1,r0

    subge r0,r0,r1

    On x86/x64, you need to use a conditional branch to jump over the code you don't want run. (On the Pentium Pro and later, you can use the special conditional-move instruction to achieve something similar, but it's much less flexible.) You can also put the S suffix on the arithmetic instructions too, then use a conditional instruction to act on the result of that operation – or, as I did above, omit the S so the result of the subtraction is ignored, leaving the result of the comparison in place for the next instruction to use.

    It's slightly odd to me to see IA-32 and x86 listed separately, and to see a reference to the ARM/Thumb 16+32 bit instruction mix as "strange" by comparison to x86, where instructions vary from 8 to a downright scary 120 bits in length, with no alignment requirement! The requirement for ARM/Thumb instructions to be aligned to 4 or 2 byte boundaries respectively also makes disassembly slightly easier: with x86, it's sometimes hard to be sure which byte is actually the start of a series of instructions.

  10. Juan Carlos says:

    Double thanks to you for taking the time of enlighten me and for sharing your acknowledge….

  11. Sean Liming says:

    Nice, but ARM is not ARM. I do appreciate the challenge of the porting.