6 min read

Go arm64 Function Call Assembly

An in-depth analysis of the assembly code emitted by the Go compiler for function calls on arm64.
Go arm64 Function Call Assembly

I am currently working on implementing frame pointer unwinding for the Go execution tracer. This involves debugging various problems and crashes caught by TestTraceSymbolize on arm64.

One of these crashes seems to be caused by a goroutine overwriting the frame pointer on the stack of another goroutine. To debug this, I'm trying to study the assembly output of the compiler. (Un)fortunately I don't read or write Go assembly every day, so this process usually involves chaotic jumping between the following resources:

  1. A Quick Guide to Go's Assembler (Go's Assembly Language is quite quirky)
  2. Go ARM64 Assembly Instructions Reference Manual (arm64 specific quirks)
  3. Go internal ABI specification (Go's register based calling convention)
  4. Arm Architecture Reference Manual (11.952 pages PDF)
  5. Arm A64 Instruction Set Architecture (HTML subset of the above)
  6. ChatGPT (with a very healthy amount of mistrust,- it's often wrong)
  7. Various Google Search Results

To make this process a little easier for my future self, I decided it's time to create some high quality notes for my own needs. In particular, I'll try to explain every assembly instruction the Go compiler (go1.20 darwin/arm64) emits for the following code in great detail:

//go:noinline
func foo() { bar() }

//go:noinline
func bar() {}

To get the assembly, I use the steps shown below.

$ go build main.go 
$ go tool objdump -gnu -s 'main.(foo|bar)' ./main

The -s 'main.(foo|bar)' flag filters the output down the two functions we are interested in, and the -gnu flag adds GNU assembly comments which make it easier to lookup instructions in the official arm documentation. Normally I'd also include the -S flag to see each line of source code above the instructions it generated, but that's not needed here.

After some manual trimming, the objdump output looks like this:

TEXT main.foo(SB) /Users/felixge/Desktop/main.go
  MOVD 16(R28), R16                    // ldr x16, [x28,#16]		
  CMP R16, RSP                         // cmp sp, x16			
  BLS 8(PC)                            // b.ls .+0x20			
  MOVD.W R30, -16(RSP)                 // str x30, [sp,#-16]!		
  MOVD R29, -8(RSP)                    // stur x29, [sp,#-8]		
  SUB $8, RSP, R29                     // sub x29, sp, #0x8		
  CALL main.bar(SB)                    // bl .+0x28			
  LDP -8(RSP), (R29, R30)              // ldp x29, x30, [sp,#-8]		
  ADD $16, RSP, RSP                    // add sp, sp, #0x10		
  RET                                  // ret				
  MOVD R30, R3                         // mov x3, x30			
  CALL runtime.morestack_noctxt.abi0(SB) // bl .+0xffffffffffffdd54	
  JMP main.foo(SB)                     // b .+0xffffffffffffffd0		
  ?									
  ?									
  ?									

TEXT main.bar(SB) /Users/felixge/Desktop/main.go
  RET                                  // ret	
  ?						
  ?						
  ?						

Before we dig into this output I want to mention that I'm not using Compiler Explorer for this, because its output seems to be missing important instructions for some reason. I'm also not using go tool compile -S main.go or go build -gcflags=-S ./main.go because those output additional pseudo instructions that are used by the compiler and linker, but don't remain in the final assembly.

Ok, let's dig in!

Prologue 1: Check if the stack needs growing

  MOVD 16(R28), R16                    // ldr x16, [x28,#16]
  CMP R16, RSP                         // cmp sp, x16
  BLS 8(PC)                            // b.ls .+0x20
  1. Store the value of g.stackguard0 in the R16 register.
  2. Compare the value of R16 with the value of the stack pointer RSP.
  3. If RSP <= R16, jump 8 instructions forward  to grow the stack

In Detail

  1. MOVD, aka LDR (immediate), loads the data from the memory address [R28  + 16 ] into R16. The R28 register points to the current goroutine, and the stackguard0 field is found at offset 16 in the g struct because the stack field above it is 16 bytes wide.
  2. CMP, aka CMP (extended register), compares R16 with RSP and stores the results in the condition flags.
  3. BLS, aka B.cond, checks if the condition flags match the condition code ls (lower or same). If yes, the CPU is instructed to jump 8 instructions (0x20 = 32 bytes) forward. We'll cover those instructions in the Prologue 3: Growing the stack section. However, usually this is not the case, and execution continues with the instructions below.

Prologue 2: Setup foo's frame

  MOVD.W R30, -16(RSP)                 // str x30, [sp,#-16]!		
  MOVD R29, -8(RSP)                    // stur x29, [sp,#-8]		
  SUB $8, RSP, R29                     // sub x29, sp, #0x8
  1. Store the return address from the link register R30 at 16 bytes below the stack pointer register RSP and update RSP to that location.
  2. Store the frame pointer register R29 of the caller 8 bytes below the stack pointer.
  3. Set the frame pointer register R29 to point to the the previous frame pointer we just pushed onto the stack.

In Detail

  1. MOVD.W, aka STR (immediate), stores the value of the R30 register at the memory address of [RSP - 16]. R30 is the link register which holds the return address of the caller. The operation is pre-indexed, which means that the memory address is computed before the memory is accessed and that the RSP pointer is updated to the computed address afterwards (i.e. RSP = RSP - 16). Technically this frame only uses 8 bytes above RSP (for storing the return address), but the architecture requires the stack pointer to be 16-byte aligned, so 8 bytes of memory are wasted here.
  2. MOVD, aka STUR, stores the value of the R29 register at the memory address of [RSP - 8]. R29 is the frame pointer register that is holding the caller's frame pointer.
  3. SUB, aka SUB (immediate), subtracts 8 from the stack pointer register RSP and stores the result in the frame pointer register R29. In other words, it points the frame pointer register to the caller's frame pointer that was just pushed onto the stack by the previous operation.

Body: Call bar

  CALL main.bar(SB)                    // bl .+0x28			
  1. Call main.bar.

In Detail

  1. CALL, aka BL, implicitly adds 4 bytes (the size of an instruction) to the value of the current program counter and stores it in the link register R30. After that it jumps to the first instruction of the main.bar function which is located 10 instructions (0x28 = 40 bytes) below this instruction.

Epilogue: Return from foo

  LDP -8(RSP), (R29, R30)              // ldp x29, x30, [sp,#-8]		
  ADD $16, RSP, RSP                    // add sp, sp, #0x10		
  RET                                  // ret
  1. Restore the frame pointer register R29 and the link register R30 from the values stored 8 bytes below the stack pointer RSP.
  2. Restore the stack pointer RSP to its original value by adding 16 to it.
  3. Return to the caller.

In Detail

  1. LDP, aka LDP,  loads 16 bytes from the memory address [RSP - 8] into a pair of registers. The first 8 bytes of the data are loaded into the frame pointer register R29, and the second 8 bytes are loaded into the link register R30. This is needed because these registers are call-preserved, so we need to restore them to their original values. Remember: The link register R30 was overwritten by Body: Call bar and the frame pointer register R29 was overwritten in Prologue 2: Setup foo's frame.
  2. ADD, aka ADD (immediate), adds 16 to the stack pointer register RSP and stores the result in the stack pointer register RSP. This is needed because RSP is another call-preserved register that we overwrote in Prologue 2: Setup foo's frame.
  3. RET, aka RET returns to the caller. The implied jump location is the value of the link register R30.

Prologue 3: Growing the stack

  MOVD R30, R3                         // mov x3, x30			
  CALL runtime.morestack_noctxt.abi0(SB) // bl .+0xffffffffffffdd54	
  JMP main.foo(SB)                     // b .+0xffffffffffffffd0	
  ?									
  ?									
  ?			

This is the continuation of Prologue 1: Check if the stack needs growing.

  1. Store the value of the R30 link register in the R3 register.
  2. Call the runtime.morestack_noctxt function to grow the stack.
  3. Jump back to the beginning of the main.foo function.
  4. ? indicates zero padding.

In Detail

  1. MOVD, aka MOV (register), copies the value of the link register R30 to the R3 register. This is done to make it available to runtime.morestack_noctxt. As far as I can tell the value is only used to help with printing debug information, especially when a function is tyring to split the stack when it's not supposed to.
  2. CALL, aka BL, implicitly adds 4 bytes (the size of an instruction) to the value of the current program counter and stores it in the link register R30. After that it jumps to the first instruction of the runtime.morestack_noctxt function which is located 2219 instructions (0xffffffffffffdd54 = -8876 bytes in two's complement) above this instruction. The called function usually grows the stack of the goroutine before it implicitly returns. Another use case is the preemption of goroutines.
  3. JMP, aka B, jumps jumps to the first instruction of the current function (main.foo) which is located 12 instructions (0xffffffffffffffd0 = -48 bytes in two's complement) above this instruction. This causes the function to be executed again, now with a big enough stack to hold all of its values.
  4. The three ? each indicate 4 bytes of zero padding which are emitted by the compiler to 16-byte align all function entry points. I'm unable to find an official reference that mentions this value. But since it was provided by an arm employee, and another arm employee added a similar patch to gcc citing performance benefits, this alignment is probably a good idea.

Function bar

  RET                                  // ret	
  ?						
  ?						
  ?	
  1. Return to the caller.
  2. ? indicates zero padding.

In Detail

  1. RET, aka RET returns to the caller. The implied jump location is the value of the link register R30.
  2. The three ? indicate zero padding as explained above.

Final Thoughts

I only covered arm64 in this article because that's the architecture used by my laptop and perhaps also the future of the cloud. Writing a similar article for amd64 would be nice, but I'm not sure if I'll find the time.

Anyway, I hope that this information will be useful to others as well as to my future self. Please let me know if you spot any mistakes or have any questions!