Assembly
SIMD Optimization in Go
February 3rd, 2016
There's something beautiful about programming in assembly.
With the complexity of modern computers, it's easy to forget the relatively simple interface they provide to the programmer at the lowest level. Machine code is an almost impervious abstraction over the unavoidably physical transistors and capacitors that make up a computer. The abstractions are still human-made, but they are well-tested and reliable. The computers we program are machines underneath, and programming in assembly lets you see just how machiney they really are.
Assembly programming can be very slow and crash-filled compared to programming in a language like Go, but very occasionally it's a good idea or at least a lot of fun. The amazing assembly programming game TIS-100, with its 14 page instruction manual, captures the fun of programming assembly without the mess of doing it on a real computer with a 3,883 page reference manual and an arsenal of build tools.
Why waste your time programming in assembly when there are perfectly good languages to program in? Your father built that language himself and now it's not good enough for you? He spent years unrolling those loops by hand before he got put out of a job by -funroll-loops and all those highfalutin optimizing compilers. Don't make the same mistake he did.
Even with today's fancy compilers, there are still a few cases where you might want to write assembly. A few places that stand out are cryptography, performance optimization, or accessing things not normally available from the language. The most fun being, of course, performance optimization.
When the performance of some piece of your code actually does matter to a user, and you've already tried all the easier ways of making it faster, assembly could be a good place to go next. Although the compiler might be better at optimizing assembly than you, you know more about your particular case than the compiler can safely assume.
Writing Assembly in Go
The best place to start here is to write the simplest function possible. So here's one that adds two int64s together, called add.
download
package main

import "fmt"

func add(x, y int64) int64 {
	return x + y
}

func main() {
	fmt.Println(add(2, 3))
}
go get goroutines.com/asm-add-go
To implement this in assembly, you create a separate file called add_amd64.s where you put the assembly implementation. I'm using a Macbook for these examples, so the assembly will be for the AMD64 architecture.
download
// add.go
package main

import "fmt"

func add(x, y int64) int64

func main() {
	fmt.Println(add(2, 3))
}

// add_amd64.s
TEXT ·add(SB),NOSPLIT,$0
	MOVQ x+0(FP), BX
	MOVQ y+8(FP), BP
	ADDQ BP, BX
	MOVQ BX, ret+16(FP)
	RET
go get goroutines.com/asm-add
To run this example, use go get:
go get goroutines.com/asm-add
asm-add
The assembly syntax is...obscure at best. There is the official Go guide and the what appears to be a somewhat ancient manual for the Plan 9 assembler that give some hints as to how the Go assembly language works. The best references are existing Go assembly code and the compiled assembly versions of Go functions which you can get with:
go tool compile -S <go file>
The most important things to know are the function declaration and the stack layout.
The magical incantation to start the function is TEXT ·add(SB),NOSPLIT,$0. There is a unicode dot character · separating the package name from the function name. In this case the package name is main so the package is left off of the identifier and the name of the function is add. The directive NOSPLIT means that you don't need to write the size of the arguments as the next parameter. The $0 constant at the end is where you would need to put the size of the arguments, but since we have NOSPLIT we can just leave it as $0. Why is the (SB) there after the function name? That's not really clear, but it doesn't work without it.
The stack layout puts every argument to the function in order on the stack starting at the address 0(FP) which means a zero-byte offset from the FP pointer, continuing on through all the arguments until there is a space for the return value. For func add(x, y int64) int64, it looks like this:
Here's the assembly again for reference:
TEXT ·add(SB),NOSPLIT,$0
	MOVQ x+0(FP), BX
	MOVQ y+8(FP), BP
	ADDQ BP, BX
	MOVQ BX, ret+16(FP)
	RET
The assembly version of the add function is loading the variable x at memory address +0(FP) into the register BX. It then loads y at memory address +8(FP) into the register BP, adds BP to BX storing the result in BX, and finally copies BX to the memory address +16(FP) and returns from the function. The calling function, which put all the arguments onto the stack, will read the return value from where we left it.
Optimizing a Function with Assembly
We don't really need to write assembly to add two numbers together, so what could we actually use it for?
Let's say you have a bunch of vectors, and you want to multiply them by a transformation matrix. Maybe the vectors are points and you want to translate them through 3D space. We'll use 4D vectors with a 4x4 transformation matrix with the vectors in a densely packed array. This is better than say, an array of objects with a vector property on each, because doing a linear scan through densely packed memory is more cache friendly.
download
type V4 [4]float32
type M4 [16]float32

func M4MultiplyV4(m M4, v V4) V4 {
	return V4{
		v[0]*m[0] + v[1]*m[4] + v[2]*m[8] + v[3]*m[12],
		v[0]*m[1] + v[1]*m[5] + v[2]*m[9] + v[3]*m[13],
		v[0]*m[2] + v[1]*m[6] + v[2]*m[10] + v[3]*m[14],
		v[0]*m[3] + v[1]*m[7] + v[2]*m[11] + v[3]*m[15],
	}
}

func multiply(data []V4, m M4) {
	for i, v := range data {
		data[i] = M4MultiplyV4(m, v)
	}
}
go get goroutines.com/asm-process-vectors
This takes 140ms for 128MB of data on a 2012 Macbook Pro using Go 1.5.3. What's the fastest this implementation could be? A memory copy seems like a good benchmark, it's about 14ms.
Here's a version of the function written in assembly using SIMD instructions to perform the multiplications, which lets us multiply 4 float32s in parallel:
download
// func multiply(data []V4, m M4)
//
// memory layout of the stack relative to FP
//  +0 data slice ptr
//  +8 data slice len
// +16 data slice cap
// +24 m[0]  | m[1]
// +32 m[2]  | m[3]
// +40 m[4]  | m[5]
// +48 m[6]  | m[7]
// +56 m[8]  | m[9]
// +64 m[10] | m[11]
// +72 m[12] | m[13]
// +80 m[14] | m[15]

TEXT ·multiply(SB),NOSPLIT,$0
  // data ptr
  MOVQ data+0(FP), CX
  // data len
  MOVQ data+8(FP), SI
  // index into data
  MOVQ $0, AX
  // return early if zero length
  CMPQ AX, SI
  JE END
  // load the matrix into 128-bit wide xmm registers
  // load [m[0], m[1], m[2], m[3]] into xmm0
  MOVUPS m+24(FP), X0
  // load [m[4], m[5], m[6], m[7]] into xmm1
  MOVUPS m+40(FP), X1
  // load [m[8], m[9], m[10], m[11]] into xmm2
  MOVUPS m+56(FP), X2
  // load [m[12], m[13], m[14], m[15]] into xmm3
  MOVUPS m+72(FP), X3
LOOP:
  // load each component of the vector into xmm registers
  // load data[i][0] (x) into xmm4
  MOVSS    0(CX), X4
  // load data[i][1] (y) into xmm5
  MOVSS    4(CX), X5
  // load data[i][2] (z) into xmm6
  MOVSS    8(CX), X6
  // load data[i][3] (w) into xmm7
  MOVSS    12(CX), X7
  // copy each component of the matrix across each register
  // [0, 0, 0, x] => [x, x, x, x]
  SHUFPS $0, X4, X4
  // [0, 0, 0, y] => [y, y, y, y]
  SHUFPS $0, X5, X5
  // [0, 0, 0, z] => [z, z, z, z]
  SHUFPS $0, X6, X6
  // [0, 0, 0, w] => [w, w, w, w]
  SHUFPS $0, X7, X7
  // xmm4 = [m[0], m[1], m[2], m[3]] .* [x, x, x, x]
  MULPS X0, X4
  // xmm5 = [m[4], m[5], m[6], m[7]] .* [y, y, y, y]
  MULPS X1, X5
  // xmm6 = [m[8], m[9], m[10], m[11]] .* [z, z, z, z]
  MULPS X2, X6
  // xmm7 = [m[12], m[13], m[14], m[15]] .* [w, w, w, w]
  MULPS X3, X7
  // xmm4 = xmm4 + xmm5
  ADDPS X5, X4
  // xmm4 = xmm4 + xmm6
  ADDPS X6, X4
  // xmm4 = xmm4 + xmm7
  ADDPS X7, X4
  // data[i] = xmm4
  MOVNTPS X4, 0(CX)
  // data++
  ADDQ $16, CX
  // i++
  INCQ AX
  // if i >= len(data) break
  CMPQ AX, SI
  JLT LOOP
END:
  // since we use a non-temporal write (MOVNTPS)
  // make sure all writes are visible before we leave the function
  SFENCE
  RET
go get goroutines.com/asm-process-vectors-asm-simd
This one takes 18ms, so it's pretty close to the speed of the memory copy. A better optimization might be to run this sort of thing on the GPU instead of on the CPU because the GPU is really good at it. Here are the running times for the different programs, including a manually inlined Go version and a non-SIMD assembly implementation:
programtimespeedup
original140ms1x
inline69ms2x
assembly43ms3x
simd17ms8x
copy15ms9x
When optimizing, we can pay the cost of some code complexity to make things easier for the machine. Assembly is a complicated way to do that, but sometimes it's the best method available.
Implementation Notes
I developed the assembly parts mostly in C and x64 assembly using XCode and then ported the assembly to the Go format. XCode has a nice debugger that lets you inspect CPU registers while the program is running. If you include an assembly .s file in an XCode project, it will build it and link it to your executable.
I used the Intel x64 Instruction Set Reference and Intel Intrinsics Guide to figure out which instructions to use. Converting to Go assembly is not necessarily straightforward, but many x64 assembly instructions are included in x86/anames.go and if they are not, they can be encoded directly with the binary representation.
goroutines is a series of articles related somehow to the Go programming language
all code samples on this site are in the public domain