Note on worker pools in Go

Goroutines are cheap. Unlike some languages, you can usually get away with spawning a Goroutine for each incoming request, but at scale, the non-zero cost may become a bottleneck. Or perhaps you’d like to limit the concurrency of your server because not all machines are created equal. In both situations, worker pools (called thread pools in other languages) are the next logical step — you amortise the cost of spawning a Goroutine by creating a pool of concurrent workers.

There are plenty of implementation details you must pay attention to when designing your worker pools, but I’m going to focus on one thing in particular in this post: the stack.

When you create a Goroutine, the runtime allocates a stack size of at least 2048 bytes. While this varies across different operating systems and architectures, the exact number doesn’t matter; most non-trivial applications will eventually hit this limit. When such a stack overflow is imminent, the Go runtime calls the morestack procedure to double the stack size, copy data, and rewrite pointers to point to the new stack addresses. All of this happens transparently and the application is none the wiser.

The most interesting gotcha in Go’s stack handling is that Go stacks never shrink. Once a Goroutine necessitates a call to morestack to double its stack size, it will never go down. There is no lessstack.

Like any operation involving copying non-trivial amounts of memory, calling morestack can cause a latency spike. Like GC pauses, this can also be hard to identify without actively profiling the application under load. In gRPC, which is where I first noticed this additional latency, this manifested in the form of at least two morestack calls for every RPC call. gRPC used to spawn a Goroutine for each request, so you can see where I’m going with this. What if request processing started with a large enough stack to begin with?

Since worker pools reuse Goroutines and since Goroutine stacks never shrink in size, worker pools are the perfect answer to eliminate these morestack calls. And it worked. Median request latency dropped by around 15% and reciprocally throughput went up by roughly the same amount.

And then I realised there was a problem. gRPC is a library that calls into external functions to handle queries. Once Goroutine control passes into the external function, all bets around how much stack the Goroutine will allocate are off. At scale, in long running services, near the tail end of the curve, we may have some rare calls that inflate a worker’s stack to abnormal sizes and these large stacks will never be freed even if the vast majority of the requests don’t need that much memory.

The solution to this is fairly straightforward — you simply kill and respawn each worker after a sufficiently large number of requests. Respawning in a new Goroutine lets the worker start from scratch. Exactly how often your workers die and respawn isn’t a science, but there is a trade-off. Respawn too frequently and you eliminate the performance benefits of a worker pool, respawn too infrequently and your memory balloons and never comes down.

I went with a straightforward request counter that counts to approximately 2^16 within each worker before respawning. You can choose something fancier like a timer-based approach. Assuming that your technique to dispatch requests to workers is something as simple as round-robin, take extra care to make sure all workers Goroutines don’t respawn at the same time. You can do this with a random perturbation on the threshold number of requests after which workers die and respawn themselves.

Below are some highlights from gRPC’s current worker pool. See the PR for the full diff switching from Goroutine-per-request to worker pools that pay attention to stack sizes.

// workerResetThreshold defines how often the stack must be reset. Every N
// requests, by spawning a new Goroutine in its place, a worker can reset its
// stack so that large stacks don't live in memory forever. 2^16 should allow
// each Goroutine stack to live for at least a few seconds in a typical
// workload (assuming a QPS of a few thousand requests/sec).
const workerResetThreshold = 1 << 16

type request struct {
	// whatever
}

type server struct {
	workerChannels []chan *request
}

func (s *server) worker(ch chan *request) {
	// To make sure all server workers don't reset at the same time, choose a
	// random number of iterations before resetting.
	threshold := workerResetThreshold + rand.Intn(workerResetThreshold)
	for completed := 0; completed < threshold; completed++ {
		req, ok := <-ch
		if !ok {
			return
		}
		s.handleSingleRequest(req)
	}
	// Restart in a new Goroutine.
	go s.worker(ch)
}

func (s *server) initWorkers(numWorkers int) {
	s.workerChannels = make([]chan *request, numWorkers)
	for i := 0; i < numWorkers; i++ {
		// One channel per worker reduces contention.
		s.workerChannels[i] = make(chan *request)
		go s.worker(s.workerChannels[i])
	}
}

func (s *server) stopWorkers() {
	for _, ch := range s.workerChannels {
		close(ch)
	}
}

func (s *server) handleSingleRequest(req *request) {
	log.Printf("processing req=%v\n", req)
}

func (s *server) listenAndHandleForever() {
	for counter := 0; ; counter++ {
		req := listenForRequest()
		select {
		case s.workerChannels[counter % len(s.workerChannels)] <- req:
		default:
			// TODO: If this workers is busy, fall back to spawning a Goroutine. Or
			// find a different worker. Or dynamically increase the number of workers.
			// Or just reject the request.
		}
	}
}

func newServer() *server {
	s := &server{}
	s.initWorkers(16)
	return s
}

func main() {
	s := newServer()
	s.listenAndHandleForever()
}