Skip to main content
Uber logo

Schedule rides in advance

Reserve a rideReserve a ride

Schedule rides in advance

Reserve a rideReserve a ride
Engineering

Optimizing M3: How Uber Halved Our Metrics Ingestion Latency by (Briefly) Forking the Go Compiler

April 18, 2019 / Global
Featured image for Optimizing M3: How Uber Halved Our Metrics Ingestion Latency by (Briefly) Forking the Go Compiler
Figure 1: The arrow on the left shows our typical end-to-end latency, hovering around 10 seconds with occasional spikes. The arrow on the right shows our end-to-end latency after the performance regression, where we see regular spikes rising to 20 seconds.
Figure 2: In this high-level look at M3’s metric ingestion pipeline, metrics are emitted from various containers over to a local daemon called Collector that runs on each of our hosts. The Collectors use a shard-aware topology that they receive from to forward the metrics to our aggregation tier where they’re aggregated into 10 second and one minute tiles. Finally, the aggregation tier flushes the tiles to various backends, including the M3DB ingester which is responsible for writing them to M3DB.
Figure 3: The rate at which the M3DB ingesters receive new metrics is not constant. At regular intervals, the ingesters will receive a large number of new metrics all at once due to the fact that the aggregation tier is creating and flushing tiles of various sizes.
Figure 4: Performing a git bisect showed us a version change in M3DB, which in turn necessitated another git bisect of the M3DB monorepo, leading us to the M3X monorepo.
Figure 5: After 81 tries, our git bisect finally revealed a small change that we made to the Clone method which somehow caused the performance regression.
Figure 6: We found that the performance regression could be narrowed down even further to the small change of replacing some existing inline code with a helper function.
Figure 7: These two flame graphs show how the version of the code that exhibited the performance regression (right) was spending significantly more time in the runtime.morestack function.
Figure 8: The runtime.morestack function will grow the stack of a Goroutine that needs more stack space by pausing execution, allocating a new stack, copying the old stack into the new stack, and then resuming the function call.
Figure 9: In production, our stack was over 30 function calls deep, which made it possible to trigger the stack growth issue. In our benchmarks, however, the stack was very shallow and it was unlikely that we would exceed the default stack size.
Figure 10: A commonly leveraged pattern in Go is to use a channel as a semaphore for controlling concurrency. Work can only be performed once a token has been reserved, so the total amount of concurrency is limited by the number of tokens (in other words, the size of the channel).
Figure 11: This implementation of the worker pool takes a different approach. Instead of using tokens to limit the number of goroutines that can be spawned, we spawn all the goroutines upfront and then use the channel to assign them work. This still limits the concurrency to the specified limit, but prevents us from having to allocate new goroutine stacks over and over again.
Figure 12: With the new worker pool, the amount of time spent in runtime.morestack was even lower than it was before we introduced the performance regression.
Figure 13: The new worker pool was so effective that even if we deployed our service with the code that initially caused the performance issue, the end-to-end latency was still lower than it was before we introduced the regression. This meant that with the new worker pool, we could safely write our code without having to worry about the cost of goroutine stack growth.
Commit Sampled Average Number of Occurrences
With regression15,685
With regression fix3,465
With new worker pool171
Richard Artoul

Richard Artoul

Richard Artoul is a senior software engineer on Uber's Primary Storage team. He enjoys geeking out over Golang performance optimizations and spends most of his time tinkering away on M3DB.

Posted by Richard Artoul

Category: