Sorting 5x faster with Go: how it's possible, what didn't work so well, and what I learned

github.com/twotwotwo/sorts is a Go package with parallel radix- and quicksorts. It can run up to 5x faster than stdlib sort on the right kind of large sort task, so it could be useful for analysis and indexing/database-y work in which you have to sort millions of items. (To be clear, I don't recommend most folks drop stdlib sort, which is great, and which sorts depends on.)

While the process of writing it's fresh on my mind, here are some technical details, some things that didn't make the cut, and some thoughts about the process:

Concretely, what this looks like inside:

Both number and string versions are in-place MSD radix sorts that look at a byte at a time and, once the range being sorted gets down to 128 items, call (essentially) the stdlib's quicksort.
The parallelization code's basic principle is "first, try to keep all the cores busy, then worry about everything else."

Concretely: when a sort routine has a >128-element subarray to sort, it tries to do a non-blocking send of a "task" to a channel to potentially hand it to another goroutine. If the send fails, it does the sort immediately, in the current goroutine. The channel is buffered so if a worker finishes its task, there's a good chance there'll be another job ready--I thought buffering would hurt, but it helped. Timings on 1 to 16 cores are online.

There are nicer approaches you could do: samplesort partitions the array into more-or-less even-sized buckets, one to a core. There also might be things that are broadly similar to the current design that are better at minimizing how often parts of the array are tossed back and forth between cores. In the latter category, work stealing seems like a promising approach: work only migrates between workers a worker runs out of sorting to do.
The integer and string sorts both do pragmatic tricks to save work in the common case where some bytes of input are the same across the whole range being sorted. If you're sorting, say, a bunch of ints between 0 and 1000000 (so only using 20 of 64 bits), or a bunch of strings that all have the same long common prefix (think lines in a log file), this makes a noticeable difference.
In early revisions, the string sort could skip multiple bytes ahead in a single pass, but I removed that because it was tricky, tricky to test, and wasn't hugely faster than what it does now, even when I artificially built a file with a really long common prefix.

Some more things I tried but didn't make it:

Keeping my own stack. Radix sorts, especially the string sort, can use a lot of stack--it needs a couple 256-entry tables per level of recursion. American flag sort reduces stack use by keeping its own stack (or "to-do list" if you want to think about it informally) to avoid the need for recursion. I tried that, but it didn't usually affect speed, it made the code trickier, and we can afford to blow a certain amount of memory on stack in 2015 (they were writing in '93), so it's not there in the released version. If you added it back, you could lift the hard limit of radix sorting strings only by the first 32 bytes. It might help the same type of data the long-common-prefix trick did.
For parallel radix sorts, trying to get all cores busy sooner by doing a few (fast) quicksort-style partitioning passes before radix sorting the pieces.

The first pass of a parallel radix sort happens on one core, and takes longer than the first pass of a quicksort. Perhaps because of that, the timings show quicksort getting higher CPU usage on multicore sorts. But I couldn't improve benchmarks results by partitioning before radix sorting. It might be worth revisiting and fiddling with the implementation, or maybe it just makes more sense to go direct to samplesort.
Caching key bytes for string sorts between two passes over smallish ranges. Only helped a little and the code wasn't pretty.
Using more than 8 bits at a time to sort ints. Sometimes helped but it seemed very sensitive to the exact array size.

I still think there might be some wins in sorting string data by collecting the first bytes of the strings in uint64s and sorting them as numbers, then using the actual strings as a tiebreaker. (Keeping string bytes in one little contiguous region has to beat jumping all over RAM for them.) And there might be something to concrete (interfaceless) sort code, but then the trick is coming up with a great API.

Some meta thoughts:

Don't be too quick to write anything off as not worth trying. Kind of the computing equivalent of "it never hurts to ask." When I first started futzing around I figured single-type versions of the sort code (e.g., one that only compares uint64s) were the way to go, and only tried the current sort.Interface thing on out of curiosity, to see (I thought) how bad it'd do. Turned out, not that bad, so here we are.
You have to remember what code runs frequently and what doesn't. That long-common-prefix code felt like it should be making everything slower, but wasn't. Intuitively, when something is a third of the code you see in front of you, and worse yet in an inner loop, it's hard not to imagine it's hurting somehow. Benchmarks didn't bear this out; it probably usually boiled down to a never-taken branch that the processor could zip right past.

Of course, it turned out that particular code was doomed, haha, but by its trickiness, not its slowness.
Sometimes the hard part of a change is testing it. Optimizations have weird special cases to tickle. Multiple algorithms (radix sort, stdlib qSort, and qSort breaks down into insertionSort/heapSort), multiple data types for each algorithm, and either can be serial or parallel. Plus, you don't just want to be able to say you have coverage, you ideally want each type of sort to see as many kinds of weird data as you can. Brute force (here, sorting each dataset twelve ways) can help.

Anyway, perspective! I did this specifically because it's fun sometimes to not have to think too hard about sorting out users' requirements and instead just take an already-well-understood problem and fiddle with computers for a bit. If I were to futz with this more, I'd much rather use it for any of the fun analysis, indexing, etc. one can build with sorts (something with actual benefit for some end user), rather than iterate more on a low-level part of the picture like this.

Questions, thoughts, non-thoughts, reply below or holler at me on the Twitters (@rf). Thanks for reading!

If you haven't tried it, when you reach 128 items, you might want to consider switching to a Shell-Metzner sort instead of a quicksort (and you might want to experiment with that threshold as well). A quicksort tends to be overkill for only 128 items.

There's one other technique that's initially counter-intuitive (at least it seemed that way to me), but actually works well. It's easiest to explain in the context of a quicksort, but may well apply to your situation as well. The typical modified Quicksort switches to an insertion sort when the partition size drops below some particular threshold.

A modification of that (due to Robert Sedgewick) is to not invoke the insertion sort immediately upon reaching that threshold--instead, just return at that point. Then when all the partitioning is done, do an insertion sort once for the entire collection.

This reduces the function call overhead. At first glance, it sounds almost suicidal because you're doing an insertion sort (O(N^2) algorithm) on a large collection. In reality, however, the time with insertion sort is basically proportional to the maximum distance an element might move rather than the overall size of the collection. Since you've already partitioned the collection, no element has to move very far, so the insertion sort ends up nearly linear.

twotwotwo/sorts.md

mijia commented May 7, 2015

shawnsmithdev commented May 17, 2015

twotwotwo commented May 18, 2015

jcoffin01 commented May 20, 2015

twotwotwo commented May 21, 2015

jcoffin01 commented May 21, 2015

twotwotwo commented Dec 9, 2015