When More Parallelism != More Performance

2022 January 11 22:38 stuartscott 10665¤ 1309174¤

After reading @Kostadin's Why Amdahl's Law is Important article, I wanted to explore the concepts in more detail by parallelizing a serial algorithm.

Bubble Sort is a serial sorting algorithm which iterates through an array comparing each adjacent pair of elements, if they are out of order they will be swapped. This is repeated until the entire array is in sorted order. Bubble Sort can not and does not take advantage of multiple cores, but it has the funny title of being the fastest sorting algorithm, providing the data is already sorted.

Here is Bubble Sort in Go:

func main() {
	array := []int{1, 5, 2, 6, 3, 7, 4, 9, 8, 0}
	log.Println(array)
	BubbleSort(array, func(a, b int) bool {
		return array[a] < array[b]
	})
	log.Println(array)
}

func BubbleSort(array []int, less func(int, int) bool) {
	limit := len(array)
	// Repeats until no swaps occur.
	for swapped := true; swapped; {
		swapped = false
		// Iterate through all pairs.
		for i := 0; i < limit-1; {
			j := i + 1
			// Compare index with adjacent.
			if !less(i, j) {
				array[i], array[j] = array[j], array[i]
				swapped = true
			}
			i = j
		}
	}
}

If the hardware running the Bubble Sort had two CPU cores, couldn't we modify the algorithm so the first core checks one pair of adjacent elements while the second core checked a different pair?

Foam Sort is an attempt to parallelize Bubble Sort - just as foam consists of many bubbles, Foam Sort consists of multiple Bubble Sorts executing in parallel. Well, that was the idea anyway. Note: this is not a new idea, it was originally presented by Habermann in 1972 and has the descriptive, though uninspiring, name of Odd-Even Transposition Sort.

Here is Foam Sort in Go:

func main() {
	array := []int{1, 5, 2, 6, 3, 7, 4, 9, 8, 0}
	log.Println(array)
	FoamSort(array, func(a, b int) bool {
		return array[a] < array[b]
	})
	log.Println(array)
}

func FoamSort(array []int, less func(int, int) bool) {
	limit := len(array)
	jobs := make(chan int, limit/2)
	results := make(chan bool, limit/2)
	// Repeatedly takes an index from job queue.
	// Compares, and optionally swaps, with adjacent index.
	work := func() {
		for j := range jobs {
			a := j
			b := j + 1
			if less(a, b) {
				results <- false
			} else {
				array[a], array[b] = array[b], array[a]
				results <- true
			}
		}
	}
	// Spawns a worker thread for each available process.
	for w := 0; w < runtime.GOMAXPROCS(0); w++ {
		go work()
	}
	// Adds indices of non-overlapping pairs to job queue.
	// Waits for all results.
	sort := func(start int) (result bool) {
		for i := start; i < limit-1; i += 2 {
			jobs <- i
		}
		for i := start; i < limit-1; i += 2 {
			result = <-results || result
		}
		return
	}
	// Sorts even, then odd, pairs.
	// Repeats until no swaps occur.
	for swapped := true; swapped; {
		swapped = sort(0)
		swapped = sort(1) || swapped
	}
	// Closes queues so worker threads can terminate.
	close(jobs)
	close(results)
}

Is Foam Sort the best new sorter on the block?

Short answer; no.

To understand the longer answer, let's compare Foam Sort with Bubble Sort and Standard Sort (the optimized Quick Sort algorithm implemented in the Go standard library) in each of the three cases; Best - the data is already sorted, Worst - the data is sorted in reverse order, and Random - the data is unsorted.

The results below show how the algorithms compare when sorting an array of 1000 integers on a MacBook Pro with 4 worker threads.

$ go test -bench=. ./...
goos: darwin
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
BenchmarkFoamSort_Best_1K-4         	    4484	    227880 ns/op
BenchmarkBubbleSort_Best_1K-4       	  169204	      6207 ns/op
BenchmarkStandardSort_Best_1K-4     	   29431	     40083 ns/op
BenchmarkFoamSort_Worst_1K-4        	      10	 110654902 ns/op
BenchmarkBubbleSort_Worst_1K-4      	     318	   3710267 ns/op
BenchmarkStandardSort_Worst_1K-4    	   28496	     41871 ns/op
BenchmarkFoamSort_Random_1K-4       	      10	 102962330 ns/op
BenchmarkBubbleSort_Random_1K-4     	     294	   4333471 ns/op
BenchmarkStandardSort_Random_1K-4   	    8998	    132668 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	13.392s

The results above show that in the Best case, Foam Sort is ~37x slower than Bubble Sort, in the Worst case it is ~30x slower, and in the Random case it is ~24x slower!

How could this happen?! The key issue here is that Foam Sort parallelizes the comparison operation of the sort, which is not the most expensive part. The main cost of Bubble Sort comes from the shear amount of comparison operations it must do, which in the best case is N, and in the worst case N^2. Reducing the amount of comparisons will do more for the performance than reducing the cost of each individual comparison, which is why the Standard Sort is so much quicker in most cases - it simply does fewer comparisons.

What if the comparison operations were more expensive?

We can simulate this by adding a small sleep into the comparison operation;

func(a, b int) bool {
	time.Sleep(100 * time.Nanosecond)
	return array[a] < array[b]
}

$ go test -bench=. ./...
goos: darwin
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
BenchmarkFoamSort_Best_1K-4         	    3717	    293859 ns/op
BenchmarkBubbleSort_Best_1K-4       	    1878	    577005 ns/op
BenchmarkStandardSort_Best_1K-4     	     277	   4195029 ns/op
BenchmarkFoamSort_Worst_1K-4        	       8	 139885513 ns/op
BenchmarkBubbleSort_Worst_1K-4      	       3	 482107601 ns/op
BenchmarkStandardSort_Worst_1K-4    	     272	   4268562 ns/op
BenchmarkFoamSort_Random_1K-4       	       8	 137650085 ns/op
BenchmarkBubbleSort_Random_1K-4     	       3	 485127224 ns/op
BenchmarkStandardSort_Random_1K-4   	     229	   5092066 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	19.632s

The results above are slightly more in Foam Sort's favour; in the Best case, it is ~2x faster than Bubble Sort, in the Worst case it ~3.4x faster, and in the Random case ~3.5x faster! If you had expected the speed-up to be 4x due to the number of worker threads, then I invite you to re-read @Kostadin's Why Amdahl's Law is Important article.

What happens if we just throw more cores at the problem?

I went to my cloud provider of choice, Digital Ocean, and purchased their top-of-the-line droplet with 32 dedicated CPU cores!

The results were poor.

$ go test -bench=. ./...
goos: linux
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel Xeon Processor (Cascadelake)
BenchmarkFoamSort_Best_1K-32          	    2278	    506672 ns/op
BenchmarkBubbleSort_Best_1K-32        	  219459	      5479 ns/op
BenchmarkStandardSort_Best_1K-32      	   37087	     32299 ns/op
BenchmarkFoamSort_Worst_1K-32         	       6	 195969930 ns/op
BenchmarkBubbleSort_Worst_1K-32       	     415	   2901585 ns/op
BenchmarkStandardSort_Worst_1K-32     	   35242	     33854 ns/op
BenchmarkFoamSort_Random_1K-32        	       6	 197362338 ns/op
BenchmarkBubbleSort_Random_1K-32      	     406	   2949955 ns/op
BenchmarkStandardSort_Random_1K-32    	   10000	    107531 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	16.011s

When using the normal comparison function, Foam Sort was on average 75x slower that Bubble Sort. But even when using the 'expensive' comparison function, Form Sort's Best case speed-up was barely more than 2x that of Bubble Sort.

$ go test -bench=. ./...
goos: linux
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel Xeon Processor (Cascadelake)
BenchmarkFoamSort_Best_1K-32          	    1672	    718890 ns/op
BenchmarkBubbleSort_Best_1K-32        	    3628	    335208 ns/op
BenchmarkStandardSort_Best_1K-32      	     441	   2683244 ns/op
BenchmarkFoamSort_Worst_1K-32         	       4	 278975445 ns/op
BenchmarkBubbleSort_Worst_1K-32       	       4	 303722158 ns/op
BenchmarkStandardSort_Worst_1K-32     	     452	   2721583 ns/op
BenchmarkFoamSort_Random_1K-32        	       4	 273567180 ns/op
BenchmarkBubbleSort_Random_1K-32      	       4	 306366861 ns/op
BenchmarkStandardSort_Random_1K-32    	     379	   3126928 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	16.490s

The ability for an algorithm to scale and make use of the available hardware resources depends on many factors. The parallelism of the algorithm is critical - a serial algorithm will never speed up no matter how many cores the are. But the data, and how it is shared by the algorithm, is also important due to the physical constraints of the hardware.

Main memory operates at a much lower frequency than the CPUs so reading and writing to main memory is avoided as much as possible by holding copies of the data in layers of increasingly smaller and faster caches. Some of the larger caches might be shared by multiple CPU cores, while the smallest are typically only used by a single core. Regardless, the data in these caches needs to be kept consistent across all of them, which is where Cache Coherency comes in.

The caches communicate and coordinate with each other to ensure they hold the correct data, and that changes made by one CPU core are visible to all others cores. In the case of sorting, many pieces of data are compared and swapped with each other, so every core needs access to every piece of data. Which means these caches have to work very hard to maintain consistency across the cores, and if one core tries to read a value that another core has just updated, then the reader will need to wait for the caches to synchronize.

In the case of Foam Sort, any potential speed up from executing the comparisons in parallel is dwarfed by the overhead of spawning, scheduling, and coordinating multiple worker threads and keeping the caches consistent while they operate.

In other situations like computer graphics, each operation typically works in isolation on its own subset of the data and so each cache only needs to hold that subset and won't need to coordinate with the other caches as much. This is why you see so much parallelism in GPU design - where a typical CPU has 2-64 cores, a GPU can have hundreds and sometimes even thousands.

While this adventure in parallelism did not yield a new super-fast sorting algorithm, it was fun. I hope you enjoyed it too. If you're interested in exploring the code, or learning more about the benchmarks checkout the project on GitHub.

Sort: Cost Yield Time

2022 January 13 21:06 stuartscott 7239¤ 2610714¤

After posting this article on Reddit a couple of people pointed out ways in which the algorithm could be improved. Thanks to all the Redditors who responded, and a special thanks to u/pdpi and u/coloredgreyscale.

Optimizing for the Cache

As mentioned before, a Cache is used to provide faster access to a small set of the data held in main memory. This "quick lookup table" consists of a memory address and the data stored at that address. However it would be inefficient to only store one piece of data for each address, so instead an address maps to a group of contiguous memory addresses into a Cache Line of typically 64 bytes, ensuring any need for the data at that address, or any of the other adjacent 63 addresses can be met quickly.

The worker threads in the Foam Sort algorithm operated on adjacent pairs of elements which share a Cache Line and resulted in poor performance because each time a pair of elements were swapped that Cache Line needed to be written back to main memory and all copies of that Cache Line in the other cores would need to be invalidated and re-read from main memory.

To alleviate this issue, workers could instead operate on a chunk of pairs which reduced the number of Cache Line invalidations and main memory accesses. This new algorithm, which I called Reddit Sort after the insightful folks aforementioned, breaks the array to be sorted into chunks, the size of which is a multiple of the Cache Line capacity.

Here is Reddit Sort in Go:
```
func RedditSort(slice []int, less func(int, int) bool) {
	limit := len(slice)
	workers := runtime.GOMAXPROCS(0)
	batch := limit / workers
	cache := 16 // Number of integers in a 64 Byte cache line
	// Align batch size to a multiple of cache line size
	for ; batch%cache != 0; batch++ {
	}

	jobs := make(chan int, workers)
	results := make(chan bool, workers)
	// Repeatedly takes an index from job queue.
	// For each element in the batch starting at the index.
	// Compares, and optionally swaps, with adjacent element.
	work := func() {
		// Add 1 to batch size so that it overlaps into next batch
		batch = batch + 1
		for j := range jobs {
			swapped := false
			for i := 0; i < batch && j < limit-1; i++ {
				a := j
				b := j + 1
				if slice[a] != slice[b] && !less(a, b) {
					slice[a], slice[b] = slice[b], slice[a]
					swapped = true
				}
				j = b
			}
			results <- swapped
		}
	}
	// Spawns a worker thread for each available process.
	for w := 0; w < workers; w++ {
		go work()
	}
	// Repeatedly tell each worker to process their batch until no swaps occur.
	for swapped := true; swapped; {
		swapped = false
		for w := 0; w < workers; w++ {
			jobs <- w * batch
		}
		for w := 0; w < workers; w++ {
			swapped = <-results || swapped
		}
	}
	// Closes queues so worker threads can terminate.
	close(jobs)
	close(results)
}
```
The results below show how the algorithms compare when running the same benchmarks - sorting an array of 1000 integers on a MacBook Pro with 4 worker threads.
```
$ go test -bench=. ./...
goos: darwin
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
BenchmarkFoamSort_Best_1K-4         	    4338	    244691 ns/op
BenchmarkBubbleSort_Best_1K-4       	  156889	      6982 ns/op
BenchmarkRedditSort_Best_1K-4       	   77558	     16628 ns/op
BenchmarkStandardSort_Best_1K-4     	   27688	     43719 ns/op
BenchmarkFoamSort_Worst_1K-4        	      10	 111162742 ns/op
BenchmarkBubbleSort_Worst_1K-4      	     268	   4374331 ns/op
BenchmarkRedditSort_Worst_1K-4      	     100	  12113010 ns/op
BenchmarkStandardSort_Worst_1K-4    	   24626	     43831 ns/op
BenchmarkFoamSort_Random_1K-4       	      10	 104008787 ns/op
BenchmarkBubbleSort_Random_1K-4     	     238	   5083578 ns/op
BenchmarkRedditSort_Random_1K-4     	     103	  12125052 ns/op
BenchmarkStandardSort_Random_1K-4   	    9252	    134686 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	19.481s
```
In the Best case Reddit Sort is ~2.4x slower than Bubble Sort due to the same parallelism overheads mentioned earlier, but ~14.7x faster than Foam Sort! In the Worst and Random cases Reddit Sort is 2.8x and 2.4x slower than Bubble Sort respectively, but 9.2x and 8.6x faster than Foam Sort respectively!

Granted, the Standard Sort is still the fastest for anything other than a mostly sorted array, but this is a huge win for Reddit Sort and all because the algorithm takes the hardware design into account!

Simulating an Expensive Comparison

Initially a sleep was used to artificially increase the amount of time taken to execute the comparison operation. However this is not that effective as it doesn't actually keep the CPU busy in any meaningful way, and it can therefore pickup other tasks which might skew the benchmark results. Instead it was replaced with a "busy loop" which adds the first 1000 integers together and asserts the sum is correct;
```
var sum int
for i := 0; i < 1000; i++ {
	sum += i
}
expected := 499500
if sum != expected {
	panic(fmt.Sprintf("Expected %d, got %d", expected, sum))
}
```
A very clever compiler could see that the calculated value of sum after this loop doesn't change and that it will never panic. It could then decide to optimize this whole workload away in the name of performance, however, fortunately, the version of Go compiler I'm using (go1.16.3) leaves the loop as-is and it serves as a useful artificial workload, though this may change in future.

The results below show how the algorithms compare when running the same benchmarks with the more expensive comparison operation.
```
$ go test -bench=. ./...
goos: darwin
goarch: amd64
pkg: github.com/stuartmscott/foamsort
cpu: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
BenchmarkFoamSort_Best_1K-4         	    3576	    341223 ns/op
BenchmarkBubbleSort_Best_1K-4       	    2499	    406355 ns/op
BenchmarkRedditSort_Best_1K-4       	    4612	    219639 ns/op
BenchmarkStandardSort_Best_1K-4     	     343	   3430212 ns/op
BenchmarkFoamSort_Worst_1K-4        	       7	 157339212 ns/op
BenchmarkBubbleSort_Worst_1K-4      	       3	 396582115 ns/op
BenchmarkRedditSort_Worst_1K-4      	       5	 213440380 ns/op
BenchmarkStandardSort_Worst_1K-4    	     346	   3451012 ns/op
BenchmarkFoamSort_Random_1K-4       	       6	 171546250 ns/op
BenchmarkBubbleSort_Random_1K-4     	       4	 289870738 ns/op
BenchmarkRedditSort_Random_1K-4     	       7	 157816732 ns/op
BenchmarkStandardSort_Random_1K-4   	     291	   4040693 ns/op
PASS
ok  	github.com/stuartmscott/foamsort	19.881s
```
Unsurprisingly, Bubble Sort is the slowest across all benchmarks and the Standard Sort is the fastest in the Worst and Random cases. But the most interesting comparison is of Foam and Reddit Sort - Reddit Sort is quicker in the Best and Random cases, but not in the Worst case and I'm not sure why. We know that in the Worst case every element needs to transition the mid point of the array to get to its correct location and that in each iteration of Reddit Sort's outer loop only element transitions each boundary between of the chunks so perhaps this outer loops needs to do more iterations.
- reply
- share
  Reddit Hacker News Twitter LinkedIn Facebook Email
- 2022 January 13 21:24 stuartscott 3021821¤ 2199607¤
  
  Sorting algorithms each have unique characteristics, however they can be difficult to visualize. To tackle this, I decided to go one step further and modify the code to generate animations showing how the data is getting sorted.
  
  Shout out to nitoyon on Github for their helpful gist on generating Gifs in Go.
  
  Here's Bubble Sort in the Random case showing how the largest elements are put in their correct location first while the smaller elements slowly migrate towards theirs:
  - reply
  - share
    Reddit Hacker News Twitter LinkedIn Facebook Email
  - 2022 January 13 21:31 stuartscott 1578105¤ 2821112¤
    
    When compared with the animation for Bubble Sort, Foam Sort's animation seems to sort the data bidirectionally, like the two arms reaching out in Michelangelo's "The Creation of Adam".
    - reply
    - share
      Reddit Hacker News Twitter LinkedIn Facebook Email
    - 2022 January 13 21:32 stuartscott 3073870¤ 2568354¤
      
      Reddit Sort's animation is very similar to Bubble Sort's.
      
      reply
      
      share
      Reddit Hacker News Twitter LinkedIn Facebook Email
      - 2022 January 13 21:36 stuartscott 3438320¤ 1698389¤
        
        In the Best case scenario the data is already sorted, so the animations are not very interesting. So instead, let's see what the animations look like in the Worst case - where the data is already sorted in reverse order.
        
        Here is Bubble Sort.
        
        reply
        
        share
        Reddit Hacker News Twitter LinkedIn Facebook Email
        
        2022 January 13 21:38 stuartscott 1678916¤ 1717862¤
        
        When compared with the animation for Bubble Sort, Foam Sort's animation in the Worst cases scenario seems 2D, like a rectangle expanding and then converging into alignment.
        
        reply
        
        share
        Reddit Hacker News Twitter LinkedIn Facebook Email
        
        2022 January 13 21:42 stuartscott 3435724¤
        
        Again, Reddit Sort's animation is very similar to Bubble Sort's, although you can make out faint patterns created by the division of work between the threads.
        
        reply
        
        share
        Reddit Hacker News Twitter LinkedIn Facebook Email
2022 January 12 15:55 stuartscott 396¤

To those asking about the cost of renting the 32 core server from Digital Ocean, it is $640/month but charged on an hourly basis which is $0.952/hour so it cost me less than a dollar to run the two benchmarks above.

If you'd like to try Digital Ocean use my referral link to get $100 in credit over 60 days. If you like it and spend $25, I'll get $25 - win/win!
- reply
- share
  Reddit Hacker News Twitter LinkedIn Facebook Email

When More Parallelism != More Performance

Optimizing for the Cache

Simulating an Expensive Comparison