Skip to content

Go is not C, so there is not an extreme fast way to merge slices

Go101 edited this page Sep 4, 2021 · 19 revisions

The builtin make function may be used to allocate a slice. As Go is a memory safe language which tries to remove the unspecified behaviors in C, all the elements allocated by a make call will be reset to zero in the make call.

Go toolchain 1.15 introduces an optimization to avoid resetting some elements allocated in a make call presenting in the following alike code.

// x is a slice which element type is T.
// n is an integer expression, either constant or not.
y = make([]T, n)
copy(y, x)
// Only the elements within y[len(x):] will be reset.

For other scenarios, make calls will still reset elements, which is often unnecessary in practice. For example, the make call in the following code.

func MergeSlices(data ...[]int) []int {
	n := 0
	for _, s := range data {
		n += len(s)
	}
	r := make([]int, 0, n)
	for _, s := range data {
		r = append(r, s...)
	}
	return r
}

That means there is not an extreme fast way to merge slices in Go.

How large is the performance downgrade caused by the unnecessary element resetting? From 1% to 45%, depending on sizes of the merged slices and CPU models. For example, on my machine, the downgrade in merging the 3 slices shown in the following code is about 20%.

package main

import "testing"

var (
	x = make([]byte, 10000)
	y = make([]byte, 20000)
	z = make([]byte, 50000)
	
	w = MergeSlices(x, y, z)
	
	r [128][]byte
)

func MergeSlices(data ...[]byte) []byte {
	type _ int
	n := 0
	for _, s := range data {
		n += len(s)
	}
	r := make([]byte, 0, n)
	for _, s := range data {
		r = append(r, s...)
	}
	return r
}

func ClearSlice(data []byte) []byte {
	type _ int

	// The following loop will be optimized
	// as a memclr call internally.
	for i := range data {
		data[i] = 0
	}
	return data
}

func Benchmark_MergeSlices(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r[i&127] = MergeSlices(x, y, z)
	}
}

func Benchmark_ClearSlice(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r[i&127] = ClearSlice(w)
	}
}

The benchmark result:

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Benchmark_MergeSlices-4     66850       17283 ns/op      81920 B/op      1 allocs/op
Benchmark_ClearSlice-4      375154       2998 ns/op	     0 B/op      0 allocs/op

2998.0 / (17283 - 2998) == 0.21

(The methodology used in this article is in a rough way. So I accept Axel Wagner's criticism.)

Clone this wiki locally