The Hidden Dangers of Default Rand

This post is based on a real-world problem that I faced when I was developing a load generator program for work. I looked through my code carefully to make sure that there weren’t any bottlenecks in the design or bugs in the implementation, but didn’t find anything explicitly wrong. I did some profiling after that and found something very interesting: deep down in the rand package was a lock that was taking up the majority of the time in the application. The following three programs show a distilled version of the start, middle, and end states that my program went through while working with and around the rand package.

Program Design

The program, in all three cases, has the same design. There is a pool of producer goroutines generating random strings that are then passed through a single channel. On the other end of the channel is a pool of consumer goroutines that consume the random strings. In the real world application there was a network call using the strings, but in these test programs the string length is used to increment a counter. CPU profiling is on for each program the whole time, which slows it down. Since we’re looking only at relative performance, this doesn’t matter.

program_architecture

Getting the SVGs

First, a quick note about the SVGs below: I find it easiest to analyze the time spent in the code using the graph generated by the web command using the pprof tool. Below is an example session in the pprof tool. You can read more about the pprof tool at the golang blog.

$ go tool pprof rand_default rand_default.prof
Entering interactive mode (type "help" for commands)
(pprof) web

Unfortunately, the SVGs don’t work well in <img> tags out of the box, so I had to open then up in a text editor and fiddle with the width and height properties to make them work properly in a browser. (In case you’re wondering, I made them fixed and guessed at values until the graph showed in its entirety.) The text might be a bit small, but I’ve included links to them below each one, and they come with a built-in pan-and-zoom functionality so you can see them better.

Initial code

Below is the entirety of the example program I’ve written to demonstrate the issues with just using the default rand methods.

package main

import "fmt"
import "math/rand"
import "os"
import "runtime/pprof"
import "sync"
import "time"

var letters = []byte("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
func randData(n int) []byte {
    b := make([]byte, n)

    for i := range b {
        b[i] = letters[rand.Intn(len(letters))]
    }

    return b
}

const numRuns = 100

func main() {
    start := time.Now()
    for i := 0; i < numRuns; i++ {
        do()
    }
    total := time.Since(start)
    perRun := float64(total) / numRuns

    fmt.Printf("Time per run: %fns\n", perRun)

    // Now generate the pprof data for a single run
    f, err := os.Create("rand_default.prof")
    if err != nil {
        panic(err.Error())
    }

    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()
    do()
}

const numRoutines = 10

func do() {
    start := make(chan struct{})
    comm := make(chan []byte)

    var read, write sync.WaitGroup
    read.Add(numRoutines)
    write.Add(numRoutines)

    for i := 0; i < numRoutines; i++ {
        go func() {
            <-start
            for j := 1; j < 10000; j++ {
                comm <- randData(rand.Intn(j))
            }
            write.Done()
        }()

        go func() {
            var sum int
            <-start
            for c := range comm {
                sum += len(c)
            }
            fmt.Println(sum)
            read.Done()
        }()
    }

    close(start)
    write.Wait()
    close(comm)
    read.Wait()
}

Running the program on the console gives 13.1715 seconds per iteration:

$ go build rand_default.go
$ ./rand_default
(output elided)
Time per run: 13171548294.990000ns

Below is the web generated by the pprof tool to show the timing and connectedness of the different pieces of the program.

rand_default link to image

Ouch, those timings hurt. The total time spent in math/rand is 2066.06 seconds, with a whopping 1067.64 seconds spent trying to obtain the underlying lock and 530.70 seconds spent unlocking that same lock. Only 99.75 seconds was spent actually doing the work we wanted it to do. This means 1598.34 / 2066.22 = 77.36% of the time just in math/rand and 72.85% of the overall time was spent on synchronization overhead.

Checking the math/rand Source Code

Now that we can see the problem in the graph, we should take a look at a bit of the rand package’s source code to find the lock.

var globalRand = New(&lockedSource{src: NewSource(1)})

type lockedSource struct {
    lk  sync.Mutex // <------------------ Found it!
    src Source
}

func (r *lockedSource) Int63() (n int64) {
    r.lk.Lock()    // <----------------------------------
    n = r.src.Int63()
    r.lk.Unlock()  // <----------------------------------
    return
}

func (r *lockedSource) Seed(seed int64) {
    r.lk.Lock()    // <----------------------------------
    r.src.Seed(seed)
    r.lk.Unlock()  // <----------------------------------
}

The math/rand package makes it easy to get started with random numbers. The globalRand variable is used in the exposed package methods, which all pass through to it. Behind globalRand (and thus the generic rand methods) is a locked pseudo-random number generator (RNG). This makes sense because the generator generates one number at a time, which then becomes the basis for the next number. This can’t be safely accessed concurrently, so a lock is needed to serialize access. I should point out that this is not a design flaw, but is a way of doing the right thing.

An Easy Optimization

Fortunately for me, the kind people in the Gophers Slack team directed me towards the rand.New() function, which takes a rand.Source as an argument. That can be obtained from rand.NewSource() which takes an int64 seed and returns an RNG the same as the default package-level one but without the locking wrapper. Applying this new-found knowledge only takes a few lines of code difference. I’ll show just the updated bits here.

The randData function takes a rand.Rand instance now to use when generating random strings:

func randData(r *rand.Rand, n int) []byte {
    b := make([]byte, n)

    for i := range b {
        b[i] = letters[r.Intn(len(letters))]
    }

    return b
}

And each producer goroutine now creates its own instance of rand.Rand when it starts so it can be passed to the randData function each time:

go func() {
    r := rand.New(rand.NewSource(time.Now().Unix()))
    <-start
    for j := 1; j < 10000; j++ {
        comm <- randData(r, r.Intn(j))
    }
    write.Done()
}()

Running the optimized program gives us a time per iteration of 2.2668 seconds:

$ go build rand_optimized.go
$ ./rand_optimized
(output elided)
Time per run: 2266835998.330000ns

That’s quite an improvement. It’s taken the time down to 17.21% of the previous time. That’s a 5.8x improvement. If we take a look at how much time was spent doing locking before, we can see that the rest of the program was taking 27.15% of the time. This means getting rid of the synchronization overhead and using separate instances actually increased our program’s speed even more than simply getting rid of the locks. My educated guess is that true parallelism allowed the program to run multiple RNGs at the same time on different cores.

From the pprof graph of the optimized version, we can see that the majority of the time is still spent generating random numbers. This isn’t unexpected, but that’s still a lot of time spent overall. The RNGs still take up 335.14 / 429.49 = 78.03% of the program’s run time. What if we could eliminate most of that as well?

rand_optimized link to image

A Step Further

Let’s take things a step further. What if instead of entirely random strings I used a random length and sliced a premade string? This relaxation allows me to generate a random string once and just select a subset that is needed each time. This saves memory, gc pressure, and a lot of computation. The changes to do this are small as well.

I added an init function and a package variable:

var rData []byte
func init() {
    r := rand.New(rand.NewSource(time.Now().Unix()))
    rData = randData(r, 10000)
}

And in the producer goroutine, I changed the call to randData to a slice operation on the pre-generated random data:

go func() {
    r := rand.New(rand.NewSource(time.Now().Unix()))
    <-start
    for j := 1; j < 10000; j++ {
        comm <- rData[:r.Intn(j)]
    }
    write.Done()
}()

Running this extra-optimized code gives an average time per iteration of 0.0252 seconds:

$ go build rand_extraopt.go
$ ./rand_extraopt
(output elided)
Time per run: 25190140.240000ns

Now the single RNG takes 0.32 / 2.81 = 11.39% of the run time, but the cose is paid once. It’s simply a ratio now based on how many iterations are run in the rest of the code. Compared to the initial version, this code takes 0.1912% of the time, which makes it a 522.88x improvement over the original code. Most of the time, at this point, is actually spent trying to communicate over the channel that is the pipe between producers and consumers. In the real-world application, this meant that the time was spent hammering the target, not overheating the server it was running on.

rand_extraopt link to image

Profiling Reveals Many Surprises

The profiling tools in Go were an engima to me before this process of debugging a slow program. In the process, I found a huge time sink in the math/rand locking, but also found some other surprises as well. I’d encourage anyone to profile their Go programs (once they are functioning properly) to see if they can’t find something that could be easily optimized and save them some time. The rand package is just one example I use now to help understand the value of doing so. If you have any times when profiling has helped you, let me know in the comments. I’d be very interested to hear some more optimization war stories.

comments powered by Disqus