Discover how the Golang Runtime Scheduler manages goroutines efficiently. Learn its key components, features, and benefits in our in-depth guide.

Jan 26, 2023

Kevin Kelche

The Golang Scheduler

Introduction

What makes Go a great language to work with is its concept of concurrency. Concurrency, in its simplest definition, refers to the capability of running multiple tasks at the same time. This is a highly powerful concept, but it presents numerous challenges. Fortunately, the Go team has made it easy to handle concurrency in Go. In this article, we will explore the Go runtime and uncover how the scheduler operates

Some Prehumble

...
func main() {
    for i := 0; i < 10; i++ {
        go func(i int) {
            fmt.Println(i)
        }(i)
    }
    time.Sleep(1 * time.Second)
}

The Go runtime is responsible for the communication between the Go program and the hardware (via OS) as demonstrated in the image below.

Before running program

When the above code has compiled the function go func(...) is compiled to a runtime function called runtime.newProc. This function is responsible for creating a new goroutine. A goroutine is a lightweight thread managed by the Go runtime. The code below shows the implementation of the runtime.newProc function.

...
func main(){
    for i := 0; i < 10; i++ {
        runtime.newProc(0, func(i int) {
            fmt.Println(i)
        }, i)
    }
    time.Sleep(1* time.Second)
    }

On run program

As you can see in the above image, goroutines operate in a user space and not in a kernel space. The difference between the two is that the kernel space is managed by the OS and the user space is managed by the Go runtime. In the kernel space, the OS manages the threads, this level is expensive to operate in. The Go runtime manages the goroutines in the user space, and as earlier explained the goroutines are lightweight threads.

Any execution happens through the running or kernel threads in the cores. There can exist more kernel threads than cores, but there can never be more running threads than cores. The same applies to goroutines, there can exist more goroutines than kernel threads.

After this now it begs the question, how do we get the goroutines to run on the cores? As in a way to map our goroutines to the kernel threads. This is where user space scheduling comes in.

The Scheduler

Now that we have a basic understanding of how the Go runtime works, we can take a look at the scheduler.

In computing, a scheduler is a program that decides which process should be executed next. The Golang scheduler is responsible for scheduling the goroutines to the kernel threads. There are various types of schedulers, but the one we are interested in is the user space schedulers. The user space scheduler is responsible for scheduling tasks between the user space and the kernel space.

The M:N Scheduler

There are different types of user space schedulers, but we are interested in the one implemented by the Go runtime, which is M:N. The M:N scheduler is a scheduler that maps N goroutines to M kernel threads. The M:N scheduler is implemented in the Go runtime, and it is responsible for scheduling the goroutines to the kernel threads.

The advantage of such a scheduler is that it is flexible and allows the existence of more goroutines than kernel threads.

M:N scheduler

States of a Goroutine

A goroutine can be in one of the following states:

Goroutine states

Runnable: A goroutine is runnable when it is ready to be executed and waiting for the kernel thread to execute it. A goroutine can be in a runnable state when it is created or after it has been unblocked.
Running: A goroutine is running when it is being executed by a kernel thread.
Blocked: A goroutine is blocked when it is waiting for an event to happen. This can be a channel operation, a system call, or a mutex.

How Does the Scheduler keep track of Goroutines?

The scheduler keeps track of the goroutines in a run queue. The run queue is a queue of goroutines that are ready to be executed. The scheduler is responsible for adding goroutines to the run queue and removing them from the run queue. The run queue follows the FIFO (First In First Out) principle.

Since we have multiple kernel threads accessing the same run queue, we need to make sure that we synchronize the access to the run queue. This is done by using a mutex. A mutex is a lock that is used to synchronize access to a shared resource. In our case, the shared resource is the run queue. The scheduler uses a mutex to make sure that only one kernel thread can access the run queue at a time.

Mutexed run queue

This is how the scheduler was implemented in Go 1.0. The problem with such an implementation is that it is not scalable. The reason for this is that, in cases where we have a lot of goroutines, the scheduler will be busy locking and unlocking the mutex to access the run queue. This is not a good thing, because it will slow down the scheduler and the goroutines.

Before coming back to this let’s explore the concept of Fairness.

Fairness

In the context of the scheduler, fairness is the ability of the scheduler to give each goroutine a fair chance to be executed. The scheduler should not give more time to goroutines that are more active than others. This is because it will lead to starvation of less active goroutines.

Think of a program that has a goroutine that is constantly reading from a channel. This goroutine will be constantly active and will be constantly added to the run queue. This will lead to the starvation of other goroutines that are less active.

var w sync.WaitGroup
func main() {
    w.Add(1)
    go func() {
        for {
            fmt.Println("Hello")
        }
    }()
    go func() {
        for {
            fmt.Println("World")
        }
    }()
    w.Wait()
}

If the above program was run in a runtime that does not have fairness, the goroutine that prints Hello will be constantly added to the run queue. This will lead to starvation of the goroutine that prints World.

How Fairness was Achieved in the Go Scheduler

Preemption

Another important concept in the scheduler is preemption. Preemption is the ability of the scheduler to preempt (stop) a goroutine that is currently running and execute another goroutine. This is done to make sure that goroutines that are time-blocking goroutines don’t consume a lot of time and give a chance to other goroutines to be executed.

The scheduler uses a time slice to determine when to preempt a goroutine. The time slice is 10ms. This means that if a goroutine is running for more than 10ms, the scheduler will preempt it and execute another goroutine. The 10ms is a soft limit because of the performance heap caused by the strict enforcement of the time.

A preempted goroutine will be added to the end of the global run queue which is a FIFO queue. This means that the goroutine will be executed after all the other goroutines in the global run queue.

Local Run Queues

There was the introduction of a new concept called local run queues. A local run queue is a run queue that is local to a kernel thread. This means that each kernel thread has its run queue. This solves the problem of fairness because each kernel thread will have its run queue and will not be affected by the other kernel threads in terms of resource sharing.

Local run queues introduced a concept called per-thread state

Local run queue

The local run queue has a dedicated length of 256 goroutines. If the local run queue is full, the goroutine will be added to the global run queue. The global run queue is a queue that is shared between all the kernel threads.

M:P:N Threading Model

With the introduction of the p-thread state, if a thread is blocked in a system call, the thread does not need to maintain its local run queue. Subsequently, those goroutines will be run elsewhere.

Also if we were to implement work-stealing, to re-balance the workload, the number of threads would be unbounded.

To deal with this, the scheduler was changed to use an M:P:N threading model. The M is the number of kernel threads. The N is the number of goroutines. The P is the number of processors. The P is in charge of managing the interaction between the local run queues, the global run queue, and the kernel threads.

The number of processors is the maximum number of GOMAXPROCS.

M:P:N scheduler

This threading model is important when exploring the concepts of work stealing and Handoff.

Work Stealing

The modern Go scheduler uses a technique called work stealing. Work stealing is a technique that is used to balance the load between the processor’s kernel threads. The idea behind work stealing is that if a processor is idle, it will steal work from another processor or the global run queue, or the network poller. This is done by taking half of the work from the other processor’s run queue and adding it to its run queue.

Before a processor starts stealing work, it will first check its run queue. If the run queue is not empty, it will pull a goroutine from the run queue and execute it. If the run queue is empty, it will then check the global run queue. It only checks the run queue 1/61 of the time. This is done to avoid the overhead of checking the run queue all the time. There is a tick that counts the number of times we have checked the local run queue, once it reaches 61 or a multiple of 61, we check the global run queue. This is important to avoid the starvation of goroutines in the global run queue.

It is set to 61 so that it is not too low to result in excessive checking of the global run queue, nor too high to result in starvation of goroutines in the queue. Another reason is that 61 is a prime number and prime numbers are beneficial for hashing purposes. If 64 is used, it will cause a lot of collisions. These collisions would interfere with the go runtime patterns and result in unfairness.

Idle processor 2

As you can see in the above diagram, processor 2 is idle. It will check its run queue and find that it is empty. It will then check the global run queue for goroutines.

stealing from global

The diagram shows that processor 2 steals 2 goroutines from the global run queue, ensuring fairness by only stealing the equivalent number of goroutines available divided by the number of processors. With 3 processors and 6 goroutines in the global run queue, each processor takes 2 goroutines.

Network Poller

The network poller is another component in the go runtime that is responsible for handling asynchronous system calls such as network I/O. For instance, if a goroutine is waiting for a network request to complete, the goroutine will be added to the network poller. This is to avoid blocking the kernel thread.

If the global run queue is also empty, the processor will then have to poll the network.

Polling the network

If there is a runnable goroutine in the network poller (ready), the processor will add it to its local run queue.

Cases where the global run queue, local run queue, and network poller are empty

If both the global run queue and local run queue are empty and the network poller is blocked, the processor randomly steals work from another processor, including itself. If a processor with runnable goroutines is found, half of the goroutines are taken and added to the stealing processor’s run queue.

All empty

As you can see in the above diagram, processor 2 is idle. It will check all of the run queues including the network poller. It will find that all the queues are empty. It will then have to randomly pick a processor to steal work from.

Stealing from processor

Processor 2 randomly picked processor 1 to steal work from. It will steal half of the work from processor 1 and add it to its run queue. Now no processor is idle.

From the runtime code, this is how the scheduler is implemented.

...
runtime.schedule(){
// only 1/61 of the time will the scheduler try to check the global run queue for a runnable goroutine.
    // if not found try the local run queue
    // if not found,
        // try to steal from other processors.
        // if not found, try to poll the network.
}

Fairness Hierarchy

The go runtime uses a hierarchy of fairness to ensure that goroutines are not starved. The hierarchy is as follows:

Goroutine - preemption.
Local run queue - time slice inheritance.
Global run queue - check once in a while.
Network poller - background thread.

All these steps result in fairness at a minimal cost.

Handoff

We have seen how goroutines end up running on the kernel threads, but what happens when a thread blocks itself? For instance, if a goroutine does a system call, it will block the kernel thread causing the other goroutines in the local run queue to be starved.

In syscall

When this happens, the go runtime will call releasep to release the processor. This will disassociate the processor from the kernel thread. The processor will then be assigned a new kernel thread that is already available or a new kernel thread will be created.

Handoff

This process is referred to as handoff.

Handoffs can be expensive, especially when we have to create a new kernel thread to associate with the processor. Also doing handoff for every syscall is not optimal as some syscalls may be shortlived and the kernel thread might not be blocked for long. To optimize this the go runtime handles the handoff in a slightly intelligent way.

The go scheduler will do immediate handoff if it knows that the syscall will be blocking for a long time. For instance, doing a read on a socket will block the kernel thread for a long time.
In other cases, it will let the processor be in a block. It will then set the status to reflect that it is in a syscall. Using Sysmon the go runtime will periodically check if the processor is still in a syscall. If it is still in a syscall, it will do a handoff.

What happens when you return from a handed-off syscall?

When you return from a syscall the scheduler will check if the old processor (the one that was handed off) is available. If it is, the goroutine will associate itself with it. If the older processor is not available, the goroutine will associate itself with any idle processor. And if there are no idle processors, the goroutine will be added to the global run queue. Subsequently, it also parks that were in the syscall.

Rutime APIs

The go runtime provides a few APIs that can be used to control the scheduler. I’d recommend that you not use these APIs unless it is necessary. These APIs are:

runtime.NumGoroutine() - returns the number of goroutines that are currently running.
runtime.GOMAXPROCS() - This sets the values of GOMAXPROCS which is the maximum number of processors that can be used by the go runtime. Something to keep in mind is that, if you change the GOMAXPROCS value while the program is running, it will cause stop the world operation. This is because the runtime will have to resize the size of the array that holds the processors, before resuming the program.
runtime.Gosched() - This will cause the current goroutine to yield to the processor. The calling goroutine will be added to the global run queue.
runtime.Goexit() - This will cause the current goroutine to exit. It will not affect the other goroutines.
runtime.LockOSThread() / runtime.UnlockOSThread() - These will wire the goroutine to the underlying kernel thread. It is primarily used when the goroutine changes the state of the kernel thread in some form.

It acts like a “taint” indicating that the thread state was changed. Meaning that no goroutine can be scheduled on this thread till runtime.UnlockOsThreadis called the same times as runtime.LockOsThread. And also no thread can be created from this locked thread.

We’re done!

There is it folks. I know this was a long one. Though I did not cover everything, I hope this was a good introduction to the go scheduler. I will be writing more posts on the go runtime in the future. So stay tuned.

Subscribe to my newsletter

Get the latest posts delivered right to your inbox.

The Golang Scheduler

Introduction

Some Prehumble

The Scheduler

The M:N Scheduler

States of a Goroutine

How Does the Scheduler keep track of Goroutines?

Fairness

How Fairness was Achieved in the Go Scheduler

Preemption

Local Run Queues

M:P:N Threading Model

Work Stealing

Network Poller

Cases where the global run queue, local run queue, and network poller are empty

Fairness Hierarchy

Handoff

What happens when you return from a handed-off syscall?

Rutime APIs

We’re done!

Further Reading

Subscribe to my newsletter