Performance analysis of goroutine switching

Introduction

In the previous article, we verified the context switch overhead of Linux processes and threads experimentally, which was approximately between 3-5 microseconds. This overhead is not significant, but for massively concurrent internet servers and typical computer programs, the characteristics are as follows:

High concurrency: Thousands to tens of thousands of user requests need to be processed per second.
Short cycles: The processing time per user should be as short as possible, often in the millisecond range.
High network I/O: Often requires network I/O from other machines, such as Redis, MySQL, etc.
Low computation: General CPU-intensive operations are not frequent.

This article is first published in the medium MPP plan. If you are a medium user, please follow me in medium. Thank you very much.

Even with a context switch overhead of 3-5 microseconds, it can still appear somewhat performance-degrading if the context switch volume is particularly high. For example, the Apache web server, which was the software product under this model, suffered from this. (In fact, when Linux operating system was designed, its goal was to be a general-purpose operating system rather than specifically designed for high-concurrency server-side applications.)

To avoid frequent context switches, there is another asynchronous non-blocking development model. That is to use a process or thread to handle a large number of user requests and then improve performance through IO multiplexing (processes or threads do not block, saving the overhead of context switches). Nginx and Node.js are typical representatives of this model. Frankly speaking, in terms of program execution efficiency, this model is the most machine-friendly, with the highest efficiency (better than the coroutine development model mentioned below). Therefore, Nginx has replaced Apache as the preferred web server. However, the problem with this programming model lies in its unfriendliness to development, which is overly mechanized and deviates from the original intention of abstracting the concept of processes. Normal linear thinking of humans is disrupted, and application layer developers are forced to write code with non-human-like thinking, making code debugging extremely difficult.

So, some smart heads continued to brainstorm at the application layer and designed “threads” that do not require process/thread context switching, called coroutines. Using coroutines to handle high-concurrency application scenarios can not only meet the original intention of processes but also allow developers to use normal linear thinking to handle their business, while also eliminating the expensive overhead of process/thread context switches. Therefore, it can be said that coroutines are a good patch for the process model in the scenario of processing massive requests on Linux.

With the background introduced, what I want to say is that although coroutine encapsulation is lightweight, it still incurs some additional costs. So, let’s take a look at how small these additional costs are.

Coroutine Overhead Test

This article is based on go 1.22.1.

1. Coroutine Context Switch CPU Overhead
The test process involves continuously yielding the CPU between coroutines. The core code is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


package main  
  
import (  
    "fmt"  
    "runtime"    
    "time"
)  
  
func cal() {  
    for i := 0; i < 1000000; i++ {  
       runtime.Gosched()  
    }  
}  
  
func main() {  
    runtime.GOMAXPROCS(1)  
    currentTime := time.Now()  
    fmt.Println(currentTime)  
    go cal()  
    for i := 0; i < 1000000; i++ {  
       runtime.Gosched()  
    }  
  
    fmt.Println(time.Now().Sub(currentTime) / 2000000)  
}

Compilation and execution

1
2
3
4


➜  trace git:(main) ✗ go run main.go              
2024-03-20 19:52:24.772579 +0800 CST m=+0.000114834
54ns
➜  trace git:(main) ✗ 

The average overhead of each coroutine switch is 54ns, which is approximately 1/70 of the context switch overhead measured in the previous article, about 3.5 microseconds, and is approximately 70 times lower than the overhead caused by system calls.

Coroutine Memory Overhead
In terms of space, when coroutines are initialized and created, a stack of 2KB is allocated for them. The stack of threads is much larger than this number, which can be checked through the ulimit command, usually in several megabytes. On my Mac, it’s 8MB. If a coroutine is created for each user to handle, 2GB of memory is sufficient for handling 1 million concurrent user requests, while the thread model would require 8TB.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


➜  trace git:(main) ✗ ulimit -a   
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8176
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       2666
-n: file descriptors                12544

Conclusion

Since coroutines complete context switches in user space, the switch time is only slightly over 50ns, which is 70 times higher than process switches. The stack memory required by a single coroutine is also quite small, only requiring 2KB. Therefore, coroutines have shined in high-concurrency scenarios in backend internet applications in recent years.

Whether in terms of space or time performance, they are much better than processes (threads). Then why doesn’t Linus implement them in the operating system? For the sake of better real-time performance, the operating system may preempt the CPU of processes with higher priorities. However, coroutines cannot achieve this and still rely on the coroutines currently using the CPU to release it actively, which is not consistent with the implementation purpose of the operating system. Therefore, the efficiency of coroutines comes at the cost of sacrificing preemption.

Coroutines ultimately execute attached to operating system threads.

A question we need to consider is:
Does using coroutines mean threads no longer switch? The frequency of thread switches basically depends on the number of threads. When using coroutines, you need to specify tasks for each thread. For the same workload, the number of threads required by coroutines should always be higher than that of automatically allocated thread pools.
Therefore:
Using threads = thread switch overhead (low)
Using coroutines = thread switch overhead (high) + coroutine switch overhead

Then CPU overhead:
Instruction cycles of threads = interrupt detection + instruction execution (including fetch, decode, and execute)
Instruction cycles of coroutines = interrupt detection + instruction execution + interrupt detection + coroutine signal detection

So, I have the following conclusion:
In terms of performance, IO multiplexing + thread pool completely outperforms coroutines; but in terms of convenience, coroutines are still easier to use.

Because calling coroutines in Go is so convenient, some Go programmers use the go keyword casually. It should be noted that before switching to a coroutine, the coroutine must be created first. Once created, plus the scheduling overhead, it increases to 400ns, which is almost equivalent to the time consumed by a system call. Although coroutines are efficient, they should not be used casually。