A Study of Concurrency Bugs In Real-World

Today I read a paper about real-world go concurrency error bugs, and here’s a transcript of what I read as a start to learning about go concurrency programming.

Link

Recently, I came across a paper in one of the newsletters I subscribe to. The paper, from the University of Pennsylvania, presents the first systematic study of concurrency-related bugs in several major open-source Golang software projects. The researchers examined the commit histories of the following software: Docker, Kubernetes, etcd, gRPC, CockroachDB, and BoltDB, and drew several interesting conclusions.

Research Method

The focus of this study was on concurrency-related bugs. The researchers conducted their study by examining the commit histories of these projects. They searched for keywords such as “race,” “deadlock,” “synchronization,” and Golang-specific synchronization primitives like “context,” “once,” and “WaitGroup.” They identified fixes for synchronization bugs and even performed some bug replay and reproduction. The bugs were classified as either “blocking” or “non-blocking.”

Pasted image 20240517214102

Number and types of bugs in different projects

Blocking Bugs

Blocking bugs refer to bugs where one or more Goroutines are blocked, leading to partial or global deadlock. These bugs typically arise from circular dependencies. Here’s an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// goroutine 1
  func goroutine1() {
      m.Lock()
-     ch <- request // blocks
+     select {
+         case ch <- request
+         default:
+     }
      m.Unlock()
  }// goroutine 1 func goroutine1() { m.Lock() - ch <- request // blocks + select { + case ch <- request + default: + } m.Unlock() goroutine 2 func goroutine2() { for { m.Lock() // blocks m.Unlock() request <- ch }

1
2
3
4
5
6
7
8


// goroutine 2
func goroutine2() {
    for {
        m.Lock()   // blocks
        m.Unlock()
        request <- ch
    }
}

In the author’s study, it was found that the proportion of bugs caused by message passing was even higher than those caused by traditional mutexes. Furthermore, there are currently no mature detection methods for such bugs.

Overall, we found that there are around 42% blocking bugs caused by errors in protecting shared memory, and 58% are caused by errors in message passing. Considering that shared memory primitives are used more frequently than message passing ones (Section 3.2), message passing operations are even more likely to cause blocking bugs.

Personal opinion: The reasons for these issues may include the unfamiliarity with new channel synchronization primitives or the overreliance on channels, leading to a relaxed attitude towards bugs. In summary, channels are powerful, but they are not a panacea for solving all synchronization problems.

Non-blocking Bugs

Non-blocking bugs refer to data race issues caused by inadequate memory protection, as well as Goroutine leaks due to delayed sending or receiving on Goroutine channels, which are unique to Go.

Here’s an example where the original code uses select-case, resulting in the default case being unintentionally executed multiple times. The fix completely replaces the original code with Once:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


// when multiple goroutines execute the following code, default // can execute multiple times, closing the channel more than once, // which leads to panic in Go runtime - select { - case <- c.closed: // do something - default: + Once.Do(func() { close(c.closed) + }) - }// when multiple goroutines execute the following code, default
// can execute multiple times, closing the channel more than once,
// which leads to panic in Go runtime

- select {
-     case <- c.closed:
          // do something
-     default:
+         Once.Do(func() {
              close(c.closed)
+         })
- }

Additionally, the paper explores different methods for modifying different types of bugs and provides recommendations for the development of future bug detection tools.

Conclusions and Reflections

Go channels provide powerful concurrency patterns, but they are not a panacea. The study found that message passing can cause a higher proportion of blocking bugs. Currently, there are no well-established detection methods. Concurrency issues are inherently complex, and while language features may reduce complexity, they cannot effortlessly solve all problems through casual analysis.
In Go programs, synchronization of shared memory (e.g., classic lock/unlock) still accounts for a higher proportion.
Furthermore, misuse of channels can lead to performance issues ( see this article).
The author observes that many Go bugs exhibit similar patterns, which suggests the possibility of developing more static analysis tools dedicated to analyzing specific types of problems.
Personal opinion: When writing Go code, whether using synchronization locks or channels, it is advisable to keep the synchronized code concise, clear, and easy to verify and inspect in order to reduce bugs.
The Go compiler includes deadlock and data race detection, but it cannot detect many situations. More in-depth bug research could consider using debuggers like gdb and runtime profiling tools like pprof ( see this article and official documentation).
There is further discussion of this paper on HackerNews ( link).
Commit messages are not only helpful for tracing bugs, but also for systematically analyzing historical bugs in the future. (Remember not to write “asdfasdf” in Git commit messages, kids!)