Decryption go：atomic package Addressing concurrency issues

Cache Consistency Issues

In concurrent programming, it’s common to encounter issues like this. For instance, concurrency issues arise when two goroutines simultaneously read and write to a variable. Take, for example, the following code: example

This article was first published in the Medium MPP plan. If you are a Medium user, please follow me on Medium. Thank you very much.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


func main() {
	var a int
	count := 1000000
	wg := sync.WaitGroup{}

	for i := 0; i < 5; i++ {
		wg.Add(1)
		go func(wg *sync.WaitGroup) {
			for i := 0; i < count; i++ {
				a++
			}
			wg.Done()
		}(&wg)
	}
	wg.Wait()
	fmt.Println(a)
}

I don’t know what the result of this code is, but it’s highly likely not the desired 5000000. It’s very likely to be a number much smaller than 5000000. Why does this happen?

In computer architecture, multi-core processors often have multiple levels of cache, with each core having its own cache. These caches store recently used data to speed up access to the data. When a thread writes to a variable, it first loads a copy of the variable from memory into its cache and makes modifications. Then, the thread writes the modified value back to memory. During this process, other threads may modify the value of this variable, but this thread is unaware of it. When other threads modify the value of this variable, the value in the cache of this thread becomes outdated. This is the so-called cache consistency problem, where data consistency issues arise due to asynchronous synchronization of caches across multiple cores.

How to Solve Cache Consistency Issues

In Intel Developer Manual section 8.2.5, we find the following explanation:

For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor.

For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow its cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.

The I/O instructions, locking instructions, the LOCK prefix, and serializing instructions force stronger ordering on the processor.

Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).

From the description, we understand that the LOCK prefix and XCHG instruction prefix provide strong consistency guarantees for internal (cached) memory read/write operations. They ensure that instructions after LOCK will only execute after instructions with the LOCK prefix. Moreover, from the manual, we also understand that in modern CPUs, the LOCK operation is not simply locking the communication bus between the CPU and main memory; Intel implements this LOCK operation at the cache level. Therefore, we need not worry about the efficiency of LOCK execution.

The Atomic package in Golang mainly ensures three things: atomicity, visibility, and order. Let’s look at the Atomic API in Go source code, which mainly includes Swap, CAS, Add, Load, Store, and Pointer categories. The corresponding assembly instructions on IA64 CPUs are as follows:

Swap: mainly XCHGQ instruction
CAS: mainly LOCK CMPXCHGQ instruction
Add: mainly LOCK XADDQ instruction
Load: mainly MOVQ (Load64) instruction
Store: mainly XCHGQ instruction
Pointer: mainly treated as a 64-bit int, calling the aforementioned related methods.

So, we can use Atomic to solve cache consistency issues, as shown in the following code: example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


func main() {
    var a int64
    count := 1000000
    wg := sync.WaitGroup{}

    for i := 0; i < 5; i++ {
        wg.Add(1)
        go func(wg *sync.WaitGroup) {
            for i := 0; i < count; i++ {
                atomic.AddInt64(&a, 1)
            }
            wg.Done()
        }(&wg)
    }
    wg.Wait()
    fmt.Println(a)
}

Now we can get the correct result.

1
2


5000000
Program exited.

Expansion: How to Implement a Lock-Free Queue Using Atomic

https://gist.github.com/hxzhouh/cfa8571f5c8d70422a37fdf9bd395d91

Through this example, we can grasp the basic operations of the atomic package. By learning atomic operations, we will delve into Go concurrent programming.