Working on High-Performance Golang Client Library — Remove the Bad Busy Loops With the Sync.Cond
Use sync.Cond with and without the sync.Mutex

This is the third topic in the series about the common tips I learned from building rueidis
, a high-performance Golang Redis client library.
I think those tips are worth sharing because they can also be useful for daily Golang programming:
- Part 1 — Batching on pipeline
- Part 2 — Reading again from channels?
- Part 3 — Remove the bad busy loops with the sync.Cond
- Part 4 — Gracefully close channels from concurrent writers
In previous part 2, we have better throughput for pipelining request/response communication with our custom ring queue compared to the double channel approach.
However, there are two places of the custom ring queue using busy waiting:
- The
EnqueueRequest
uses a busy loop to wait for the slot to be available. - The writing goroutine calls
NextRequestToSend
in a busy loop because it doesn’t have blocking behavior while the Golang channel has.
In the following sections, I will cover:
- What problems do these busy loops have?
- Remove the bad busy loop by the
sync.Cond
with thesync.Mutex
- Minimize the bad busy loop by the
sync.Cond
without thesync.Mutex
What problems do these busy loops have?
Golang is known for making concurrent programming easily, and its runtime scheduler does a great job of scheduling goroutines on the processes of the operating system. But the real concurrency of a Go program is still limited by the CPU cores you have.
Doing a busy loop basically occupies one CPU core. Furthermore, if the loop takes uncertain time to complete, it means the core is hard to do other useful work for uncertain time and leads to bad overall performance.
That is the case with our custom ring queue:
The EnqueueRequest
will loop until the ring slot is available, but the ring slot will be available only if we have already processed its previous response. That is, we have already sent out the previous request and received its response from the server, and most importantly, how long will it take is unknown.
Similarly, our writing goroutine just keeps looping and calls the NextRequestsToSend
, but when the user will make requests is also unknown:
The writing goroutine will just keep occupying one of your CPU cores. and the EnqueueRequest
, in the worst case, will occupy all of them.
We can confirm the performance degradation by benchmarking the custom ring queue with higher parallelism settings.
As you can tell from the result, the custom ring queue performs dramatically worst than the channel approach when the parallelism goes up. That is because the racing of acquiring the OS processes between goroutines also becomes harder.
Remove the bad busy loop by the sync.Cond
with the sync.Mutex
For our EnqueueRequest
, we need the ability to put a goroutine into sleep when the slot is not available and wake it up once the slot is available.
This ability is like what a semaphore provides in other programming languages.
There are two recommended ways to use “semaphore” like synchronization technique in Golang:
- golang.org/x/sync/semaphore
sync.Cond
The former provides a complex weighted semaphore mechanism, implemented with a sync.Mutex
and a linked list of channels, allowing users to Acquire
and Release
multiple signals at once. I put more about this in the appendix.
The latter provides much a simpler interface with Wait
and Signal
methods but requires users to prepare a sync.Locker
to avoid racing on the condition.
Since our waiting condition is based on the slot state which is external to the semaphore, it is a better fit to use the sync.Cond
in the EnqueueRequest
.
We can add the sync.Cond
into our slot and initialize it with a sync.Mutex
:
And rewrite our custom ring queue with it:
We now put the EnqueueRequest
into sleep if the slot is not available and wake just one goroutine up with the cond.Signal()
when the previous slot response is delivered.
Now the benchmark result is better than the channel approach:
Then, we are going to deal with the busy loop in our writing goroutine.
Minimize the bad busy loop by the sync.Cond
without the sync.Mutex
In our new EnqueueRequest
, there will be lock contentions on slot only if the ring is always recycling.
But if we use the sync.Cond
in the same way on our writing goroutine, that means every EnqueueRequest
call needs to access our writing goroutine’s sync.Cond
to check whether it is necessary to wake it up. That will undoubtedly have lots of lock contentions.
Fortunately, we don’t need a real sync.Locker
in this case. we can slightly relax the sleeping condition of our writing goroutine and still make it not keep occupying one CPU core.
We let our writing goroutine to go sleep only if there are no more flying EnqueueRequest
calls and only wake it up in the next EnqueueRequest
.
To do that, we use two atomic counters: waits
and sleep
.
We use the sleep
counter to mark when the writing goroutine is going to sleep and when it is awakened.
We increase the waits
counter before entering the EnqueueRequest
and decrease the counter after leaving it. If the waits
counter is 1 after we increase it, we then try to wake the writing goroutine up with the cond.Broadcast()
.
It is important that we access the waits
counter first and then access the sleep
counter later in our makeRequest
function, while on the other hand, we access the sleep
counter first and then access the waits
counter later in the writing
function.
These reversed access sequences can guarantee we will not miss the chance to wake the goroutine up.
We still use a busy loop to wake up the writing goroutine in our makeRequest
function, but this busy loop is better than the previous one because we know it will complete shortly.
Putting writing goroutine into sleep does add some overhead. However, now it will not occupy and not waste one of your CPUs while there is no request to send.
UPDATE: It is actually possible to remove the busy loop completely. Check out https://github.com/rueian/rueidis/pull/9/files to see how to do it.
What’s next: Closing channel having concurrent writers
The final piece of a thread-safe client library probably is the problem of how to close it. In the final post, I will share how rueidis
handles it gracefully.