Better Programming

Advice for programmers.

Follow publication

Working on High-Performance Golang Client Library — Remove the Bad Busy Loops With the Sync.Cond

Use sync.Cond with and without the sync.Mutex

Rueian
Better Programming
Published in
5 min readFeb 21, 2022

--

Photo by Paul MARSAN on Unsplash

This is the third topic in the series about the common tips I learned from building rueidis, a high-performance Golang Redis client library.

I think those tips are worth sharing because they can also be useful for daily Golang programming:

In previous part 2, we have better throughput for pipelining request/response communication with our custom ring queue compared to the double channel approach.

However, there are two places of the custom ring queue using busy waiting:

  1. The EnqueueRequest uses a busy loop to wait for the slot to be available.
  2. The writing goroutine calls NextRequestToSend in a busy loop because it doesn’t have blocking behavior while the Golang channel has.

In the following sections, I will cover:

  1. What problems do these busy loops have?
  2. Remove the bad busy loop by the sync.Cond with the sync.Mutex
  3. Minimize the bad busy loop by the sync.Cond without the sync.Mutex

What problems do these busy loops have?

Golang is known for making concurrent programming easily, and its runtime scheduler does a great job of scheduling goroutines on the processes of the operating system. But the real concurrency of a Go program is still limited by the CPU cores you have.

Doing a busy loop basically occupies one CPU core. Furthermore, if the loop takes uncertain time to complete, it means the core is hard to do other useful work for uncertain time and leads to bad overall performance.

That is the case with our custom ring queue:

The EnqueueRequest will loop until the ring slot is available, but the ring slot will be available only if we have already processed its previous response. That is, we have already sent out the previous request and received its response from the server, and most importantly, how long will it take is unknown.

Similarly, our writing goroutine just keeps looping and calls the NextRequestsToSend, but when the user will make requests is also unknown:

The writing goroutine will just keep occupying one of your CPU cores. and the EnqueueRequest, in the worst case, will occupy all of them.

We can confirm the performance degradation by benchmarking the custom ring queue with higher parallelism settings.

Benchmark source code: https://gist.github.com/rueian/ffa36c008be14717732377a35a3956d0

As you can tell from the result, the custom ring queue performs dramatically worst than the channel approach when the parallelism goes up. That is because the racing of acquiring the OS processes between goroutines also becomes harder.

Remove the bad busy loop by the sync.Cond with the sync.Mutex

For our EnqueueRequest, we need the ability to put a goroutine into sleep when the slot is not available and wake it up once the slot is available.

This ability is like what a semaphore provides in other programming languages.

There are two recommended ways to use “semaphore” like synchronization technique in Golang:

  1. golang.org/x/sync/semaphore
  2. sync.Cond

The former provides a complex weighted semaphore mechanism, implemented with a sync.Mutex and a linked list of channels, allowing users to Acquire and Release multiple signals at once. I put more about this in the appendix.

The latter provides much a simpler interface with Wait and Signal methods but requires users to prepare a sync.Locker to avoid racing on the condition.

Since our waiting condition is based on the slot state which is external to the semaphore, it is a better fit to use the sync.Cond in the EnqueueRequest.

We can add the sync.Cond into our slot and initialize it with a sync.Mutex:

And rewrite our custom ring queue with it:

We now put the EnqueueRequest into sleep if the slot is not available and wake just one goroutine up with the cond.Signal() when the previous slot response is delivered.

Now the benchmark result is better than the channel approach:

Benchmark source code: https://gist.github.com/rueian/8c18e905aa6b543d2b2dac8f975f2bef

Then, we are going to deal with the busy loop in our writing goroutine.

Minimize the bad busy loop by the sync.Cond without the sync.Mutex

In our new EnqueueRequest, there will be lock contentions on slot only if the ring is always recycling.

But if we use the sync.Cond in the same way on our writing goroutine, that means every EnqueueRequest call needs to access our writing goroutine’s sync.Cond to check whether it is necessary to wake it up. That will undoubtedly have lots of lock contentions.

Fortunately, we don’t need a real sync.Locker in this case. we can slightly relax the sleeping condition of our writing goroutine and still make it not keep occupying one CPU core.

We let our writing goroutine to go sleep only if there are no more flying EnqueueRequest calls and only wake it up in the next EnqueueRequest.

To do that, we use two atomic counters: waits and sleep.

We use the sleep counter to mark when the writing goroutine is going to sleep and when it is awakened.

We increase the waits counter before entering the EnqueueRequest and decrease the counter after leaving it. If the waits counter is 1 after we increase it, we then try to wake the writing goroutine up with the cond.Broadcast().

It is important that we access the waits counter first and then access the sleep counter later in our makeRequest function, while on the other hand, we access the sleep counter first and then access the waits counter later in the writing function.

These reversed access sequences can guarantee we will not miss the chance to wake the goroutine up.

We still use a busy loop to wake up the writing goroutine in our makeRequest function, but this busy loop is better than the previous one because we know it will complete shortly.

Putting writing goroutine into sleep does add some overhead. However, now it will not occupy and not waste one of your CPUs while there is no request to send.

UPDATE: It is actually possible to remove the busy loop completely. Check out https://github.com/rueian/rueidis/pull/9/files to see how to do it.

What’s next: Closing channel having concurrent writers

The final piece of a thread-safe client library probably is the problem of how to close it. In the final post, I will share how rueidis handles it gracefully.

Appendix

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Rueian
Rueian

Written by Rueian

Software engineer experienced in Golang, Database and Networking. https://github.com/rueian

Responses (1)

Write a response