Working on High-Performance Golang Client Library — Reading Again From Channels?

Custom ring queue and its problems

Published in

Better Programming

6 min readFeb 11, 2022

This post is the second topic in the series about the common tips I learned from building rueidis, a high-performance golang Redis client library.

Those tips can also be useful for daily golang programming:

Part 1 — Batching on pipeline
Part 2 — Reading again from channels?
Part 3 — Removing the bad busy loop with the sync.Cond
Part 4 — Closing channels gracefully from concurrent writers

In part 1, we talked about how batching on the pipeline can improve performance and implemented the writing function in golang with just channel and bufio.Writer.

In this post, I will talk about the implementation of the reading function that maps responses from the reading stream back to the original requests.

The Reading Function

As mentioned in the previous part, to take advantage of pipelining, the client library should keep writing to the outgoing stream and keep reading from the incoming stream simultaneously. Therefore, the reading function should be like this:

Similar to bufio.Writer, bufio.Reader automatically batches the socket read system calls for us. We just need to extract our responses from the buffer, and that is what the readResponse function does. After that, we need to map the response back to the original request, or more precisely, inform the original caller.

One key point is that the loop must be driven from the incoming stream. That is, keep reading from the bufio.Reader at the beginning of the loop because in this way we can not only handle out-of-band messages from the server easily but also get notified immediately once the socket is closed. The later one is crucial for the pipeline liveness monitoring.

There are many ways to find the original caller in this loop. The following sections are about the two approaches I have tried and their pros and cons.

Double Channels

Since we are pipelining the Request/Response model protocol, we should expect that the server also responds in the same sequence of what we have sent. However, we are not allowed to read the same data again from channels again.

Therefore a naive approach to keeping the order of requests is pushing them into another waiting channel after retrieving requests from the writing channel.

We can add the following two waiting <-req lines into our previous writing function:

And complete the reading function by reading the waiting channel:

And here is how we use these functions:

Once we call the pipelining function, the pipeline is setup, we can call makeRequest function concurrently to send requests and receive responses.

This approach is simple and works correctly. However, it is not good enough in two ways:

Moving requests from requests channel to waiting channel looks inefficient and requires the double size of the channel buffer.
It requires allocating a respCh channel for every concurrent request, which is also costly.

The below benchmark result also reveals the above two issues. Double channels approach only has ~50% throughput compared to the approach I used in rueidis:

Custom Ring Queue

Channels don’t allow us to read the same data again from them. As they use ring queue internally, why couldn’t we make our custom ring queue to address the above issues?

On the web, there are many articles talking about how to build a lockless ring queue or ring buffer. They are all worth reading. I also use some tips from them in rueidis, including:

Using & instead of % to cycle the ring.
Padding between counters to reduce CPU false sharing.
Relying the happens-before guarantee provided by Golang.

I am not going to cover them all in this series of posts. If you are interested in them, You may checkout the reference links I put in the appendix.

In the following post, we will get started by building a lockless ring queue as I did earlier in rueidis where I tried to pursue the optimal performance. But in the next post, we will abandon the lockless ring queue and I will show you why it is also not good in our case.

To support the pipelining use case, what we need is a ring queue that supports multiple concurrent writers and exactly two concurrent readers. Which are:

The user can write requests concurrently into the queue.
One goroutine reads requests from the queue and then writes to the socket.
One goroutine reads responses from the socket and then finds corresponding requests from the queue.

We can start by defining the queue interface which, I believed, is a common interface for pipelining the request/response protocol:

EnqueueRequest is used by concurrent writers and blocked until the response is returned.
NextRequestToSend is used by our writer goroutine of the pipeline to send the next request to the outgoing stream.
ReplyToNextRequest is used by our reader goroutine of the pipeline to find the original caller once received the response from the incoming stream.

We can define our ring queue implementation like this:

The slots field is where we store the requests.
The write field is an atomic counter incremented by concurrent writers to find which slot they should write to.
The read1 and read2 fields are dedicated to our writer and reader goroutine of the pipeline independently, and they don’t need to be atomic counters.

Access to the ring slot still needs to be synchronized. That is:

Writers should not override slots that have unfulfilled request.
The writer goroutine of the pipeline should not read slots that haven’t been written.
The reader goroutine of the pipeline should not read slots that haven’t been read by the writer goroutine of the pipeline.

Furthermore, to avoid allocating the response channel for the caller on every request, we preallocate them on the slot and reuse them. We need to find a way to guarantee that once a caller acquires the channel, no other ones can read the channel at the same time, otherwise, responses will be delivered to wrong callers.

The resulting slot struct is defined like this:

We can use the mark field of the slot to achieve the above requirements:

EnqueueRequest increments the write counter on every call, and it is busy waiting on the mark field to be zero and guarantees that it is the only one who is waiting on the ch for the response among concurrent calls by theCompareAndSwap operation. After putting the request into the slot, it changes the mark to the next value to allow our writer goroutine to read it:

NextRequestToSend is used only by our writer goroutine. It can access the read1 counter without atomic operation. It checks if the next slot’s mark is 2 then copy the req before changing the mark into the next value. If the mark is not 2 then it returns ok = false to inform the writer to flush the outgoing buffer.

Finally, ReplyToNextRequest is similar to NextRequestToSend but it is used only by our reader goroutine and increments the read2 counter and pass the response back to the caller:

Note that the s.ch channel should be non-buffered to ensure that the response is already received by the caller before we change the mark back to zero and allow other EnqueueRequest callers to acquire the mark by theCompareAndSwap operation.

That’s it. Though, it is actually not fully lockless because we still use blocking channel operations on the s.ch.

Benchmark result on the EnqueueRequest will look astonishing if we omit the blocking s.ch channel operations. The fully lockless ring queue itself has almost ~6x throughput improvement comparing to the double channels:

However, benchmarking only on the lockless part is misleading in our case. After putting the ring queue into rueidis, it doesn’t show ~6x improvement but only 2x as mentioned previously.

And most importantly, the lockless ring queue has a clear drawback: busy waiting.

What’s next: Remove The Bad Busy Loop

Now, not only is the EnqueueRequest has busy waiting loop in it, but the writer goroutine also has the same issue, because we lose the blocking behavior of the writing channel:

In the next post, I will share why they are bad and how I removed them without affecting rueidis performance.

Appendix

I've been writing ring buffers wrong all these years

So there I was, implementing a one element ring buffer. Which, I'm sure you'll agree, is a perfectly reasonable data…

www.snellman.net

Understanding the LMAX Disruptor

Understanding the principles of LMAX Disruptor library, the inter-threads communication library

itnext.io

The Go Memory Model

The Go memory model specifies the conditions under which reads of a variable in one goroutine can be guaranteed to…

go.dev