Working on High-Performance Golang Client Library — Reading Again From Channels?
Custom ring queue and its problems
This post is the second topic in the series about the common tips I learned from building rueidis
, a high-performance golang Redis client library.
Those tips can also be useful for daily golang programming:
- Part 1 — Batching on pipeline
- Part 2 — Reading again from channels?
- Part 3 — Removing the bad busy loop with the sync.Cond
- Part 4 — Closing channels gracefully from concurrent writers
In part 1, we talked about how batching on the pipeline can improve performance and implemented the writing
function in golang with just channel
and bufio.Writer
.
In this post, I will talk about the implementation of the reading
function that maps responses from the reading stream back to the original requests.
The Reading Function
As mentioned in the previous part, to take advantage of pipelining, the client library should keep writing to the outgoing stream and keep reading from the incoming stream simultaneously. Therefore, the reading
function should be like this:
Similar to bufio.Writer
, bufio.Reader
automatically batches the socket read system calls for us. We just need to extract our responses from the buffer, and that is what the readResponse
function does. After that, we need to map the response back to the original request, or more precisely, inform the original caller.
One key point is that the loop must be driven from the incoming stream. That is, keep reading from the bufio.Reader
at the beginning of the loop because in this way we can not only handle out-of-band messages from the server easily but also get notified immediately once the socket is closed. The later one is crucial for the pipeline liveness monitoring.
There are many ways to find the original caller in this loop. The following sections are about the two approaches I have tried and their pros and cons.
Double Channels
Since we are pipelining the Request/Response model protocol, we should expect that the server also responds in the same sequence of what we have sent. However, we are not allowed to read the same data again from channels again.
Therefore a naive approach to keeping the order of requests is pushing them into another waiting
channel after retrieving requests from the writing
channel.
We can add the following two waiting <-req
lines into our previous writing
function:
And complete the reading
function by reading the waiting
channel:
And here is how we use these functions:
Once we call the pipelining
function, the pipeline is setup, we can call makeRequest
function concurrently to send requests and receive responses.
This approach is simple and works correctly. However, it is not good enough in two ways:
- Moving requests from
requests
channel towaiting
channel looks inefficient and requires the double size of the channel buffer. - It requires allocating a
respCh
channel for every concurrent request, which is also costly.
The below benchmark result also reveals the above two issues. Double channels approach only has ~50% throughput compared to the approach I used in rueidis
:
Custom Ring Queue
Channels don’t allow us to read the same data again from them. As they use ring queue internally, why couldn’t we make our custom ring queue to address the above issues?
On the web, there are many articles talking about how to build a lockless ring queue or ring buffer. They are all worth reading. I also use some tips from them in rueidis
, including:
- Using
&
instead of%
to cycle the ring. - Padding between counters to reduce CPU false sharing.
- Relying the happens-before guarantee provided by Golang.
I am not going to cover them all in this series of posts. If you are interested in them, You may checkout the reference links I put in the appendix.
In the following post, we will get started by building a lockless ring queue as I did earlier in rueidis
where I tried to pursue the optimal performance. But in the next post, we will abandon the lockless ring queue and I will show you why it is also not good in our case.
To support the pipelining use case, what we need is a ring queue that supports multiple concurrent writers and exactly two concurrent readers. Which are:
- The user can write requests concurrently into the queue.
- One goroutine reads requests from the queue and then writes to the socket.
- One goroutine reads responses from the socket and then finds corresponding requests from the queue.
We can start by defining the queue interface which, I believed, is a common interface for pipelining the request/response protocol:
EnqueueRequest
is used by concurrent writers and blocked until the response is returned.NextRequestToSend
is used by our writer goroutine of the pipeline to send the next request to the outgoing stream.ReplyToNextRequest
is used by our reader goroutine of the pipeline to find the original caller once received the response from the incoming stream.
We can define our ring queue implementation like this:
- The
slots
field is where we store the requests. - The
write
field is an atomic counter incremented by concurrent writers to find which slot they should write to. - The
read1
andread2
fields are dedicated to our writer and reader goroutine of the pipeline independently, and they don’t need to be atomic counters.
Access to the ring slot still needs to be synchronized. That is:
- Writers should not override slots that have unfulfilled request.
- The writer goroutine of the pipeline should not read slots that haven’t been written.
- The reader goroutine of the pipeline should not read slots that haven’t been read by the writer goroutine of the pipeline.
Furthermore, to avoid allocating the response channel for the caller on every request, we preallocate them on the slot and reuse them. We need to find a way to guarantee that once a caller acquires the channel, no other ones can read the channel at the same time, otherwise, responses will be delivered to wrong callers.
The resulting slot struct is defined like this:
We can use the mark
field of the slot to achieve the above requirements:
EnqueueRequest
increments the write
counter on every call, and it is busy waiting on the mark
field to be zero and guarantees that it is the only one who is waiting on the ch
for the response among concurrent calls by theCompareAndSwap
operation. After putting the request into the slot, it changes the mark
to the next value to allow our writer goroutine to read it:
NextRequestToSend
is used only by our writer goroutine. It can access the read1
counter without atomic operation. It checks if the next slot’s mark
is 2 then copy the req before changing the mark
into the next value. If the mark
is not 2 then it returns ok = false
to inform the writer to flush the outgoing buffer.
Finally, ReplyToNextRequest
is similar to NextRequestToSend
but it is used only by our reader goroutine and increments the read2
counter and pass the response back to the caller:
Note that the s.ch
channel should be non-buffered to ensure that the response is already received by the caller before we change the mark
back to zero and allow other EnqueueRequest
callers to acquire the mark
by theCompareAndSwap
operation.
That’s it. Though, it is actually not fully lockless because we still use blocking channel operations on the s.ch
.
Benchmark result on the EnqueueRequest
will look astonishing if we omit the blocking s.ch
channel operations. The fully lockless ring queue itself has almost ~6x throughput improvement comparing to the double channels:
However, benchmarking only on the lockless part is misleading in our case. After putting the ring queue into rueidis
, it doesn’t show ~6x improvement but only 2x as mentioned previously.
And most importantly, the lockless ring queue has a clear drawback: busy waiting.
What’s next: Remove The Bad Busy Loop
Now, not only is the EnqueueRequest
has busy waiting loop in it, but the writer goroutine also has the same issue, because we lose the blocking behavior of the writing channel:
In the next post, I will share why they are bad and how I removed them without affecting rueidis
performance.