Better Programming

Advice for programmers.

Follow publication

Working on High-Performance Golang Client Library — Reading Again From Channels?

Custom ring queue and its problems

Rueian
Better Programming
Published in
6 min readFeb 11, 2022

Photo by Joseph Barrientos on Unsplash

This post is the second topic in the series about the common tips I learned from building rueidis, a high-performance golang Redis client library.

Those tips can also be useful for daily golang programming:

  • Part 1 — Batching on pipeline
  • Part 2 — Reading again from channels?
  • Part 3 — Removing the bad busy loop with the sync.Cond
  • Part 4 — Closing channels gracefully from concurrent writers

In part 1, we talked about how batching on the pipeline can improve performance and implemented the writing function in golang with just channel and bufio.Writer.

In this post, I will talk about the implementation of the reading function that maps responses from the reading stream back to the original requests.

The Reading Function

As mentioned in the previous part, to take advantage of pipelining, the client library should keep writing to the outgoing stream and keep reading from the incoming stream simultaneously. Therefore, the reading function should be like this:

Similar to bufio.Writer, bufio.Reader automatically batches the socket read system calls for us. We just need to extract our responses from the buffer, and that is what the readResponse function does. After that, we need to map the response back to the original request, or more precisely, inform the original caller.

One key point is that the loop must be driven from the incoming stream. That is, keep reading from the bufio.Reader at the beginning of the loop because in this way we can not only handle out-of-band messages from the server easily but also get notified immediately once the socket is closed. The later one is crucial for the pipeline liveness monitoring.

There are many ways to find the original caller in this loop. The following sections are about the two approaches I have tried and their pros and cons.

Double Channels

Since we are pipelining the Request/Response model protocol, we should expect that the server also responds in the same sequence of what we have sent. However, we are not allowed to read the same data again from channels again.

Therefore a naive approach to keeping the order of requests is pushing them into another waiting channel after retrieving requests from the writing channel.

We can add the following two waiting <-req lines into our previous writing function:

And complete the reading function by reading the waiting channel:

And here is how we use these functions:

Once we call the pipelining function, the pipeline is setup, we can call makeRequest function concurrently to send requests and receive responses.

This approach is simple and works correctly. However, it is not good enough in two ways:

  1. Moving requests from requests channel to waiting channel looks inefficient and requires the double size of the channel buffer.
  2. It requires allocating a respCh channel for every concurrent request, which is also costly.

The below benchmark result also reveals the above two issues. Double channels approach only has ~50% throughput compared to the approach I used in rueidis:

Custom Ring Queue

Channels don’t allow us to read the same data again from them. As they use ring queue internally, why couldn’t we make our custom ring queue to address the above issues?

On the web, there are many articles talking about how to build a lockless ring queue or ring buffer. They are all worth reading. I also use some tips from them in rueidis, including:

  1. Using & instead of % to cycle the ring.
  2. Padding between counters to reduce CPU false sharing.
  3. Relying the happens-before guarantee provided by Golang.

I am not going to cover them all in this series of posts. If you are interested in them, You may checkout the reference links I put in the appendix.

In the following post, we will get started by building a lockless ring queue as I did earlier in rueidis where I tried to pursue the optimal performance. But in the next post, we will abandon the lockless ring queue and I will show you why it is also not good in our case.

To support the pipelining use case, what we need is a ring queue that supports multiple concurrent writers and exactly two concurrent readers. Which are:

  1. The user can write requests concurrently into the queue.
  2. One goroutine reads requests from the queue and then writes to the socket.
  3. One goroutine reads responses from the socket and then finds corresponding requests from the queue.

We can start by defining the queue interface which, I believed, is a common interface for pipelining the request/response protocol:

  1. EnqueueRequest is used by concurrent writers and blocked until the response is returned.
  2. NextRequestToSend is used by our writer goroutine of the pipeline to send the next request to the outgoing stream.
  3. ReplyToNextRequest is used by our reader goroutine of the pipeline to find the original caller once received the response from the incoming stream.

We can define our ring queue implementation like this:

  1. The slots field is where we store the requests.
  2. The write field is an atomic counter incremented by concurrent writers to find which slot they should write to.
  3. The read1 and read2 fields are dedicated to our writer and reader goroutine of the pipeline independently, and they don’t need to be atomic counters.

Access to the ring slot still needs to be synchronized. That is:

  1. Writers should not override slots that have unfulfilled request.
  2. The writer goroutine of the pipeline should not read slots that haven’t been written.
  3. The reader goroutine of the pipeline should not read slots that haven’t been read by the writer goroutine of the pipeline.

Furthermore, to avoid allocating the response channel for the caller on every request, we preallocate them on the slot and reuse them. We need to find a way to guarantee that once a caller acquires the channel, no other ones can read the channel at the same time, otherwise, responses will be delivered to wrong callers.

The resulting slot struct is defined like this:

We can use the mark field of the slot to achieve the above requirements:

EnqueueRequest increments the write counter on every call, and it is busy waiting on the mark field to be zero and guarantees that it is the only one who is waiting on the ch for the response among concurrent calls by theCompareAndSwap operation. After putting the request into the slot, it changes the mark to the next value to allow our writer goroutine to read it:

NextRequestToSend is used only by our writer goroutine. It can access the read1 counter without atomic operation. It checks if the next slot’s mark is 2 then copy the req before changing the mark into the next value. If the mark is not 2 then it returns ok = false to inform the writer to flush the outgoing buffer.

Finally, ReplyToNextRequest is similar to NextRequestToSend but it is used only by our reader goroutine and increments the read2 counter and pass the response back to the caller:

Note that the s.ch channel should be non-buffered to ensure that the response is already received by the caller before we change the mark back to zero and allow other EnqueueRequest callers to acquire the mark by theCompareAndSwap operation.

That’s it. Though, it is actually not fully lockless because we still use blocking channel operations on the s.ch.

Benchmark result on the EnqueueRequest will look astonishing if we omit the blocking s.ch channel operations. The fully lockless ring queue itself has almost ~6x throughput improvement comparing to the double channels:

However, benchmarking only on the lockless part is misleading in our case. After putting the ring queue into rueidis, it doesn’t show ~6x improvement but only 2x as mentioned previously.

And most importantly, the lockless ring queue has a clear drawback: busy waiting.

What’s next: Remove The Bad Busy Loop

Now, not only is the EnqueueRequest has busy waiting loop in it, but the writer goroutine also has the same issue, because we lose the blocking behavior of the writing channel:

In the next post, I will share why they are bad and how I removed them without affecting rueidis performance.

Appendix

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Rueian
Rueian

Written by Rueian

Software engineer experienced in Golang, Database and Networking. https://github.com/rueian

No responses yet

Write a response