Detecting Race Conditions in Go

Race condition detection with a reverse proxy example

Antonio K
Better Programming

--

Photo by Braden Collum on Unsplash

Race conditions are subtle yet devastating software defects.

Chat GPT describes it as:

a software defect that occurs when the correctness of a program depends on the relative timing or interleaving of multiple concurrent operations.

The term is often used as a mental shortcut to explain the unexplainable software behavior.

They are a huge time sink and an endless source of frustration.

Humans and their single-threaded brains usually fail to detect them. It doesn’t matter how good you think you are at multitasking. If you’re playing with concurrency, you’re going to get burnt by race conditions.

What’s worse, there’s a good chance that you’ll observe the issue well after you deployed the changes to production.

One such scenario played out in the 80s with Therac 25 — a revolutionary dual-mode radiation therapy machine. In layman's terms, it had low-power and high-power modes.

Diagram of Typical Therac-25 facility by ComputingCases.org

A skilled technician could type the commands to set up the machine in a blink of an eye.

One fateful day, the technician made an error setting up the machine for X mode (x-ray) rather than e mode (electron). The technician noticed the mistake and quickly fixed it.

The machine halted and displayed the error “Malfunction 54”. The technician interpreted the error as a low-priority issue and continued with the process.

The patient reported hearing a loud buzzing sound followed by a burning sensation as though someone had poured hot coffee on their skin.

A couple of days later, the patient suffered paralysis due to radiation overexposure and soon died. Six patients lost their lives because of the “Malfunction 54” error between 1985 and 1987.

Later investigation uncovered that a race condition in the machine’s software caused the incident.

There’s a lot to unpack in the Therac 25 cautionary tale. For the purposes of this article, we’ll focus on a thin slice related to race conditions.

In the next section, we’ll put on our debugging hat and explore race conditions in a much safer environment.

Setting the stage

We want to implement an HTTP reverse proxy that will forward the request to the appropriate system based on some condition.

A reverse proxy is an intermediary server that receives requests from clients and directs them to backend servers. They are typically used to address performance, security, and scalability concerns.

High-level architecture diagram

The one in our example forwards requests to either system A or B depending on the path of the original request. The original request path has to be mapped to the appropriate upstream path.

For example:

api.com/a/foo/bar -> system-a.com/v1/foo/bar

api.com/b/baz/14 -> system-b.com/baz/14

A variation of this setup is commonly used for:

  • transparent replacement of system internals (strangler pattern)
  • traffic shadowing
  • API aggregation

Show me the code

We’ll use GOs ReverseProxy found in thehttputil package from the standard library to implement the proxy.

Reverse proxy exposes a variety of hooks that enable clients to modify its behavior. They come with sensible defaults so clients don’t have to implement everything manually.

We need to modify the incoming request and change the path based on some mapping rules.

To achieve this, we’ll implement our own Director function which is defined as:

Director is a function which modifies the request into a new request to be sent using Transport. Its response is then copied back to the original client unmodified.

Our Director function will modify the incoming request by changing the URL based on the URL path prefix. The modified request should target the correct subsystem URL without modifying anything else.

// director is a function that takes a pointer to http request and modifies it
director := func(req *http.Request) {
// store original URL
originalURL := req.URL

if strings.HasPrefix(originalURL.Path, systemARoutePrefix) {
req.URL = urlA
} else if strings.HasPrefix(originalURL.Path, systemBRoutePrefix) {
req.URL = urlB
} else {
return
}

// don't forget to take all the URL parts
req.URL.Fragment = originalURL.Fragment
req.URL.RawQuery = originalURL.RawQuery
// map the original path based on some rules
req.URL.Path = mapPath(originalURL.Path)
}

Here’s the full proxy implementation:

package main

import (
"fmt"
"net/http"
"net/http/httputil"
"net/url"
"strings"
)

// URL mapping rules
var subsystemUrlPrefix map[string]string = map[string]string{
// system A
"/a/foo/bar": "/v1/foo/bar",
// system B
"/b/baz": "/baz",
}

const (
systemARoutePrefix = "/a"
systemBRoutePrefix = "/b"
)

// create a new proxy for system A and system B
func NewProxy(systemAURL, systemBURL string) (*httputil.ReverseProxy, error) {
urlA, urlErr := url.Parse(systemAURL)
if urlErr != nil {
return nil, fmt.Errorf("cannot parse system A URL: %w", urlErr)
}

urlB, urlErr := url.Parse(systemBURL)
if urlErr != nil {
return nil, fmt.Errorf("cannot parse system B URL: %w", urlErr)
}
// set up a director function to modify incoming requests
director := func(req *http.Request) {
originalURL := req.URL

if strings.HasPrefix(originalURL.Path, systemARoutePrefix) {
req.URL = urlA
} else if strings.HasPrefix(originalURL.Path, systemBRoutePrefix) {
req.URL = urlB
} else {
return
}

req.URL.Fragment = originalURL.Fragment
req.URL.RawQuery = originalURL.RawQuery
req.URL.Path = mapPath(originalURL.Path)
}

return &httputil.ReverseProxy{Director: director}, nil
}

// map path based on the URL prefix
func mapPath(path string) string {
for apiPrefix, subsystemPrefix := range subsystemUrlPrefix {
if strings.HasPrefix(path, apiPrefix) {
return strings.Replace(path, apiPrefix, subsystemPrefix, 1)
}
}

return path
}

Testing

We can implement a test to verify that

  • given the request, the URL shall be modified to match the correct subsystem
  • given the request, the HTTP method shall not be modified

We’ll implement a fixture proxy on top of the actual fixture proxy to help us out with this.

First, sending actual HTTP requests to verify the above behavior is unnecessary. We’ll set the proxy transport to use a noopRoundTripper to ensure tests don’t make any network calls.

Second, we’ll define an onOutgoing hook that will allow the test code to inspect the outgoing request.

func fixtureProxy(t *testing.T, onOutgoing func(r *http.Request)) *httputil.ReverseProxy {
p, err := NewProxy(systemABaseUrl, systemBBaseURL)
require.NoError(t, err)

originalDirector := p.Director
p.Director = func(outgoing *http.Request) {
onOutgoing(outgoing)
originalDirector(outgoing)
}
p.Transport = noopRoundTripper{onRoundTrip: successRoundTrip}
return p
}

The test will instantiate the fixture proxy, fire a test request and inspect its URL to ensure that it’s been correctly modified.

func TestProxy(t *testing.T) {
testCases := []struct {
desc string
originalPath string
originalMethod string
expectedProxyURL string
}{
{
desc: "System A POST",
originalPath: "/a/foo/bar",
originalMethod: "POST",
expectedProxyURL: fmt.Sprintf("%s/v1/foo/bar", systemABaseUrl),
},
{
desc: "System B POST",
originalPath: "/b/baz/14",
originalMethod: "POST",
expectedProxyURL: fmt.Sprintf("%s/baz/14", systemBBaseURL),
},
}
for _, tC := range testCases {
t.Run(tC.desc, func(t *testing.T) {
var proxiedRequest *http.Request
p := fixtureProxy(t, func(r *http.Request) {
proxiedRequest = r
})

writer := fixtureWriter()
req := fixtureRequest(t, tC.originalPath, tC.originalMethod)
p.ServeHTTP(writer, req)
require.Equal(t, tC.expectedProxyURL, proxiedRequest.URL.String())
require.Equal(t, tC.originalMethod, proxiedRequest.Method, "HTTP method should not be modified on proxy")
})
}
}

All tests passed, as expected. So far so good.

Observing the problem

Now it’s time to run our proxy in production.

To simulate production conditions we’ll implement two simple HTTP servers for service A and service B and run them using Docker Compose.

Both services will have a single HTTP listener handling the route the proxy targets.

package main

import (
"fmt"
"net/http"
)

func main() {
// Return "Hello from service A" when any HTTP request reaches /v1/foo/bar URL
http.HandleFunc("/v1/foo/bar", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello from service A")
})

// start the HTTP server running on port 9202
if err := http.ListenAndServe(":9202", nil); err != nil {
panic(err)
}
}

Next, we’ll define Dockerfiles and spin everything up using Docker Compose (check out the GitHub repository for details)

Once the system is up and running, we can send some traffic to see how proxying behaves.

Sending sequential requests works like a charm.

As our service grows in popularity, the number of requests pouring in reaches new heights.

To simulate these high-traffic conditions, we’ll make use of k6.

The k6 script will randomly send HTTP requests to either the route targeting Service A or the one targeting Service B.

import http from 'k6/http';
import { check } from 'k6';

// Testing constants
const SERVICE_A_URL = 'http://localhost:8080/a/foo/bar'
const SERVICE_A_EXPECTED_RESPONSE = 'Hello from service A'
const SERVICE_A_METHOD = "POST"
const SERVICE_B_URL = 'http://localhost:8080/b/baz/14'
const SERVICE_B_EXPECTED_RESPONSE = 'Hello from service B'
const SERVICE_B_METHOD = "GET"

export default function() {
// Randomly choose between two URLs
const url = Math.random() > 0.5 ? SERVICE_A_URL: SERVICE_B_URL;
const expectedResponse = url === SERVICE_A_URL ? SERVICE_A_EXPECTED_RESPONSE : SERVICE_B_EXPECTED_RESPONSE
const method = url === SERVICE_A_URL ? SERVICE_A_METHOD : SERVICE_B_METHOD

// Make the GET request
const res = http.request(method, url);

// Check that the response was successful
check(res, {
'status is 200': (r) => r.status === 200,
'OK response': (r)=> r.body === expectedResponse
});
}

Both requests expect the response status to be 200 OK and a correct response message.

After running the script we found that almost 50% of requests failed. What gives?

If we let concurrent requests execute in the background and we try a couple of manual requests we’ll see that some of our requests fail with 404 Not Found.

Despite passing our tests for expected proxy behavior, our system returns 404s during heavy load conditions.

Typically, engineers would lose their sanity over this kind of issue (as I did one too many times).

However, this is a blog post about race conditions so you likely have a hunch as to what’s going on.

Race detection tool to the rescue

Go ecosystem has a plethora of tools that improve productivity and help engineers build robust software.

One such tool is the Go Race Detector. As the name suggests, we’ll use the tool to see if our code has any race conditions.

The GO compiler magic injects code that records memory access while the runtime library watches for unsynchronized access to shared variables.

According to the docs:

… the race detector can detect race conditions only when they are actually triggered by running code

Let’s create a test with a realistic test scenario that might cause the race condition to surface.

The test will send 100 concurrent requests using the previously described fixture proxy.

func TestProxy_ConcurrentRequests(t *testing.T) {
// create a new fixture proxy
p := fixtureProxy(t, func(r *http.Request) {})
// define a new WaitGroup that enables testing code to wait for all
// goroutines to finish with their work
wg := sync.WaitGroup{}

for i := 0; i < 100; i++ {
// increment the WaitGroup
wg.Add(1)
// start a new goroutine
go func() {
// don't forget to decrement the WaitGroup
defer wg.Done()
writer := fixtureWriter()
req := fixtureRequest(t, "/a/foo/bar", "GET")
// serve the test request with fixture proxy
p.ServeHTTP(writer, req)
}()
}

// wait until all goroutines are done
wg.Wait()
}

The tests should be run with -race flag to enable the race condition detector.

Bingo! Tests failed with 3 detected data races. Let’s zoom into the issue and figure out what went wrong.

Deciphering the output

Race detector prints stack traces explaining the race condition. The output can be divided into two parts:

  1. Race condition stack traces pointing to a memory address and the line where it happened (What/Where)
  2. Origin of goroutines involved in the race condition (Who/How)

What/Where?

The first part of the output tells the engineers what kind of issue happened and where exactly it happened.

==================
# What happened
WARNING: DATA RACE
# Where it happened
Write at 0x00c0001ccbb0 by goroutine 14:
github.com/pavisalavisa/race-condition-detection.NewProxy.func1()
/Users/pavisalavisa/repos/race-condition-detection/proxy.go:47 +0x174 #the problematic write by goroutine 14
...

Previous write at 0x00c0001ccbb0 by goroutine 13:
github.com/pavisalavisa/race-condition-detection.NewProxy.func1()
/Users/pavisalavisa/repos/race-condition-detection/proxy.go:47 +0x174 #the problematic write by goroutine 13
...

The tool found concurrent writes to a 0x00c0001ccbb0 memory address on line 47 of the proxy implementation.

This is the line inside the director function that copies the original URL fragment to the proxy URL.

 director := func(req *http.Request) {
// Rest of the code

req.URL.Fragment = originalURL.Fragment // Line 47 DATA RACE

// Rest of the code
}

Who/How?

The second part of the output tells the engineers which goroutines were involved and how they came to life:

# Goroutine origins
Goroutine 14 (running) created at:
github.com/pavisalavisa/race-condition-detection.TestProxy_ConcurrentRequests()
/Users/pavisalavisa/repos/race-condition-detection/proxy_test.go:92 +0x64
...

Goroutine 13 (finished) created at:
github.com/pavisalavisa/race-condition-detection.TestProxy_ConcurrentRequests()
/Users/pavisalavisa/repos/race-condition-detection/proxy_test.go:92 +0x64
...

Goroutines were created by the test code, no surprises there.

These goroutines would’ve been created by the HTTP library had the application been deployed leading to the very same failure.

In Go servers, each incoming request is handled in its own goroutine. (source)

The Race detector tool can be used to inspect running applications as well by running the service with the -race flag.

Be careful experimenting with this feature in production though because

The cost of race detection varies by program, but for a typical program, memory usage may increase by 5–10x and execution time by 2–20x. (source)

Fixing the problem

Now that we are armed with the data race understanding, let’s see what we did wrong in the director function.

 director := func(req *http.Request) {
originalURL := req.URL

if strings.HasPrefix(originalURL.Path, systemARoutePrefix) {
req.URL = urlA
} else if strings.HasPrefix(originalURL.Path, systemBRoutePrefix) {
req.URL = urlB
} else {
return
}

req.URL.Fragment = originalURL.Fragment
req.URL.RawQuery = originalURL.RawQuery
req.URL.Path = mapPath(originalURL.Path)
}

The director function updates proxy request URL parts with the original request data. Concurrent write to the Fragment struct field indicates that multiple goroutines have access to the same URL.

Proxy URLs are instantiated by url.Parse function which returns a pointer to the URL when the provided string is a valid URL.

> go doc net/url URL.Parse

func (u *URL) Parse(ref string) (*URL, error)
Parse parses a URL in the context of the receiver. The provided URL may be
relative or absolute. Parse returns nil, err on parse failure, otherwise its
return value is the same as ResolveReference.

These URLs are parsed only once at startup when creating a new proxy. As a result, every request uses the same URL pointer and modifies the underlying data.

func NewProxy(systemAURL, systemBURL string) (*httputil.ReverseProxy, error) {
urlA, urlErr := url.Parse(systemAURL)
if urlErr != nil {
return nil, fmt.Errorf("cannot parse system A URL: %w", urlErr)
}

// Rest of the code
}

That is a bit of an issue. No need to panic because we have (at least) two options to fix it:

  1. clone the proxy URLs
  2. avoid modifying the URL pointer

Let’s proceed with option one and see how that goes.

According to the GitHub discussion in the official GO repo, it’s safe to clone the URL by dereferencing the pointer.

 director := func(req *http.Request) {
// store the original URL
originalURL := req.URL
var proxyURL url.URL

if strings.HasPrefix(originalURL.Path, systemARoutePrefix) {
proxyURL = *urlA // dereference the parsed urlA to ensure we get a copy
} else if strings.HasPrefix(originalURL.Path, systemBRoutePrefix) {
proxyURL = *urlB // dereference the parsed urlB to ensure we get a copy
} else {
return
}

req.URL = &proxyURL
req.URL.Fragment = originalURL.Fragment
req.URL.RawQuery = originalURL.RawQuery
req.URL.Path = mapPath(originalURL.Path)
}

After making the above changes, we can now proudly say that the issue has been resolved! Our tests have passed and we can deploy our service with confidence.

Takeaways

Race conditions are elusive and dangerous programming errors that persisted throughout the decades.

The damage they can cause varies from zero harm in educational and toy projects to loss of lives in extreme cases.

We’ve shown that even a service with no more than a hundred lines of code can be plagued with issues.

If you’re building a critical piece of infrastructure for your system, don’t let the simplicity deceive you.

Throw a curve ball at your system, or better yet, thousands of curve balls per second.

While laboratory condition tests can help you catch some bugs, don’t expect them to be perfect. Production is a completely different beast and you better be prepared for it.

When things go south (and they will), make sure you have proper observability set up. Blind debugging is a fool’s errand.

Finally, bring some peers along for the ride. Sometimes, a fresh pair of eyes is all you need to spot a tricky race condition. It’s a great bonding experience. Plus, it’s more fun to solve problems with friends.

In short:

  • stress test your system
  • look at the edge cases
  • use specialized tools
  • bring in peers to help you out
  • and don’t forget to have fun while you’re at it. After all, if you’re not enjoying the process, what’s the point?

You can find the code examples on GitHub.

The animated terminal recordings were created with terminalizer.

--

--