A Case Study on Dealing with a Noisy Load Balancer Alarm

How to diagnose and resolve load balancer issues

Talha Malik

Published in

Better Programming

7 min readSep 18, 2020

Introduction

To deliver a highly available application, it’s standard practice to set up alarms that get triggered if something goes wrong. Alarms may be triggered by:

An instance with high CPU or memory usage
Exceptions or non-zero exit codes
A high number of HTTP error status codes

Good alarms are actionable; otherwise, important issues may be masked by seemingly unimportant alerts and get swept under the rug. Alarms that are alerted often under irrelevant circumstances are considered noisy. To reduce the cognitive load on developers and first responders, it’s important irrelevant circumstances don’t spawn alerts.

As of late, our AWS application load balancer has been generating a lot of 504 and 502 error codes which have triggered an alarm a handful of times. We didn’t see any smoking guns during our investigation, and everything seemed to be fine. We realized this alarm had become noisy.

In this article, I’ll talk about the steps we took to bring the alarm back to a healthy state. Who’ll benefit from this article:

People interested in network configurations for REST services
Developers that lack experience with infrastructure and want exposure. This exercise is an excellent example of the kind of work developers on platform, DevOps, and site reliability teams do.

Terminology

A load balancer reverse proxying to multiple servers (Image source: Author)

Load balancer — A load balancer is an architectural component that distributes requests amongst a group of servers so that no one server is overwhelmed.

Upstream service — An upstream service refers to one or more servers that are running the same business logic. In the diagram above, Servers A, B, and C form an upstream service (from the perspective of the load balancer).

The Reaction

The process is started with an ELB_5XX alarm notification from PagerDuty:

Slack notification triggered by our PagerDuty integration

Before we resolve the issue, we need to see if there are any immediate steps we can take to reduce the level of service degradation experienced by users. We’ve written “run books” for different failure scenarios that help us expedite the process. The link to the appropriate run book is included in the description of the triggered alarm. The first responder simply needs to follow the run book, which usually has the following structure:

Are upstream servers experiencing high CPU or memory usage? If so, increase the number of servers to cope with the load.
Is only one upstream server experiencing high CPU or memory usage? If so, kill that server and replace it with a new one.
Check upstream logs. Are the alarms originating from the same server? If so, kill that server and replace it with a new one.
And so on…

A few things to note about run books:

The instructions are extremely simple.
The steps are sequential and easily followed.
They are designed to help first responders systematically resolve the issue.

In our case, the associated run book didn’t result in a resolution, and our dashboards didn’t show any error/warning signs. During our investigation, the alarm auto-resolved, and the 502 errors went away.

5XX error count from our elastic load balancer

Before we do a deeper dive into the issue, we need to determine the impact of the errors and give constant status updates to stakeholders. In our case, less than 0.5% of requests resulted in 502 errors.

The Discovery

Now that we’ve updated stakeholders, it’s time to investigate and resolve the issue. I temporarily increased the threshold for the alarm to make sure it doesn’t trigger during our investigation:

Github PR to change the threshold which triggers the alarm

There are many reasons why a load balancer would throw 5XX errors:

One (or many) upstream servers have become unresponsive.
There’s a transient network issue that’s disrupting upstream connections.
An upstream service responds with non-compatible/corrupt data.

AWS has many articles that explain how to troubleshoot ELB 5XX errors. Our job is to systematically check and eliminate possible failure modes. The easiest way to uncover the root cause is to query the ELB (elastic load balancer) access logs and analyze the results:

AWS ELBs can be configured to deposit their access logs into an S3 bucket. An access log file simply contains a list of all requests made to the ELB.
We can load the log files into AWS Athena and use SQL to comb through the logs. (Using Athena makes it extremely easy to narrow down possible failure modes, as we’ll see.)

First, we index relevant log files in Athena:

Let’s check if the 5XX errors are resulting from the same upstream server. We can do this by grouping the requests by their target IP. We should see only a handful of IP addresses.

The query result includes many different IP addresses, so we know the error isn’t originating from one upstream server. Here are some other queries we can use to help with our investigation:

Determining the response time of requests which resulted in a 5XX:

If the target processing time is 0, it means the ELB is having trouble establishing a connection with upstream services.
If the target processing time is almost 60s (the default ELB timeout) then upstream services are not responding fast enough. In this case, you’ll need to investigate why upstream services are taking so long to respond or increase the ELB target response timeout.

Grouping the requests based on client IP:

If all erroneous requests were generated from a handful of clients, it may be the case that those clients are experiencing issues with their internet connection.

Grouping requests using HTTP verbs and URLs to discover slow endpoints:

Note: We’ve indexed the Athena table for our ELB logs by the year, month, and day. Using a date-based index is absolutely crucial because we’re able to load and query log files in the timeframe we’re interested in. These isolated queries have the following benefits:

Our queries are faster because we’re processing a subset of the log files.
Athena charges by the amount of data processed; processing a subset of log files means we spend less per query.

If you want to avoid spending money on queries, another way to comb through log files is to download the relevant log files and write a script that parses and filters the requests.

Reducing the Noise

After thoroughly querying the ELB access logs, we saw all our 5XX requests had a response processing time of 60 seconds (i.e., the ELB waited 60 seconds for an upstream server to respond). Upstream servers weren’t responding fast enough, which resulted in the ELB throwing a 5XX. We tested our upstream servers to figure out why they were responding so slowly, and we found they weren’t actually responding slowly at all.

After doing some research, we discovered the problem was an inconsistent network configuration.

Open connections between the load balancer and upstream servers

When the ELB forwards a request to an upstream server, a connection is established and data is subsequently transmitted and received over that connection. Establishing a connection is slow and expensive; to avoid opening a new connection for every request, our ELB and upstream servers were using the same connection to handle several requests (known as keep-alive connections).

The load balancer maintaining a connection with no receiving server

Our upstream servers were configured to close these connections after some time to free up resources. Due to a race condition, our ELB would sometimes think these connections were still open. The ELB would try sending data over this “ghost” connection, and the request would timeout since there’s nothing there that will respond.

To fix this issue, we made sure our upstream server had a larger keep-alive timeout than our ELB’s idle connection timeout, insuring the ELB doesn’t maintain any ghost connections. After making this change, we stopped received periodic alerts and our alarm was once again healthy.

Conclusion

Once an alarm is resolved, there are some metrics we consider when evaluating the significance of an alert:

Number/percentage of users affected
Duration of degraded user experience
The monetary cost of the event

If the alarm was triggered despite no disruption on the preceding factors, then you may have a noisy alarm, and in our case, we did. Our users weren’t significantly impacted, we were in an “alarm” state for less than a minute, and the alert auto resolved.

Let’s recap why we want to make sure this alarm never triggers under insignificant circumstances again:

We need to reduce the cognitive load for our developers.
Actionable alerts shouldn’t be hidden under a blanket of noisy alarms.
First responders shouldn’t be disturbed with an alert after business hours.

In some cases, refactoring the alarm to be less noisy simply means to raise/lower the alerting threshold. In others, you may have to do a deeper dive and adjust server/network configurations.

Closing remarks

Whether you’re using Athena, CloudWatch, or custom Python/C scripts, teams should conduct ELB diagnosis drills every couple of months. Having a practiced workflow for investigating and resolving issues is extremely valuable in critical situations.
Some ELB 5XX errors result from an AWS outage. It’s best to timebox investigations and open an AWS support ticket if you’re not able to arrive at a root cause. AWS specialists will be able to tell you if the issue was on their end.