Improving Incident Recovery By using The SLI Pyramid
Improving system reliability and uptime by prioritising SLIs
Let me take you back to one of my most painful production incidents: one of our main datastores failed during what we thought to be a simple maintenance act.
The datastore was a main component in our system at that time and turned out to be a single point of failure. While it was down, we couldn’t process any requests.
Full recovery was estimated to take a few hours, which seemed like a lifetime.
About ten minutes into the recovery process, a few talented engineers from the incident response team came up with an idea that might get our system operating: disconnecting the logic which uses the datastore from the main process by toggling a “dooms-day” configuration setting we had.
Sounds amazing! The first rule in the incident response rule book — “Stop the bleeding! Get the system back up.” Well, it turned out not to be that simple. Rule one in the rule book has some vague implications we were about to uncover.
In this article, I’ll explain that although “getting the system back up” should be our first priority, to do so safely, we first need to very carefully define what “up” means.
The Incident — And The Response
Circling back to my story, our system was processing messages from a message bus using a microservice dependent on the failing datastore.
The first thing we did when we identified the issue was to scale down our workers that pulled requests from one of our message buses.
By doing this, we limited the noise in the system and allowed optimal recovery performance for the impacted services. So the current state is that requests were queued in our message bus, waiting to be processed.
Circling back to the configuration setting, we debated whether to toggle it.
Toggling it meant we could turn off the system using the datastore, thus “unblocking” the system, scale it back up, and start processing the (already huge) backlog of requests. But the microservice using the failing datastore was there for a reason! It performed an important role in our system — can we provide valid results without it?
This debate lasted about an hour and involved various people from the organization:
- The engineering team was eager to toggle it on and get the system back running
- One of the business liaisons on the incident response team found out that toggling it will impact our performance in a way that will cause a beach SLA with few of our top customers
- Next, the head of engineering mentioned it might cause issues downstream when other “offline processes” would try and process partially processed requests (partially = without the input from the failing microservice)
We couldn’t decide what to do; each alternative had pros and cons. Eventually, it escalated to the company’s executives, who decided the SLA impact was unacceptable, and we couldn’t toggle it.
We sat there waiting for the recovery process to finish. Although this was the correct decision, it was still an awful feeling.
The bottom line is that it’s “easy” to mitigate an incident when the mitigation process fixes the underlying issue and returns the system to its correct state. But when the mitigation process puts your system in a different bad state, it’s much harder to decide whether to go down this mitigation path.
The SLI Pyramid
First, it’s important to mention that the method I propose is not always possible and is deeply dependent on the system you have at hand, your understanding of the business and technology, and your company’s business stance.
The issue that made our decision so difficult was that “getting the system back up” was infeasible at that point. We could only choose between two different bad states.
I suggest addressing this by slightly redefining what “up” means. Instead of a single state, consider it an “SLI Pyramid,” a visual representation of SLI prioritization. It means we decide, during system design, or system lunch review, or any other point in time, what our most important SLIs are, and order them by importance. An example pyramid can be something like this:
This pyramid means that SLIs at the top of the pyramid can be sacrificed if we can secure the SLIs at the bottom of the pyramid — when forced to choose between bad states, we now have a way of defining which are preferred.
Consider it to be the Maslow pyramid of your system.
In the above pyramid example, when an incident escalated to a state where both the mobile platform was in a degraded state and we had a data quality issue — if by turning the mobile platform off, we could fix the data quality issue — we were authorized (and expected) to do it!
The pyramid should be defined with the product owner, business, and whoever can sign off on such an important decision.
It’s important to mention that it doesn’t always work, and making this discrete prioritization is impossible. But in most cases I’ve encountered, it’s completely doable (even if it means slightly more complex SLIs). For example, it could be something like this:
The pyramid doesn’t have to cover 100% of use cases, incidents, and SLIs, but I am certain some of your SLIs are so clear they can easily be prioritized.
A Word About Analysis Paralysis
In my story, the incident response team suffered from analysis paralysis — being unable to decide due to overanalyzing the problem.
It took us hours of debating, weighing alternatives and outcomes, and even then, only escalation to a company executive enabled us to decide.
This was quite a frustrating scenario; it’s frustrating to the response team, who’s helplessly waiting for the recovery process to finish, and frustrating to the creative engineer, who came up with an alternative mitigation path since we weren’t progressing with it.
The SLI Pyramid tries to “pre-analyze” the most important SLIs of the system and make the decision-making during outages a little easier.
Closing Thoughts
I believe that applying this method can really help with reducing incident friction and maintaining the psychological safety of your responders. This is also a great way to think of production drills or “game days” where we can simulate choosing between sub-optimal mitigation steps and track how our response team operates under these tough conditions.
This article is based on my talk “The (ir)rational Incident Response: How Psychological Biases Affect Incident Response” and is available in English and Hebrew.
As always, thoughts and comments are welcome on Twitter at @cherkaskyb.