Designing a Backup and Disaster Recovery Plan
A comprehensive disaster recovery guide takes time, planning, and automation
A deep-dive into backup and disaster recovery planning, as a followup post to: The Many Facets of Infrastructure
A comprehensive backup plan is much more complex than just taking nightly backups. There are many different scenarios to consider and covering all of them can be unrealistic for your organization’s current state. The critical part of a good backup and disaster recovery plan is having a clear understanding of what you cover, what your risks are, and ensuring all decision makers are comfortable with those risks.
For the purpose of this article, I’m writing from the perspective of someone responsible for an application on a major cloud provider, using AWS for specific examples. There are different complexities to consider if you are designing a disaster recovery program for an on-prem system.
First, we are going to focus on 3 different types of disaster scenarios to plan around:
- Redundancy for Hardware/Network Disasters → depending on the scope and duration of the failure, what is your plan to recover your infrastructure? This covers everything from simple hardware failures to ecological disasters.
- Recoverability from Application-Driven Disasters → if the application itself is driving the deletion (software bug, misconfiguration, or user error), how do you recover from that?
- Recoverability from Insider Threat → can your most senior admin override all the security controls? Or is everyone bound by a technical control?
Then we are going to look at some qualities that reflect healthy backups:
- Integrity → how do you know if your backups are intact and not corrupted and immutable?
- Usability → if you actually tried to restore from the backup, would it work?
- Security → how are your backups encrypted, and who can access them?
- Documentation → what’s the point of backups, if you don’t know how to use them?
Then we are going to look over some commonly overlooked non-technical topics:
- Personnel redundancy — sure, your infrastructure is hosted in two regions, but what about your team?
- Cross-System Dependencies and Composite SLAs — what services need to be online in order for you to restore your service, and how likely would an event be that could take out both?
And finally, we are going to wrap up with talking about Routine Disaster Recovery Testing. Ignoring the fact this is required by multiple compliance programs, this is a healthy habit every team should be following.
It also wouldn’t be me if I didn’t include some templates or excel spreadsheets right?
How should I read/use this document?
Read, Assess, Improve, Repeat
This article covers many aspects of contingency planning, and you are not meant to crank each dial to max. Contingency planning is around risk avoidance, risk mitigation, and risk acceptance. My recommendation is that you read the material, evaluate where your team is at, and propose incremental improvements. If successful, then re-read, reassess, and incrementally improve.
If you read this article and try to suggest massive, expensive revamps for extremely rare error conditions, you will be met with pitchforks. Inversely, if you skim this and think that I’m suggesting everyone should build a multi-cloud, multi-region active-active replication strategy with cold backups shipped to Iron Mountain, then you missed the point: identifying and accepting risk is part of a healthy ecosystem.
To create a consistent narrative, let’s say your job is to build a disaster recovery plan for an AWS RDS database. The database is critical to your business, but it is not overly massive in size or complexity. Your company isn’t overly concerned about cost for this use case.
Disaster Scenarios To Consider
Redundancy for Hardware/Network Disasters
Someone newer to DevOps might assume that environmental disasters are only referring to apocalyptic events. However, cloud providers have had region-wide disruptions due to bad code changes or unexpected load. From the DevOps perspective, you are still responsible to get your service back up and running.
Scope-of–Failure:
- Single Server Failure: a single server or container unexpectedly dies
- Single Site Failure: a single datacenter is affected
- Region Failure: all datacenters in a region are affected
- Multi-Region Cloud Service Failure: service(s) provided by the Cloud Provider go down globally
The only case I don’t bother considering is Multi-Region Multi-Provider Failure, as this means that every major cloud provider in every region of the world is simultaneously destroyed. At this point, I would begin barricading my front door in prep for the inevitable collapse of society, and my job would be the last thing on my mind.
For your RDS database, your minimum bar is going to be running AWS RDS in a multi-az architecture. To cover region failure, you need to either use a multi-region architecture or use a tool like AWS Backup to copy the data to another region. To be extra cautious in case of event 4, you set up an Amazon SWF to dump the DB and copy it to Azure Blob Store.
Duration-of-Failure:
- Short term: <15 minutes
- Medium term: <2 hours
- Long term: <24 hours
- Unknown: >24 hours
- Permanent: No recovery
The reason to consider durations of failure is that some disaster recovery plans will only get activated under certain conditions. For example, changing a DNS entry for a short-term outage might cause more harm than good. For your organization, change these bullets/numbers based on your situation: these aren’t hard and fast rules. The strategy of documenting plans based on duration-of-failure is a really useful way to organize the information.
I personally consider unknown outages to be worse than permanent from a psychological standpoint. If you are already over a day into an outage and don’t know the end, how many days do you go before you consider the outage conditions permanent? For example: New York City has been out of power for 48 hours, FEMA is called in. Will power be restored today? Tomorrow? Next week? How long do you wait before taking drastic action?
For your RDS database, since you hosted it in a multi-az architecture, your strategy for single server/site failure is the same regardless of duration: fail over to the read-replica in the other AWS AZ. For a regional failure of short-to-medium duration, you rely on the read-replica you set up in the multi-region architecture and leave the service in a degraded state. Any duration longer than medium term, you then spend the time to promote that regional read-replica as the new master instance. If there exists an AWS (or RDS) outage that lasts for >24 hours, you would assume the worst and start engineering an Azure-based architecture based on the most recent data dump.
Recoverability from Application-Driven Disasters
By application-driven, I mean that a user/admin/attacker drives a data corruption/deletion action through the application. For example, a user accidentally deletes the content of an important table. Normal database-replication strategies don’t work here because the unintended deletion would still get propagated to all the replicas.
One strategy is to use a cold backup. The idea here is that you are capturing the state of the data at a point in time, and then rolling back (or spinning up a new host) at that point in time. This doesn’t work for data that was created and then deleted in-between two snapshots. This is where your recovery-point-objective comes in. What you are trying to calculate is the cost of snapshots vs the business risk of losing data.
For your RDS database, you determine that it is essential that you minimize data loss. The dev team cannot afford cycles to build this functionality. You choose to use continuous backups via AWS Backup to get down to 1 second granularity of fold backups.
An alternative strategy to cold backups is delayed deletion. This can be implemented on the application layer or underlying datastore (if compatible). The idea is that instead of deleting or modifying data, the system keeps both copies of the state, and purges the old copy after a period of time (e.g. AWS S3’s Object Versioning).
Implementing delayed deletion on the application layer is great for creating a seamless user recovery story (e.g. have an Undo button in the app). Implementing delayed deletion on the infrastructure side can be much simpler. I personally would make this a product decision based on the frequency or criticality of the workflow.
Recoverability from Insider Threat
Alright you security-minded individuals, this one’s for you. What if one of your administrators decides they’re upset with your company’s politics and does the software equivalent of burning the building down. Most companies I’ve talked to just accept that if their CISO or head of cloud operations goes rogue, they could end the company. I hate this give-up attitude: we should strive to make sure it’s not possible for a single individual to end your entire business. IT’s worth noting that this doesn’t have to be malicious, people can just make mistakes.
The gold standard would be to make your backups immutable to even a group of administrators with nefarious intent. There are security settings in each cloud provider that are meant to protect against accidental deletion (e.g. AWS EC2 Termination Protection or AWS S3 mfa-delete), but an administrator with root access can override all these trivially. What you are looking for is a backup mechanism that cannot be overridden (e.g. AWS Backup Vault locks).
For your RDS database, you set up an AWS Backup Vault to take hourly snapshots, and you use a Vault Lock to prevent all modifications to the backups for a 7 day retention window.
Ready to make things even worse? Grab that tinfoil hat. What if a rogue administrator on the cloud provider side decides to burn you down (and yes, this is a real risk). No matter what AWS tells you, I still wouldn’t believe them (no offense). The question you have to ask yourself is: am I willing to let my company fail based on the actions of a single individual? When you are a startup of 3 people, then yeah, that’s a fine way to go down. If you are a $100mm company, that seems irresponsible.
Qualities of a Healthy Backup Process
The goal of this section is to go through several key properties of a healthy backup process: Integrity, Usability, Security, and Documentation. There are other qualities, but these came to mind as the most universal qualities. If you think I’m missing anything, let me know, and I can update this in a future post.
Integrity
This seems as good a place as any to talk about Backup Integrity. Whether this is for compliance or security reasons, the question here is how do you know if your backups have maintained their integrity.
- The backups are complete and usable — failure to take the data snapshot, error during uploads (especially prevalent during large file uploads), etc
- The backups are not tampered with — modification of backups is either entirely prevented or is at least detectable
A great real-life example of this was a former project where we set up a script to copy backups to S3. For large file transfers, you are expected to use AWS S3 multi-part upload to handle normal network disruptions, but we didn’t use this because our backup size was so small. This was fine for 3–4 years, until the system size grew 100x and would take ~12 hours to upload a file. As the backups grew in size, so too did the error rate, and pretty soon most backups were all partial uploads. At the time, we weren’t monitoring the integrity of the backups, and this went undetected for months before getting caught by our disaster recovery testing process.
If you’ve implemented backup protection using tools like AWS Backup Vault locks, then you shouldn’t need to worry about modification of your backups. If you weren’t able to implement a strong technical control, there are still things you can do to prevent modification of backups, and in most cloud providers, this can come for free (or almost free) by looking at the creation or modification dates.
For your RDS database: AWS RDS Snapshots cannot be modified once created (only deleted). In theory, someone could delete a backup and create a new one with the same name, but AWS will still show the new creation date.
Usability
Just because you have a backup, doesn’t mean it’s feasible to use it under normal operations. Here are some examples:
- When an EBS volume is created from an EBS snapshot, the drive is created immediately but the data is restored over time. If the disk performance is critical, you’d need to warm the drive which could take hours to restore (or look into EBS FSR).
- For some backups, the duration to restore is based on the number of items, not the size. A backup that worked fine at 10k objects might be entirely unusable at 100mm items, even if it’s small in total size.
- Though rare (and awful engineering decision making), some backups can be encoded with data that isn’t easily transferable to a new host, such as certificate, host name, DNS name, database name, etc.
The important part here is that you’ve fully tested your backup routines, and you know approximately how they scale. Periodic testing of the backup (at least yearly, or more often based on risk) becomes an important validation. You could even consider setting alarms to monitor if the data reaches a certain scale that would necessitate that you re-test.
For your RDS database: AWS uses a background job to restore data from a Snapshot to a volume, but prioritizes data that is being read. When restoring from a snapshot, you document a command to read across all the table data to force AWS to prioritize the loading of the data. Only after this is complete, you put the new RDS database into production.
Security
All the privacy, permissions, encryption and security that apply to your running app most likely apply to your backups as well. Security engineers are generally quick to look for this, but DevOps folks early in the career can easily miss this.
- Unauthorized Access — Whoever has access to your backups should be held to a similar standard as the live data
- Encryption — Apply the same encryption standards across both the app and the backup
A word of caution from experience: usability and security need to work together. Our team was so security-centric that we accidentally made the backups unusable by the operations team, and during an emergency outage the operations team wasted valuable time trying to debug why their restore snapshot workflow wasn’t working before escalating. After implementing your security policies, re-check your restoration procedures in full to ensure that nothing was overlooked.
For your RDS database: The database uses KMS for encrypting its live data, and the same key is used to encrypt the snapshots to ensure uniformity of permissions. To prevent easy leakage, the KMS key policy is set to only allow decryption from specific users and AWS RDS.
Documentation
Every engineer’s least favorite topic: documentation.
No backup process would be complete without documentation to talk through how and when to use the backup. An operator isn’t going to jump to restoring from a backup until they’ve exhausted the other options, so their starting place will be the operations guide for the service. This is where the backup and restore procedures should be documented.
Here are some example ideas to get you started:
On a per-service layer, have a document that covers:
- Architecture diagram showing the various layers of redundancy
- SLAs, Recovery-Point Objective, Recovery-Time-Objective
- Descriptions of the different backups and recovery procedures
- What monitors for failure to capture the backup?
- Under what conditions should the backups be used?
- Who are the system experts for emergencies?
- Internal service experts, enterprise support contacts, escalation process
- A list of each cross-system dependency, contact information, hard vs soft dependency, composite SLAFor each cross-system dependencies
On an infrastructure-wide layer
- Architecture diagram showing critical components, traffic flow, ingress/egress points, etc
- Service listing with links to each contingency plan document
- The organization policies around annual testing, approved backup utilities, etc.
- An organization-wide outage management policy, including standards for communication, war rooms, retrospectives, escalation thresholds, etc
Commonly Overlooked Topics
Now that we’ve covered the core technical parts, let’s dive into two commonly overlooked topics when creating a disaster recovery plan: personnel redundancy and cross-system dependencies. The mindset here is to take a step back from the raw mechanics of restoring just your service, and think about what else might be going on at the same time.
For example, an ecological disaster in New York has taken out your company’s main product and its enterprise services (auth, vpn, etc). You are working remotely and are unaffected. What enterprise services need to be restored first before working on the main product? How many of your team are in New York and are now unavailable to help you?
These types of questions are an important part of a holistic disaster recovery plan.
Personnel Redundancy
The worst time for something bad to happen is when something bad is already happening.
What is the point of having so much time and money spent on multi-region, multi-cloud strategies, when a bad Chipotle dinner during an offsite could take out the whole team? Ensuring that there are adequately trained people, available in multiple regions, is a minimum bar for any company worth a significant amount.
- People Redundancy: there exists multiple experts for each critical system
- Regional Distribution: the system experts are regionally distributed, in case a region-specific event occurs
Regardless of if it’s COVID, a plane crash, or food poisoning during an offsite, a good engineering leader thinks about their people just as much as their infrastructure. I’m not suggesting you do extreme things like never perform off-sites, or force people to relocate, but at a minimum you need to think about the risks you incur on a daily basis and the risks you incur during get-together events like a company offsite. If your tour bus gets lost in a shady part of town, ensure the rest of the company can function without you.
Cross-System Dependencies and Composite SLAs
This is my favorite overlooked topic because it’s so obvious yet so easily discounted. You’ve made all these plans to make your infrastructure region-agnostic and ensure that your team members remain geographically separated, but you forget that during a large enough disaster, the SaaS you use for documentation could also undergo an outage. Now you are stuck recovering your backups without any documentation. Let’s hope everyone has a great memory.
Here is a list of common dependencies that people forget:
- Authentication
- DNS
- Documentation
- Hosting Provider Console Access
- Secret Management
- Source Control / CI / In-House Artifacts / Third-party Artifacts
- VPN
If you have not done a cross-system dependency analysis from a disaster recovery perspective, you should make this a top priority.
Let’s say you are unlucky enough to have a hard dependency on all of the services mentioned above. The SLA for recovering your service is now a composite SLA. The time for you to fix your service will first block on all the other services recovering first. If those services are also restored in a linear fashion, then you’d best get ready for a long wait.
There are often steps you can take today to remove these cross-system dependencies. Here are some examples to get your brain moving:
- Authentication, Documentation, and Secret Management — create a local administrator user, print out the secret and its corresponding disaster recovery plan, and put both in a fire-proof safe.
- Source Control / In-House Artifacts / Third-party Artifacts — bake all of your service dependencies into an image that you copy between multiple regions/cloud providers.
- CI — ensure your recovery plan includes how to execute the steps manually.
For your RDS database, you determine that your disaster recovery plan relies on access to either the AWS or Azure console and the database master password. You print out a copy of an unrestricted admin user in AWS/Azure, the database master password, and the disaster recovery documentation and put it in a safe. You use Shamir’s Secret Sharing protocol on the secrets and spread the shares out amongst members of your team and the security team.
I know this sounds like a lot of overhead, but this is where standards and team practices come into play. Getting a fireproof safe, learning how to Shamir a secret, and figuring out your system for printing out documentation is a pain the first time. But after that, it becomes part of a pattern that is cheap to follow.
Consider this: if your company already considers things like earthquake kits or emergency power units as a standard, then this proposal is not a big leap.
Disaster Recovery Testing
If you’ve made it this far in the article, then you either believe I’m crazy (and should stop reading) or you have bought into the idea that backup and disaster recovery is a complex topic. Especially when it comes to the cross-system dependencies, the only way to be confident in your plan is to test it. This section goes through one approach toward building a disaster recovery plan, starting with the compliance requirements as a foundation.
Compliance Requirements for Disaster Recovery Testing
Multiple compliance programs and audits specifically require documented disaster recovery plans and testing:
- FedRAMP via NIST 800–53 CP-4 — “at least annually”
- SOX via Section 404 — “each annual report”
- HIPPA via HIPPA 164.308 pt 7 pt D — “period testing”
Amongst engineers, compliance is pretty universally disliked, but from my perspective, the vast majority of compliance is just asking you to do things you already know you should be doing. I would like to write an entire post just on how toxic compliance can evolve in a company, but here’s the important part: if your disaster recovery plan is comprehensive and reduces risk, the added work from compliance is a non-issue.
Even if your company doesn’t need to do disaster recovery testing for a compliance program, let’s talk about what the compliance requirements almost always look like:
- Redundancy — data should be available in multiple sites
- Recoverability — cold backups of the data should exist
- Integrity — proof that the backups aren’t modified after-the-fact
- Security — backups don’t expand access and are encrypted properly
- Documentation — plans are documented, reviewed, and updated
- Training — people actually know the plan
- Testing — (at least) annual testing is performed, with follow ups documented and actioned upon
Everything above is a reasonable ask. Don’t make compliance out to be the problem. Implement these steps early in your organization, and you won’t be stuck trying to half-ass a testing program the week before your first audit.
Annual Disaster Recovery Testing
We now have all the components:
- Redundancy — covered in “Redundancy for Hardware/Network Disasters”
- Recoverability — covered in “Recoverability from Application-Driven Disasters” and “Recoverability from Insider Threat Disasters”
- Integrity, Security, and Documentation — covered in “Qualities of a Healthy Backup”
All we need now is Training and Testing, and what better way to do training than via your testing! Your disaster recovery exercise should cover (at minimum):
For each service
- Perform an AZ, Region, or Cloud failover test, validating a multi-site recovery story
- Validate your cold backups (or equivalent): existence, integrity, and usability
- Update the documentation, even if it’s just updating a timestamp/sign-off sheet
- Update the disaster recovery testing spreadsheet with results, follow-ups, who performed the test(s), and when the test(s) were performed.
Have the team-lead or department lead sign off on the collective results.
Pro tip: some auditors are particular about wanting clear instructions on what the disaster recovery exercise covers, who is responsible, who the testers were, and dates for everything. Do yourself a favor and spend the 5 minutes necessary to document this.
Example Disaster Recovery Template 1 — This DR template is made in Google Doc, and it meant to try and break up the work into a summary table and followup table. It gives much more room for taking notes, extra links, extra narratives, etc. This format works for some, and not for others. Copy what you like out of it, change what you dislike.
Example Disaster Recovery Template 2 — This DR template is made in Google Sheets, and it focuses more on minimizing the amount of documentation that the engineers have to type up. Much more of a “click the box” and move on type approach. If you thought the above format was too verbose, this might work better for you.
Wrap-up
If you made it this far in the article, I am impressed. For most DevOps engineers, this is by far one of the most de-energizing efforts, but for anyone that has gone through a disastrous outage, you know how important this is.
If you had 1 take-way though, it would be this: a comprehensive disaster recovery guide takes time, planning, and automation. Every year, you should be making incremental improvements.
I am very interested in hearing about gaps in my proposed mindset or the example templates, so hit me up.