Error budgets
In a team following DevOps principles, error budgets become a very important part of the direction that the team takes in the future. An error budget is calculated with this formula: Error budget = 1-SLA (in decimal)
What this basically means is that an error budget is the percentage left over from the SLA. So, if there is an SLA of 99%, then the error budget would be 1%. It is the downtime to our uptime. In this case, the error budget per month would be around 7.2 hours. According to this budget, we can define how our team can progress based on team goals:
- If the team’s goal is reliability, then the objective should be to tighten the error budget. Doing this will help the team deliver a higher SLO and gain more trust from their customers. If you tighten an SLO from 99% to 99.9%, you are reducing the tolerable downtime from 7.2 hours to 44 minutes, so you need to ensure that you can deliver on such a promise. Inversely, if you cannot deliver on such an SLO, then you shouldn’t promise it in any sort of agreement.
- If the team’s goal is developing new features, then it mustn’t come at the cost of a decreased SLO. If a large amount of the error budget is being consumed every month, then the team should pivot from working on new features to making the system more reliable.
All these statistics exist to help us have metrics that can be used to maintain high availability. But we aren’t the ones who will use them, we will simply configure them to be used automatically.
How to automate for high availability?
Now that you know the rules of the game, you need to figure out how to work within the rules and deliver on the promises that you have given your customers. To accomplish this, you simply have to accomplish the things that have been set in your SLAs. Not particularly difficult on a small scale, but we’re not here to think small.
There are some essentials that every DevOps engineer needs to know to accomplish high availability:
- Using desired state configurations on virtual machines to prevent state drift
- How to properly backup data and recover it quickly in the event of a disaster
- How to automate recovery of servers and instances with minimal downtime
- How to properly monitor workloads for signs of errors or disruptions
- How to succeed, even when you fail
Sounds easy, doesn’t it? Well, in a way it is. All these things are interconnected and woven into the fabric of DevOps and depend upon each other. To recover success from failure is one of the most important skills to learn in life, not just in DevOps.
This concept of failure and recovering back to a successful state has been taken even further by the DevOps community through the development of tools that maintain the necessary state of the workload through code.