RTOs and RPOs
These two abbreviations are much more availability-focused than the other three. Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) are used as measuring sticks to measure the borders of availability. If an application fails to fall within its RTO or RPO then it hasn’t fulfilled its guarantee of availability. RTOs and RPOs are largely concerned with recovering operations after a disaster. There are financial, medical, and other critical systems in this world that wouldn’t be able to function if their underlying systems went down for even a few minutes. And given the everything fails all the time motto, that disaster or failure is not unrealistic.
An RTO is placed on a service when there is a need for a service to constantly be up and the time used in RTO is the amount of time that a service can afford to be offline before it recovers and comes online again. The fulfillment of an RTO is defined in the SLA as the maximum time that a system will be down before it is available again. To be compliant with the SLA that the DevOps has, they must recover the system within that time frame.
Now, you may think this is easy: just turn the thing on and off again, right? Well, in many cases that’ll do the job, but remember that this is not about just doing the job, it’s about doing the job within a set amount of time.
In most cases, when a server goes down, restarting the server will do the trick. But how long does that trick take? If your RTO is five minutes and you take six minutes to restart your server, you have violated your RTO (and in a lot of critical enterprise systems, the RTO is lower than that). This is why, whenever you define RTOs initially, you should do two things: propose for more time than you have and think with automation.
Modern SLAs of 99% (seven hours a month) or even 99.9% (44 minutes a month) are achieved through the removal of human interaction (specifically, hesitation) from the process of recovery. Services automatically recover through constant monitoring of their health so when an instance shows signs of unhealthiness, it can either be corrected or replaced. This concept is what gave rise to the popularity of Kubernetes which in its production form has the best recovery and health check concepts on the market.
RPOs are different in that they are largely related to data and define a specific date or time (point) which the data in a database or instance can be restored from. The RPO is the maximum tolerable difference of time between the present and the date of the backup or recovery point. For example, a database of users on a smaller internal application can have an RPO of one day. But a business-critical application may have an RPO of only a few minutes (if that).
RPOs are maintained through constant backups and replicas of databases. The database in most applications that you use isn’t the primary database but a read replica that is often placed in a different geographical region. This alleviates the load from the primary database, leaving it open for exclusive use for writing operations. If the database does go down, it can usually be recovered very quickly by promoting one of the read replicas into the new primary. The read will have all of the necessary data, so consistency is usually not a problem. In the event of a disaster in a data center, such backup and recovery options become very important for restoring system functions.
Based on these objectives and agreements, we can come up with metrics that can affect team behavior, like our next topic.