Monitoring

When you look at the title of this section, Understanding logging and monitoring, some of you might wonder, what’s the difference? Well, that’s valid. It took me a while to figure that out as well. And I believe that it comes down to a couple of things:

  1. Monitoring looks at a specific metric (usually generated by logs) and whether or not that metric has passed a certain threshold. However, logging is simply collecting the data without generating any insight or information from it.
  2. Monitoring is active and focuses on the current state of an instance or object that is being monitored, whereas logging is passive and focuses more on the collection of largely historical data.

In many ways, it is like the differences between a transactional database and a data warehouse. One functions on current data while the other is about storing historical data to find trends. Both are intertwined with each other nearly inexorably and thus are usually spoken of together. Now that you have logged and monitored all the data, you might ask yourself, what is it for? The next section will help with that.

Alerts

You cannot have a conversation about logging and monitoring without bringing up the concept of alerts. A logged metric is monitored by a monitoring service. This service looks at the data produced from the logs and measures it against a threshold that is set for that metric. If the threshold is crossed for a sustained, defined period of time, an alert or alarm is raised.

Most of the time, these alerts or alarms are either connected to a notification system that can inform the necessary personnel regarding the heightened alarm state, or a response system that can automatically trigger a response to the event.

Now that you have learned about the powers of observation and insight that you gain from logging and monitoring, it is time to learn how to wield that power. Let’s find out the actions we should take when we find significant and concerning insights through logging and monitoring.

Incident and event response

I’m going to put Murphy’s Law here again because I cannot state this enough:

Anything that can go wrong will go wrong at the worst possible time.

Dealing with incident and event response involves either a lot of work or zero work. It depends on how prepared you are and how unique the incident or event is. Incident and event response covers a lot of ground from automation and cost control, to cybersecurity.

How a DevOps engineer responds to an event depends on a great number of things. In terms of dealing with clients and customers, a Service Level Objective (SLO) is used when a response is necessary. However, this is largely on production environments and requires the definition of a Service Level Indicator (SLI). It also involves the creation of an error budget to determine the right time to add new features and what the right time is to work on the maintenance of a system. Lower-priority development environments are used to stress test potential production cases and the effectiveness of incident response strategies. These objectives will be further explored in the Understanding high availability section.

If you work on the Site Reliability Engineering (SRE) side of DevOps, then incidents are going to be your bread and butter. A large part of the job description for that role involves having the correct metrics set up so that you can respond to a situation. Many SRE teams are set up these days to have active personnel around the globe who can monitor sites according to their active time zones. The response to the incident itself is done by an incident response team which I will cover in detail in the next section.

Another part of incident response is the understanding of what caused the incident, how long it took to recover, and what could have been done better in the future. This is covered by post-mortems, which usually assist in the creation of a clear, unbiased report that can help with future incidents. The incident response team is responsible for the creation of this document.