What is a Resilience Matrix and how it can boost your Digital Product

One of the key practices for Digital Teams that aim to have 99.999% availability on their systems in the cloud is tracking versions of Applications and Services and get to know all dependencies between them. This is part of Old (but still Gold) Configuration Management, very often ignored.

A Resilience Matrix can be considered a tool that displays Applications and Services of a given Product and also connects every Application to its dependencies. Additionally, it’s possible to track the level of impact caused by failures in each dependency of a given Application.

By doing this, a digital team can effectively identify what’s causing a problem for clients and fix all the issues much faster than might be expected.

In the Clouds: The Tracing x Monitoring Challenge

Not so far ago, companies had a well-known number of infrastructure resources at their on-premises environments, and architectures were all static. After setting up the monitoring configuration for those applications, frequent updates simply would not be necessary.

Fast forward to 2021, and what we see is that many cloud-native applications have been built while many traditional applications have been remodeled to benefit from the cloud. Not a surprise at all. But the big thing is that architectures are now elastic and distributed. In short, monitoring became a challenge of tracking moving targets. A few applications run in Infrastructure-as-a-Service (IaaS) deployments; others are already running as Platform-as-a-Service (PaaS) components and a growing number of applications are purchased as Software-as-a-Service offerings.

How to deal with that?

There is a famous quote in the Cloud/DevOps world: “Treat servers like cattle instead of pets”.

And there is an image for that:

No one can guide the cattle calling each cow by its name.

Indeed, if you need to take care of a few pets alone, you’ll be able to get along and interact with each one (possibly) with minor issues. That’s what system administrators were able to do with their servers during the pre-cloud age.

However, if you need to manage hundreds of components serving your Product – much like someone guiding cattle – chances are you will wish to learn new DevOps techniques, such as Distributed Monitoring and Tracing. And that is because just a few configurations and conventions shall enable you to keep up with hundreds or even thousands of smaller components.

Since transactions now happen in several locations, our Monitoring routine must be able to trace and aggregate every little single step in the process. Log Management has also been changed a lot, since many machines and containers are kept alive for small periods of time. The standard logging process now involves centralized logging with stacks like ELK or Graylog, among others.

The combination of this new tooling and process is often called Observability, since it’s more than pure monitoring. We’ll talk a lot about the subject in this blog (so please sign up our newsletter to stay tuned!)

Products, Applications and Services

Before we go further discussing monitoring, it’s nice to agree on semantics and explain how we call each of the elements in place.

Product is something a company offers to its clients or maybe to an internal group of employees. It has business meaning and is usually composed of several Applications and Services.

Application is something the organization develops internally (maybe with a help of partners or contractors) to create value within a Product. It typically improves over time and has frequent updates through new features and bugfixes. Applications usually depend upon Services.

Service is something the company uses to build Applications. It could be an infrastructure service like a database or something more sophisticated, like a Pricing API.

Although servers, containers and other technical components change very often in cloud computing environments, Products, Applications and Services change much less. They’re also much more relevant for clients, so it makes sense to monitor them instead of underlying moving pieces.

ITIL best practices recommended the use of a Configuration Management Database (CMDB) to manage servers and other kinds of devices, which are often tied to legacy on-premises environments. However, most of the ideas behind it are still relevant in cloud computing environments if we shift the mindset.

We like to manage the Applications, Services and other relevant entities like environments in an Application and Service Catalog. Managing Application lifecycle becomes a lot easier with quality information presented in the Catalog.

Application Health-Checks

Focusing on high-level components like Applications, we must have effective ways to check if everything is running accordingly. Much like an airplane, we want to make sure if it’s working OK or not, filtering all the noise that would distract us to reach the task.

This is done with Application Health-Checks. In a nutshell, a Health-Check is a way we can ask an Application if everything it needs is still fine. Applications should be built with Observability in mind, so that it’s easy to figure out if we need to do something or not.

When we ask for a Health-Check, an Application should talk to each of its services the same way it does all day long. If it needs data from a database, it should try to get data from that database. If it needs to send mail to clients, it should try to send a simple mail to a real address. This way, if we face any kind of issue, the Health-Check will let us know immediately (and hopefully) tell us which check failed — and, if so, immediately troubleshoot the problem.

A Sample Resilience Matrix

Below we can see a Sample Resilience Matrix, from our Product One Platform.

In this sample, we have 4 applications (in the columns):

Admin
Provisioner
1P-Backend
Steras

In each Matrix row we can see the Application and Service dependencies. Let’s pick the 1P-Backend application to describe its dependencies:

1P-Nats-prod: infrastructure service
1P-Elasticsearch-prod: infrastructure service
Sendgrid: external service
1P-Postgres-prod: infrastructure service
1P-Redis-prod: infrastructure service

A nice thing about the Matrix is that it becomes easy to check what might be causing a given problem. When an incident occurs, we can instantly check if any dependency is also failing, speeding up our troubleshooting.

In each Matrix cell, we can have one of three options, depending on the impact a dependency causes on the applications that depend upon it:

Available: when the application tolerates failures in the dependency without any relevant impact for its clients

Degraded: when the application can still partially operate but lacks some of its behaviour. Ex: Youtube stops rendering thumbnails but keeps playing videos.

Unavailable: when the application cannot serve its purpose

There are some techniques the applications can use during failures, like the Circuit Breaker pattern. However, designing for failures and building resilient applications will be covered in another article. 🙂

Improving our Health-Checks

Knowing our dependencies, we can improve our Health-Checks to properly validate them. We already monitor each service independently, so in our Health-Checks we can validate that the given application can successfully use each dependency.

An example would be doing a query in our database or trying to write and read from a caching service. Doing this we’ll be sure that credentials and network are also working fine.

Picking again our 1P-Backend application, we could have a health-check payload like this:

{
   "1P-Nats-prod" : "OK",
   "1P-Elasticsearch-prod": "OK",
   "Sendgrid" : "ERROR",
   "1P-Postgres-prod" : "OK",
   "1P-Redis-prod" : "OK"
}

We have multiple key-value attributes, one for each dependency, and simple values like “OK” and “ERROR” to describe if the application is able to use the dependency or not. In this example, the only failing dependency is Sendgrid.

This might not mean that Sendgrid is down for everyone. It might just be an issue with our account. What’s relevant for us is the fact that our transactional emails are likely not being sent, since the dependency that takes care of it is returning errors for our application.

This health-check approach is recommended for all applications, due to its simplicity and the gains it brings for diagnosis and troubleshooting.

Using Resilience Matrixes in your company

Although it’s easy to understand a Matrix with several Applications and Services, it can become much harder if you have dozens of each in the same Matrix.

However, Product Teams typically don’t manage so many Applications and Services. Usually, for a given Product you’ll need one or 2 Matrixes to display it completely. In an Organization, you can have many Matrixes if you have many Product Teams, but each Matrix will be focused in a specific domain.

If you want to apply these techniques in your team, feel free to reach out to us! 🙂

Share the Post:

Cloud FinOps: The Unit Economics Era

The concept of FinOps is relatively new: just a year ago the book “Cloud FinOps: Collaborative, Real-Time Cloud Financial Management”