noun_Email_707352 noun_917542_cc Map point Play Untitled Retweet Group 3 Fill 1

How to not build resilient systems?

"Anything that can go wrong will go wrong."- Murphy's law

Charu Upadhyay / June 19, 2023

Most systems have recovery plans, backups, resiliency measures in place. But when things start turning sideways, we realize something is missing. In this article we will discuss what is preventing us to build a truly resilient system and how we can break out of this cycle to build a truly resilient system.

Survivorship Bias

During World War II, researchers at the Center for Naval Analyses conducted a study to assess the damage done to returned aircraft after missions. The initial inclination was to add armor to the areas that showed the most damage as those were perceived as the most vulnerable parts. However, Abraham Wald, a mathematician, observed that the study was only conducted on the aircraft that had withstand the attacks (represented by the red dots in the figure) and still return safely. However, the areas that remained unscathed such as the cockpit and engine, were overlooked, potentially obscuring a different narrative. These unaffected are those areas that if hit, would case the plane to crash and be lost.

image3rexn.png

Image: Wikipedia

 

This brings us to a critical point: what assumptions do we make regarding the resilience and recovery capabilities of our systems, and are they well-justified?

False assumptions can undermine our recovery capabilities. If the aircraft were reinforced in the most hit areas, this would be a result of survivorship bias because crucial data from fatally damaged planes was missing while making assessment. Consequently, we run the risk of overlooking significant factors and failing to cover the ‘unknowns’ that may impact our operations.

When was the last time your organization experienced an outage, and did it have a substantial impact?

The answer to this question cannot be simplified to a simple yes or no. In the current dynamic business environment, where any amount of downtime is unacceptable, even a brief interruption can result in billions of dollars in losses. The repercussions of such downtime extend beyond financial implications and encompass reduced productivity, damage to reputation, low customer confidence, mental stress and more.

Regardless of whether your operations are on-premises or in the cloud, system outages are an inevitable reality. They can arise from several factors, including hardware, software, human error, system malfunctions, natural disasters, and others. This issue is not new; we have encountered it repeatedly in the past and will encounter it in the future as well. Failures occur all the time, and while complete eradication may be unattainable, we must acknowledge and embrace them to enhance our preparedness.

 

imagegqkpr.png

According to Uptime’s 2022 Data Center Resiliency Survey

 

  • Networking issues are causing a large portion of IT

Networking-related problems have been the single biggest cause of all IT service downtime incidents – regardless of severity – over the past three years. Outages attributed to software, network and systems issues are on the rise due to complexities from the increasing use of cloud technologies, software-defined architectures and hybrid, distributed architectures.

  • The overwhelming majority of human error-related outages involve ignored or inadequate procedures.

Nearly 40% of organizations have suffered a major outage caused by human error over the past three years. Of these incidents, 85% stem from staff failing to follow procedures or from flaws in the processes and procedures themselves.

Resilience is often underestimated, but its significance cannot be emphasized enough. We all have experts who have helped us build a stable and functioning system following the best practices, but no architecture is fail safe.

We may have ensured high availability, replication, geographic redundancy, backup strategies, BCDR strategy etc., but are we still truly confident in their resilience? Merely assuming everything is taken care of can be a risky mindset.

Before we dive in, let us make sure we understand some terms very clearly:

Reliability and Resiliency

The Oxford Dictionary definition of reliability is "the quality of being trustworthy or of performing consistently well," whereas resilience is "the capacity to recover quickly from difficulties."

In the world of cloud computing, reliability means that the services should run as they are intended to run at any given point in time, whereas the resilience of the service means being able to withstand certain types of failure and yet remain functional from the customer's perspective. In other words, reliability is the result we strive, and resiliency is the way to achieve it.

Business continuity and disaster recovery plan (BCDR)

It is a set of strategies, policies, and procedures that help an organization respond, adapt, continue its essential operations, and recover in the event of a disruptive event. While BC (Business Continuity) deals with the business processes and functions, DR (Disaster Recovery) is primarily focused on the recovery from the IT side.

Recovery Point Objective (RPO)

It refers to the point in time in the past to which you will recover.

Recovery Time Objective (RTO)

It refers to the point in time in the future at which you will be up and running again.

Let us understand this with an analogy:

There is a baker who bakes pies that takes 2 hours, using an oven that runs continuously. One day, the oven breaks while baking, and the pies get ruined. To get back to baking, the baker has another oven available, but it needs a 1-hour preheat time.

RPO is 2 hours because the damaged set of pies represents a loss that the business must accept.
RTO is 1 hour, which is the time it takes to resume baking after preheating the second oven.

Backup strategies

A backup strategy is essential for data protection, maintaining business continuity, compliance and legal requirement, disaster recovery, and overall peace of mind. Many solutions exist for backing up data, including hardware-based, software-based, and cloud-based methods. Cloud providers offer different backup plans with strategies like incremental, full, automated, hybrid, and multi-cloud backups. Assessing your needs helps you choose the right backup strategy for you.

Chaos engineering

It is the discipline of experimenting with a system in order to build confidence in the system's capability to withstand turbulent conditions in production. A harsh way to ensure that your failure recovery is working correctly is to intentionally crash your production servers.

There are many tools available on the market to assist with resilience testing like Netflix famous chaos monkey, AWS FIS (fault injection simulator), Azure Chaos studio and many others.

Ask yourself following questions:

  • Do we have a well-defined business continuity and disaster recovery (BCDR) plan in place? If so, are we implementing it effectively?
  • How frequently do we test the resiliency of our application?
  • Do we share the results and learnings from testing with different teams?
  • Are our recovery time objectives (RTO) and recovery point objectives (RPO) realistic enough and aligned with the business needs?
  • Do we have reliable runbooks that are regularly updated? Can we trust these runbooks during critical failures?
  • Have we ever rehearsed these runbooks and playbooks?
  • Are our monitoring alarms properly configured to detect failures in a timely manner? Even the best backup strategy can fail if our alarms are not set up correctly.
  • Are we collecting right metrics (technical metrics and business metrics)?
  • Do we conduct thorough post-mortem analyses to learn from failures and prevent their recurrence?

Final Reflections

So, iterating again Murphy’s principle "Anything that can go wrong will go wrong.” Are you prepared for your next outage?

While no one desires an outage, it is prudent to be prepared for the unexpected in our ever-changing world. It is wise to be proactive before such events take us by surprise. These strategies are designed to build trust in our applications, allowing us to detect any unforeseen problems at the earliest stages. Prioritizing resilience is vital for sustaining seamless operations, safeguarding customer trust, and securing long-term business prosperity.

We at Tietoevry have all the expertise to ensure your business is safe and secure with us. Whether you are a small business or someone in the transformation journey, our utmost responsibility lies in safeguarding the trust of your customers while you place your trust in us.

 

I hope you enjoyed reading this!

Charu Upadhyay

 

 

 

Charu Upadhyay
Cloud Consultant

Share on Facebook Tweet Share on LinkedIn