Table of Contents
Distributed software systems and cloud applications are more prone to failures due to their complexity and interdependence of components. And while companies do their best to make their systems resilient, sometimes these factors are unexpected and out of your control, so you can’t really predict them. What do you do in this case?
Chaos engineering strives to resolve this issue and helps companies prepare for the unexpected. In this article, we answer the “what is chaos engineering” question and explain how controlled chaos can benefit your company.
Chaos engineering: definition
If we address Wikipedia, it defines chaos engineering as a discipline aimed at testing the system’s resilience by deliberately injecting failures into it. In other words, you intentionally break the system, observe how it behaves, and analyze what exactly causes the outage and why.
Chaos engineering can be compared to a flu shot: you introduce a harmful foreign body to your system so it becomes immune to it later on. Same with chaos engineering – you introduce a failure to make the system more resilient to similar events in the future.
In addition to making the system more stable, chaos engineering also helps train your team for potential emergencies and helps build its muscle memory in terms of responding to an emergency. Needless to say, an immediate reaction to a threat and its prompt resolution can save you from massive financial losses, caused by the system’s outage.
How does chaos engineering differ from testing?
Some may confuse chaos engineering with stress testing or fault injection, since both approaches aim to test the system’s limits and see how it behaves under stress. However, they are not the same.
Testing, in general, aims to:
- Verify if the system works as expected;
- Test one condition at a time.
On the other hand, chaos engineering studies issues that have a near-infinite number of possible causes and does not aim to test a specific condition. Instead, chaos engineering is more about letting things loose and then examining what exactly happened and why. It therefore covers a much broader area and generates new knowledge, since chaos engineering principles are based on experimenting.
A brief history of chaos engineering
Chaos engineering dates back to 2010, when businesses started widely adopting distributed systems and shifting their operations to the cloud. The pioneer of chaos engineering is Netflix, that once suffered a three-day database corruption and then decided to move to a distributed cloud architecture. And while this decision positively affected the overall business operations, it brought Netflix new challenges regarding the complexity of the system and its interdependence.
Netflix knew that their systems had to be more reliable and resilient and for that, the company had to test how the system would behave in abnormal conditions. This is how Chaos Monkey was designed. It is a tool that intentionally and absolutely randomly terminates VM instances or containers in a production environment. In simple words, it simulates the behavior of a monkey, let loose in a server room. You never know what instance will be terminated – and that’s exactly what Netflix engineers strived for. The main purpose of Chaos Monkey is to emulate real-world emergencies and to understand how you can make your system more resilient.
For Netflix, the ultimate goal was to ensure that the termination of the Amazon Elastic Compute Cloud (EC2) instance won’t have a negative impact on the overall service experience. As a result, Chaos Monkey achieved huge success and soon, many similar tools (like Gremlin chaos engineering software) appeared. Meanwhile, Netflix developed a set of additional tools called The Simian Army to inject more complex failures to the system beyond the loss of a VM instance or a container. This tool set helped Netflix implement further improvements and significantly reduce the number of outages.
Which companies need chaos engineering?
Chaos engineering is highly beneficial, but it is not suitable for all organizations. Hence, before discussing the main benefits, let’s first look at the companies that might want to consider chaos engineering adoption:
- Organizations that have high observability, digital maturity, and high resilience: such organizations have enough skills and resources to promptly and effectively perform chaos engineering experiments.
- Organizations that operate in the cloud: cloud brings an additional layer of complexity, such as required coordination with the cloud provider and hence, chaos engineering can help facilitate it.
- Organizations that use microservices and distributed systems: the complexity and interdependence of such systems cause additional risks in case of an emergency, since the failure of one instance can bring down the whole system.
Chaos engineering is most commonly used in fast-paced and large organizations with complex distributed systems. For them, chaos engineering serves as an effective method of increasing the resilience and stability of the system while continuously delivering stellar customer service.
Biggest benefits of chaos engineering
In general, chaos engineering helps improve reliability and resilience of a system while helping your team prepare for emergencies in advance. But if we drill down to other benefits, we can define the following:
- Better availability and durability of services for customers;
- Reduced financial losses due to preventative maintenance and downtime budget planning;
- Improved on-call training for the teams;
- Reduced number of incidents and failures due to thorough inspection of the system during testing.
As you can see, chaos engineering allows companies to become proactive rather than reactive in terms of emergencies and failures. Such approach leads to a significant reduction in financial losses as well as in the number of problems during an unexpected event.
Chaos engineering principles: how does it work?
Despite its name, chaos engineering follows a structured, step-by-step approach:
- Create a hypothesis. The first step is setting the baseline and defining how a system should behave (in your opinion) in case of an emergency. In other words, you consider a potential failure and theorize about its effects on the system. An example would be something like, “if A occurs, B will happen”.
- Define the blast radius and conduct a small test. In chaos engineering, a blast radius is a number of resources targeted in your experiment. It is a common practice to start with the smallest blast radius and expand it, if the issue is not found.
- Measure the impact of failure. After conducting the experiment from step 2, measure the impact of injected failure and continue, if needed. By using the obtained results, measure them against the hypothesis and determine how to fix the issues, if they happened.
The eight fallacies of distributed computing
If you don’t know where to start with chaos engineering, you can always use the eight fallacies of distributed computing, developed by L. Peter Deutsch and colleagues at Sun Microsystems:
- Network is reliable;
- Latency is zero;
- Bandwidth is infinite;
- Network is secure;
- There is always one admin;
- Topology doesn’t change;
- Transport cost is zero;
- Network is homogenous.
These fallacies are the wrong assumptions that developers tend to make about their systems, and they point out the main areas to be tested.
Chaos engineering examples
To better understand the process, let’s look at some chaos engineering examples that a chaos engineer may perform:
- A failure of a micro component;
- A sudden increase in traffic and a high CPU load;
- Injection of latency failures;
- Failure of the entire Availability Zone;
- Host failure.
As you can see, all these examples mirror emergencies that companies face on a regular basis and should prepare for.
Chaos engineering best practices
There is a list of chaos engineering best practices that are applicable to every company and perfectly fit their experimenting strategy. You can use them as a baseline and as a check-list to make sure that all important aspects of the process are considered.
Know your system
You first need to understand your whole system, including its architecture, topology, steady-state behavior, and characteristics like latency or availability, for chaos engineering to be successful. When planning the experiments, you will base the hypothesis on this knowledge, since an incorrect assumption about the system can lead to wrong results.
Define steady-state behavior
You need to know how your system behaves during the uptime and what is expected from it in normal, uninterrupted conditions. For that, you can use monitoring and tracking tools to help you collect the data. These findings will later be compared against the results of chaos engineering tests and can be used as a benchmark.
Define real-world failure scenarios
Chaos engineering is not about extraordinary emergencies – on the contrary, it mimics real-world failure scenarios that your system may encounter. Thus, for accurate testing, you need to list potential real-world scenarios like power issues or traffic overload to use in your experiments.
Conduct experiments in the production environment
We’ve already mentioned it, but let’s repeat one more time: perform your chaos experiments in the production environment for the most accurate results. Since chaos engineering strives to simulate real-world scenarios and emergencies, the production environment is most suitable for these tests. Also, the duplication of your production environment would be too costly and cumbersome, which is another reason why chaos experiments are carried out in real-world conditions.
Restrict the blast radius
Since chaos engineering tests are run in a production environment, the results can be quite disruptive, and nobody wants the whole system to go down. Hence, you always have to start with a narrow blast radius and expand it gradually, if the needed results are not achieved. Remember: ideally, no system users should even be aware of chaos experiments taking place. Therefore, you need a high redundancy to back up the system in case of an issue, caused by experiments.
Summing up
After answering the “what is chaos engineering?” question, we now understand that it is a highly effective method to identify vulnerabilities in a complex system and understand what causes them and how they can be avoided or eliminated. However, chaos engineering is not suitable for all organizations, so before considering it, first assess your current system and compare the pros and cons of introducing chaos experiments to it.
Comments