Chaos Engineering for Resilient Software

Building software to be resilient and fault tolerant is not enough. As we have different kinds of tests to make sure our system behaves as expected, we also need to test it under exceptional conditions to ensure our software is resilient.

What is Scope Creep and why is so related to our every day work? Learn everything on our blog post Scope Creep in Project Management.

The concept of chaos engineering started at Netflix in 2011 when they were moving to the cloud. The main goal was the lack of resilience testing and the assumption that there are no breakdowns. This lead to the creation of Chaos Monkey, a tool that shutdown random machines.

“Imagine a monkey entering a “data center,” these “farms” of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices, and returns everything that passes by the hand [i.e., flings excrement]. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.”

Types of Tests

Several kinds of abnormal conditions can occur in a production environment. Gremlin makes a careful distinction of the different attacks that the tool can make:


When something is consuming critical resources of your application, such as:

    • CPU: simulated with an application with a CPU intensive task using one or more cores.
    • Memory: this can be simulated with an application allocating a large amount of 
    • RAM.
    • IO: with an application putting read/write pressure on I/O devices such as hard disks.
    • Disk: simulated filling the disk up to a certain percentage


When something changes in the environment where the application is running, such as: 

    • Shutdown: rebooting  or killing one node / VM / machine
    • Time travel: changes the host’s system time, which can be used to simulate adjusting to daylight saving time and other time-related events.
    • Process killer: killing a specific process, which can be used to simulate application or dependency crashes.


Simulate the inherently unpredictable behavior of the network, such as:

    • Blackhole: dropping all matching network traffic.
    • Latency: injecting latency into all matching egress network traffic.
    • Packet loss: introducing packet loss into all matching egress network traffic.
    • DNS: blocking access to DNS servers.

Application: knowing the internals of your applications, you can target specific conditions (i.e., dropping specific requests).

Perhaps the most well-known test is shutting down random servers. However, all the other types are significant, especially network issues since are the most likely to happen.

Start Small, Increase the Blast Zone

The final goal is to test abnormal conditions in the production environment to find weaknesses in our system. Of course, we don’t want to affect users while searching for these weaknesses. So the first step is to start small.

You should start in a pre-prod or testing environment, so you can get an idea of how your system behaves when something goes wrong. You may also start affecting one particular user (a test user), one specific node, or execute the test for a very small amount of time. As the confidence in your system increases, you want to increase the blast zone. Kill more servers, try higher latencies, affect requests from more users, increase the time of your test.

While you want to push your system further and further, you must make sure that you don’t affect your users. An abort condition should always be present (i.e., stop the test and revert everything if you get 4xx / 5xx error codes).

Be Reliable and Resilient

Chaos engineering won’t fix your problems, nor it would create problems. It would just expose them. The next step is to change your software to be able to cope with issues that will arise. The question is not whether they will appear or not, but when and if we are prepared for that.

Reliable means that your software works ok a certain percentage of the time. While we should aim for reliability, it is not possible to be 100% safe.

Resilient means that your software can recover from abnormal conditions and remain functional. For example, in a food delivery application composed of several microservices, if the rating microservice is down, you may return a degraded response. Examples are “Rating not available right now,” or “There are no current ratings” or even a cached or default response. The user will still be able to use the application but not the whole functionality.

In a production environment, things will happen. Abnormal conditions occur and affect our systems. Servers reboot, networks aren’t 100% reliable, applications can be misconfigured. Chaos engineering lets you find out how you safely cope with this problem, avoiding an outage in the future and providing a much better user experience.

What is Scope Creep and why is so related to our every day work? Learn everything on our blog post Scope Creep in Project Management.

You may also like

chaos engineering

Chaos Engineering