When talking about quality in the software industry, references are made to an essential field that is software testing, also known as automated testing. There are several types of tests, and each one focused on different needs. Software services are currently distributed, complex, and unpredictable in any kind of failure that cannot be undetectable by traditional automated tests. This problem is of great importance for companies, and their solution is called “Chaos Engineering.” Chaos engineering injects controlled faults into the system to identify weaknesses. But what is chaos engineering? How can chaos engineering be implemented? And why is chaos engineering important in my system? Read on to find the answers to these questions and learn why Chaos engineering should be implemented in your company.
Artificial intelligence is transforming the recruitment industry. If you’re wondering how, here’s how AI is helping.
Companies with internet services make continuous changes in their systems to improve their services, functionalities, and availability. The frequency of the changes can be both monthly and daily. This whole process can generate errors that are not detected by the automated tests whereby these companies take a different approach. Netflix is a pioneer in chaos engineering and has developed principles that describe how to design and run the tests. You must first have an inventory of multiple services and an inventory of possible errors. When they are distributed systems, this is very complex. Their approach should be a dynamic system with multiple interactions in real time and not as typical and stable system behavior. Among the changes of a dynamic system are type of requests, configuration change, server fall, nodes or regions. To solve this problem, chaos engineering uses the scientific method and generates the following principles:
- Generate a system hypothesis
- Injector real failures
- Data collection and validate hypotheses.
- Automate, add to continuous integration and generate reports.
Generate a system hypothesis.
To define a hypothesis, the system’s stable or normal state must be defined, and it depends on each service and the stipulated time that can be classified as normal. Also, the importance of the service should be defined, the main thing that should not happen. For example, Netflix uses multiple services, such as search, recommendation, catalog, etc. But the most important that should not fail is streaming videos. Then, use the generate behavior hypothesis, “What if” and draw conclusions.
Injector real failures
Many companies, such as Amazon, Google, Microsoft, and Facebook, apply chaos engineering techniques, and many have created tools. In the case of Netflix, it has created an open-source tool that facilitates the injection of faults. Its name is Chaos Monkey, and it helps adding configurable types of faults in random events.
Data collection and validate hypotheses
Have data monitoring tools and try to refute the thesis exposed at the beginning of the experiment.
Automate, add to continuous integration and generate reports
The tests with continuous integration allow us to replicate the hypotheses, study them, and correct them. Report generation is evidence that will demonstrate system improvements.
Conclusion
In conclusion, the chaos engineering practice in our system produces great advantages because we show that a system can be robust or, on the contrary, how fragile it can be. Generating hypotheses and validating them reduces the degree of uncertainty, identifies large risks, and produces added value to customers.
Artificial intelligence is transforming the recruitment industry. If you’re wondering how, here’s how AI is helping.