Post

Chaos Engineering - System resilience in practice. Book notes

Chaos Engineering - System resilience in practice. Book notes

System resilience in practice by Casey Rosenthal and Nora Jones

Book notes

Dynamic safety model

Safety

Engineers tend to optimize for what they see. Since they have an intuition for the Economics and Workload properties, they see those properties in their day-to-day work. This inadvertently leads them further away from Safety. This effect sets up conditions for a silent, unseen drift towards failure as the success of their endevours provides opportunity to optimize towards Economics and Workload, but away from Safety.

One way to interpret the benefit of Chaos Engineering on an organization is that it helps engineers develop an intuition for Safety where it is otherwise lacking. The empirical evidence provided by the experiments inform the engineers’s intuition.

By far the greater value is in teaching the engineers things they did not anticipate about how safety mechanisms interplay in the complexity of the entire system.

Economic pillars of complexity

The gut instinct most engineers have when faced with complexity is to avoid or reduce it. Unfprtunately, simplification removes utility and ultimately limits business value. The potential for success rises with complexity.

Reversibility

Optimizing for reversibility is a virtue in contemporary software engineering. Optimizing for reversibility pay dividends down the road when working with complex systems. This is also a foundational model for Chaos Engineering. The experiments expose properties of a system that are counterproductive to reversibility. In many cases these might be efficiencies that are purposefully built by engineers.

Overview of Principles

Experimentation versus testing

Testing, strictly speaking, does not create new knowledge.

Experimentation on the other hand, creates new knowledge. Experiments propose a hypothesis and as long as the hypothesis is not disproven, confidence grows in that hypothesis. If it is disproven, then we learn something new.

Verification versus validation

Chaos Engineering strongly prefers verification over validation. Chaos Engineering cares whether something works not how.

What Chaos Engineering is not

“Breaking stuff” could be done in countless ways, with little time invested. The larger question here is, how do we reasons about things that are already broken, when we don’t even know they are broken?

“Fixing stuff in production” does a much better job of capturing the value of Chaos Engineering since the point of the whole practice is to proactively improve availability and security of a complex system.

Build a Hypothesis around steady-state behavior

Doing a deep dive can help with exploration, but it is a distraction from the best learning that Chaos Engineering can offer. At its best, Chaos Engineering is focused on key performance indicators (KPIs) or other metrics that track with clear business priorities, and those make for the best steady-state definitions.

Google DiRT: Disaster Recovery Testing

Minimize cost, maximize value

If you already know a system is broken you may as well prioritize the engineering work to address the known risks and then disaster test your mitigations later.

What to test

Which systems keep you up at night? Are you aware of singly homed data or services? Are there processes that depend on peope in a single location, a single vendor?. Are you 100% confident that your monitoring and alerting systems raise alarms when expected? When was the last time you performed a cutover to your fallback systems? When was the last time you restored your system from backup? Haveyou validated your system’s behavior when its “noncritical” dependencies are unavailable?

Data integrity

Backups are only as good as the last time you tested a restore.

How to test

You should aim as much as possible to only be testing one hypothesis at a time and be especially wary of mixing the testing of automated system reactions in conjuction with human reactions

This post is licensed under CC BY 4.0 by the author.