Resilience Testing Tools
Testing Resilience
The complexity of modern application infrastructure, allows varied capabilities, but complexity can lead to fragility, under certain conditions.
The purpose of resilience testing is to mitigate the risk of a system’s functionality being lost or degraded during error or failure. To do so, we put the system under test in controlled failure conditions, such as losing core functions or data.
Typically, we can cover scenarios such as:-
-
turning off hardware
-
turning off specific jobs
-
turn off specific services
Alternatively we can simulate a plethora of chaotic conditions using special tools known as Chaos Engineering Tools.
Chaos Engineering Tools
Chaos Monkey
-
In 2010 Netflix moved their systems to the cloud, and designed Chaos Monkey to test the system by causing failures via the random termination of instances and/or services within Netflix’s architecture.
-
This tool will terminate services at a certain time of day that the user sets.
-
Chaos Monkey only supports deployments made through Spinnaker.
-
This tool is open source and free.
Due to chaos engineering being a relatively newly explored area of testing, the tooling available may be limited, but it is expanding rapidly with more tools being created constantly. Only a couple, however, provide options for multiple infrastructure solutions.
Gremlin
-
Operating as a Resilience Testing as a Service, Gremlin offers a comprehensive set of failure modes across multiple cloud systems, containerised environments and servers. It gives you the option to throttle CPU, RAM or Network throughput along with termination of hosts and services.
-
It is cloud agnostic and works on AWS, Azure, Google Cloud Platform and even physical servers in a datacentre.
-
It has native integration with Kubernetes.
-
There is a limited free option, but full use would require an enterprise license
To learn more about how Resilience Testing can help you plan for and avoid failures, please get in touch using the form below.