A simple ansible playbook to run chaos experiments on infrastructure that have been deployed via this project.
Playbook chooses a random host for every group in inventory:
- etcd_cluster
- postgres_cluster (patroni and psql)
- haproxy (balancer)
And runs several experiments:
- Network packet loss / delay
- RAM / CPU / IO load
- Service stop (etcd,patroni)
Note
you can disable some of it in vars.yml
After specified delay the rollback role begins to put everything in its place.
Firstly, clone repo and check playbook on your hosts:
ansible-playbook chaos.yml -i inventory --check
and run it:
ansible-playbook chaos.yml -i inventory
Warning
The cluster will be active only then at least:
- one patroni
- two etcd is available
1.1 Experiment description: Stop patroni service on current patroni leader with systemd role.
1.2 Expected results: Patroni replica has promoted self to a new leader.
1.3 Real outcomes:
Patroni logs:
INFO: Cleaning up failover key after acquiring leader lock...
INFO:patroni.watchdog.base:Software Watchdog activated with 25 second timeout, timing slack 15 seco>
INFO: Software Watchdog activated with 25 second timeout, timing slack 15 s>
INFO:patroni.__main__:promoted self to leader by acquiring session lock INFO:patroni.ha:Lock owner: psql-2; I am psql-2
INFO: promoted self to leader by acquiring session lock
INFO:patroni.__main__:updated leader lock during promote
INFO: Lock owner: psql-2; I am psql-2
INFO: updated leader lock during promote
1.4 Results analysis: Patroni replica has promoted self to a new leader as espected.
2.1 Experiment description: Stop etcd service on current etcd leader with systemd role.
2.2 Expected results: Etcd hosts have a valid quorum with at least two active nodes, so new leader will be reelected accordingly.
2.3 Real outcomes:
e1f06668267121f5 [term 38] received MsgTimeoutNow from b586ded327f9460d and starts an election to get leadership."
e1f06668267121f5 lost leader b586ded327f9460d at term 39"
e1f06668267121f5 became leader at term 39"
1f06668267121f5 elected leader e1f06668267121f5 at term 39"
2.4 Results analysis: During quorum desigion etcd has randomly elected one node as leader.
3.1 Experiment description: Create network packet loss via tc to patroni master.
3.2 Expected results: Increased latency from blackbox to any API request.
3.3 Real outcomes: Latency has increased:
3.4 Results analysis: negative impact on API request latency.
4.1 Experiment description: Push maximum load on CPU with stress-ng.
4.2 Expected results: Email alert from alertmanager and latency bump on blacbox probe statistics.
4.3 Real outcomes:
4.4 Results analysis: The CPU stress test did not show a significant impact on latency for API requests.
5.1 Experiment description: Check that alertmanager alert rules working as espected.
5.2 Expected results: New email alerts from prometheus alertmanager.
5.3 Real outcomes: Have email alerts due to abnormal cluster state.
5.4 Results analysis: Notifications related to the operating system, such as: disk, memory, and RAM load the CPU most of the time, receives a notification about the CPU load, although alerts are configured for other types. The solution would be to lower the thresholds for other types of attacks (disk, memory) in order to receive them earlier.