Chaos experiments

A simple ansible playbook to run chaos experiments on infrastructure that have been deployed via this project.

Playbook chooses a random host for every group in inventory:

etcd_cluster
postgres_cluster (patroni and psql)
haproxy (balancer)

And runs several experiments:

Network packet loss / delay
RAM / CPU / IO load
Service stop (etcd,patroni)

Note

you can disable some of it in vars.yml

After specified delay the rollback role begins to put everything in its place.

How to run experiments?

Firstly, clone repo and check playbook on your hosts:

ansible-playbook chaos.yml -i inventory --check

and run it:

ansible-playbook chaos.yml -i inventory

Warning

The cluster will be active only then at least:

one patroni
two etcd is available

Main experiments:

1. Patroni failover

1.1 Experiment description: Stop patroni service on current patroni leader with systemd role.

1.2 Expected results: Patroni replica has promoted self to a new leader.

1.3 Real outcomes:

Patroni logs:

INFO: Cleaning up failover key after acquiring leader lock...
INFO:patroni.watchdog.base:Software Watchdog activated with 25 second timeout, timing slack 15 seco>
INFO: Software Watchdog activated with 25 second timeout, timing slack 15 s>
INFO:patroni.__main__:promoted self to leader by acquiring session lock INFO:patroni.ha:Lock owner: psql-2; I am psql-2
INFO: promoted self to leader by acquiring session lock
INFO:patroni.__main__:updated leader lock during promote
INFO: Lock owner: psql-2; I am psql-2
INFO: updated leader lock during promote

1.4 Results analysis: Patroni replica has promoted self to a new leader as espected.

2. Etcd failover

2.1 Experiment description: Stop etcd service on current etcd leader with systemd role.

2.2 Expected results: Etcd hosts have a valid quorum with at least two active nodes, so new leader will be reelected accordingly.

2.3 Real outcomes:

e1f06668267121f5 [term 38] received MsgTimeoutNow from b586ded327f9460d and starts an election to get leadership."
e1f06668267121f5 lost leader b586ded327f9460d at term 39"
e1f06668267121f5 became leader at term 39"
1f06668267121f5 elected leader e1f06668267121f5 at term 39"

2.4 Results analysis: During quorum desigion etcd has randomly elected one node as leader.

3. Network delay

3.1 Experiment description: Create network packet loss via tc to patroni master.

3.2 Expected results: Increased latency from blackbox to any API request.

3.3 Real outcomes: Latency has increased:

3.4 Results analysis: negative impact on API request latency.

4. Cpu load

4.1 Experiment description: Push maximum load on CPU with stress-ng.

4.2 Expected results: Email alert from alertmanager and latency bump on blacbox probe statistics.

4.3 Real outcomes:

4.4 Results analysis: The CPU stress test did not show a significant impact on latency for API requests.

5. Alerting test

5.1 Experiment description: Check that alertmanager alert rules working as espected.

5.2 Expected results: New email alerts from prometheus alertmanager.

5.3 Real outcomes: Have email alerts due to abnormal cluster state.

5.4 Results analysis: Notifications related to the operating system, such as: disk, memory, and RAM load the CPU most of the time, receives a notification about the CPU load, although alerts are configured for other types. The solution would be to lower the thresholds for other types of attacks (disk, memory) in order to receive them earlier.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
img		img
roles		roles
README.md		README.md
chaos.yml		chaos.yml
inventory		inventory
vars.yml		vars.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chaos experiments

How to run experiments?

Main experiments:

1. Patroni failover

2. Etcd failover

3. Network delay

4. Cpu load

5. Alerting test

About

fishaffair/chaos-experiment

Folders and files

Latest commit

History

Repository files navigation

Chaos experiments

How to run experiments?

Main experiments:

1. Patroni failover

2. Etcd failover

3. Network delay

4. Cpu load

5. Alerting test

About

Topics

Resources

Stars

Watchers

Forks