Load Testing Framework for internal and external usage #412

markmandel · 2018-11-09T22:12:39Z

Problem

We need some way to (a) be able to load test Agones at scale and (b) help users or Agones load test it for their own workloads.

Notes

I feel like we should be able to work both of these out at the same time - if we can create a framework for load testing we can use internally, such that it can also be used externally, that would be ideal.

Thoughts, feeling and opinions are appreciated 😄

Research

http://blog.supertuxkart.net/2018/11/supertuxkart-networking-looking-for.html

cyriltovena · 2018-11-12T19:24:15Z

I would vote for locust and we could build some pre-build plan working against the kubernetes api.
Also very interested in helping/building on this.

markmandel · 2018-11-13T01:15:46Z

Maybe we start with a simple example using something like xonotic/simple-udp? and then we can look at how to make it a bit more customisable?

I've not got strong opinions personally on load testing systems.

Side thought - we may also want to think about how we can automate some of this. Wondering if we should have nightly jobs for autoscale testing and load testing??? (but that can be a phase two)

stephbu · 2018-11-28T05:47:24Z

Definitely +1 having a standard "small" load that can be assigned a port, generates logs, simulates and emits metrics, can defer response to SIGTERM (upto and beyond termination grace period) is really useful. We use sample PSU curves from real data that we scale and/or offset to simulate daily, as well as compress timeline to accelerate growth.

Zanderax · 2018-11-28T07:32:22Z

I'd be interested in using this. It'd be good to be able to show stats from a cluster at scale.

pm7h · 2018-12-18T01:53:29Z

I have started looking into using Locust for this. From a quick scan, it seems to be straight-forward. We can start by writing a client to start Agones game servers. We will also need to define the test scenarios.

In the first step, we can run the test from a single machine. We can then extend it spin up master and slave nodes: https://docs.locust.io/en/stable/running-locust-distributed.html. A single docker image can run as standalone, master or slave: https://docs.locust.io/en/latest/running-locust-docker.html

markmandel · 2018-12-19T18:58:18Z

Should we make a plan for what types of load tests we should have in our system?
I'm thinking of several that could be possible:

How many game servers until things break
How many allocations per second until things break
How fast can we allocate
How long does it take to scale up a Fleet

There are likely more?

Then there is likely also load testing for real CPU & network metrics for determining limits etc for the gameserver itself - to which I'm not sure how to tackle.

pm7h · 2018-12-20T00:57:30Z

Makes sense. So, two categories of tests:

Load testing: 1,2,3,4 above.
Performance tests for CPU and network metrics.

Let's start by 1 since it seems to be more straight-forward using Locust. I will think about what approach we can take for 2.

pm7h · 2019-01-24T18:13:39Z

Adding my notes on the design:

Design

We will focus on two categories of tests, performance tests and load tests. These two categories have different requirements and goals which implies different test approaches.

Performance Tests

The goal of performance tests is to provide metrics on various operations such as fleet scaling up/down. The existing Agones e2e test framework can be used for performance tests.

Test Cases

Fleet scaling up. Create a fleet of size 1, increase the size to 100/1000/100000, and measure the time it takes to fully scale up the fleet.

In addition to the time it takes to fully scale up the fleet, the test should also emit continues metrics on game servers. This includes how many game servers are in different states (PortAllocation, Creating, Starting, Scheduled, RequestReady, Ready).

If tested with GKE, this test should be repeated with GKE cluster Autoscaling enabled and disabled. When GKE cluster Autoscaling is disabled we should test two scenarios. One where the cluster has sufficient capacity and one where it does not.

Fleet scaling down. Create a fleet of size 100/1000/100000, scale down to 1, and measure the time it takes to fully scale down the fleet.

In addition to the time it takes to fully scale down the fleet, the test should also emit continues metrics on game servers. This includes how many game servers are in different states (PortAllocation, Creating, Starting, Scheduled, RequestReady, Ready).

If tested with GKE, this test should be repeated with GKE cluster Autoscaling enabled and disabled. When GKE cluster Autoscaling is disabled we should test two scenarios. One where the cluster has sufficient capacity and one where it does not.

Load Tests

Load tests aim to test the performance of the system under heavy load. Game server allocation is an example where multiple parallel operations should be tested.

Locust is a good option for load tests. Unfortunately, Locust integration with go is not stable so the only options are raw HTTP requests, or the Python client library.

Locust can be easily integrated with other open source tools for storage and visualization. I have tested integration with Graphite and Grafana. Prometheus is more powerful that Graphite and is therefore a better option to Graphite.

The final Locust tasks that are for running the test, and the server that is being tested should be containerized for easy adoption.

Test Cases

GameServerAllocation. Create a fleet of size 10/100/1000, and allocate multiple game servers in parallel. Measure the time it takes to allocate a game server. This test includes two scenarios, one in which the number of allocations exceeds the fleet size and one that it doesn’t. The tests should evaluate whether allocation time depends on the number of ready GameServers.

pm7h · 2019-01-25T00:04:23Z

Observations

Testing Fleet scaling up/down with GKE Cluster Autoscaling enabled

Test Environment

GKE cluster with the following configurations:

Node version: 1.11.6-gke.2
Node image: Container-Optimized OS (cos)
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
Automatic node upgrades: Disabled
Automatic node repair: Enabled
Autoscaling: On
Minimum size (in all zones): 3
Maximum size (in all zones): 100
Preemptible nodes: Disabled
Boot disk type: Standard persistend disk
Boot disk size in GB (per node): 100

Results - Fleet Scaling

I have observed that with GKE Cluster Autoscaling enabled, scaling up the Fleet gets stuck at some point.

Pod: At this point there are a number of pods that are out of CPU (example error: Node didn't have enough resource: cpu, requested: 130, used: 3840, capacity: 3920) but no new nodes are being created.
GameServer: The corresponding game servers get stuck at "Scheduled" state. The GameServerSet for the Fleet has a number of UnhelathyDelete where a number of GameServers where successfully deleted.
Quota: No GCE quota errors.
Other errors: I noticed errors on getting external address for GameServers (example error: error getting external address for GameServer : error retrieving node for Pod : node "" not found)

Testing Fleet scaling up/down with GKE Cluster Autoscaling disabled

Average time to spin up a Fleet: 9.3 seconds
Average time to scale up a Fleet from 0 to 1000: 19.4 minutes

On Fleet scaling down, I have observed that there are cases where the Fleet scales down (all game servers are deleted), but the Fleet is not updated and still shows 1000 ready GameServers.

Test Environment

GKE cluster with the following configurations:

Size (in all zones): 60
Node version: 1.11.6-gke.2
Node image: Container-Optimized OS (cos)
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
Total cores: 240 vCPUs
Total memory: 1,560.00 GB
Automatic node upgrades: Disabled
Automatic node repair: Enabled
Autoscaling: Off
Preemptible nodes: Disabled
Boot disk type: Standard persistend disk
Boot disk size in GB (per node): 100

Results - Fleet Autoscaling

Test Scenario. Spin up a fleet, scale it up to 100 replicas, and then scale down to 0. Repeat multiple times.

Results - Fleet Allocation

Test Scenario. Spin up a fleet, scale it up to 100 replicas, and then start a Locust test where 100 users try to do a game server allocation in parallel.

roberthbailey · 2019-07-10T21:15:40Z

@markmandel - I see that this is marked as part of the 0.12.0 milestone (but it was also in 0.11.0, 0.10.0, 0.9.0, and 0.8.0). Is it part of the milestone optimistically (hoping for someone to finish it)?

Also, for @markmandel or @pm7h - can you summarize what we think remains for this task? I know that @ilkercelikyilmaz has another test harness that does some load testing that maybe falls under this area as well.

roberthbailey · 2019-07-10T21:22:03Z

Reading through the other issues in the 0.12.0 milestone, I see that this is referenced from the top level plan for the 1.0 release, which at least partially answers my questions here:

Once 1.0 features are complete, performance testing will be completed, and the project will publish supported cluster sizes, fleet sizes and throughput metrics based on the current code base.

pm7h · 2019-07-12T17:44:19Z

I think the main remaining item is providing automation and dashboards for running these tests.

markmandel · 2022-03-15T19:50:44Z

@roberthbailey @ilkercelikyilmaz do we feel we can close this, now that we have the scenario load tests?

roberthbailey · 2022-03-15T22:15:33Z

I think so. We now have the locust load tests, the allocation load tests (gRPC and k8s API), and now the scenario tests as well. We have been using the allocation load tests to verify that new k8s versions don't introduce memory leaks and the scenario tests can be used to verify performance over a long period of time.

markmandel · 2022-03-15T22:24:16Z

CLOSING!

markmandel added kind/feature New features for Agones kind/design Proposal discussing new features / fixes and how they should be implemented area/build-tools Development tooling. I.e. pretty much everything in the `build` directory. labels Nov 9, 2018

markmandel changed the title ~~Load Testing Framework for Agones~~ Load Testing Framework for internal and external usage Nov 9, 2018

markmandel added this to the 0.8.0 milestone Jan 9, 2019

markmandel modified the milestones: 0.8.0, 0.9.0 Feb 13, 2019

pm7h mentioned this issue Feb 20, 2019

Use batching in GameServerAllocation controller to improve throughput. #536

Closed

markmandel modified the milestones: 0.9.0, 0.10.0 Mar 26, 2019

markmandel mentioned this issue Apr 19, 2019

Top Level Plan: 1.0 Release! #732

Closed

markmandel modified the milestones: 0.10.0, 0.11.0 May 7, 2019

markmandel modified the milestones: 0.11.0, 0.12.0 Jun 18, 2019

markmandel modified the milestones: 0.12.0, 1.0 Aug 1, 2019

markmandel removed this from the 1.0.0 milestone Sep 10, 2019

markmandel added this to the 1.1.0 milestone Sep 10, 2019

markmandel modified the milestones: 1.1.0, 1.2.0 Oct 22, 2019

markmandel removed this from the 1.2.0 milestone Dec 4, 2019

roberthbailey mentioned this issue May 7, 2021

Perform some sort of perf tests on release candidates #2097

Closed

markmandel closed this as completed Mar 15, 2022

markmandel added this to the 1.22.0 milestone Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load Testing Framework for internal and external usage #412

Load Testing Framework for internal and external usage #412

markmandel commented Nov 9, 2018 •

edited

Loading

cyriltovena commented Nov 12, 2018 •

edited

Loading

markmandel commented Nov 13, 2018

stephbu commented Nov 28, 2018

Zanderax commented Nov 28, 2018

pm7h commented Dec 18, 2018 •

edited

Loading

markmandel commented Dec 19, 2018 •

edited

Loading

pm7h commented Dec 20, 2018

pm7h commented Jan 24, 2019 •

edited

Loading

pm7h commented Jan 25, 2019 •

edited

Loading

roberthbailey commented Jul 10, 2019

roberthbailey commented Jul 10, 2019

pm7h commented Jul 12, 2019

markmandel commented Mar 15, 2022

roberthbailey commented Mar 15, 2022

markmandel commented Mar 15, 2022

Load Testing Framework for internal and external usage #412

Load Testing Framework for internal and external usage #412

Comments

markmandel commented Nov 9, 2018 • edited Loading

Problem

Notes

Research

cyriltovena commented Nov 12, 2018 • edited Loading

markmandel commented Nov 13, 2018

stephbu commented Nov 28, 2018

Zanderax commented Nov 28, 2018

pm7h commented Dec 18, 2018 • edited Loading

markmandel commented Dec 19, 2018 • edited Loading

pm7h commented Dec 20, 2018

pm7h commented Jan 24, 2019 • edited Loading

Design

Performance Tests

Test Cases

Load Tests

Test Cases

pm7h commented Jan 25, 2019 • edited Loading

Observations

Testing Fleet scaling up/down with GKE Cluster Autoscaling enabled

Test Environment

Results - Fleet Scaling

Testing Fleet scaling up/down with GKE Cluster Autoscaling disabled

Test Environment

Results - Fleet Autoscaling

Results - Fleet Allocation

roberthbailey commented Jul 10, 2019

roberthbailey commented Jul 10, 2019

pm7h commented Jul 12, 2019

markmandel commented Mar 15, 2022

roberthbailey commented Mar 15, 2022

markmandel commented Mar 15, 2022

markmandel commented Nov 9, 2018 •

edited

Loading

cyriltovena commented Nov 12, 2018 •

edited

Loading

pm7h commented Dec 18, 2018 •

edited

Loading

markmandel commented Dec 19, 2018 •

edited

Loading

pm7h commented Jan 24, 2019 •

edited

Loading

pm7h commented Jan 25, 2019 •

edited

Loading