Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TelemetryAPIJourney - Retrieving the telemetry payload (cached vs. fresh) #211

Merged
merged 4 commits into from
Feb 24, 2022

Conversation

afharo
Copy link
Member

@afharo afharo commented Jan 5, 2022

Summary

This PR adds a Telemetry API Journey with 3 scenarios:

Scenario 1: First hit - non-cached encrypted usage

The first scenario is where the users hit the telemetry endpoint for the first time in the past 4 hours or when a new node is added. This grabs the non-cached encrypted usage data:

Scenario 2: Second+ hit - cached encrypted usage

The second scenario is when users hit the endpoint for the second or more times during the past 4 hours. This grabs the cached encrypted usage data:

Scenario 3: Example flyout and stats API - non-cached non-encrypted usage, check collectors status

The third scenario tests grabbing a fresh copy of non-encrypted usage. This happens when Kibana explicitly asks for this data (the stats API, and example flyout). This grabs non-cached unencrypted usage data.

This scenario also fails when we have failed, timeed-out, and non-ready collector.

Notes

All scenarios run against a freshly installed Kibana instance. This means that the size of the indices is as small as they can get.

@dmlemeshko
Copy link
Member

@afharo the code looks good, I started a job to see it in Jenkins/kibana-stats cluster: https://kibana-ci.elastic.co/view/Kibana/job/elastic+kibana+load-testing/883/

Do you want these simulations to be executed on daily basis and have results in Kibana-stats cluster, like we do for other ones?
If so, you need to add them here

Of course, you can always run it manually whenever there is a need

@dmlemeshko
Copy link
Member

@afharo I got slack alert for your simulation with results showing 692 request failures:

Scenario: org.kibanaLoadTest.simulation.branch.TelemetryAPIJourney
Users count: 400
Load testing branch: TelemetryAPIJourney
Kibana branch: main
Elasticsearch: 8.1.0-SNAPSHOT / 2022-01-04T15:05:08.795979329Z / 8ab0d40cb550f3156bd643ae73eabcadfaa607a1
Failed requests: 692 of 1912
Response time (ms):
* 75th percentile: 60000
* 95th percentile: 60000
* 99th percentile: 60001
* Maximum: 60002

It is possible to get gatling reports from Jenkins, but it looks like without cache 400 users is too much.

I also have a general question about your use case: is this end-point triggered by individual Kibana user or just a single call Kibana is doing to stats cluster once in a while? Is there any reason behind 400 users threshold?

@afharo
Copy link
Member Author

afharo commented Jan 6, 2022

@dmlemeshko thanks for the ping!

It's actually good news that it fails with that amount of users. We had the suspicion that this endpoint could cause issues if there were too many users requesting it at once.

This load test was built to prove that elastic/kibana#121656 is needed.

@dmlemeshko
Copy link
Member

@afharo thank you for explanation. Now my understanding is that there is no need to run these simulations on daily basis, is it correct?

We can merge it as is and you can run it anytime on CI. Actually soon there will be a possibility to comment a Kibana PR and it will trigger load simulations based on Team label

@afharo
Copy link
Member Author

afharo commented Jan 6, 2022

Thank you! Before merging I'd like someone from @elastic/kibana-core to share their thoughts as well 😇

@afharo afharo requested a review from a team January 6, 2022 08:40
@afharo
Copy link
Member Author

afharo commented Jan 6, 2022

I think that we could apply an additional improvement in the telemetry report generation logic to aggregate multiple concurrent requests into the same promise. Then we can run these journeys on the daily performance checks.

I'm AFK right now. But I'll create an issue on Monday.

@afharo
Copy link
Member Author

afharo commented Jan 10, 2022

I think that we could apply an additional improvement in the telemetry report generation logic to aggregate multiple concurrent requests into the same promise. Then we can run these journeys on the daily performance checks.

I'm AFK right now. But I'll create an issue on Monday.

Issue elastic/kibana#122572 created!

@Bamieh
Copy link
Member

Bamieh commented Feb 23, 2022

I've updated this PR to include 1 more scenario to check for failed collectors and I did merge the journeys into one with three scenarios. @dmlemeshko what are the next steps here? I think we need:

  1. Test the journey on CI
  2. Make sure our concurrent users' thresholds make sense for the different scenarios. 250 for cached, 30 for non-cached.
  3. Agree if we want to add a config to jenkins and merge the PR

@dmlemeshko
Copy link
Member

@Bamieh I ran it on CI and I think thresholds are reasonable.
Screenshot 2022-02-23 at 19 22 26

Code looks good as well.

Though I would not add it to our regular (4x times/ day) run. We are currently working on removing the noise for bare metal worker and limited number of scenarios. You can run it anytime manually and compare results in Kibana stats. Let me know if it makes sense.

@Bamieh Bamieh merged commit 11eb3ca into elastic:main Feb 24, 2022
@Bamieh Bamieh deleted the TelemetryAPIJourney branch February 24, 2022 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants