Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔥 release-mainnet aggregator is unreachable #1310

Closed
6 tasks done
jpraynaud opened this issue Oct 19, 2023 · 5 comments
Closed
6 tasks done

🔥 release-mainnet aggregator is unreachable #1310

jpraynaud opened this issue Oct 19, 2023 · 5 comments
Assignees
Labels
bug ⚠️ Something isn't working

Comments

@jpraynaud
Copy link
Member

jpraynaud commented Oct 19, 2023

Why

🔥 Alerts are received stating that the release-mainnet aggregator is unreachable.

2 incidents have been created:

What

The source of the problem must be identified and fixed swiftly.

Facts

  • the other Mithril networks in the same datacenter are up and running
  • the aggregator is inaccessible from multiple areas (US / Europe)
  • the status page reports increase in response time for /epoch-settings route (> 10s)
  • the endpoint of the aggregator returns a 404 error (as expected)
  • the VM is accessible in SSH
  • the VM is healhty (load, memory, io)
  • the logs from aggregator container shows the HTTP server is serving some clients/browsers
  • the traffic has increased x4 times
  • the aggregator is back up after ~10-20 min
  • the aggregator is able to produce certificates and artifacts since it is backup

How

Now

  • Reproduce the problem on the testing-preview network
  • Analyze the source of the unreachability
  • Implement a quick fix if possible:
    • Use ConnectionWithFullMutex instead of Mutex<Connection> (will release the Mutex early)
    • Rollback modifications from add soft limit in certificate query #1314
    • Remove verification key, signature of verification key and operational certificate from list of signers in certificate
@jpraynaud jpraynaud added bug ⚠️ Something isn't working critical 🔥 Criticial bug labels Oct 19, 2023
@jpraynaud
Copy link
Member Author

Analysis

At first glance

It looks like a problem with database being locked, triggering timeouts from the status service that scrapes the /epoch-settings route.

We have also witnessed a high variance on the response times for the same routes (from 50ms to 2500ms), which indicates that the problem is not likely due to indexing problems in the database.

Reproducing the problem

We have been able to reproduce the problem on the testing-preview network:

  • Performances dropped with load on the /certificates or /artifacts/mithril-stake-distributions routes.
  • There is no performance drop with load on the /epoch-settings and /artifact/snapshots routes.
  • There is no performance drop with load on the / route (which does not access the database).

We have ran the following commands sequentially:

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator
...
Requests per second:    118.30 [#/sec] (mean)
Time per request:       845.330 [ms] (mean)
Time per request:       8.453 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    767
  66%    829
  75%    873
  80%    923
  90%   1041
  95%   1122
  98%   1289
  99%   1362
 100%   1557 (longest request)
$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/epoch-settings
...
Requests per second:    119.29 [#/sec] (mean)
Time per request:       838.319 [ms] (mean)
Time per request:       8.383 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    755
  66%    833
  75%    879
  80%    930
  90%   1056
  95%   1190
  98%   1290
  99%   1317
 100%   1557 (longest request)
$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/snapshots
...
Requests per second:    78.72 [#/sec] (mean)
Time per request:       1270.301 [ms] (mean)
Time per request:       12.703 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   1155
  66%   1380
  75%   1494
  80%   1557
  90%   1748
  95%   1920
  98%   2122
  99%   2196
 100%   2501 (longest request)
$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/mithril-stake-distributions
...
Requests per second:    24.00 [#/sec] (mean)
Time per request:       4165.801 [ms] (mean)
Time per request:       41.658 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   4094
  66%   4118
  75%   4138
  80%   4155
  90%   4227
  95%   4303
  98%   4350
  99%   4688
 100%   5205 (longest request)
$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second:    24.00 [#/sec] (mean)
Time per request:       4165.801 [ms] (mean)
Time per request:       41.658 [ms] (mean, across all concurrent requests)
...
  50%   4094
  66%   4118
  75%   4138
  80%   4155
  90%   4227
  95%   4303
  98%   4350
  99%   4688
 100%   5205 (longest request)

From these tests, we can deduct that there is probably a lock that is not released early enough by the routes.
Given that the connection to SQLite is done through a Mutex (one read or one write at a time), any delay in releasing the lock on the mutex creates a queue of queries that keeps growing, thus extending the delays of execution.

Also, the aggregator HTTP server does not seem to purge the requests that are too old, and simply serves them when it can do so. This is probably responsible for extending further the delay to serve pages. When this delay is above 30s it triggers an alert on the monitoring tool.

@jpraynaud
Copy link
Member Author

Following the merge of PR #1316, we have run the same commands and here are the figures:

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator
...
Requests per second:    126.53 [#/sec] (mean)
Time per request:       790.333 [ms] (mean)
Time per request:       7.903 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    563
  66%    641
  75%    693
  80%    728
  90%    891
  95%    968
  98%    998
  99%   1044
 100%   1428 (longest request)
$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/epoch-settings
...
Requests per second:    123.41 [#/sec] (mean)
Time per request:       810.336 [ms] (mean)
Time per request:       8.103 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    589
  66%    662
  75%    708
  80%    743
  90%    822
  95%    887
  98%   1092
  99%   1204
 100%   1264 (longest request)
$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/snapshots
...
Requests per second:    122.67 [#/sec] (mean)
Time per request:       815.213 [ms] (mean)
Time per request:       8.152 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    626
  66%    686
  75%    730
  80%    763
  90%    898
  95%   1019
  98%   1245
  99%   1297
 100%   1509 (longest request)
$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/mithril-stake-distributions
...
Requests per second:    60.72 [#/sec] (mean)
Time per request:       1646.788 [ms] (mean)
Time per request:       16.468 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   1367
  66%   1544
  75%   1678
  80%   1771
  90%   2015
  95%   2286
  98%   2656
  99%   2740
 100%   2817 (longest request)
$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second:    40.60 [#/sec] (mean)
Time per request:       2462.868 [ms] (mean)
Time per request:       24.629 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   1963
  66%   2254
  75%   2487
  80%   2630
  90%   3036
  95%   3422
  98%   3725
  99%   4000
 100%   4400 (longest request)

We have seen significant improvements on the routes:

  • artifact/snapshots: -50% to serve a page
  • artifact/mithril-stake-distributions: -60% to serve a page
  • certificates: ``-40%` to serve a page

@jpraynaud
Copy link
Member Author

jpraynaud commented Oct 26, 2023

Following the merge of PR #1314, we have run the same command for the route certificate and here are the figures:

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second:    31.24 [#/sec] (mean)
Time per request:       3201.020 [ms] (mean)
Time per request:       32.010 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   3074
  66%   3517
  75%   3765
  80%   3921
  90%   4349
  95%   4746
  98%   5056
  99%   5205
 100%   5444 (longest request)

We have also run the same commands as previously on the other routes and the results were approximately the same.

We see that the performances have worsen with the addition of a LIMIT to the query. We suspect that the way it is implemented is responsible for this drop of performances and we will try to implement them in a more efficient way.

@ghubertpalo
Copy link
Collaborator

Change the type Certificate to replace signer_with_stake with a new type that only contains party_id and their associated stake. When this is done, the certificate chain will be broken, all certificate hashes will have to be recalculated read this guide.
The API part is also impacted, a PartyMessage must be made alongside to transfer this new type in the certificates.

@jpraynaud
Copy link
Member Author

Following the merge of PR #1333, we have run the same command for the route certificate and here are the figures:

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second:    60.25 [#/sec] (mean)
Time per request:       1659.740 [ms] (mean)
Time per request:       16.597 [ms] (mean, across all concurrent requests)
...
  50%   1560
  66%   1763
  75%   1878
  80%   1969
  90%   2197
  95%   2427
  98%   3019
  99%   3266
 100%   4030 (longest request)

We see that the performances have improved 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ⚠️ Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants