🔥 `release-mainnet` aggregator is unreachable #1310

jpraynaud · 2023-10-19T17:30:34Z

Why

🔥 Alerts are received stating that the release-mainnet aggregator is unreachable.

2 incidents have been created:

What

The source of the problem must be identified and fixed swiftly.

Facts

the other Mithril networks in the same datacenter are up and running
the aggregator is inaccessible from multiple areas (US / Europe)
the status page reports increase in response time for /epoch-settings route (> 10s)
the endpoint of the aggregator returns a 404 error (as expected)
the VM is accessible in SSH
the VM is healhty (load, memory, io)
the logs from aggregator container shows the HTTP server is serving some clients/browsers
the traffic has increased x4 times
the aggregator is back up after ~10-20 min
the aggregator is able to produce certificates and artifacts since it is backup

How

Now

Reproduce the problem on the testing-preview network
Analyze the source of the unreachability
Implement a quick fix if possible:
- Use ConnectionWithFullMutex instead of Mutex<Connection> (will release the Mutex early)
- Rollback modifications from add soft limit in certificate query #1314
- Remove verification key, signature of verification key and operational certificate from list of signers in certificate

The text was updated successfully, but these errors were encountered:

jpraynaud · 2023-10-20T10:53:36Z

Analysis

At first glance

It looks like a problem with database being locked, triggering timeouts from the status service that scrapes the /epoch-settings route.

We have also witnessed a high variance on the response times for the same routes (from 50ms to 2500ms), which indicates that the problem is not likely due to indexing problems in the database.

Reproducing the problem

We have been able to reproduce the problem on the testing-preview network:

Performances dropped with load on the /certificates or /artifacts/mithril-stake-distributions routes.
There is no performance drop with load on the /epoch-settings and /artifact/snapshots routes.
There is no performance drop with load on the / route (which does not access the database).

We have ran the following commands sequentially:

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator
...
Requests per second:    118.30 [#/sec] (mean)
Time per request:       845.330 [ms] (mean)
Time per request:       8.453 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    767
  66%    829
  75%    873
  80%    923
  90%   1041
  95%   1122
  98%   1289
  99%   1362
 100%   1557 (longest request)

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/epoch-settings
...
Requests per second:    119.29 [#/sec] (mean)
Time per request:       838.319 [ms] (mean)
Time per request:       8.383 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    755
  66%    833
  75%    879
  80%    930
  90%   1056
  95%   1190
  98%   1290
  99%   1317
 100%   1557 (longest request)

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/snapshots
...
Requests per second:    78.72 [#/sec] (mean)
Time per request:       1270.301 [ms] (mean)
Time per request:       12.703 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   1155
  66%   1380
  75%   1494
  80%   1557
  90%   1748
  95%   1920
  98%   2122
  99%   2196
 100%   2501 (longest request)

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/mithril-stake-distributions
...
Requests per second:    24.00 [#/sec] (mean)
Time per request:       4165.801 [ms] (mean)
Time per request:       41.658 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   4094
  66%   4118
  75%   4138
  80%   4155
  90%   4227
  95%   4303
  98%   4350
  99%   4688
 100%   5205 (longest request)

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second:    24.00 [#/sec] (mean)
Time per request:       4165.801 [ms] (mean)
Time per request:       41.658 [ms] (mean, across all concurrent requests)
...
  50%   4094
  66%   4118
  75%   4138
  80%   4155
  90%   4227
  95%   4303
  98%   4350
  99%   4688
 100%   5205 (longest request)

From these tests, we can deduct that there is probably a lock that is not released early enough by the routes.
Given that the connection to SQLite is done through a Mutex (one read or one write at a time), any delay in releasing the lock on the mutex creates a queue of queries that keeps growing, thus extending the delays of execution.

Also, the aggregator HTTP server does not seem to purge the requests that are too old, and simply serves them when it can do so. This is probably responsible for extending further the delay to serve pages. When this delay is above 30s it triggers an alert on the monitoring tool.

jpraynaud · 2023-10-24T12:46:43Z

Following the merge of PR #1316, we have run the same commands and here are the figures:

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator
...
Requests per second:    126.53 [#/sec] (mean)
Time per request:       790.333 [ms] (mean)
Time per request:       7.903 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    563
  66%    641
  75%    693
  80%    728
  90%    891
  95%    968
  98%    998
  99%   1044
 100%   1428 (longest request)

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/epoch-settings
...
Requests per second:    123.41 [#/sec] (mean)
Time per request:       810.336 [ms] (mean)
Time per request:       8.103 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    589
  66%    662
  75%    708
  80%    743
  90%    822
  95%    887
  98%   1092
  99%   1204
 100%   1264 (longest request)

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/snapshots
...
Requests per second:    122.67 [#/sec] (mean)
Time per request:       815.213 [ms] (mean)
Time per request:       8.152 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%    626
  66%    686
  75%    730
  80%    763
  90%    898
  95%   1019
  98%   1245
  99%   1297
 100%   1509 (longest request)

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/artifact/mithril-stake-distributions
...
Requests per second:    60.72 [#/sec] (mean)
Time per request:       1646.788 [ms] (mean)
Time per request:       16.468 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   1367
  66%   1544
  75%   1678
  80%   1771
  90%   2015
  95%   2286
  98%   2656
  99%   2740
 100%   2817 (longest request)

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second:    40.60 [#/sec] (mean)
Time per request:       2462.868 [ms] (mean)
Time per request:       24.629 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   1963
  66%   2254
  75%   2487
  80%   2630
  90%   3036
  95%   3422
  98%   3725
  99%   4000
 100%   4400 (longest request)

We have seen significant improvements on the routes:

artifact/snapshots: -50% to serve a page
artifact/mithril-stake-distributions: -60% to serve a page
certificates: ``-40%` to serve a page

jpraynaud · 2023-10-26T09:37:04Z

Following the merge of PR #1314, we have run the same command for the route certificate and here are the figures:

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second:    31.24 [#/sec] (mean)
Time per request:       3201.020 [ms] (mean)
Time per request:       32.010 [ms] (mean, across all concurrent requests)
...
Percentage of the requests served within a certain time (ms)
  50%   3074
  66%   3517
  75%   3765
  80%   3921
  90%   4349
  95%   4746
  98%   5056
  99%   5205
 100%   5444 (longest request)

We have also run the same commands as previously on the other routes and the results were approximately the same.

We see that the performances have worsen with the addition of a LIMIT to the query. We suspect that the way it is implemented is responsible for this drop of performances and we will try to implement them in a more efficient way.

ghubertpalo · 2023-11-06T16:04:17Z

Change the type Certificate to replace signer_with_stake with a new type that only contains party_id and their associated stake. When this is done, the certificate chain will be broken, all certificate hashes will have to be recalculated read this guide.
The API part is also impacted, a PartyMessage must be made alongside to transfer this new type in the certificates.

jpraynaud · 2023-11-13T08:46:42Z

Following the merge of PR #1333, we have run the same command for the route certificate and here are the figures:

$ ab -n 1000 -c 100 https://aggregator.testing-preview.api.mithril.network/aggregator/certificates
...
Requests per second:    60.25 [#/sec] (mean)
Time per request:       1659.740 [ms] (mean)
Time per request:       16.597 [ms] (mean, across all concurrent requests)
...
  50%   1560
  66%   1763
  75%   1878
  80%   1969
  90%   2197
  95%   2427
  98%   3019
  99%   3266
 100%   4030 (longest request)

We see that the performances have improved 👍

jpraynaud added bug ⚠️ Something isn't working critical 🔥 Criticial bug labels Oct 19, 2023

jpraynaud assigned Alenar, jpraynaud, ghubertpalo and dlachaume Oct 20, 2023

ghubertpalo mentioned this issue Oct 20, 2023

add soft limit in certificate query #1314

Merged

11 tasks

Alenar mentioned this issue Oct 23, 2023

Handle sqlite access using a ConnectionWithFullMutex #1316

Merged

11 tasks

Alenar added a commit that referenced this issue Oct 23, 2023

Disable Certificates tab in explorer until #1310 is solve

df505dc

Alenar mentioned this issue Oct 23, 2023

Explorer certificates list & signers summary #1309

Merged

11 tasks

jpraynaud removed the critical 🔥 Criticial bug label Oct 26, 2023

jpraynaud mentioned this issue Nov 6, 2023

Enhance aggregator REST API performances #1327

Closed

5 tasks

ghubertpalo mentioned this issue Nov 6, 2023

Revert "add soft limit in certificate query" #1328

Merged

11 tasks

ghubertpalo mentioned this issue Nov 9, 2023

Greg/1310/lower serialization #1333

Merged

11 tasks

jpraynaud closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔥 `release-mainnet` aggregator is unreachable #1310

🔥 `release-mainnet` aggregator is unreachable #1310

jpraynaud commented Oct 19, 2023 •

edited

Loading

jpraynaud commented Oct 20, 2023

jpraynaud commented Oct 24, 2023

jpraynaud commented Oct 26, 2023 •

edited

Loading

ghubertpalo commented Nov 6, 2023

jpraynaud commented Nov 13, 2023

🔥 release-mainnet aggregator is unreachable #1310

🔥 release-mainnet aggregator is unreachable #1310

Comments

jpraynaud commented Oct 19, 2023 • edited Loading

Why

What

Facts

How

Now

jpraynaud commented Oct 20, 2023

Analysis

At first glance

Reproducing the problem

jpraynaud commented Oct 24, 2023

jpraynaud commented Oct 26, 2023 • edited Loading

ghubertpalo commented Nov 6, 2023

jpraynaud commented Nov 13, 2023

🔥 `release-mainnet` aggregator is unreachable #1310

🔥 `release-mainnet` aggregator is unreachable #1310

jpraynaud commented Oct 19, 2023 •

edited

Loading

jpraynaud commented Oct 26, 2023 •

edited

Loading