-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigation: Ocean Intermittent Failures #2
Comments
@prasannavl: Thanks for opening an issue, it is currently awaiting triage. The triage/accepted label can be added by foundation members by writing /triage accepted in a comment. DetailsI am a bot created to help the DeFiCh developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the DeFiCh/oss-governance-bot repository. |
/triage accepted |
Update: Waiting for tests to pass on DeFiCh/ain#1731 for performance improvement release. |
https://github.com/DeFiCh/ain/releases/tag/v3.2.3 released. Downgrading to P2 since immediate remediations are complete. Reverting the auto-close and leaving it open till we have feedback to understand if this has resolved the issue completely (atleast back to pre-3.2.2 state). cc/ @fuxingloh |
Performance regression confirmed to resolved to pre-3.2.2 state. Still need a longer term solution as the APIs are expensive calculations in the node, but this is unlikely to be addressed any time soon since it requires larger changes in the node to be more reactive or built up indexes above the node layer. Closing for now. Sample results with 3.2.3 for ref: |
@prasannavl maybe this needs to be reopened. looks like the FutureSwap triggers some troubles in the node (had to restart some of mine too). Ocean is super slow and throwing errors again. |
Reopening. Many observations of the same with some regions in the last few mins (to hours from the above comment), and team is exploring at re-adjusting timeouts and scale in the interim. |
Trying to add as much information from last night as I have here:
exact same behaviour on both my european nodes (one windows, one ubuntu) all nodes run 3.2.4 attached also the resource usage: you can see the CPU running high on Europe but no issue in SGP. but SGP has a strong uptick in bandwidth after the futureswap, europe also sees that after the restart. (graphs and log timing shows 1 hour difference due to timezone) looking at the block content, block 2661124 (the first one that didn't go throu or where the node filled up during processing) is packed with paybackLoan txs (from ppl cleaning up after the FS). I have the feeling that not the FutureSwap is the reason, but the cluster of PaybackLoans. Europe: (last block received at 22:40 in the chart, restarted at 22:59 ) SGP: no anomaly in CPU but increased bandwidth usage after the FS hope this helps the logs I mentioned: |
We have similar errors on our Nodes at the time of mentioned block. Most of the Nodes have CPU near 100%. What we see on all nodes are massive errors with conflicting (custom) tx at this time for the next ~40 Minutes.
PS: Most of the conflicting TX are unique |
I also see a lot of |
Thanks @kuegi and @Stonygan. Most of the above appear to be consequences of the slow down. Just a short update:
Will add more updates as we progress. |
Update: Ocean has been operating stably with a 2x scale on top of the 1.5x scale that was already done previously. This temporary mitigation works for the moment, so reset the goal for the node release to be on Mon/Tue to see what additional optimizations can be targeted. |
https://github.com/DeFiCh/ain/releases/tag/v3.2.5 has been released with improvements across the board. |
Closing this, as it's no longer an issue after https://github.com/DeFiCh/ain/releases/tag/v3.2.8 |
Summary
Observations
Temporary Mitigation
Hypothesis
Possibly combinations of multiple issues to rule out, and/or target specific action towards:
Require
contracts optimizationsgetloaninfo
getburninfo
getgovproposal
Known Issues
Require
contracts that end up in throw-away decimal conversions calculations that can slow down the node in various code paths. Key validation paths have already been fixed in Add lambda support to Require ain#1705Additions
rpcstats: ocean-random-sample-rpcstats.log
Resolution
The text was updated successfully, but these errors were encountered: