Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: Ocean Intermittent Failures #2

Closed
1 task done
prasannavl opened this issue Feb 3, 2023 · 14 comments · Fixed by DeFiCh/ain#1731
Closed
1 task done

Investigation: Ocean Intermittent Failures #2

prasannavl opened this issue Feb 3, 2023 · 14 comments · Fixed by DeFiCh/ain#1731

Comments

@prasannavl
Copy link
Member

prasannavl commented Feb 3, 2023

Summary

  • Ocean has been having intermittent failures over the last 24 hours.
  • Using this thread to consolidate information across different projects here temporarily.

Observations

  • No direct issues on Ocean or Jellyfish or the blockchain.
  • Infrastructure going into failure recovery due nodes being slower and failing health checks timeouts.
  • rpc stats attached below.

Temporary Mitigation

  • Adjusted timeout to allow slower responses and higher scale ups.

Hypothesis

Possibly combinations of multiple issues to rule out, and/or target specific action towards:

  • Require contracts optimizations
  • Governance API perf. optimizations needed
  • Optimizations or additional indexes might be needed for the most used APIs:
    • getloaninfo
    • getburninfo
    • getgovproposal

Known Issues

  • Require contracts that end up in throw-away decimal conversions calculations that can slow down the node in various code paths. Key validation paths have already been fixed in Add lambda support to Require ain#1705

Additions

rpcstats: ocean-random-sample-rpcstats.log

Resolution

  • https://github.com/DeFiCh/ain/releases/tag/v3.2.3 fixes the main regression.
  • There are still long term performance improvements that are required through better reactive indexes on either the node or cache layers on higher infrastructure to be able to handle larger traffic. However, they are low priority on the blockchain node for the time being, esp. since it requires significant changes.
@defichain-bot
Copy link
Member

@prasannavl: Thanks for opening an issue, it is currently awaiting triage.

The triage/accepted label can be added by foundation members by writing /triage accepted in a comment.

Details

I am a bot created to help the DeFiCh developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the DeFiCh/oss-governance-bot repository.

@prasannavl prasannavl changed the title Investigation: Ocean Regression Investigation: Ocean Intermittent Failures Feb 3, 2023
@prasannavl
Copy link
Member Author

/triage accepted

@prasannavl
Copy link
Member Author

Update: Waiting for tests to pass on DeFiCh/ain#1731 for performance improvement release.

@prasannavl
Copy link
Member Author

prasannavl commented Feb 3, 2023

https://github.com/DeFiCh/ain/releases/tag/v3.2.3 released.

Downgrading to P2 since immediate remediations are complete.

Reverting the auto-close and leaving it open till we have feedback to understand if this has resolved the issue completely (atleast back to pre-3.2.2 state).

cc/ @fuxingloh

@prasannavl
Copy link
Member Author

Performance regression confirmed to resolved to pre-3.2.2 state.

Still need a longer term solution as the APIs are expensive calculations in the node, but this is unlikely to be addressed any time soon since it requires larger changes in the node to be more reactive or built up indexes above the node layer.

Closing for now.

Sample results with 3.2.3 for ref:
listrpcstats.3.2.3.log

@kuegi
Copy link

kuegi commented Feb 9, 2023

@prasannavl maybe this needs to be reopened. looks like the FutureSwap triggers some troubles in the node (had to restart some of mine too). Ocean is super slow and throwing errors again.

@prasannavl
Copy link
Member Author

Reopening.

Many observations of the same with some regions in the last few mins (to hours from the above comment), and team is exploring at re-adjusting timeouts and scale in the interim.

@prasannavl prasannavl reopened this Feb 10, 2023
@kuegi
Copy link

kuegi commented Feb 10, 2023

Trying to add as much information from last night as I have here:
timeline:

  • futureswap went throu in a burst of 3 blocks without any obvious topic
  • another block (2661123) came in within a few seconds, also no issue
  • 50 sec later I see Timeout downloading block...
  • this continues without any block being received, but mempool tx seem to come throu
  • getting only Timeout downloading... until I restart the node
  • after restart, everything runs normally again

exact same behaviour on both my european nodes (one windows, one ubuntu)
The node in singapore (also ubuntu, but more CPU power) ran throu without any issue. only interesting fact: the SGP node received block 2661123 10 sec after the european node. Normally they have the same timing. SGP node then gets block 2661124 10 sec after that and continued without topics while europe went into "timeout downloading".

all nodes run 3.2.4

attached also the resource usage: you can see the CPU running high on Europe but no issue in SGP. but SGP has a strong uptick in bandwidth after the futureswap, europe also sees that after the restart. (graphs and log timing shows 1 hour difference due to timezone)

looking at the block content, block 2661124 (the first one that didn't go throu or where the node filled up during processing) is packed with paybackLoan txs (from ppl cleaning up after the FS). I have the feeling that not the FutureSwap is the reason, but the cluster of PaybackLoans.

Europe: (last block received at 22:40 in the chart, restarted at 22:59 )
image
image

SGP: no anomaly in CPU but increased bandwidth usage after the FS
image
image

hope this helps

the logs I mentioned:
FSLogs.txt

@Stonygan
Copy link

Stonygan commented Feb 10, 2023

We have similar errors on our Nodes at the time of mentioned block. Most of the Nodes have CPU near 100%. What we see on all nodes are massive errors with conflicting (custom) tx at this time for the next ~40 Minutes.

rebuildAccountsView: Remove conflicting custom TX: a02d94c480bdefb2fa44865e4baf28274216f148b23595ff9e9d18f4d7f6eee5 rebuildAccountsView: Remove conflicting custom TX: 7345cb09c1ebedd9162beca40c54629dff24a2b687014ea77d595cb8f7546ac6 rebuildAccountsView: Remove conflicting custom TX: 85f1049e0473fee26c362a7c2935a21049d4e350cffe7df28b65ba251e7e94c2 rebuildAccountsView: Remove conflicting custom TX: 2f9b56565134b18a60ffb6c666c3925314793ceeac8c797a0e9fb2b682cb0679 rebuildAccountsView: Remove conflicting custom TX: 4b20649d6362a5b1753bbb3aa8447d9e39881b685d2b6ad6f7176aaa1c65ec13 rebuildAccountsView: Remove conflicting custom TX: fba3e01e60a5c67d12bbf7dbb780e51c32a3cdb391d71b53ce630a0dc016a313
~20 entrys every minute

PS: Most of the conflicting TX are unique

@kuegi
Copy link

kuegi commented Feb 10, 2023

I also see a lot of
2023-02-10T10:11:10Z ERROR: ContextualValidateAnchor: Anchor and blockchain mismatch at height 2656095. Expected c6f5fb0d7bff9055a819c73a2975a72878a81c426cfb58f356c684581ab080fb found 95bac89e1619f688019278eccbb93d110426f8e37a841aaee4167429e35dd5f7
lately. not sure if its in any way connected.

@prasannavl
Copy link
Member Author

Thanks @kuegi and @Stonygan. Most of the above appear to be consequences of the slow down.

Just a short update:

  • On the node side, we're expanding the 'Require' optimizations to other areas. It was planned for later releases, but is now being expedited given the situation. We expect a 3.2.5, targeting it within the next 12 hours.
  • Possibly add in some more observational logging that'll help understand and gather data on the failures before the stateless containers are being wiped. This may or may not make it into the release depending on its state of completion.

Will add more updates as we progress.

@prasannavl
Copy link
Member Author

prasannavl commented Feb 12, 2023

Update: Ocean has been operating stably with a 2x scale on top of the 1.5x scale that was already done previously. This temporary mitigation works for the moment, so reset the goal for the node release to be on Mon/Tue to see what additional optimizations can be targeted.

@prasannavl
Copy link
Member Author

https://github.com/DeFiCh/ain/releases/tag/v3.2.5 has been released with improvements across the board.

@prasannavl
Copy link
Member Author

Closing this, as it's no longer an issue after https://github.com/DeFiCh/ain/releases/tag/v3.2.8

@prasannavl prasannavl transferred this issue from DeFiCh/ain Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants