Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something blocking Hermes from properly starting up and spinning up workers #4101

Closed
5 tasks
freak12techno opened this issue Jul 28, 2024 · 7 comments · Fixed by #4110
Closed
5 tasks

Something blocking Hermes from properly starting up and spinning up workers #4101

freak12techno opened this issue Jul 28, 2024 · 7 comments · Fixed by #4110

Comments

@freak12techno
Copy link
Contributor

freak12techno commented Jul 28, 2024

Summary of Bug

Somehow Hermes fails to spin up workers (or rather is stuck spinning up workers), and apparently it's doing it sequentially, therefore if it fails to spin up workers on the first chain, other won't spin up so Hermes won't properly do anything.

Example: Here's my test config: https://gist.github.com/freak12techno/1a995d3822d5fee50e4c569298c6b8d6, and running Hermes with trace logging produces these results: https://gist.github.com/freak12techno/74f02dd94df0fd627312f0f62a90a37f. Seems like it's done spinning workers for bitsong-2b, cosmoshub-4, jackal-1, then it somehow is stuck on spinning workers for osmosis-1 (not sure why, that's another thing, likely the node not working properly) and all of the chains going after osmosis-1 in config (so sentinelhub-2 and neutron-1) are not loaded properly.

I also faced a case a few times where it's bitsong-2b which is faulty, so none of the chains are having their workers running and therefore Hermes effectively does nothing at that point for all of the chains.

I have a feeling that the workers spinning process is sequential, and the fix would be making it asynchronous, so failing to load 1 chain won't fail loading others.
@ljoss17 I remember you investigating the clearing packets routine blocking Hermes functioning, which should be somewhat similar, can you check this out?

Just to clarify: my main concern is not a failure of a single chain (like in my example, Hermes failing to spin up Osmosis worker), but rather a failure of a single chain blocking Hermes from doing anything else.

Version

1.10.1

Steps to Reproduce

  1. Use my Hermes config
  2. Start the relayer
  3. Expect something like this in logs if Hermes somehow is stuck spinning up a worker.

Acceptance Criteria

Failing to spin up a worker for one chain should not fail Hermes from working properly on others.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@ljoss17
Copy link
Contributor

ljoss17 commented Jul 29, 2024

Hey @freak12techno, could you run the Hermes instance with the flag --debug=rpc and share the full logs?

@romac
Copy link
Member

romac commented Jul 29, 2024

The scan is what can take a long time and/or fail, so we should indeed scan the chains in parallel, and gracefully handle any failures there. Once the scan is done, spawning the workers should be very fast and not a bottleneck at all.

@freak12techno
Copy link
Contributor Author

@ljoss17 so I started Hermes with debug=rpc and ran it for like 10 minutes, here are the logs: https://gist.github.com/freak12techno/43d9b674388f35b000b7b22979424483. Apparently at least the wallet worker for cosmoshub-4 had never started, as I fail to see the metrics regarding cosmoshub-4 wallet balance (I see the metrics for balances of wallets on bitsong-2b, jackal-1, sentinel, same with osmosis-1 (these chains are the last in the config).

@romac agree. I also created another issue on scanning the chains in parallel as well, which should speed it up, but that goes out of the scope of this one.

@freak12techno
Copy link
Contributor Author

@ljoss17 I seem to have the same issue again on 1.10.13 - after restarting, seems like all the chains are scanned, but somehow Hermes isn't doing anything at all.
Pretty sure it's because of one of the nodes misbehaving (likely the Osmosis one), but I have paths that I can relay that are not involving Osmosis (for example DVPN <=> ATOM), and it seems like it's not behaving correctly here as well.
Wonder if I should create another issue on that, or reopen this one.

Metrics (as you can see, after the restart it isn't submitting anything at all):
image
Using pretty much the same config as above, and here are my logs: https://gist.github.com/freak12techno/8ce404c507700d3ac73f483d5ca6d2db.
Can you check this out?

@romac
Copy link
Member

romac commented Sep 3, 2024

What happens if you comment out the osmosis chain and all channels tagged # Osmosis from your config?

@romac
Copy link
Member

romac commented Sep 3, 2024

It's weird that Hermes is scanning all clients/connections/channels on all chains. Are you still using an allowlist for each chain?

@freak12techno
Copy link
Contributor Author

@romac sorry, I forgot that I disabled the chain policy, here's the up-to-date config https://gist.github.com/freak12techno/3b8f3672521e77e0ff35e464a8dcdd21. let me know if that helps or you need something else.

My concern here is that it did finish scanning chains, but then something weird started to happen,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

3 participants