Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator observability infra #3116

Merged
merged 11 commits into from
Jan 19, 2024
Merged

Conversation

tkporter
Copy link
Collaborator

@tkporter tkporter commented Jan 3, 2024

Description

  • Configures the relayers to start tracking the latest checkpoints of the manta TIA WR, arbitrum TIA WR, and helloworld sets (and, implicitly, the default ISM sets too)
  • deployed all agent roles & networks with the new image
  • created alerts and added to the nexus dashboard

One thing that's a bit of a bummer is the metrics are only updated whenever we try to construct metadata for a message. Naturally the validators will poll & sign checkpoints slightly out of sync, so pretty frequently the relayer attempts to deliver a message where say 4/6 signatures are needed, and 2 of the validators happen to just have not polled & signed the checkpoint yet. The message is then successfully delivered, and the 2 remaining validators probably sign the checkpoint in a matter of seconds, but the metrics aren't updated to reflect this unless there's another message whose metadata is being constructed. This just means that the graph is a bit ugly when things are working. But when things aren't working, we'll still be able to clearly see which validators are behind.

We may want to consider changing the metrics if this proves confusing. Some ideas:

  1. Move to a separate task that occasionally polls the latest checkpoints of the configured app contexts
  2. Also track the threshold. This way we can construct alerts & dashboards based off the threshold and not be overly concerned if there are some stragglers

Drive-by changes

added bytesToBytes32 so that we can construct matching lists, which the agents expect to be 0x-prefixed, using a router address config that may include protocol-specific address formats

Related issues

Fixes #3109

Backward compatibility

yes

Testing

deployed

Copy link

changeset-bot bot commented Jan 3, 2024

🦋 Changeset detected

Latest commit: 6966645

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 6 packages
Name Type
@hyperlane-xyz/utils Patch
@hyperlane-xyz/core Patch
@hyperlane-xyz/cli Patch
@hyperlane-xyz/infra Patch
@hyperlane-xyz/sdk Patch
@hyperlane-xyz/helloworld Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@tkporter tkporter enabled auto-merge (squash) January 5, 2024 12:19
@tkporter tkporter disabled auto-merge January 8, 2024 16:00
@nambrot
Copy link
Contributor

nambrot commented Jan 10, 2024

Should this be merged? It's already live right?

Copy link

codecov bot commented Jan 19, 2024

Codecov Report

Merging #3116 (6966645) into main (ae4476a) will not change coverage.
The diff coverage is n/a.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3116   +/-   ##
=======================================
  Coverage   67.18%   67.18%           
=======================================
  Files         101      101           
  Lines        1021     1021           
  Branches      106      106           
=======================================
  Hits          686      686           
  Misses        291      291           
  Partials       44       44           
Components Coverage Δ
core 50.00% <ø> (ø)
hooks 68.79% <ø> (ø)
isms 65.94% <ø> (ø)
token 54.62% <ø> (ø)
middlewares 81.46% <ø> (ø)

@tkporter tkporter enabled auto-merge (squash) January 19, 2024 12:56
@tkporter tkporter merged commit 78e50e7 into main Jan 19, 2024
24 checks passed
@tkporter tkporter deleted the trevor/validator-observability-infra branch January 19, 2024 14:13
ltyu pushed a commit to ltyu/hyperlane-monorepo that referenced this pull request Mar 13, 2024
### Description

- Configures the relayers to start tracking the latest checkpoints of
the manta TIA WR, arbitrum TIA WR, and helloworld sets (and, implicitly,
the default ISM sets too)
- deployed all agent roles & networks with the new image
- created alerts and added to the nexus dashboard

One thing that's a bit of a bummer is the metrics are only updated
whenever we try to construct metadata for a message. Naturally the
validators will poll & sign checkpoints slightly out of sync, so pretty
frequently the relayer attempts to deliver a message where say 4/6
signatures are needed, and 2 of the validators happen to just have not
polled & signed the checkpoint yet. The message is then successfully
delivered, and the 2 remaining validators probably sign the checkpoint
in a matter of seconds, but the metrics aren't updated to reflect this
unless there's another message whose metadata is being constructed. This
just means that the graph is a bit ugly when things are working. But
when things aren't working, we'll still be able to clearly see which
validators are behind.

We may want to consider changing the metrics if this proves confusing.
Some ideas:
1. Move to a separate task that occasionally polls the latest
checkpoints of the configured app contexts
2. Also track the threshold. This way we can construct alerts &
dashboards based off the threshold and not be overly concerned if there
are some stragglers

### Drive-by changes

added `bytesToBytes32` so that we can construct matching lists, which
the agents expect to be 0x-prefixed, using a router address config that
may include protocol-specific address formats

### Related issues

Fixes hyperlane-xyz#3109 

### Backward compatibility

yes

### Testing

deployed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Infra support & grafana alerts for validator observability
4 participants