Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observe 3rd Party validators are signing at the tip in the relayer #1762

Closed
2 tasks done
nambrot opened this issue Feb 4, 2023 · 7 comments · Fixed by #3057
Closed
2 tasks done

Observe 3rd Party validators are signing at the tip in the relayer #1762

nambrot opened this issue Feb 4, 2023 · 7 comments · Fixed by #3057

Comments

@nambrot
Copy link
Contributor

nambrot commented Feb 4, 2023

Compared to v1, the relayer has no practical metric anymore to assess whether validators are signing "at the tip". We should consider adding this functionality to the scraper

Indexing in relayers v2 for validator signature. This will also show up for monitoring and alerting in Grafana.

Tasks

Preview Give feedback
  1. tkporter
  2. infrastructure operations
    tkporter
@nambrot
Copy link
Contributor Author

nambrot commented Feb 4, 2023

@tkporter curious to get your thoughts

@nambrot nambrot moved this to On Deck in Hyperlane Tasks Feb 13, 2023
@nambrot
Copy link
Contributor Author

nambrot commented Feb 17, 2023

@tkporter friendly ping :)

@nambrot
Copy link
Contributor Author

nambrot commented Mar 3, 2023

Another friendly ping @tkporter :)

@tkporter
Copy link
Collaborator

tkporter commented Mar 9, 2023

i don't have any satisfactory ideas that I can come up with off the top of my head but will think about this more

I'm a little hesitant to put this in the scraper, it doesn't feel too much more natural to put this there than in the relayer

If we're happy with only monitoring the default ISM's validator set then it sort of works. But I don't think there's anything satisfactory that works for all validators. I'd be up for having a metrics loop that tries to fetch the latest signed checkpoint for the set relating to the default ISM every 30s or minute or so

Feels like this could live in the relayer or scraper. and would probably make more sense for PI chains to live in the relayer, given not everyone will be operating a scraper and these metrics are probably useful

@nambrot
Copy link
Contributor Author

nambrot commented Mar 9, 2023

i guess you could actually enumerate the validators on ValidatorAnnounce, but agree that defaultISM makes more sense, and i could see it be useful in the relayer

@nambrot nambrot changed the title Observe validator signatures in the scraper Observe validator signatures in the relayer Mar 13, 2023
@nambrot nambrot moved this from On Deck to Backlog in Hyperlane Tasks Mar 16, 2023
@avious00 avious00 moved this from Backlog to Below Deck in Hyperlane Tasks Aug 23, 2023
@avious00 avious00 moved this from Backlog to Next Sprint in Hyperlane Tasks Sep 8, 2023
@avious00 avious00 moved this from Next Sprint to Backlog in Hyperlane Tasks Sep 8, 2023
@avious00
Copy link
Contributor

@tkporter friendly ping

@avious00 avious00 changed the title Observe validator signatures in the relayer Track whether 3rd Party validators are signing at the tip Sep 13, 2023
@avious00 avious00 changed the title Track whether 3rd Party validators are signing at the tip Observe 3rd Party validators are signing at the tip in the v2 relayer Sep 14, 2023
@avious00 avious00 moved this from Backlog to Sprint in Hyperlane Tasks Sep 14, 2023
@mattiekat mattiekat removed their assignment Oct 11, 2023
@avious00 avious00 changed the title Observe 3rd Party validators are signing at the tip in the v2 relayer Observe 3rd Party validators are signing at the tip in the relayer Nov 20, 2023
@tkporter
Copy link
Collaborator

tkporter commented Nov 23, 2023

Thinking the move may be to monitor validators that are enrolled on:

  • default ISMs
  • ISMs used by a configurable list of application recipients (e.g. we will want monitoring for the Neutron <-> arb warp route)

Some interesting things:

  1. We already fetch validator checkpoint info when trying to process a message - it probably doesn’t make sense to have a separate task that does this just for metrics, so instead we can just tap into the existing logic when trying to fetch metadata
  2. It’s possible for the validator set or ISMs used by an application to change, in which case we should not keep around the old irrelevant metrics exposed in the /metrics endpoint, and we should start keeping track of the new set
  3. An app may be using an ISM that’s really e.g. an aggregation pointing to a routing ISM that finally points to a multisig ISM
  4. The validators sets enrolled on destination chains D1 and D2 relating to validators from the same origin chain may be different

Imo the most important thing to track to begin with are validator latest checkpoint indices.
There are a couple snafus we made that are reasonable counterarguments that this isn’t sufficient:

  • We mistakenly removed the updating of checkpoint_latest_index.json when we stopped legacy checkpoint support. But this is now added back & will soon be rolled out to all validators in an update, so I think it’s okay to assume this will exist at least soon for all validators
  • For some older validators that are susceptible to the bug where merkle tree indexing was still done in a way that’s different from message indexing & where they still are signing legacy checkpoints, it’s possible for the checkpoint_latest_index.json to relate to legacy checkpoints and not the new checkpoint format. Similar deal - I think it’s okay to assume this will not be an issue for folks with newer validator version

    It’s also possible in theory for a validator to have gaps in what checkpoints they’ve signed, and it’d be nice to be aware of this, but this feels a bit out of scope and harder to surface using prom metrics, at least for now

Proposed metric:

Name: hyperlane_relayer_validator_checkpoint_latest_index (there's probably a better name for this that ill try to think of...)
Value: the value of checkpoint_latest_index.json
Labels:

  • origin: the origin chain
  • validator: the address of the validator
  • app: "default-ism" or the name of a configured application to be monitored, e.g. "neutron-arb-tia"
  • destination: the destination chain the validator is enrolled on
  • ism_address: the address of the ISM that the validator is enrolled on. Because e.g. the default ISM may be an aggregation ISM pointing to routing ISMs which ultimately point to a multisig ISM variant

I’m open to just having the origin and validator labels and scrapping the rest if it doesn’t seem useful to have the app / destination / ism_address. The usefulness of these latter 3 labels is to just have more granularity to know why certain validator are being tracked. Also segmenting them by app / destination / ism_address will make it easy to remove no longer relevant validators from the exposed metrics in the event of an ISM / validator set config change. Otherwise we may have a metric for a validator that has been rotated out of a validator set stick around a pollute alert conditions because it’ll never have its latest checkpoint updated

So to be clear we'd just update the metrics when fetching metadata for validators that are part of the default ISM or for a specially configured app. This also means that if there are no messages, then the metrics won't be updated because this is tapping into the metadata building logic

@avious00 avious00 moved this from In Progress to In Review in Hyperlane Tasks Jan 2, 2024
tkporter added a commit that referenced this issue Jan 2, 2024
### Description

Goal of this was to have insight into validators of important sets being
"up"

Introduces a new metric used by relayers:
`hyperlane_observed_validator_latest_index`, e.g.:

```
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test1",hyperlane_baselib_version="0.1.0",origin="test2",validator="0x9965507d1a55bcc2695c58ba16fb37d819b0a4dc"} 664
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test1",hyperlane_baselib_version="0.1.0",origin="test3",validator="0x976ea74026e726554db657fa54763abd0c3a0aa9"} 641
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test2",hyperlane_baselib_version="0.1.0",origin="test1",validator="0x15d34aaf54267db7d7c367839aaf71a00a2c6a65"} 670
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test2",hyperlane_baselib_version="0.1.0",origin="test3",validator="0x976ea74026e726554db657fa54763abd0c3a0aa9"} 665
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test3",hyperlane_baselib_version="0.1.0",origin="test1",validator="0x15d34aaf54267db7d7c367839aaf71a00a2c6a65"} 652
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test3",hyperlane_baselib_version="0.1.0",origin="test2",validator="0x9965507d1a55bcc2695c58ba16fb37d819b0a4dc"} 664
hyperlane_observed_validator_latest_index{agent="relayer",app_context="testapp",destination="test1",hyperlane_baselib_version="0.1.0",origin="test2",validator="0x9965507d1a55bcc2695c58ba16fb37d819b0a4dc"} 658
hyperlane_observed_validator_latest_index{agent="relayer",app_context="testapp",destination="test1",hyperlane_baselib_version="0.1.0",origin="test3",validator="0x976ea74026e726554db657fa54763abd0c3a0aa9"} 641
```

Tapping into metadata building for multisig ISMs, the relayer will
update the metric with the latest indices for the validators in a set.
In order to prevent the cardinality being ridiculously high, only
certain validator sets are tracked. This is done by introducing an
`app_context` label (I'm very open to other names here, for some reason
whenever idk how to name some kind of identifier I end up calling it a
context 😆)

The app context can either be:
- if a new setting, --metricAppContexts, is specified, a message will be
classified based off the first matching list it matches. E.g.
`--metricAppContexts '[{"name": "testapp", "matchingList":
[{"recipient_address": "0xd84379ceae14aa33c123af12424a37803f885889",
"destination_domain": 13371 }] }]'`. This is nice for e.g. warp route
deployments, where the ISM is maybe not a default ISM, and can be
changed
- if a message doesn't get classified this way, it can also be
classified with the "default_ism" app context, which is just for any
message that happens to use the default ISM as its "root" ISM

This way we have insight in to the default ISM and any
application-specific ISMs.

Some things to note:
- it's possible for a message to actually have more than one validator
set, e.g. if it's using an aggregation ISM. In this case, we'll have
metrics on the union of all validator sets for that app context
- Some effort is required to make sure that metrics don't stick around
for a validator that has actually been removed from the set. To handle
this, we cache the validator set for an app context and clear out the
entire set each time we set the metrics

### Drive-by changes

- Zod's nonempty function for strings is deprecated, moves to `.min(1)`
instead

### Related issues

- Fixes #1762

### Backward compatibility

yes

### Testing

Ran locally - I think i'll probably add something in e2e tests, but
opening now
@github-project-automation github-project-automation bot moved this from In Review to Done in Hyperlane Tasks Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants