Observe 3rd Party validators are signing at the tip in the relayer #1762

nambrot · 2023-02-04T17:11:41Z

Compared to v1, the relayer has no practical metric anymore to assess whether validators are signing "at the tip". We should consider adding this functionality to the scraper

Indexing in relayers v2 for validator signature. This will also show up for monitoring and alerting in Grafana.

Tasks

Give feedback

Scope: 3rd Party Validator #2938
Infra support & grafana alerts for validator observability #3109

infrastructure operations
Options

nambrot · 2023-02-04T17:11:49Z

@tkporter curious to get your thoughts

nambrot · 2023-02-17T22:07:37Z

@tkporter friendly ping :)

nambrot · 2023-03-03T21:47:45Z

Another friendly ping @tkporter :)

tkporter · 2023-03-09T15:52:04Z

i don't have any satisfactory ideas that I can come up with off the top of my head but will think about this more

I'm a little hesitant to put this in the scraper, it doesn't feel too much more natural to put this there than in the relayer

If we're happy with only monitoring the default ISM's validator set then it sort of works. But I don't think there's anything satisfactory that works for all validators. I'd be up for having a metrics loop that tries to fetch the latest signed checkpoint for the set relating to the default ISM every 30s or minute or so

Feels like this could live in the relayer or scraper. and would probably make more sense for PI chains to live in the relayer, given not everyone will be operating a scraper and these metrics are probably useful

nambrot · 2023-03-09T17:40:07Z

i guess you could actually enumerate the validators on ValidatorAnnounce, but agree that defaultISM makes more sense, and i could see it be useful in the relayer

avious00 · 2023-09-13T16:16:20Z

@tkporter friendly ping

tkporter · 2023-11-23T15:41:01Z

Thinking the move may be to monitor validators that are enrolled on:

default ISMs
ISMs used by a configurable list of application recipients (e.g. we will want monitoring for the Neutron <-> arb warp route)

Some interesting things:

We already fetch validator checkpoint info when trying to process a message - it probably doesn’t make sense to have a separate task that does this just for metrics, so instead we can just tap into the existing logic when trying to fetch metadata
It’s possible for the validator set or ISMs used by an application to change, in which case we should not keep around the old irrelevant metrics exposed in the /metrics endpoint, and we should start keeping track of the new set
An app may be using an ISM that’s really e.g. an aggregation pointing to a routing ISM that finally points to a multisig ISM
The validators sets enrolled on destination chains D1 and D2 relating to validators from the same origin chain may be different

Imo the most important thing to track to begin with are validator latest checkpoint indices.
There are a couple snafus we made that are reasonable counterarguments that this isn’t sufficient:

We mistakenly removed the updating of checkpoint_latest_index.json when we stopped legacy checkpoint support. But this is now added back & will soon be rolled out to all validators in an update, so I think it’s okay to assume this will exist at least soon for all validators
For some older validators that are susceptible to the bug where merkle tree indexing was still done in a way that’s different from message indexing & where they still are signing legacy checkpoints, it’s possible for the checkpoint_latest_index.json to relate to legacy checkpoints and not the new checkpoint format. Similar deal - I think it’s okay to assume this will not be an issue for folks with newer validator version 
It’s also possible in theory for a validator to have gaps in what checkpoints they’ve signed, and it’d be nice to be aware of this, but this feels a bit out of scope and harder to surface using prom metrics, at least for now

Proposed metric:

Name: hyperlane_relayer_validator_checkpoint_latest_index (there's probably a better name for this that ill try to think of...)
Value: the value of checkpoint_latest_index.json
Labels:

origin: the origin chain
validator: the address of the validator
app: "default-ism" or the name of a configured application to be monitored, e.g. "neutron-arb-tia"
destination: the destination chain the validator is enrolled on
ism_address: the address of the ISM that the validator is enrolled on. Because e.g. the default ISM may be an aggregation ISM pointing to routing ISMs which ultimately point to a multisig ISM variant

I’m open to just having the origin and validator labels and scrapping the rest if it doesn’t seem useful to have the app / destination / ism_address. The usefulness of these latter 3 labels is to just have more granularity to know why certain validator are being tracked. Also segmenting them by app / destination / ism_address will make it easy to remove no longer relevant validators from the exposed metrics in the event of an ISM / validator set config change. Otherwise we may have a metric for a validator that has been rotated out of a validator set stick around a pollute alert conditions because it’ll never have its latest checkpoint updated

So to be clear we'd just update the metrics when fetching metadata for validators that are part of the default ISM or for a specially configured app. This also means that if there are no messages, then the metrics won't be updated because this is tapping into the metadata building logic

### Description Goal of this was to have insight into validators of important sets being "up" Introduces a new metric used by relayers: `hyperlane_observed_validator_latest_index`, e.g.: ``` hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test1",hyperlane_baselib_version="0.1.0",origin="test2",validator="0x9965507d1a55bcc2695c58ba16fb37d819b0a4dc"} 664 hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test1",hyperlane_baselib_version="0.1.0",origin="test3",validator="0x976ea74026e726554db657fa54763abd0c3a0aa9"} 641 hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test2",hyperlane_baselib_version="0.1.0",origin="test1",validator="0x15d34aaf54267db7d7c367839aaf71a00a2c6a65"} 670 hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test2",hyperlane_baselib_version="0.1.0",origin="test3",validator="0x976ea74026e726554db657fa54763abd0c3a0aa9"} 665 hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test3",hyperlane_baselib_version="0.1.0",origin="test1",validator="0x15d34aaf54267db7d7c367839aaf71a00a2c6a65"} 652 hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test3",hyperlane_baselib_version="0.1.0",origin="test2",validator="0x9965507d1a55bcc2695c58ba16fb37d819b0a4dc"} 664 hyperlane_observed_validator_latest_index{agent="relayer",app_context="testapp",destination="test1",hyperlane_baselib_version="0.1.0",origin="test2",validator="0x9965507d1a55bcc2695c58ba16fb37d819b0a4dc"} 658 hyperlane_observed_validator_latest_index{agent="relayer",app_context="testapp",destination="test1",hyperlane_baselib_version="0.1.0",origin="test3",validator="0x976ea74026e726554db657fa54763abd0c3a0aa9"} 641 ``` Tapping into metadata building for multisig ISMs, the relayer will update the metric with the latest indices for the validators in a set. In order to prevent the cardinality being ridiculously high, only certain validator sets are tracked. This is done by introducing an `app_context` label (I'm very open to other names here, for some reason whenever idk how to name some kind of identifier I end up calling it a context 😆) The app context can either be: - if a new setting, --metricAppContexts, is specified, a message will be classified based off the first matching list it matches. E.g. `--metricAppContexts '[{"name": "testapp", "matchingList": [{"recipient_address": "0xd84379ceae14aa33c123af12424a37803f885889", "destination_domain": 13371 }] }]'`. This is nice for e.g. warp route deployments, where the ISM is maybe not a default ISM, and can be changed - if a message doesn't get classified this way, it can also be classified with the "default_ism" app context, which is just for any message that happens to use the default ISM as its "root" ISM This way we have insight in to the default ISM and any application-specific ISMs. Some things to note: - it's possible for a message to actually have more than one validator set, e.g. if it's using an aggregation ISM. In this case, we'll have metrics on the union of all validator sets for that app context - Some effort is required to make sure that metrics don't stick around for a validator that has actually been removed from the set. To handle this, we cache the validator set for an app context and clear out the entire set each time we set the metrics ### Drive-by changes - Zod's nonempty function for strings is deprecated, moves to `.min(1)` instead ### Related issues - Fixes #1762 ### Backward compatibility yes ### Testing Ran locally - I think i'll probably add something in e2e tests, but opening now

nambrot added agent operations labels Feb 4, 2023

nambrot added this to Hyperlane Tasks Feb 13, 2023

nambrot moved this to On Deck in Hyperlane Tasks Feb 13, 2023

nambrot changed the title ~~Observe validator signatures in the scraper~~ Observe validator signatures in the relayer Mar 13, 2023

nambrot assigned mattiekat Mar 13, 2023

nambrot moved this from On Deck to Backlog in Hyperlane Tasks Mar 16, 2023

avious00 moved this from Backlog to Below Deck in Hyperlane Tasks Aug 23, 2023

avious00 moved this from Backlog to Next Sprint in Hyperlane Tasks Sep 8, 2023

avious00 moved this from Next Sprint to Backlog in Hyperlane Tasks Sep 8, 2023

avious00 added the tech-debt label Sep 8, 2023

avious00 assigned daniel-savu Sep 8, 2023

avious00 changed the title ~~Observe validator signatures in the relayer~~ Track whether 3rd Party validators are signing at the tip Sep 13, 2023

avious00 changed the title ~~Track whether 3rd Party validators are signing at the tip~~ Observe 3rd Party validators are signing at the tip in the v2 relayer Sep 14, 2023

avious00 moved this from Backlog to Sprint in Hyperlane Tasks Sep 14, 2023

mattiekat removed their assignment Oct 11, 2023

avious00 changed the title ~~Observe 3rd Party validators are signing at the tip in the v2 relayer~~ Observe 3rd Party validators are signing at the tip in the relayer Nov 20, 2023

tkporter assigned tkporter and unassigned daniel-savu Nov 24, 2023

avious00 added the validator label Nov 26, 2023

avious00 moved this from Sprint to In Progress in Hyperlane Tasks Nov 27, 2023

avious00 mentioned this issue Nov 29, 2023

Scope: 3rd Party Validator #2938

Closed

tkporter mentioned this issue Dec 14, 2023

Observability of validators for relayers #3057

Merged

avious00 moved this from In Progress to In Review in Hyperlane Tasks Jan 2, 2024

tkporter closed this as completed in #3057 Jan 2, 2024

github-project-automation bot moved this from In Review to Done in Hyperlane Tasks Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observe 3rd Party validators are signing at the tip in the relayer #1762

Observe 3rd Party validators are signing at the tip in the relayer #1762

nambrot commented Feb 4, 2023 •

edited by avious00

Loading

Tasks

nambrot commented Feb 4, 2023

nambrot commented Feb 17, 2023

nambrot commented Mar 3, 2023

tkporter commented Mar 9, 2023 •

edited

Loading

nambrot commented Mar 9, 2023

avious00 commented Sep 13, 2023

tkporter commented Nov 23, 2023 •

edited

Loading

Observe 3rd Party validators are signing at the tip in the relayer #1762

Observe 3rd Party validators are signing at the tip in the relayer #1762

Comments

nambrot commented Feb 4, 2023 • edited by avious00 Loading

Tasks

nambrot commented Feb 4, 2023

nambrot commented Feb 17, 2023

nambrot commented Mar 3, 2023

tkporter commented Mar 9, 2023 • edited Loading

nambrot commented Mar 9, 2023

avious00 commented Sep 13, 2023

tkporter commented Nov 23, 2023 • edited Loading

nambrot commented Feb 4, 2023 •

edited by avious00

Loading

tkporter commented Mar 9, 2023 •

edited

Loading

tkporter commented Nov 23, 2023 •

edited

Loading