Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Prometheus endpoint "polkadot_node_is_authority" does not switch back when getting out of the active set #5664

Closed
stakeworld opened this issue Jun 12, 2022 · 11 comments · Fixed by #5706
Assignees

Comments

@stakeworld
Copy link
Contributor

stakeworld commented Jun 12, 2022

Binaries:
polkadot 0.9.23-a7e188cd966 from default apt repo https://releases.parity.io/deb
prometheus, version 2.34.0
grafana from apt repo version 8.5.5 (commit: d32ae18909, branch: HEAD)

Server:
Ubuntu 20.04 LTS 5.4.0-113-generic #127-Ubuntu SMP x86_64 on a dedicated AMD server, 64 MB ram.

Description:
When getting in the active validator set the prometheus metric endpoint "polkadot_node_is_authority" switches to 1, but when you get out of the active set it does not switch back to 0. If you restart the node it goes to 0 and stays at 0. Witnessed this on 0.9.23 and also before on 0.9.22. The "polkadot_node_is_parachain_validator" endpoint does switch on and off depending on the paravalidating state. Heard rumors that if you end the authority session in a parachain validator state it also stays on 1 but can't confirm because all my parachain sessions ended before the authority ended.

Expected behaviour: when you get out of the active set and are not longer authority I would expect the "polkadot_node_is_authority" switch back to 0.

I've heard in the 1000 validator matrix group it is a known problem but I could not find a previous bug report, if it does exist my excuses for the duplicate. A fellow validator uses the "overseer rate" rate but this seems to change between version so is not perfect.

It seems to be included here #4699 by @sandreim

@Bruno-Lussan
Copy link

I confirm that both metrics have changed behavior since version 0.9.20.

@paritytech paritytech deleted a comment Jun 13, 2022
@stakeworld
Copy link
Contributor Author

I can confirm that while ending the active validation session while paravalidating the "polkadot_node_is_parachain_validator" also stays "1". My node came out of the active set during a parachain validating session and both metrics ("polkadot_node_is_authority" and "polkadot_node_is_parachain_validator" stayed "1" while already out of the active set. After a restart they reset to "0" and stay there (until the next session).

@sandreim
Copy link
Contributor

@stakeworld @Bruno-Lussan This was added in 0.9.16 and I can also confirm the polkadot_node_is_authority seems broken now, but polkadot_node_is_parachain_validator looks fine to me.

Unfortunately I don't have metrics spanning that long ago, but 0.9.20 includes some changes that might have affected the metric update (8bb84d2). Looking into it ...

I can confirm that while ending the active validation session while paravalidating the "polkadot_node_is_parachain_validator" also stays "1". My node came out of the active set during a parachain validating session and both metrics ("polkadot_node_is_authority" and "polkadot_node_is_parachain_validator" stayed "1" while already out of the active set. After a restart they reset to "0" and stay there (until the next session).

Can you be more specific about the polkadot_node_is_parachain_validator metric, was it lagging behind ? or it did not go back to 0 at all?

@stakeworld
Copy link
Contributor Author

@sandreim , the polkadot_node_is_parachain_validator seems not to go back at all to 0 if the validation session ends in a parachain validation session. It seems to turn on and off normal within an active validation session. My validation session ended this morning in a parachain session and both metrics stayed 1. This evening both were still 1 (so about 10 hours later), after I restart the node everything is normal. In a previous session I had only the polkadot_node_is_authority stay on 1.

On a side note also my resources looked like it was still (para)validating, the pvf-host processes were still running and network use was the same as in an active session, while in the js app and 1000 validator site I was not active anymore. After the restart al resources and metrics went back to a not active state.

Hope you can find where the problem lies, the metric is very handy for monitoring and alerting active states.

@sandreim
Copy link
Contributor

I figured out my initial implementation was wrong because it did not handle errors properly, relying on side effects from the gossip topology implementation. Going to test the fix some more before PR review.

@paulormart
Copy link
Contributor

Just a note on the flag name polkadot_node_is_authority shouldn't this always be true if you start your node with flag --validator and not to rely if the validator is really in the active/waiting set? polkadot_node_is_authority might be misleading otherwise.

A node is acting as Authority and not Full node when the flag --validator is present.

@stakeworld
Copy link
Contributor Author

Just a note on the flag name polkadot_node_is_authority shouldn't this always be true if you start your node with flag --validator and not to rely if the validator is really in the active/waiting set? polkadot_node_is_authority might be misleading otherwise.

A node is acting as Authority and not Full node when the flag --validator is present.

Isn't there a difference between "validator" and "authority"? I would think a validator which gets in the active set becomes an authority. If not maybe strictly speaking the term "is_active_authority" or something like that would be more fitting but for me as a validator the most important is that the metrics works, so thanks @sandreim !

@paulormart
Copy link
Contributor

Just saying that the initial purpose of that flag might have been to indicate the role status of the node as described when you start your node:

Jun 22 09:51:21 localhost polkadot[423744]: 2022-06-22 09:51:21 Parity Polkadot
Jun 22 09:51:21 localhost polkadot[423744]: 2022-06-22 09:51:21 ✌️  version 0.9.24-22836e55d41
Jun 22 09:51:21 localhost polkadot[423744]: 2022-06-22 09:51:21 ❤️  by Parity Technologies <admin@parity.io>, 2017-2022
Jun 22 09:51:21 localhost polkadot[423744]: 2022-06-22 09:51:21 📋 Chain specification: Polkadot
Jun 22 09:51:21 localhost polkadot[423744]: 2022-06-22 09:51:21 🏷  Node name: some-random-name
Jun 22 09:51:21 localhost polkadot[423744]: 2022-06-22 09:51:21 👤 Role: AUTHORITY

@stakeworld
Copy link
Contributor Author

Good point, thats what shows in the logs and in the command line... You are right the terms are confusing and ambiguous. Something to improve for the future i think

@Bruno-Lussan
Copy link

Bruno-Lussan commented Jun 22, 2022

It seems that the name flag actually did not reflect its operation which corresponded to the fact of being in the active set. It would indeed be more logical to make it work so that it reflects the role displayed in the log as you proposed. Also, it would be very useful for us to have a flag that tells us that we are in the active set. This flag could be named polkadot_node_is_active or something similar. Sorry for my English and thank you for the time spent on this topic. And many thanks also to stakeworld.

@sandreim
Copy link
Contributor

sandreim commented Jun 22, 2022

You are right about the confusion. Just to clarify, these metrics are updated at each session boundary to reflect the new session info and not the configured role. That being said, polkadot_node_is_authority means being in the active set and polkadot_node_is_parachain_validator additionaly implies that the node is doing parachain consensus work (which also requires being in the active set).

I'll followup with a PR to remove this confusion.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@paulormart @sandreim @Bruno-Lussan @stakeworld and others