-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polkadot validator stops gaining points #4321
Comments
syslog from new server |
syslog from original docker (only has logs from most recent |
I see your node has been running with two different node identities, unfortunately, that is bound to cause problems, I think that's why you are getting 0s now, see #3673. I'm not sure when you last rotated your keys, but it seems you are now using an old identity not your newest one, see the timestamps, at So, to recover please make sure first you keep your identity stable, then rotate your keys and the issue should go away. |
the timestamp 17:36:58 was when I corrected the the issue occured overnight when the ID was stable and node achieved an A+ and then F. this morning I moved the node to a new server: |
You need your key to be present on the new machine before you start the node, because otherwise if it starts with a new identity and a correct AuthorithyId, nodes will start publish on the DHT the new address and once they do that because of the distributed nature of the DHT where nodes replicate records regularly, there is no determinism as to which address your peers will see, even if you connect to them they might concur that the other PeerID should fulfil the job of this AuthorithyId. As a matter of fact it is never safe to publish a new network identity while you are in the active set, because it will take 36h for that record to expire.
Even if it is stable the old one it will still exist on some node and there is no way to know which one your peers think you are using, hence why a key rotation fixes it, because by changing your AuthorityId you change the key other nodes look up your node in the network. |
Apologies, if it came across as the wrong way, the action matters because I'm trying to understand what happened, unfortunately with this type of bugs all the details that you can offer us would help us understand what happened and how to fix it. We know for sure changing the PeerId and the public address of your node, is bound to cause problems until the past record expires 36h latter, even if it seemed your node got to work correctly for a session, the old identity it still cached on other nodes, fixes are underway to make this mistake harder to happen: #3673. Looking at app turboflakes, it seems your node started getting Fs at session Now, I guess the problem you want us to focus on, is what happened at session |
Here is the timeline: Expecting perhaps it could be hardware/ disk/ cpu/ network card issues: in 8598 it got D (after the actions above) |
Scanned the network with https://github.com/lexnv/subp2p-explorer: Your node seems to be advertising a lot of public addresses, bare in mind that the other nodes would accept only 10 addresses: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/authority-discovery/src/worker.rs#L73
Not sure how you ended-up with so many records of different IPS in DHT, but that is definitely the culprit of your issues. I assume the first AuthorithyID is the one before your rotate the keys and the second one is the one after rotate, that probably worked until enough address accumulated. Tagging a few more people that know more about networking and infrastructure than me @dmitry-markin, @BulatSaif, @lexnv, @altonen, but this look like a repeat of #2523 and #3519 (comment). |
Wow, that's a really useful tool! |
The records are signed with your authority id key, so it is more like DoS-ing yourself. Not sure why your polkadot node decides to publish all those records, probably your pod/machine changes IPs that often ? Could you also post the command-line you use for starting your node ? |
from previous server (docker): polkadot-validator:
container_name: polkadot-validator-1
image: parity/polkadot:v1.10.0
restart: unless-stopped
ports:
- "15032:30333" # p2p port
- "15033:15033" # prometheus port
- "15034:9933" # rpc port
- "15035:9944" # ws port
volumes:
- /media/nvme-2tb/polkadot-val-1:/data
command: [
"--chain", "polkadot",
"--validator",
"--name", "METASPAN (also try POOL #18)",
"--telemetry-url", "wss://telemetry-backend.w3f.community/submit 1",
"--base-path", "/data",
"--database", "paritydb",
"--pruning", "256",
# "--sync", "warp",
"--allow-private-ipv4",
"--discover-local",
"--listen-addr", "/ip4/0.0.0.0/tcp/30333",
"--public-addr", "/ip4/195.144.22.130/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD",
"--prometheus-port", "15033",
"--prometheus-external",
# RPC node
#"--rpc-external",
"--rpc-methods", "safe",
#"--rpc-methods", "unsafe",
"--rpc-cors", "all", from the new service file:
|
ok, so these are all private addresses and would never be reachable by external parties.
These 2 could be routable, but only the 1st one (relating to the --public-addr) is correct.
|
This one |
@dcolley I see you validator is fixed now, can I go ahead and close this issue ? Also for future us, could share what cli knobs you removed/added to get it fixed. Thank you! |
In the end I set up a new validator with a new ID, then transferred the old ID to the new machine and rotated keys. |
With no changes to the installation, this validator stopped gaining points:
https://apps.turboflakes.io/?chain=polkadot#/validator/5HgM1fbhs7uRCB9KxNquaFBGLPxmQfYEcHT8GbNSrZ9HRWEY?mode=history
After rotating keys and restarting the node, it started gaining points.
In the next session it achieved A+, then stopped again. Next session got an F.
This morning I created a new node, and transferred the network/secret_ed25519.
The new node is not accumulating any points (yet)
The text was updated successfully, but these errors were encountered: