Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error after upgrading polkadot-parachain to v1.1.0-polkadot on Asset Hub #1566

Closed
ilhanu opened this issue Sep 14, 2023 · 8 comments · Fixed by #1788
Closed

Error after upgrading polkadot-parachain to v1.1.0-polkadot on Asset Hub #1566

ilhanu opened this issue Sep 14, 2023 · 8 comments · Fixed by #1788
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@ilhanu
Copy link

ilhanu commented Sep 14, 2023

I've updated 6 system chains with ansible-galaxy. From v1.0.0 to v1.1.0, I'm using the binary that is published in the releases. 5 out of 6 upgrades went smooth. One of the machines is giving an error which I can't solve.

2023-09-14 13:25:32.761  INFO main sc_cli::runner: Polkadot parachain
2023-09-14 13:25:32.761  INFO main sc_cli::runner: ✌️  version 1.1.0-783579b4a49
2023-09-14 13:25:32.761  INFO main sc_cli::runner: ❤️  by Parity Technologies <admin@parity.io>, 2017-2023
2023-09-14 13:25:32.761  INFO main sc_cli::runner: 📋 Chain specification: Polkadot Asset Hub
....
....
2023-09-14 13:13:14.603  WARN tokio-runtime-worker sc_service::builder: [Parachain] The NetworkStart returned as part of `build_network` has been silently dropped
Error: Service(Client(RuntimeApiError(Application(Execution(Other("Exported method AuraApi_slot_duration is not found"))))))
asset-hub-dot.service: Main process exited, code=exited, status=1/FAILURE
asset-hub-dot.service: Failed with result 'exit-code'.

There are no other warnings or errors in the start-up logs after the starting. After that it just keeps rebooting.

I have deleted the DB, since it was a redundant node and tried to re-sync it, but the error pertains.

The exec parameters

COMMON="\
--base-path /home/statemint/.local/share/polkadot \
--detailed-log-output"
RC_CHAIN="--chain polkadot"
RC_ADDR="\
--listen-addr=/ip4/0.0.0.0/tcp/30338 \
--public-addr=/ip4/xxxx/tcp/30338"
RC_CONNECTIONS="--in-peers 25 --out-peers 25"
RC_DB="\
--database paritydb \"
RC_PRUNING="--state-pruning=256"
RC_METRICS="\
--prometheus-external \
--prometheus-port 9620"
RC_RPC="--rpc-port 10000"


PC_NAME="--name 'Staker Space [2]'"
PC_ROLE_SPECIFIC="\
--collator"
PC_CHAIN="--chain asset-hub-polkadot"
PC_REMOTE_RC_URLS=""
PC_ADDR="\
--listen-addr=/ip4/0.0.0.0/tcp/31555 \
--public-addr=/ip4/xxxx/tcp/31555"
PC_CONNECTIONS="--in-peers 25 --out-peers 25"
PC_DB="\--database paritydb \"
PC_PRUNING="--state-pruning=256"
PC_LOGS=""
PC_METRICS="\
--prometheus-external \
--prometheus-port 9621"
PC_WS="\
--rpc-max-connections 100"
PC_RPC="--rpc-port 10001"

The server:

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
kernel: 5.4.0-144-generic
cpu :Core(TM) i7-9700K CPU @ 3.60GHz
@github-actions github-actions bot added the I10-unconfirmed Issue might be valid, but it's not yet known. label Sep 14, 2023
@ilhanu
Copy link
Author

ilhanu commented Sep 14, 2023

Just tried it on a different machine and tried different chain specs on Polkadot with collectives-polkadot and bridgehub-polkadot it works fine. It again errors out with statemint/asset-hub-dot.

The production server also got thee upgrade and is collating blocks on assethub-dot, so there is an inconsistency, not sure what is causing it.

@georgepisaltu
Copy link
Contributor

I managed to reproduce the issue locally by trying to start a node with --chain asset-hub-polkadot, and running with trace logs shows

main state: [Parachain] Return ext_id=7552 result=Err(Other("Exported method AuraApi_slot_duration is not found"))

In my local tests, this happens only with --chain asset-hub-polkadot while --chain asset-hub-polkadot-local and --chain asset-hub-polkadot-dev seem to be working fine. Also tested with the polkadot bridge hub runtime with --chain bridge-hub-polkadot and that seems to be working fine too.

The production server also got thee upgrade and is collating blocks on assethub-dot, so there is an inconsistency, not sure what is causing it.

It isn't an intermittent issue for me.

I don't see anything abnormal in the logs so far, but they're massive in trace mode so I need to look more. What seems to be certain right now is that it's not a runtime problem regarding the API, since the same runtime blob works with one chainspec but not with the other. It might be that the chain isn't transitioning smoothly from shell, which doesn't have the runtime API implemented in this version, so it errors out when the node can't find the function in the export section of the WASM blob.

@georgepisaltu
Copy link
Contributor

I have deleted the DB, since it was a redundant node and tried to re-sync it, but the error pertains.

Does this mean that you sync successfully but then the node crashes with that error again or you can't even sync? Did you try to purge the chain in between restart attempts using purge-chain?

In my local tests the node can't get past the first block when the runtime was shell, which is to be expected since it doesn't have any peers and it doesn't have a synced db behind it. Once in sync, the runtime should no longer be shell and it shouldn't show the error again. What's unexpected is the call to the slot_duration Aura API while the node hasn't yet transitioned from the shell runtime, but I think it's a separate issue and you shouldn't encounter it again if you manage to sync. The fact that the other nodes you have are working fine after the upgrade, including the asset hub one, seems to support this theory.

Can you provide full logs in debug level (trace will be too big) and attach the JSON chainspec you're using if it's different than the default one in cumulus/parachains/chainspec.

@ilhanu
Copy link
Author

ilhanu commented Sep 27, 2023

I'm using the baked in chainspec in the binary asset-hub-polkadot nothing other than that. I think your right, I might have played around with the DB switched from rocksdb to paritydb on one of the nodes while doing the upgrade this caused the syncing bug. Which forced to sync the systemchain from the start and this caused the bug to be shown on our node.

I didn't use the cli to purge-chain, but emptied most likely the base-path folder. But this didn't cause as you witnessed the chain to sync from the genesis.

Which I think is still worrisome if you wanted to do a fresh sync it wouldn't work with this latest stable version for mainnet of asset-hub-polkadot. So I had switched back to version v1.0.0 and synced to the tip and after that I switched again, this went as a usual upgrade. This also proves that

There is not much to log, the would just restart in a loop with the above error. Also I have moved to the new version as we needed these redundant nodes to be in sync.

@georgepisaltu
Copy link
Contributor

Between v1.0.0 and v1.1.0 async backing was merged and, among other changes, it changed the client implementation to always make a call to the Aura API slot_duration in the start_generic_aura_node function that asset-hub nodes would use. It is that specific call that is causing the crash. The asset-hub runtime is a special case because it started off as a shell runtime. This shell runtime is hardcoded as a WASM blob into the genesis chainspec in the CODE section, and the blob in that genesis chainspec didn't have the Aura API configured.

When a user is trying to start a node from scratch with --chain asset-hub-polkadot, without a DB and only using the genesis chainspec, when it would try to sync to get the asset-hub runtime, it would fail because the client would make that faulty runtime API call.

We merged a PR with another one soon to follow to enable Aura and related APIs on all runtimes, but we need to create a fix that accounts for asset-hub chains which have the old shell runtime encoded at genesis.

Before the async backing PR, there was a section of the code in cumulus/polkadot-parachain/src/service.rs, more specifically a WaitForAuraConsensus future, which was waiting for the shell runtime to upgrade to a generic Aura compatible runtime. A fix could look something like that, a special function to handle the asset-hub-polkadot and other chains in the same situation (asset-hub-kusama and asset-hub-westend), though I'm not sure how it fits into the async backing efforts.

@bkchr
Copy link
Member

bkchr commented Sep 29, 2023

A fix could look something like that, a special function to handle the asset-hub-polkadot and other chains in the same situation (asset-hub-kusama and asset-hub-westend), though I'm not sure how it fits into the async backing efforts.

This is not in any way related to async backing. The async backing changes may broke it, but this just means we need to fix it. You already analyzed this correctly and we need to bring back the old logic of having the "hard fork" being handled correctly. Do you think that you can fix this?

As before, the implementation should handle this in the polkadot-parachains binary and not be baked into any other crate.

@georgepisaltu
Copy link
Contributor

Do you think that you can fix this?

I think so. I'll have a try at it and I'll ask for help if I can't do it.

@georgepisaltu
Copy link
Contributor

@bkchr PR with proposed fix is up, PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants