-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post upgrade with x/upgrade queries not working #7385
Comments
Small update. After checking these are the details of the update: The time in between was due to the initial wrong migration process and its fix release. If you try to query using the REST APIs for any height after the upgrade, the same error shows up: {
"error": "invalid request: failed to load state at height 513575; version does not exist (latest height: 517249)"
} |
@RiccardoM I've tested couple of upgrades locally, everything went smooth and I am able to query LCD without any issues. I didn't cover jailing events and withdraw commissions yet, but tried to cover other possible transactions. Do you mind sharing some info about your fix for the |
Akash Network is going mainnet with |
You can see it here. It was just a small addition to make sure that when reading the old state, Amino could de-serialize maps properly. It was just some application-layer bug. I think this has no impact on this bug. |
I'd be curious to see which specific IAVL substore is failing to load. For example, there might be a substore that is no longer in use after the upgrade, such that new versions aren't written to it - the failing code path tries to load all substores. Might be an idea to write a script that inspects the latest version root of all IAVL substores in the application state. |
Okay. Is this only for REST or are you facing issues with CLI too? |
The same error is returned also for all CLI queries. |
UpdateI have now finished re-syncing a new node from scratch using |
Ok, so it's not related to pruning. I'm not really on the SDK team, so I don't have a lot of insight into what might cause this, but if you could stop a node, create a tarball of the entire application state, and make it available somewhere I can have a look at the data stores. |
@erikgrinaker Suppose the application lives inside |
At least with the standard SDK layout I'd need |
On a quick look, I believe we are facing store issues due to these changes: https://github.com/desmos-labs/desmos/blob/v0.12.0/x/posts/keeper/v0120migration.go#L10-L56 |
Oh, yeah, I think that migration is wrong. It fetches the post value into
It should be using |
I should also mention that it's not safe to write to a key domain while iterating over it. This is documented for tm-db, but doesn't seem to be documented for IAVL or the SDK: https://github.com/tendermint/tm-db/blob/17536ad9315e5c54dd327d4d49c77cad4fbfbb08/types.go#L48 |
Turns out that IAVL doesn't actually use |
EDIT: This link does not download the correct status. I'm removing it. @anilcse @erikgrinaker You're both right. That migration is not correct and I should use the |
Even if it wasn't the case, which is the appropriate way of doing this? I was thinking about
Are there better alternatives? |
Modifying while iterating is generally a bit iffy - I'll have to have a closer look at IAVL to see how it behaves when this happens. The simplest approach is to just collect all the keys in a slice, close the iterator, then iterate over the slice to read and update them - at least this way you don't need to keep all values in memory, just the keys. This is assuming that you have enough memory to hold all keys. |
Ok, I'll wait for a new one then. |
I think I got what you mean. Could you please check if this is better? https://github.com/desmos-labs/desmos/blob/version/v0.12.2/x/posts/keeper/v0120migration.go#L15
It might take a while. It's actually ~20GB of database |
You need to store |
Ok, I ran a script on the database which inspected the latest IAVL version of each substore in the database, it found this:
Notice the Notice that 18596 is almost exactly the number of heights since the upgrade (at 513573). This is not a coincidence. This is probably a new store that was registered during the upgrade, but it starts from version 1 (as all IAVL stores do). Either the SDK needs to handle this case during upgrades by using the initial height API currently included in the IAVL 0.15.x prereleases, or it should just use a single IAVL store for all modules (which would also fix the partial commit problem in #6370) - I'm heavily in favor of the latter. |
Great debugging @erikgrinaker. The issue totally makes sense now. However, for the solution I'm not totally convinced this responsibility should be placed on the SDK. Because either way, the fix is to call to
|
Sure, that's your call, I don't know the details here - I'll backport the IAVL initial height stuff for 0.14. |
Occurred to me that |
I've backported Let me know when you've tried it out, and I'll cut 0.14.1. |
That's unfortunate. How do you recommend we proceed here? |
Off the top of my head, maybe pad out the new store with empty versions corresponding to the versions in the other stores (which will depend on historical pruning settings and such), such that all stores have the same versions. Other options would be to check which stores have which ranges via e.g. AvailableVersions or new methods, or return specific error types that the SDK can handle appropriately. Longer term, use a single IAVL store. |
Hey @alexanderbez, @erikgrinaker are there any news on this? Is there something I can do to help tackle this? |
No news, not clear on how to handle loading new stores on upgrades. I think padding or doing anything fancy in the app construction is likely to be buggy and hacky. I'm liking the idea of where @erikgrinaker was heading with modifying (Current logic) func (rs *Store) CacheMultiStoreWithVersion(version int64) (types.CacheMultiStore, error) {
cachedStores := make(map[types.StoreKey]types.CacheWrapper)
for key, store := range rs.stores {
switch store.GetStoreType() {
case types.StoreTypeIAVL:
// If the store is wrapped with an inter-block cache, we must first unwrap
// it to get the underlying IAVL store.
store = rs.GetCommitKVStore(key)
// Attempt to lazy-load an already saved IAVL store version. If the
// version does not exist or is pruned, an error should be returned.
iavlStore, err := store.(*iavl.Store).GetImmutable(version)
if err != nil {
return nil, err // this fails because store N is NEW and obviously doesn't exist at 'version'
}
cachedStores[key] = iavlStore
default:
cachedStores[key] = store
}
}
return cachemulti.NewStore(rs.db, cachedStores, rs.keysByName, rs.traceWriter, rs.traceContext), nil
} So I think we need to look into modifying this method s.t. it doesn't error possibly when it can't load that version. But, we have to be sure that the query is handled safely still and
Do you feel like this is something you can look into? To start, it could be as simple as just not returning an error. |
But I'm not sure that works either, because the new store needs to be at the same version as the others... |
@alexanderbez I've tried implementing a solution inside #7415. Please let me know what you think about it. |
Origin: #7415 Closes: #7385 Thanks: @RiccardoM for the backport.
Summary of Bug
Today we ran an automatic upgrade using the
x/upgrade
module.After a first error in the upgrade during the de-serialization of the data from the old state, we pushed a fix and every validator was able to upgrade correctly.
After the upgrade, the chain continued to create blocks correctly. You can see it running from here.
Unluckily, we later found out that all the queries are now returning the same error:
From the REST APIs:
From the CLI:
This happened on all the nodes that underwent the upgrade procedure, no matter what their
pruning
strategies was. We tested it with ones havingpruning = "nothing
,pruning = "everything"
andpruning = "default"
. All with the same result.Currently we're testing if restarting a node with
pruning = "nothing"
, and making sure the update works at the first try, solves the problem.Version
0.39.1
Steps to Reproduce
For Admin Use
The text was updated successfully, but these errors were encountered: