fix error in resolve funcs #815

brennanjl · 2024-06-11T19:14:21Z

While working through the Powerpod integration, I spotted a bug in our resolution code. If a resolution fails, it will halt consensus. This should not be the case, since resolutions are made to target databases, which can have arbitrary user-defined logic.

jchappelow

Seems fine. If it's not actually an issue with "arbitrary user-defined logic" but some other fatal error, instead something local to the node, would there least be an app hash mismatch? Or would that be go unnoticed for some arbitrary time?

brennanjl · 2024-06-11T19:38:12Z

Seems fine. If it's not actually an issue with "arbitrary user-defined logic" but some other fatal error, instead something local to the node, would there least be an app hash mismatch? Or would that be go unnoticed for some arbitrary time?

There are only two cases I could see where the node does something node-specific, resulting in a different result:

The node loses connection to Postgres, in which case we will find out immediately after the failed resolution when we try to roll the tx back.
The author creates something with a non-deterministic, in which case there likely would be an apphash mismatch.

Since we do not include user balances and validator power in app-hashes, there is a risk that a non-deterministic change there might not result in an app-hash mismatch until later, but this seems more like a Kwil issue to fix.

jchappelow · 2024-06-11T19:51:09Z

Point 2 there is not really the subject of this, I think. The question is, what is the meaning of a nil vs. non-nil error for the purposes of the extension author writing a ResolveFunc, which is modifying consensus logic for their network. It's a documentation issue. When defining a resolution, the understanding was previously: do not return a non-nil error from ResolveFunc unless it's a fatal error that should halt the node.

But we do have examples and built-in resolutions that look like:

	ResolveFunc: func(ctx context.Context, app *common.App, resolution *resolutions.Resolution) error {
		removeReq := &UpdatePowerRequest{}
		if err := removeReq.UnmarshalBinary(resolution.Body); err != nil {
			return fmt.Errorf("failed to unmarshal remove request: %w", err)
		}
		if removeReq.Power != 0 {
			// this should never happen since UpdatePowerRequest is only used for internal communication
			// between modules. Removes are sent from the client in a separate message.
			return fmt.Errorf("remove request with non-zero power")
		}
			return SetValidatorPower(ctx, app.DB, removeReq.PubKey, 0)
	},

So in part an error can be created by several things:

bad resolution Body that cannot be unmarshalled
invalid arguments according to the logic defined in the function (e.g. if removeReq.Power != 0)
invalid inputs to the DB function (e.g. maybe violates some unique constraint or any other query error)
DB error -- disk full, connection down, whatever

I agree the majority of cases should not halt consensus, so this change is best, but it's funny that we have no distinction between fatal an non-fatal errors here. If there is something funny with the node that caused SetValidatorPower to error where other nodes succeeded, I don't know where we'd find out. I would hope for a mechanism for an actual error to be emitted, like how an error returned to cometbft from any ABCI method causes a node halt, but some methods have a field in the result like ACCEPT, REJECT, UNKNOWN.

brennanjl · 2024-06-11T19:54:44Z

I see what you're saying. IMO, it still seems like this could be solved with improved apphash tracking. Right now, only user schemas affect apphash. If the validator schema affected apphash too, then the node that gets an unexpected result / error from changing validator power would recognize their state diverged in the next block, right?

jchappelow · 2024-06-11T20:00:04Z

Hypothetically. I think it's handled more easily by hard failing where the extension author does not anticipate an error. You cannot roll back an already-committed an app state change. If you commit something erroneous (and we can know it is), then you diverge permanently and have to resync if you want to get back on the network even if apphash mismatch catches it right away. Not a good outcome.

I think the API is simply lacking for the resolution resolution function, or the ResolveFunc should log (not the app) when it's not fatal, and return an error when it's fatal.

jchappelow · 2024-06-11T20:03:12Z

In any case, if I were writing an extension, I would look at the docs for the field:

// ResolveFunc is a function that is called once a resolution has
// received a required number of votes, as defined by the
// ConfirmationThreshold. It is given a readwrite database
// connection and the information for the resolution that has
// been confirmed. All nodes will call this function as a part of
// block execution. It is therefore expected that the function is
// deterministic, regardless of a node's local configuration.
ResolveFunc func(ctx context.Context, app *common.App, resolution *Resolution) error

When should I return an error. How does the app handle an error? How should I deal with internal errors?

The contract for that API could use some clarification.

brennanjl · 2024-06-11T20:09:49Z

The contract for that API could use some clarification.

Agreed

brennanjl · 2024-06-11T20:13:53Z

I think the API is simply lacking for the resolution function, or the ResolveFunc should log (not the app) when it's not fatal, and return an error when it's fatal.

Unfortunately, this is hard to delineate right now. For example, if procedure execution failed due to an error() function, and if procedure execution failed due to a lost connection to Postgres, the error would be returned in the same place. Maybe something we can solve in the future.

fix error in resolve funcs

ed724f8

brennanjl added this to the v0.8.1 milestone Jun 11, 2024

brennanjl requested a review from charithabandi June 11, 2024 19:14

brennanjl added the backport-to-release-v0.8 backport to release-v0.8 branch label Jun 11, 2024

brennanjl removed this from the v0.8.1 milestone Jun 11, 2024

jchappelow approved these changes Jun 11, 2024

View reviewed changes

brennanjl merged commit 96b79f7 into main Jun 11, 2024
2 checks passed

brennanjl deleted the failed-resolution branch June 11, 2024 19:42

brennanjl mentioned this pull request Jun 11, 2024

Backport fixes #816

Merged

brennanjl mentioned this pull request Jun 11, 2024

Documentation: Clarify Extension Error Returns #817

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix error in resolve funcs #815

fix error in resolve funcs #815

brennanjl commented Jun 11, 2024

jchappelow left a comment

brennanjl commented Jun 11, 2024

jchappelow commented Jun 11, 2024

brennanjl commented Jun 11, 2024

jchappelow commented Jun 11, 2024 •

edited

Loading

jchappelow commented Jun 11, 2024 •

edited

Loading

brennanjl commented Jun 11, 2024

brennanjl commented Jun 11, 2024

fix error in resolve funcs #815

fix error in resolve funcs #815

Conversation

brennanjl commented Jun 11, 2024

jchappelow left a comment

Choose a reason for hiding this comment

brennanjl commented Jun 11, 2024

jchappelow commented Jun 11, 2024

brennanjl commented Jun 11, 2024

jchappelow commented Jun 11, 2024 • edited Loading

jchappelow commented Jun 11, 2024 • edited Loading

brennanjl commented Jun 11, 2024

brennanjl commented Jun 11, 2024

jchappelow commented Jun 11, 2024 •

edited

Loading

jchappelow commented Jun 11, 2024 •

edited

Loading