Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix error in resolve funcs #815

Merged
merged 1 commit into from
Jun 11, 2024
Merged

fix error in resolve funcs #815

merged 1 commit into from
Jun 11, 2024

Conversation

brennanjl
Copy link
Collaborator

While working through the Powerpod integration, I spotted a bug in our resolution code. If a resolution fails, it will halt consensus. This should not be the case, since resolutions are made to target databases, which can have arbitrary user-defined logic.

@brennanjl brennanjl added this to the v0.8.1 milestone Jun 11, 2024
@brennanjl brennanjl requested a review from charithabandi June 11, 2024 19:14
@brennanjl brennanjl added the backport-to-release-v0.8 backport to release-v0.8 branch label Jun 11, 2024
@brennanjl brennanjl removed this from the v0.8.1 milestone Jun 11, 2024
Copy link
Member

@jchappelow jchappelow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine. If it's not actually an issue with "arbitrary user-defined logic" but some other fatal error, instead something local to the node, would there least be an app hash mismatch? Or would that be go unnoticed for some arbitrary time?

@brennanjl
Copy link
Collaborator Author

Seems fine. If it's not actually an issue with "arbitrary user-defined logic" but some other fatal error, instead something local to the node, would there least be an app hash mismatch? Or would that be go unnoticed for some arbitrary time?

There are only two cases I could see where the node does something node-specific, resulting in a different result:

  1. The node loses connection to Postgres, in which case we will find out immediately after the failed resolution when we try to roll the tx back.
  2. The author creates something with a non-deterministic, in which case there likely would be an apphash mismatch.

Since we do not include user balances and validator power in app-hashes, there is a risk that a non-deterministic change there might not result in an app-hash mismatch until later, but this seems more like a Kwil issue to fix.

@brennanjl brennanjl merged commit 96b79f7 into main Jun 11, 2024
2 checks passed
@brennanjl brennanjl deleted the failed-resolution branch June 11, 2024 19:42
@brennanjl brennanjl mentioned this pull request Jun 11, 2024
@jchappelow
Copy link
Member

Point 2 there is not really the subject of this, I think. The question is, what is the meaning of a nil vs. non-nil error for the purposes of the extension author writing a ResolveFunc, which is modifying consensus logic for their network. It's a documentation issue. When defining a resolution, the understanding was previously: do not return a non-nil error from ResolveFunc unless it's a fatal error that should halt the node.

But we do have examples and built-in resolutions that look like:

	ResolveFunc: func(ctx context.Context, app *common.App, resolution *resolutions.Resolution) error {
		removeReq := &UpdatePowerRequest{}
		if err := removeReq.UnmarshalBinary(resolution.Body); err != nil {
			return fmt.Errorf("failed to unmarshal remove request: %w", err)
		}
		if removeReq.Power != 0 {
			// this should never happen since UpdatePowerRequest is only used for internal communication
			// between modules. Removes are sent from the client in a separate message.
			return fmt.Errorf("remove request with non-zero power")
		}
			return SetValidatorPower(ctx, app.DB, removeReq.PubKey, 0)
	},

So in part an error can be created by several things:

  1. bad resolution Body that cannot be unmarshalled
  2. invalid arguments according to the logic defined in the function (e.g. if removeReq.Power != 0)
  3. invalid inputs to the DB function (e.g. maybe violates some unique constraint or any other query error)
  4. DB error -- disk full, connection down, whatever

I agree the majority of cases should not halt consensus, so this change is best, but it's funny that we have no distinction between fatal an non-fatal errors here. If there is something funny with the node that caused SetValidatorPower to error where other nodes succeeded, I don't know where we'd find out. I would hope for a mechanism for an actual error to be emitted, like how an error returned to cometbft from any ABCI method causes a node halt, but some methods have a field in the result like ACCEPT, REJECT, UNKNOWN.

@brennanjl
Copy link
Collaborator Author

I see what you're saying. IMO, it still seems like this could be solved with improved apphash tracking. Right now, only user schemas affect apphash. If the validator schema affected apphash too, then the node that gets an unexpected result / error from changing validator power would recognize their state diverged in the next block, right?

@jchappelow
Copy link
Member

jchappelow commented Jun 11, 2024

Hypothetically. I think it's handled more easily by hard failing where the extension author does not anticipate an error. You cannot roll back an already-committed an app state change. If you commit something erroneous (and we can know it is), then you diverge permanently and have to resync if you want to get back on the network even if apphash mismatch catches it right away. Not a good outcome.

I think the API is simply lacking for the resolution resolution function, or the ResolveFunc should log (not the app) when it's not fatal, and return an error when it's fatal.

@jchappelow
Copy link
Member

jchappelow commented Jun 11, 2024

In any case, if I were writing an extension, I would look at the docs for the field:

// ResolveFunc is a function that is called once a resolution has
// received a required number of votes, as defined by the
// ConfirmationThreshold. It is given a readwrite database
// connection and the information for the resolution that has
// been confirmed. All nodes will call this function as a part of
// block execution. It is therefore expected that the function is
// deterministic, regardless of a node's local configuration.
ResolveFunc func(ctx context.Context, app *common.App, resolution *Resolution) error

When should I return an error. How does the app handle an error? How should I deal with internal errors?

The contract for that API could use some clarification.

@brennanjl
Copy link
Collaborator Author

The contract for that API could use some clarification.

Agreed

@brennanjl
Copy link
Collaborator Author

I think the API is simply lacking for the resolution function, or the ResolveFunc should log (not the app) when it's not fatal, and return an error when it's fatal.

Unfortunately, this is hard to delineate right now. For example, if procedure execution failed due to an error() function, and if procedure execution failed due to a lost connection to Postgres, the error would be returned in the same place. Maybe something we can solve in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-release-v0.8 backport to release-v0.8 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants