Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VerifyTxConsistency to backend. #17359

Merged
merged 1 commit into from
Feb 22, 2024
Merged

Conversation

siyuanfoundation
Copy link
Contributor

@siyuanfoundation siyuanfoundation commented Feb 2, 2024

Followup test verification for #17158

Should help us find all places impacted by readbuf inconsistency.
Similar to #17127

@siyuanfoundation
Copy link
Contributor Author

/retest

@siyuanfoundation
Copy link
Contributor Author

cc @serathius @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Feb 4, 2024

Also let's simplify the verification just at one place (e.g. at the end of each etcd transaction)

@siyuanfoundation siyuanfoundation force-pushed the verify-test branch 2 times, most recently from 535ad95 to d20f507 Compare February 5, 2024 18:00
@siyuanfoundation
Copy link
Contributor Author

siyuanfoundation commented Feb 5, 2024

Also let's simplify the verification just at one place (e.g. at the end of each etcd transaction)

Doing it in applyAll is essentially at the end of each etcd transaction.
Please see #17127 for the read tx level approach and why it does not work.
Another good place to do it is in func (t *batchTxBuffered) Unlock() because t.buf.writeback needs to be done before checking, but I cannot call schema.AllBuckets there because of cyclic dependencies.
"wrap backend.ReadTx() and backend.BatchTx() and intercept UnsafeRead, UnsafePut calls." this would involve some non trivial code refactoring, which may be more risky than beneficial for the purpose of verification.

server/storage/backend/verify.go Outdated Show resolved Hide resolved
server/storage/backend/verify.go Outdated Show resolved Hide resolved
if lg != nil {
lg.Debug("verifyBackendConsistency", zap.Bool("skipSafeRangeBucket", skipSafeRangeBucket))
}
b.BatchTx().LockOutsideApply()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we should use LockOutsideApply? We call this function in applyAll

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, LockInsideApply would fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use LockOutsideApply. This is just verification code, obviously we don't need to execute the txPostLockInsideApplyHook.

dataFromReadTxn[string(k)] = string(v)
return nil
})
if diff := cmp.Diff(dataFromWriteTxn, dataFromReadTxn); diff != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you confirmed that this would detect all the recently discovered problems?

When I proposed #17126 I was thinking that we should implement a stub for backend that would implement the same methods but without bbolt and keep the data only in memory. With validate enabled, we would send all request to bbolt to both backend and the stub and compare results.

I think this would be a better long term approach, however your solution should be acceptable assuming it can reproduce previous issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can detect #17158.
For #17263, since it is rare to write multiple keys in 1 tx, there is no existing issues to detect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that we should implement a stub for backend that would implement the same methods but without bbolt and keep the data only in memory

Sound like a good idea. We clear/reset the stub backend before each test, and compare the real backend with the stub at the end of each test. But when the member restarts, all the in-memory data will be lost, unless we also support persistent storage for the stub backend. It's a separate topic (This PR just verifies the consistency between read TXN and write TXN.), please feel free to raise a separate ticket to track it if you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when the member restarts, all the in-memory data will be lost, unless we also support persistent storage for the stub backend

You are right that it's impossible to validate cross restart without keeping own persistent storage, however we don't need to test durability, just the runtime behavior. If we validate the data before member is restarted, we should trust the data when backend is loaded back after restart.

@siyuanfoundation
Copy link
Contributor Author

/retest

Comment on lines +76 to +81
if b == nil {
return
}
if lg != nil {
lg.Debug("verifyBackendConsistency", zap.Bool("skipSafeRangeBucket", skipSafeRangeBucket))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: suggest to remove the ugly nil check. We require the callers to pass in valid parameters.

Suggested change
if b == nil {
return
}
if lg != nil {
lg.Debug("verifyBackendConsistency", zap.Bool("skipSafeRangeBucket", skipSafeRangeBucket))
}
lg.Debug("verifyBackendConsistency", zap.Bool("skipSafeRangeBucket", skipSafeRangeBucket))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to be an umbrella check. There are existing tests that have nil srv backend or nil logger. I don't want to fail those tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to fail those tests.

Which tests are "those tests"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestProcessDuplicatedAppRespMessage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Actually we shouldn't enable & run the verification in unit test at all. They are only supposed to be executed in e2e and integration tests.

We may want to remove the following setting.

export ETCD_VERIFY=all

Since it's unrelated to this PR, so I won't insist on it. It can be discussed separately.

@ahrtr
Copy link
Member

ahrtr commented Feb 19, 2024

Please rebase this PR

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Thanks

cc @fuweid @jmhbnz @serathius

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to test that performance in the follow-up when uses enables CorruptCheckTime. It seems it's too expensive to read all the keys into memory.

@ahrtr
Copy link
Member

ahrtr commented Feb 22, 2024

I think we need to test that performance in the follow-up when uses enables CorruptCheckTime. It seems it's too expensive to read all the keys into memory.

The verification is only executed in test. Usually there isn't too much data in test.

@fuweid
Copy link
Member

fuweid commented Feb 22, 2024

The verification is only executed in test.

Thanks for the comment.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ahrtr
Copy link
Member

ahrtr commented Feb 22, 2024

I just merged another PR, which updated the go.mod files. Please rebase this PR. Sorry for the inconvenience.

Signed-off-by: Siyuan Zhang <sizhang@google.com>

Update server/storage/backend/verify.go

Co-authored-by: Benjamin Wang <benjamin.wang@broadcom.com>

Update server/storage/backend/verify.go

Co-authored-by: Benjamin Wang <benjamin.wang@broadcom.com>
@siyuanfoundation
Copy link
Contributor Author

rebased.

@ahrtr ahrtr merged commit 8c7f911 into etcd-io:main Feb 22, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants