-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/fold open delete race #175
Conversation
This fixes the race between a fold getting an epoch and opening the files in the Bitcask directory. If between those two actions a file needed for that snapshot is deleted by a merge, the fold will return incorrect data. A fold will now fetch the list of files that it needs to open and the epoch to use to access a snapshot of data with one operation. It will then try to open the files, and if it fails try again with a new set of files and a new epoch. One question remains open: Now if someone deletes a file from under Bitcask, the fold will fail indefinitely. Before, it would eventually proceed and return whatever data is still left. Obviously we'll need to let it do something similar, falling back to returning partial data.
The fix to the fold open/delete file race made folds fail for a long time if a file was deleted from the cask. This fix makes it so that we detect that and avoid trying to re-open a file that has gone truly missing again and again.
With commit ee05a89 added to |
I've grumpy news and bad news. Upon reflection, commit ee05a89 shouldn't be so necessary for helping find that bug: PULSE ought to have enough instrumentation to do it itself. Either the state space is really huge and PULSE doesn't blunder into that corner often, or there's a problem with the instrumentation. The bad news: counterexamples. They come in two varieties.
... yields an unexpected error error. It works when cut-and-paste'd to my laptop, so I hope it's a stable'ish counterexample.
Number two is:
... and it yields a bad fold: 1 key expected, 0 found. This one is not stable when cut-and-paste'd to my Mac.
On my Mac, it's usually less than 20 iterations to find an error via:
|
Because now we have code appending tombstones to potentially deleted files, file stats entries were occasionally created on already deleted files. Now the update_fstats call has a flag to signal when it is OK to create a file stat entry if it doesn't exist already.
The fix for the first counter example is in. Since we now rely on the file stats entries to tell which files should be there and new code is writing tombstones to files other than the active file, it was possible to materialize a file stats entry on a file that was just deleted. A clunky new flag was added to update_fstats so that for certain cases you can specify "don't create a file stats entry if there isn't one already". |
The second counter example can be explained by #170. (A single delete in a file + merge + close + re-open + re-use same file + delayed delete deletes the new file by mistake) Yay! Here is a redbug trace of what happens with annotations: https://gist.github.com/engelsanchez/873e6b54ccd04b5ef352 |
@engelsanchez You appear to be correct, yay! After commit 925c2f7, then this is still a problem:
But that list comprehension doesn't throw an exception when the |
The merged code tests well (see #170 (comment)). I've removed the += 100 fudge and am now retesting. |
Bugfix/fold open delete race Reviewed-by: slfritchie
@borshop merge |
This is a fix for #174. It changes the opening of files for a fold to use a list from the file stats in the keydir instead of whatever is lying around on the cask, which could result in missing just deleted files.
One interesting thing about the change: folds will tend to fail if an operator deletes a good number of files from the directory by mistake, etc. So a little protection had to be added to detect that and avoid having folds retrying to open the same missing files over and over. Eventually they will succeed and fold over the incomplete data set.