Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/fold open delete race #175

Merged
merged 7 commits into from
Jul 8, 2014
Merged

Conversation

engelsanchez
Copy link
Contributor

This is a fix for #174. It changes the opening of files for a fold to use a list from the file stats in the keydir instead of whatever is lying around on the cask, which could result in missing just deleted files.

One interesting thing about the change: folds will tend to fail if an operator deletes a good number of files from the directory by mistake, etc. So a little protection had to be added to detect that and avoid having folds retrying to open the same missing files over and over. Eventually they will succeed and fold over the incomplete data set.

engelsanchez and others added 3 commits July 1, 2014 19:08
This fixes the race between a fold getting an epoch and opening the
files in the Bitcask directory.  If between those two actions a file
needed for that snapshot is deleted by a merge, the fold will return
incorrect data.

A fold will now fetch the list of files that it needs to open and the
epoch to use to access a snapshot of data with one operation. It will
then try to open the files, and if it fails try again with a new set of
files and a new epoch.

One question remains open: Now if someone deletes a file from under
Bitcask, the fold will fail indefinitely. Before, it would eventually
proceed and return whatever data is still left. Obviously we'll need to
let it do something similar, falling back to returning partial data.
The fix to the fold open/delete file race made folds fail for a long
time if a file was deleted from the cask. This fix makes it so that we
detect that and avoid trying to re-open a file that has gone truly
missing again and again.
@slfritchie
Copy link
Contributor

With commit ee05a89 added to develop, I can pretty quickly see a counterexample well within an hour, both on my laptop and on the Linux boxes in the Boston cluster, yay. No failures yet when used with this branch, double yay.

@slfritchie
Copy link
Contributor

I've grumpy news and bad news. Upon reflection, commit ee05a89 shouldn't be so necessary for helping find that bug: PULSE ought to have enough instrumentation to do it itself. Either the state space is really huge and PULSE doesn't blunder into that corner often, or there's a problem with the instrumentation.

The bad news: counterexamples. They come in two varieties.

Ck1 = [[{set,{var,14},{call,bitcask_pulse,bc_open,[true,{true,{128,39,195},97}]}},{set,{var,15},{call,bitcask_pulse,puts,[{var,14},{1,19},<<0,0,0,0>>]}},{set,{var,24},{call,bitcask_pulse,bc_close,[{var,14}]}},{set,{var,27},{call,bitcask_pulse,bc_open,[true,{true,{3,314,133},12}]}},{set,{var,36},{call,bitcask_pulse,puts,[{var,27},{1,19},<<0,0,0,0,0,0>>]}},{set,{var,39},{call,bitcask_pulse,merge,[{var,27}]}},{set,{var,42},{call,bitcask_pulse,bc_close,[{var,27}]}},{set,{var,43},{call,bitcask_pulse,fork,[[{init,{state,undefined,false,false,[]}},{set,{not_var,5},{not_call,bitcask_pulse,bc_open,[false,{true,{50,436,60},100}]}},{set,{not_var,6},{not_call,bitcask_pulse,fold,[{not_var,5}]}}]]}},{set,{var,65},{call,bitcask_pulse,bc_open,[true,{true,{362,156,190},7}]}},{set,{var,69},{call,bitcask_pulse,merge,[{var,65}]}}],{86462,36436,66255},[{errors,[]}]].

... yields an unexpected error error. It works when cut-and-paste'd to my laptop, so I hope it's a stable'ish counterexample.

errors: failed
[{<0.25755.2>,"Unexpectedly missing file ~s",
  ["1404-275176-752256/2.bitcask.data"]}] /= []

Number two is:

Ck4 = [[{set,{var,1},{call,bitcask_pulse,bc_open,[true,{true,{97,304,441},4}]}},{set,{var,3},{call,bitcask_pulse,delete,[{var,1},1]}},{set,{var,6},{call,bitcask_pulse,bc_close,[{var,1}]}},{set,{var,7},{call,bitcask_pulse,fork,[[{init,{state,undefined,false,false,[]}},{set,{not_var,10},{not_call,bitcask_pulse,bc_open,[false,{true,{437,404,227},31}]}},{set,{not_var,11},{not_call,bitcask_pulse,fold,[{not_var,10}]}},{set,{not_var,13},{not_call,bitcask_pulse,bc_close,[{not_var,10}]}},{set,{not_var,29},{not_call,bitcask_pulse,bc_open,[false,{true,{223,15,87},100}]}},{set,{not_var,30},{not_call,bitcask_pulse,fold,[{not_var,29}]}}]]}},{set,{var,9},{call,bitcask_pulse,bc_open,[true,{true,{196,31,319},100}]}},{set,{var,11},{call,bitcask_pulse,merge,[{var,9}]}},{set,{var,12},{call,bitcask_pulse,bc_close,[{var,9}]}},{set,{var,27},{call,bitcask_pulse,bc_open,[true,{true,{127,87,413},68}]}},{set,{var,51},{call,bitcask_pulse,puts,[{var,27},{1,1},<<>>]}}],{66310,74969,22249},[{errors,[]},{events,[]}]].

... and it yields a bad fold: 1 key expected, 0 found. This one is not stable when cut-and-paste'd to my Mac.

Bad:
[{0,92442,[]},{92442,infinity,[{bad,<<"pid_2">>,{fold,[{1,[<<>>]}],[]}}]}]

On my Mac, it's usually less than 20 iterations to find an error via:

[begin Now = now(), io:format("Now ~p ", [Now]), true = eqc:check(bitcask_pulse:prop_pulse(), [lists:nth(1, Ck4), Now, lists:nth(3, Ck4)]) end || _ <- lists:seq(1,100)].

Because now we have code appending tombstones to potentially deleted files,
file stats entries were occasionally created on already deleted files.
Now the update_fstats call has a flag to signal when it is OK to create
a file stat entry if it doesn't exist already.
@engelsanchez
Copy link
Contributor Author

The fix for the first counter example is in. Since we now rely on the file stats entries to tell which files should be there and new code is writing tombstones to files other than the active file, it was possible to materialize a file stats entry on a file that was just deleted. A clunky new flag was added to update_fstats so that for certain cases you can specify "don't create a file stats entry if there isn't one already".

@engelsanchez
Copy link
Contributor Author

The second counter example can be explained by #170. (A single delete in a file + merge + close + re-open + re-use same file + delayed delete deletes the new file by mistake) Yay!

Here is a redbug trace of what happens with annotations: https://gist.github.com/engelsanchez/873e6b54ccd04b5ef352

@slfritchie
Copy link
Contributor

@engelsanchez You appear to be correct, yay! After commit 925c2f7, then this is still a problem:

Cm2 = [[{set,{var,1},{call,bitcask_pulse,incr_clock,[]}},{set,{var,2},{call,bitcask_pulse,incr_clock,[]}},{set,{var,4},{call,bitcask_pulse,fork,[[{init,{state,undefined,false,false,[]}},{set,{not_var,3},{not_call,bitcask_pulse,bc_open,[false,{true,{396,466,56},19}]}},{set,{not_var,4},{not_call,bitcask_pulse,gets,[{not_var,3},{1,1}]}},{set,{not_var,8},{not_call,bitcask_pulse,bc_close,[{not_var,3}]}},{set,{not_var,13},{not_call,bitcask_pulse,bc_open,[false,{true,{123,478,31},82}]}},{set,{not_var,14},{not_call,bitcask_pulse,fold,[{not_var,13}]}}]]}},{set,{var,11},{call,bitcask_pulse,bc_open,[true,{true,{288,347,452},23}]}},{set,{var,12},{call,bitcask_pulse,delete,[{var,11},1]}},{set,{var,13},{call,bitcask_pulse,bc_close,[{var,11}]}},{set,{var,14},{call,bitcask_pulse,incr_clock,[]}},{set,{var,15},{call,bitcask_pulse,incr_clock,[]}},{set,{var,16},{call,bitcask_pulse,incr_clock,[]}},{set,{var,17},{call,bitcask_pulse,incr_clock,[]}},{set,{var,18},{call,bitcask_pulse,bc_open,[true,{true,{169,422,102},27}]}},{set,{var,20},{call,bitcask_pulse,merge,[{var,18}]}},{set,{var,21},{call,bitcask_pulse,bc_close,[{var,18}]}},{set,{var,29},{call,bitcask_pulse,bc_open,[true,{true,{324,341,134},12}]}},{set,{var,35},{call,bitcask_pulse,puts,[{var,29},{3,37},<<>>]}}],{42269,61720,9051},[{errors,[]},{events,[]}]].
[true = eqc:check(eqc:testing_time(24*3600, bitcask_pulse:prop_pulse()), Cm2) || _ <- lists:seq(1,550)]. 

But that list comprehension doesn't throw an exception when the bug/deferred-delete-Cd8 branch is merged into 925c2f7. I'll try to punish the merged code now.....

@slfritchie
Copy link
Contributor

The merged code tests well (see #170 (comment)). I've removed the += 100 fudge and am now retesting.

@slfritchie
Copy link
Contributor

+1 c8c0acc

Issue #170 is merged. I've been PULSE testing #170 and #175 and #177 together for days without problem.

borshop added a commit that referenced this pull request Jul 8, 2014
Bugfix/fold open delete race

Reviewed-by: slfritchie
@slfritchie
Copy link
Contributor

@borshop merge

@borshop borshop merged commit c8c0acc into develop Jul 8, 2014
@seancribbs seancribbs deleted the bugfix/fold-open-delete-race branch April 1, 2015 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants