Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tombstone management #128

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
eb182ba
Add regression test for GH #82
slfritchie Dec 15, 2013
1a9c994
Fix race with merge & bitcask:delete(): remove from keydir *first*
slfritchie Dec 15, 2013
d594f90
Change order of scan & fold: now oldest -> youngest
slfritchie Dec 15, 2013
8805790
New regression test (currently broken) for recent scan change of olde…
slfritchie Dec 15, 2013
a331587
Fix rough edges on test/bitcask_qc.erl & test/bitcask_qc_expiry.erl
slfritchie Dec 15, 2013
239ed71
Write tombstone on delete only if there's a keydir entry, plus race b…
slfritchie Dec 15, 2013
afc5c3f
Fix remaining bug when bitcask scan/fold order was changed to oldest-…
slfritchie Dec 17, 2013
1712650
FML: fix tombstone merging bug when K & K's tombstone are in same fileid
slfritchie Dec 17, 2013
dd1c8b4
WIP: new test new_20131217_c_test_() is broken, need to experiment wi…
slfritchie Dec 18, 2013
a99916f
Fix non-determinism in the fold_visits_unfrozen_test test ... now 100…
slfritchie Dec 18, 2013
2af49f0
Fix iterator freeze bug demonstrated by last commit: remove kh_put_wi…
slfritchie Dec 18, 2013
662fba8
Fix bug introduced by almost-bugfix in commit 1a9c99 that was suppose…
slfritchie Dec 18, 2013
990b5ca
Fix unexported function errors, thanks Buildbot
slfritchie Dec 18, 2013
4019fd0
Omnibus bugfixing, including:
slfritchie Dec 21, 2013
89711f3
Add UNIX epoch time to put_int NIF call, to avoid C use of time(3).
slfritchie Dec 21, 2013
dd068bf
Fix buildbot ?TESTDIR definition location problem
slfritchie Dec 21, 2013
0695ba1
Uff da: most PULSE non-determinism problems fixed?
slfritchie Dec 23, 2013
0c3699a
Fix merge race when keydir_put specifies old fileid & offset races wi…
slfritchie Dec 23, 2013
c0f8760
Fix bug in previous commit: offset=0 is valid, do not check it
slfritchie Dec 24, 2013
91a781b
Use larger retry # for open_fold_files() and keydir_wait_ready() when…
slfritchie Dec 24, 2013
3c4ffa5
Small bitcask_pulse changes: file size, pretty the keys used, slightl…
slfritchie Dec 24, 2013
f27c640
Adjust command frequency weightings for PULSE model
slfritchie Dec 25, 2013
a981d53
Add no_tombstones_after_reopen_test() to test desired condition
slfritchie Dec 26, 2013
2690354
Avoid storing tombstones in the keydir when possible
slfritchie Dec 26, 2013
cef207b
Add checks to the PULSE model to try to catch improper tombstones in …
slfritchie Dec 26, 2013
72d87aa
Dialyzer fixes
slfritchie Dec 26, 2013
78a2667
Add failing test: tombstones exist in keydir if hint files are destroyed
slfritchie Dec 26, 2013
7bb2db0
Add {tombstone,Key} wrapper when folding .data files; fixes test fail…
slfritchie Dec 26, 2013
01d4538
Fix intermittent bitcask_merge_worker intermittent timeout failure
slfritchie Jan 7, 2014
1a444dd
NIF review: avoid looking at argv[1] directly
slfritchie Jan 7, 2014
f0d5ffc
Remove unwanted oopsie foo.erl
slfritchie Jan 7, 2014
2518628
Add an 'epoch' counter to subdivide Bitcask timestamps
slfritchie Jan 22, 2014
1563feb
Add sibling->regular entry conversion sweeper, to GC keydir siblings
slfritchie Jan 24, 2014
9db9248
Add new EUnit test and fix PULSE model
slfritchie Jan 28, 2014
432f3b1
Fix last two Dialyzer errors by removing dead code
slfritchie Jan 28, 2014
a50a1c0
Add error filter for harmless race
slfritchie Jan 30, 2014
959ecce
Add gets() to PULSE model, analogus to puts() for multiple keys
slfritchie Jan 30, 2014
5f4d145
Add new NIF, bitcask_nifs_keydir_global_pending_frozen/0, to assist P…
slfritchie Jan 30, 2014
c9bf3a3
Review fixer-upper fixes, both small but worthwhile, thanks!
slfritchie Feb 20, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
478 changes: 407 additions & 71 deletions c_src/bitcask_nifs.c

Large diffs are not rendered by default.

13 changes: 10 additions & 3 deletions include/bitcask.hrl
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@
fd, % File handle
hintfd, % File handle for hints
hintcrc=0,% CRC-32 of current hint
ofs }). % Current offset for writing
ofs, % Current offset for writing
l_ofs=0, % Last offset written to data file
l_hbytes=0,% Last # bytes written to hint file
l_hintcrc=0}). % CRC-32 of current hint prior to last write

-record(file_status, { filename,
fragmented,
Expand All @@ -26,8 +29,12 @@
-define(FMT(Str, Args), lists:flatten(io_lib:format(Str, Args))).

-define(TOMBSTONE, <<"bitcask_tombstone">>).
-define(TOMBSTONE2_STR, "bitcask_tombstone2").
-define(TOMBSTONE2,<<?TOMBSTONE2_STR>>).

-define(OFFSETFIELD, 64).
-define(OFFSETFIELD_v1, 64).
-define(TOMBSTONEFIELD_V2, 1).
-define(OFFSETFIELD_V2, 63).
-define(TSTAMPFIELD, 32).
-define(KEYSIZEFIELD, 16).
-define(TOTALSIZEFIELD, 32).
Expand All @@ -36,7 +43,7 @@
-define(HEADER_SIZE, 14). % 4 + 4 + 2 + 4 bytes
-define(MAXKEYSIZE, 2#1111111111111111).
-define(MAXVALSIZE, 2#11111111111111111111111111111111).
-define(MAXOFFSET, 16#ffffffffffffffff). % max 64-bit unsigned
-define(MAXOFFSET_V2, 16#7fffffffffffffff). % max 63-bit unsigned

%% for hintfile validation
-define(CHUNK_SIZE, 65535).
Expand Down
12 changes: 9 additions & 3 deletions rebar.config.script
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@ PulseBuild = case os:getenv("BITCASK_PULSE") of
case PulseBuild of
true ->
PulseOpts =
[{pulse_no_side_effect,[{erlang,display,1}]},
[{pulse_no_side_effect,
[{erlang,display,1},
{bitcask_nifs, keydir_global_pending_frozen, 0}]},
{pulse_side_effect,
[ {bitcask_nifs, keydir_new, 0}
, {bitcask_nifs, keydir_new, 1}
, {bitcask_nifs, keydir_mark_ready, 1}
, {bitcask_nifs, keydir_put_int, 9}
, {bitcask_nifs, keydir_get_int, 4}
, {bitcask_nifs, keydir_put_int, 10}
, {bitcask_nifs, keydir_get_int, 5}
, {bitcask_nifs, keydir_remove, 3}
, {bitcask_nifs, keydir_remove_int, 6}
, {bitcask_nifs, keydir_copy, 1}
Expand All @@ -23,8 +25,11 @@ case PulseBuild of
, {bitcask_nifs, keydir_info, 1}
, {bitcask_nifs, keydir_release, 1}
, {bitcask_nifs, keydir_trim_fstats, 2}
%% For PULSE purposes, this func is side-effect free (see above)
%% bitcask_nifs:keydir_global_pending_frozen/0

, {bitcask_nifs, increment_file_id, 1}
, {bitcask_nifs, increment_file_id, 2}

, {bitcask_nifs, lock_acquire_int, 2}
, {bitcask_nifs, lock_release_int, 1}
Expand All @@ -41,6 +46,7 @@ case PulseBuild of
, {bitcask_nifs, file_position_int, 2}
, {bitcask_nifs, file_seekbof_int, 1}

, {bitcask_file, '_', '_'}
, {bitcask_time, tstamp, 0}

, {prim_file, '_', '_'}
Expand Down
738 changes: 577 additions & 161 deletions src/bitcask.erl

Large diffs are not rendered by default.

10 changes: 10 additions & 0 deletions src/bitcask_file.erl
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@
-compile(export_all).
-behaviour(gen_server).

-ifdef(PULSE).
-compile({parse_transform, pulse_instrument}).
-endif.

%% API

%% gen_server callbacks
Expand Down Expand Up @@ -70,6 +74,9 @@ file_position(Pid, Position) ->
file_seekbof(Pid) ->
file_request(Pid, file_seekbof).

file_truncate(Pid) ->
file_request(Pid, file_truncate).

%%%===================================================================
%%% API helper functions
%%%===================================================================
Expand Down Expand Up @@ -168,6 +175,9 @@ handle_call(file_seekbof, From, State=#state{fd=Fd}) ->
check_owner(From, State),
{ok, _} = file:position(Fd, bof),
{reply, ok, State};
handle_call(file_truncate, From, State=#state{fd=Fd}) ->
check_owner(From, State),
{reply, file:truncate(Fd), State};

handle_call(_Request, _From, State) ->
Reply = ok,
Expand Down
108 changes: 51 additions & 57 deletions src/bitcask_fileops.erl
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
close/1,
close_for_writing/1,
data_file_tstamps/1,
write/4,
write/5,
read/3,
sync/1,
delete/1,
Expand All @@ -38,7 +38,8 @@
filename/1,
hintfile_name/1,
file_tstamp/1,
check_write/4]).
check_write/4,
un_write/1]).
-export([read_file_info/1, write_file_info/2, is_file/1]).

-include_lib("kernel/include/file.hrl").
Expand Down Expand Up @@ -120,7 +121,7 @@ open_file(Filename) ->
{ok, FD} ->
{ok, #filestate{mode = read_only,
filename = Filename, tstamp = file_tstamp(Filename),
fd = FD, ofs = 0 }};
fd = FD, ofs = 0}};
{error, Reason} ->
{error, Reason}
end.
Expand Down Expand Up @@ -150,7 +151,7 @@ close_hintfile(State = #filestate { hintfd = HintFd, hintcrc = HintCRC }) ->
%% timestamp and offset as large as the file format supports so opening with
%% an older version of bitcask will just reject the record at the end of the
%% hintfile and otherwise work normally.
Iolist = hintfile_entry(<<>>, 0, {?MAXOFFSET, HintCRC}),
Iolist = hintfile_entry(<<>>, 0, 0, ?MAXOFFSET_V2, HintCRC),
ok = bitcask_io:file_write(HintFd, Iolist),
bitcask_io:file_sync(HintFd),
bitcask_io:file_close(HintFd),
Expand Down Expand Up @@ -191,14 +192,14 @@ delete(#filestate{ filename = FN } = State) ->

%% @doc Write a Key-named binary data field ("Value") to the Filestate.
-spec write(#filestate{},
Key :: binary(), Value :: binary(), Tstamp :: integer()) ->
Key :: binary(), Value :: binary(), Tstamp :: integer(), TombstoneP :: boolean()) ->
{ok, #filestate{}, Offset :: integer(), Size :: integer()} |
{error, read_only}.
write(#filestate { mode = read_only }, _K, _V, _Tstamp) ->
write(#filestate { mode = read_only }, _K, _V, _Tstamp, _TombstoneP) ->
{error, read_only};
write(Filestate=#filestate{fd = FD, hintfd = HintFD,
hintcrc = HintCRC0, ofs = Offset},
Key, Value, Tstamp) ->
Key, Value, Tstamp, TombstoneP) ->
KeySz = size(Key),
true = (KeySz =< ?MAXKEYSIZE),
ValueSz = size(Value),
Expand All @@ -212,14 +213,32 @@ write(Filestate=#filestate{fd = FD, hintfd = HintFD,
ok = bitcask_io:file_pwrite(FD, Offset, Bytes),
%% Create and store the corresponding hint entry
TotalSz = KeySz + ValueSz + ?HEADER_SIZE,
Iolist = hintfile_entry(Key, Tstamp, {Offset, TotalSz}),
TombInt = if TombstoneP -> 1;
true -> 0
end,
Iolist = hintfile_entry(Key, Tstamp, TombInt, Offset, TotalSz),
ok = bitcask_io:file_write(HintFD, Iolist),
%% Record our final offset
TotalSz = iolist_size(Bytes),
HintCRC = erlang:crc32(HintCRC0, Iolist), % compute crc of hint
{ok, Filestate#filestate{ofs = Offset + TotalSz,
hintcrc = HintCRC}, Offset, TotalSz}.

hintcrc = HintCRC,
l_ofs = Offset,
l_hbytes = iolist_size(Iolist),
l_hintcrc = HintCRC0}, Offset, TotalSz}.

%% WARNING: We can only undo the last write.
un_write(Filestate=#filestate{fd = FD, hintfd = HintFD,
l_ofs = LastOffset,
l_hbytes = LastHintBytes,
l_hintcrc = LastHintCRC}) ->
{ok, _O2} = bitcask_io:file_position(FD, LastOffset),
ok = bitcask_io:file_truncate(FD),
{ok, 0} = bitcask_io:file_position(FD, 0),
{ok, _HO2} = bitcask_io:file_position(HintFD, {cur, -LastHintBytes}),
ok = bitcask_io:file_truncate(HintFD),
{ok, Filestate#filestate{ofs = LastOffset,
hintcrc = LastHintCRC}}.

%% @doc Given an Offset and Size, get the corresponding k/v from Filename.
-spec read(Filename :: string() | #filestate{}, Offset :: integer(),
Expand Down Expand Up @@ -406,7 +425,8 @@ read_crc(Fd) ->
{ok, <<0:?TSTAMPFIELD,
0:?KEYSIZEFIELD,
ExpectCRC:?TOTALSIZEFIELD,
(?MAXOFFSET):?OFFSETFIELD>>} ->
_TombInt:?TOMBSTONEFIELD_V2,
(?MAXOFFSET_V2):?OFFSETFIELD_V2>>} ->
ExpectCRC;
_ -> error
end.
Expand Down Expand Up @@ -466,7 +486,7 @@ fold_keys_loop(Fd, Offset, Fun, Acc0) ->

fold_keys_int_loop(<<_Crc32:?CRCSIZEFIELD, Tstamp:?TSTAMPFIELD,
KeySz:?KEYSIZEFIELD, ValueSz:?VALSIZEFIELD,
Key:KeySz/bytes, _:ValueSz/bytes,
Key:KeySz/bytes, Value:ValueSz/bytes,
Rest/binary>>,
Fun, Acc0, Consumed0,
{Offset, AvgValSz0},
Expand All @@ -475,24 +495,13 @@ fold_keys_int_loop(<<_Crc32:?CRCSIZEFIELD, Tstamp:?TSTAMPFIELD,
PosInfo = {Offset, TotalSz},
Consumed = Consumed0 + TotalSz,
AvgValSz = (AvgValSz0 + ValueSz) div 2,
Acc = Fun(Key, Tstamp, PosInfo, Acc0),
KeyPlus = case bitcask:is_tombstone(Value) of
true -> {tombstone, Key};
false -> Key
end,
Acc = Fun(KeyPlus, Tstamp, PosInfo, Acc0),
fold_keys_int_loop(Rest, Fun, Acc, Consumed,
{Offset + TotalSz, AvgValSz}, EOI);
%% in the case where values are very large, we don't actually want to
%% get a larger binary if we don't have to, so just issue a skip.
fold_keys_int_loop(<<_Crc32:?CRCSIZEFIELD, Tstamp:?TSTAMPFIELD,
KeySz:?KEYSIZEFIELD, ValueSz:?VALSIZEFIELD,
Key:KeySz/bytes,
_Rest/binary>>,
Fun, Acc0, _Consumed0,
{Offset, AvgValSz0},
_EOI) when AvgValSz0 > ?CHUNK_SIZE ->
TotalSz = KeySz + ValueSz + ?HEADER_SIZE,
PosInfo = {Offset, TotalSz},
Acc = Fun(Key, Tstamp, PosInfo, Acc0),
AvgValSz = (AvgValSz0 + ValueSz) div 2,
NewPos = Offset + TotalSz,
{skip, Acc, NewPos, {NewPos, AvgValSz}};
fold_keys_int_loop(<<>>, _Fun, Acc, Consumed, _Args, EOI) when EOI =:= true ->
{done, Acc, Consumed};
fold_keys_int_loop(_Bytes, _Fun, Acc, Consumed, Args, EOI) when EOI =:= false ->
Expand Down Expand Up @@ -527,21 +536,26 @@ fold_hintfile(State, Fun, Acc0) ->
%% hint record, three-tuple done indicates that we've exhausted all bytes, or
%% it's an error
fold_hintfile_loop(<<0:?TSTAMPFIELD, 0:?KEYSIZEFIELD,
_ExpectCRC:?TOTALSIZEFIELD, (?MAXOFFSET):?OFFSETFIELD>>,
_ExpectCRC:?TOTALSIZEFIELD,
_TombInt:?TOMBSTONEFIELD_V2, (?MAXOFFSET_V2):?OFFSETFIELD_V2>>,
_Fun, Acc, Consumed, _Args, EOI) when EOI =:= true ->
{done, Acc, Consumed + ?HINT_RECORD_SZ};
%% main work loop here, containing the full match of hint record and key.
%% if it gets a match, it proceeds to recurse over the rest of the big
%% binary
fold_hintfile_loop(<<Tstamp:?TSTAMPFIELD, KeySz:?KEYSIZEFIELD,
TotalSz:?TOTALSIZEFIELD, Offset:?OFFSETFIELD,
TotalSz:?TOTALSIZEFIELD,
TombInt:?TOMBSTONEFIELD_V2, Offset:?OFFSETFIELD_V2,
Key:KeySz/bytes, Rest/binary>>,
Fun, Acc0, Consumed0, {DataSize, HintFile} = Args,
EOI) ->
case Offset + TotalSz =< DataSize + 1 of
true ->
PosInfo = {Offset, TotalSz},
Acc = Fun(Key, Tstamp, PosInfo, Acc0),
KeyPlus = if TombInt == 1 -> {tombstone, Key};
true -> Key
end,
Acc = Fun(KeyPlus, Tstamp, PosInfo, Acc0),
Consumed = KeySz + ?HINT_RECORD_SZ + Consumed0,
fold_hintfile_loop(Rest, Fun, Acc,
Consumed, Args, EOI);
Expand Down Expand Up @@ -577,7 +591,6 @@ fold_hintfile_loop(_Bytes, _Fun, Acc0, Consumed0, Args, _EOI) ->
{more, any(), integer(), any()} |
{done, any()} |
{done, any(), integer()} |
{skip, any(), integer(), any()} |
{error, any()}),
fun(), any(), any()) ->
{error, any()} | any().
Expand All @@ -590,14 +603,8 @@ fold_file_loop(Fd, FoldFn, IntFoldFn, Acc0, Args0, Prev0, ChunkSz0) ->
%% in datafile folds and key folds
{Prev, ChunkSz}
= case Prev0 of
none -> {<<>>, ChunkSz0};
%if we're skipping around, we're likely too big
skip ->
CS = case ChunkSz0 >= (?MIN_CHUNK_SIZE * 2) of
true -> ChunkSz0 div 2;
false -> ?MIN_CHUNK_SIZE
end,
{<<>>, CS};
none ->
{<<>>, ChunkSz0};
Other ->
CS = case byte_size(Other) of
%% to avoid having to rescan the same
Expand Down Expand Up @@ -627,19 +634,6 @@ fold_file_loop(Fd, FoldFn, IntFoldFn, Acc0, Args0, Prev0, ChunkSz0) ->
end,
fold_file_loop(Fd, FoldFn, IntFoldFn,
Acc, Args, Rest, ChunkSz);
%% foldfuns should return skip when they have no need for
%% the rest of the binary that they've been handed.
%% see fold_int_loop for proper usage.
{skip, Acc, SkipTo, Args} ->
case bitcask_io:file_position(Fd, SkipTo) of
{ok, SkipTo} ->
fold_file_loop(Fd, FoldFn, IntFoldFn,
Acc, Args, skip, ChunkSz);
{error, Reason} ->
{error, Reason};
Other1 ->
{error, {file_fold_error, Other1}}
end;
%% the done two tuple is returned when we want to be
%% unconditionally successfully finished, i.e. trailing data
%% is a non-fatal error
Expand Down Expand Up @@ -677,10 +671,10 @@ open_hint_file(Filename, FinalOpts, Count) ->
open_hint_file(Filename, FinalOpts, Count - 1)
end.

hintfile_entry(Key, Tstamp, {Offset, TotalSz}) ->
hintfile_entry(Key, Tstamp, TombInt, Offset, TotalSz) ->
KeySz = size(Key),
[<<Tstamp:?TSTAMPFIELD>>, <<KeySz:?KEYSIZEFIELD>>,
<<TotalSz:?TOTALSIZEFIELD>>, <<Offset:?OFFSETFIELD>>, Key].
[<<Tstamp:?TSTAMPFIELD, KeySz:?KEYSIZEFIELD, TotalSz:?TOTALSIZEFIELD,
TombInt:?TOMBSTONEFIELD_V2, Offset:?OFFSETFIELD_V2>>, Key].

%% ===================================================================
%% file/filelib avoidance code.
Expand Down Expand Up @@ -745,7 +739,7 @@ get_efile_port() ->
Key = bitcask_efile_port,
case get(Key) of
undefined ->
case prim_file_drv_open(efile, [binary]) of
case prim_file_drv_open("efile", [binary]) of
{ok, Port} ->
put(Key, Port),
get_efile_port();
Expand Down
8 changes: 8 additions & 0 deletions src/bitcask_io.erl
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@
-module(bitcask_io).
-compile(export_all).

-ifdef(PULSE).
-compile({parse_transform, pulse_instrument}).
-endif.

file_open(Filename, Opts) ->
M = file_module(),
M:file_open(Filename, Opts).
Expand Down Expand Up @@ -58,6 +62,10 @@ file_position(Ref, Position) ->
M = file_module(),
M:file_position(Ref, Position).

file_truncate(Ref) ->
M = file_module(),
M:file_truncate(Ref).

file_module() ->
case get(bitcask_file_mod) of
undefined ->
Expand Down
Loading