Skip to content

Commit

Permalink
Add repair ability
Browse files Browse the repository at this point in the history
Add the framework needed for services on top of core to provide a
"repair" mechanism.  At this point repair was built with the sole
purpose of repairing data for KV and Search.

The key insight behind repair is that since Riak is a replicated data
store one partition can be rebuilt from the replicas on other
partitions.  Specifically, the adjacent (i.e. before and after)
partitions on the ring, together, contain all the replicas that are
stored on the partition to be repaired.  The rub is that adjacent
partitions also contain replicas that are _not_ meant to be on the
partition under repair.  This means a filter function must be used
while folding over the source partitions to transfer only the data
that belongs.  This is done as efficiently as possible by generating a
hash range for all the buckets and thus avoiding a preflist
calculation for each key.  Only a hash of each key is done, it's range
determined from a bucket->range map, and then the hash is checked
against the range.  The services under repair (i.e. Search or KV) must
provide callback functions to generate these ranges since it is
specific to the service.

In order to not repeat code and capitalize on concurrency control the
repair mechanism is currently an extension of handoff.  This creates
some awkwardness as repair is _not_ handoff.  Some of the differences
include:

1. It needs to filter data during the fold

2. All vnodes involved are primary and thus repair _cannot_ block

3. Repair involves 3 distinct partitions and 3 distinct vnodes

4. Repair does not imply a change of responsibility from one vnode to
another, but is more a sharing of data

In the future the handoff subsystem should be rewritten around the
notion of "transfers" in which repair, handoff, and ownership are all
different logical operations but use the transfer mechanism
underneath.  Especially now that repair is in it should be more clear
what that system should look like to meet the goals of all 3.

The idea of repair lives in the vnode manager as it is very much a
vnode semantic.  This is important because other things like handoff
and ownership also affect vnodes and thus affect repair.  The vnode
manager controls the logical repair, but the handoff/transfer
mechanism controls the physical movement of replicas from source to
target.

For now, it seemed easiest and smartest to _not_ allow ownership
change and repair to run concurrently.  In the event that an ownership
change is detected all repairs will be hard killed, regardless of
status.

In the case were the low-level repair transfer is killed because of
`handoff_concurrency` a msg is _not_ sent back to the vnode mgr to
indicate a failure for that transfer.  Instead, the vnode mgr has a
period tick.  During each tick the vnode mgr checks the status of all
it's repairs and retries any transfers that have since died for a
reason other than completion.
  • Loading branch information
rzezeski committed Jun 12, 2012
1 parent b3d564a commit 036e409
Show file tree
Hide file tree
Showing 5 changed files with 650 additions and 158 deletions.
23 changes: 23 additions & 0 deletions include/riak_core_handoff.hrl
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,26 @@
}).

-type ho_stats() :: #ho_stats{}.
-type ho_type() :: ownership_handoff | hinted_handoff.
-type predicate() :: fun((any()) -> boolean()).

-type index() :: integer().
-type mod_src_tgt() :: {module(), index(), index()}.
-type mod_partition() :: {module(), index()}.

-record(handoff_status,
{ mod_src_tgt :: mod_src_tgt(),
src_node :: node(),
target_node :: node(),
direction :: inbound | outbound,
transport_pid :: pid(),
timestamp :: tuple(),
status :: any(),
stats :: dict(),
vnode_pid :: pid() | undefined,
type :: ownership | hinted_handoff | repair,
req_origin :: node(),
filter_mod_fun :: {module(), atom()}
}).
-type handoff() :: #handoff_status{}.
-type handoffs() :: [handoff()].
Loading

0 comments on commit 036e409

Please sign in to comment.