Add repair ability

Add the framework needed for services on top of core to provide a "repair" mechanism. At this point repair was built with the sole purpose of repairing data for KV and Search. The key insight behind repair is that since Riak is a replicated data store one partition can be rebuilt from the replicas on other partitions. Specifically, the adjacent (i.e. before and after) partitions on the ring, together, contain all the replicas that are stored on the partition to be repaired. The rub is that adjacent partitions also contain replicas that are _not_ meant to be on the partition under repair. This means a filter function must be used while folding over the source partitions to transfer only the data that belongs. This is done as efficiently as possible by generating a hash range for all the buckets and thus avoiding a preflist calculation for each key. Only a hash of each key is done, it's range determined from a bucket->range map, and then the hash is checked against the range. The services under repair (i.e. Search or KV) must provide callback functions to generate these ranges since it is specific to the service. In order to not repeat code and capitalize on concurrency control the repair mechanism is currently an extension of handoff. This creates some awkwardness as repair is _not_ handoff. Some of the differences include: 1. It needs to filter data during the fold 2. All vnodes involved are primary and thus repair _cannot_ block 3. Repair involves 3 distinct partitions and 3 distinct vnodes 4. Repair does not imply a change of responsibility from one vnode to another, but is more a sharing of data In the future the handoff subsystem should be rewritten around the notion of "transfers" in which repair, handoff, and ownership are all different logical operations but use the transfer mechanism underneath. Especially now that repair is in it should be more clear what that system should look like to meet the goals of all 3. The idea of repair lives in the vnode manager as it is very much a vnode semantic. This is important because other things like handoff and ownership also affect vnodes and thus affect repair. The vnode manager controls the logical repair, but the handoff/transfer mechanism controls the physical movement of replicas from source to target. For now, it seemed easiest and smartest to _not_ allow ownership change and repair to run concurrently. In the event that an ownership change is detected all repairs will be hard killed, regardless of status. In the case were the low-level repair transfer is killed because of `handoff_concurrency` a msg is _not_ sent back to the vnode mgr to indicate a failure for that transfer. Instead, the vnode mgr has a period tick. During each tick the vnode mgr checks the status of all it's repairs and retries any transfers that have since died for a reason other than completion.
basho · Jun 12, 2012 · 036e409 · 036e409
1 parent b3d564a
commit 036e409
Show file tree

Hide file tree

Showing 5 changed files with 650 additions and 158 deletions.
diff --git a/include/riak_core_handoff.hrl b/include/riak_core_handoff.hrl
@@ -13,3 +13,26 @@
         }).
 
 -type ho_stats() :: #ho_stats{}.
+-type ho_type() :: ownership_handoff | hinted_handoff.
+-type predicate() :: fun((any()) -> boolean()).
+
+-type index() :: integer().
+-type mod_src_tgt() :: {module(), index(), index()}.
+-type mod_partition() :: {module(), index()}.
+
+-record(handoff_status,
+        { mod_src_tgt           :: mod_src_tgt(),
+          src_node              :: node(),
+          target_node           :: node(),
+          direction             :: inbound | outbound,
+          transport_pid         :: pid(),
+          timestamp             :: tuple(),
+          status                :: any(),
+          stats                 :: dict(),
+          vnode_pid             :: pid() | undefined,
+          type                  :: ownership | hinted_handoff | repair,
+          req_origin            :: node(),
+          filter_mod_fun        :: {module(), atom()}
+        }).
+-type handoff() :: #handoff_status{}.
+-type handoffs() :: [handoff()].