Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Background Manager with Handoff And AAE Tree Rebuilds #769

Merged
merged 4 commits into from
Dec 20, 2013

Conversation

buddhisthead
Copy link
Contributor

NOTE: DO NOT MERGE YET (this PR requires code from #475 or basho/riak_core#484)

The riak_core handoff portion is here: basho/riak_core#484 (includes #475)
The riak_core background manager PR is here: basho/riak_core#475
The riak_repl PR is here: basho/riak_repl#488

The background manager allows subsystems to coordinate with locks (and tokens) in a non-blocking API when desired. In our case, we've seen several customer support issues related to a production cluster being overloaded by the combined weight of MDC fullsync replication in combination with either handoff, or AAE tree rebuilds (or all three!).

Each of these subsystems is being taught, for 2.0, to "take" a lock on the kv vnode/partition before it does a fold over that partition. Locks are returned when the process dies or exists normally. Each subsystem (at this point in time) is coded to keep working on other partitions if one partition is denied a lock via the return value of max_concurrency, which means progress is made even when one partition is busy.

Use of this feature can be bypassed in production. Besides a global skill switch on the background manager itself, the riak_core configuration supports two new parameters: aae_skip_bg_manager and handoff_skip_bg_manager, which defaults to false. Setting it to true will not call the background manager and proceed like it used to.

I chose not to separate the AAE and Handoff into two separate PRs because we already have too many, but if we determine that we want to exclude the handoff portion, I'll spit them up.

%% turns off all use of bg-mgr. 'aae_skip_background_manager' only stops
%% AAE tree rebuilds from using bg-mgr.
-spec use_bg_mgr() -> boolean().
use_bg_mgr() ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is duplicated in both riak_kv and riak_repl. Can't this be in core, and take a parameter instead of what property to inspect in the environment, if required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the global switch, you have a good point. I'll add this entry to the background manager.

try_set_concurrency_limit(Lock, Limit, riak_core_bg_manager:use_bg_mgr()).

try_set_concurrency_limit(_Lock, _Limit, false) ->
lager:info("Skipping background manager."),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably going to get verbose as an info statement since its per-partition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh heck. I meant to make that a debug. I wonder if we even need it all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we need it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by a78aee5

@jrwest
Copy link
Contributor

jrwest commented Dec 20, 2013

+1 merge after fixing the handoff_skip_background_manager config and removing the overly verbose logging commented on above.

I tested this mostly by hand because unfortunately there aren't many riak_test's that cover this. I verified that AAE rebuilds and handoff's don't stomp on each other via redbug (max_concurrency returned as expected, only one ?FOLD_REQ handled, etc) in addition to grabbing some of the partition locks myself and ensuring that neither subsystem could do any work (a nice trick btw). Also checked that turning off the subsystems' use of the background manager or global use of it allowed them to proceed anyways.

Final Comments:

  • A riak_test that does some of the above would be a nice addition to the test suite
  • I think try_set_concurrency_limit is exactly why we should move away from the table manager in the future and use a public ets table. Having one of those in each subsystem (or some general version) is going to get pretty annoying.
  • Along w/ that, I think we should re-consider my comments above about moving the subsystem enabled checking into a get_lock/4. Otherwise, each subsystem we add code to is going to need a similar maybe_lock_something function.
  • we still need cuttlefish mappings for the various new configs added

buddhisthead added a commit that referenced this pull request Dec 20, 2013
Integrate Background Manager with Handoff And AAE Tree Rebuilds
@buddhisthead buddhisthead merged commit d54b1e9 into develop Dec 20, 2013
@seancribbs seancribbs deleted the cet-bg-mgr-vnode-lock branch April 1, 2015 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants