Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dup): support shell set fail_mode and collector duplication ops #520

Merged
merged 18 commits into from
Apr 26, 2020

Conversation

neverchanje
Copy link
Contributor

@neverchanje neverchanje commented Apr 16, 2020

What problem does this PR solve?

  • Support shell command set_dup_fail_mode. See the following shell example for details.
  • Add perf-counters for QPS of duplication shipping and handling.

New perf-counters

  • collector*app.pegasus*app.stat.duplicate_qps#<app_name>
  • collector*app.pegasus*app.stat.dup_shipped_ops#<app_name>
  • collector*app.pegasus*app.stat.dup_failed_shipping_ops#<app_name>

Check List

Tests

  • Manual test (add detailed scripts or steps below)
>>> add_dup temp onebox2
adding duplication succeed [app: temp, remote: onebox2, appid: 2, dupid: 1587111546, freeze: false]

>>> set_dup_fail_mode temp 1587111546 skip
set duplication(1587111546) fail_mode (skip) for app temp succeed

>>> query_dup temp -d
duplications of app [temp] in detail:
{"1":{"create_ts":"2020-04-17 16:19:06","dupid":1587111546,"fail_mode":"FAIL_SKIP","remote":"onebox2","status":"DS_START"},"appid":2}

>>> set_dup_fail_mode temp 1587111546 slow
set duplication(1587111546) fail_mode (slow) for app temp succeed

>>> query_dup temp -d
duplications of app [temp] in detail:
{"1":{"create_ts":"2020-04-17 16:19:06","dupid":1587111546,"fail_mode":"FAIL_SLOW","remote":"onebox2","status":"DS_START"},"appid":2}

"onebox2" is the remote cluster to duplicate. To test whether the perf-counters work, we use pegasus-YCSB to load writes to the local onebox.

> ./start.sh load
YCSB Client 0.12.0-SNAPSHOT
Command line: -db com.yahoo.ycsb.db.PegasusClient -p pegasus.config=file://./pegasus.properties -s -P ./workload_pegasus -load
Loading workload...
Starting test.
...
2020-04-20 10:38:42:570 30 sec: 57019 operations; 1575.9 current ops/sec; est completion in 14 hours 36 minutes [INSERT: Count=15759, Max=1591295, Min=158, Avg=663.11, 90=488, 99=973, 99.9=100991, 99.99=371967] 
  • Case 1: onebox2 doesn't set up.

image

  • Case 2: onebox2 sets up.

image

On the receiver side (onebox2), the data duplicating it handles displayed like this:

image

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation
  • Need to be included in the release note

@neverchanje neverchanje changed the title feat(dup): support shell set_dup_fail_mode and collector aggregated duplication ops feat(dup): support shell set fail_mode and collector duplication ops Apr 16, 2020
@neverchanje neverchanje marked this pull request as ready for review April 17, 2020 08:16
@neverchanje neverchanje added component/duplication cluster duplication type/perf-counter PR that made modification on perf-counter, which should be noted in release note. labels Apr 17, 2020
@acelyc111

This comment has been minimized.

src/server/config.min.ini Outdated Show resolved Hide resolved
src/server/test/pegasus_mutation_duplicator_test.cpp Outdated Show resolved Hide resolved
src/shell/commands/duplication.cpp Outdated Show resolved Hide resolved

if (rpc_code == dsn::apps::RPC_RRDB_RRDB_DUPLICATE) {
// ignore if it is a DUPLICATE
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just can't understand your updates in this function, could you please explain to me?

Copy link
Contributor Author

@neverchanje neverchanje Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I left a comment here. It was a bug. DUPLICATE can not participate in duplication like other normal writes. Because duplication is designed to be a directed edge flowing from master to slave. Forwarding will unintentionally add edges in this topology.

src/server/table_stats.h Show resolved Hide resolved
@@ -87,6 +90,9 @@ class info_collector
::dsn::perf_counter_wrapper check_and_set_qps;
::dsn::perf_counter_wrapper check_and_mutate_qps;
::dsn::perf_counter_wrapper scan_qps;
::dsn::perf_counter_wrapper duplicate_qps;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to dup_qps? so the three metrics have the same pefix to identify.

Copy link
Contributor Author

@neverchanje neverchanje Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not trivial because on replica side the perf-counter called replica*app.pegasus*duplicate_qps@<pid>. Therefore on the collector side, it must follow the same name.

@hycdong hycdong merged commit 4b1f5eb into apache:master Apr 26, 2020
@neverchanje neverchanje deleted the dup-hint branch April 27, 2020 03:46
@neverchanje neverchanje mentioned this pull request May 14, 2020
@neverchanje neverchanje mentioned this pull request Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/duplication cluster duplication type/perf-counter PR that made modification on perf-counter, which should be noted in release note. v2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants