Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: improve the reorg task scheduling #38646

Merged
merged 16 commits into from
Nov 10, 2022

Conversation

tangenta
Copy link
Contributor

@tangenta tangenta commented Oct 26, 2022

What problem does this PR solve?

Issue Number: ref #35983

Problem Summary:

In the data reorganization stage of adding index, we wrap the table ranges to a few tasks and then send them to several workers. These workers named "add index workers" or "backfill workers", which run in parallel to improve the performance of creating index records.

The main thread organizes the tasks in batch. Each batch contains @@tidb_ddl_reorg_worker_cnt tasks. The tasks in batch are sent to the backfill workers one by one:

tidb/ddl/backfilling.go

Lines 393 to 396 in ac0d36b

func (dc *ddlCtx) sendTasksAndWait(sessPool *sessionPool, reorgInfo *reorgInfo, totalAddedCount *int64, workers []*backfillWorker, batchTasks []*reorgBackfillTask) error {
for i, task := range batchTasks {
workers[i].taskCh <- task
}

We cannot proceed with the next batch until all the workers finish:

tidb/ddl/backfilling.go

Lines 367 to 369 in ac0d36b

for i := 0; i < taskCnt; i++ {
worker := workers[i]
result := <-worker.resultCh

As a result, the time consumed on a batch is determined by the slowest backfill worker. The CPU utilization is not good.

What is changed and how it works?

This PR proposes a better model: All backfill workers share the same task channel and the same result channel. Once a worker finishes a task, it could pick up the next one instantly without waiting.

However, this change breaks the order of the task execution. For example, task 6 may be handled earlier than task 5. We need another way to determine the "next handle", which is persisted to the storage as a check point. In this PR, doneTaskKeeper solves this problem.

Check List

Tests

  • Unit test
  • Integration test

  • Manual test (add detailed scripts or steps below)

    • Local environment
    • Sysbench table sbtest1
      sbtest1 | CREATE TABLE `sbtest1` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `k` int(11) NOT NULL DEFAULT '0',
      `c` char(120) NOT NULL DEFAULT '',
      `pad` char(60) NOT NULL DEFAULT '',
      PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
      KEY `k_1` (`k`),
      KEY `idx` (`k`)
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=10224220
    • 10 millions records
    • @@tidb_ddl_enable_fast_reorg = 1
    • @@tidb_ddl_reorg_worker_cnt = 4
    • @@tidb_ddl_reorg_batch_size = 256

Before this PR:

mysql> alter table sbtest1 add index idx(k);
Query OK, 0 rows affected (36.71 sec)
[2022/10/26 13:54:49.549 +08:00] [INFO] [backfilling.go:713] ["[ddl] start backfill workers to reorg record"] [type="add index"] [workerCnt=4] [regionCnt=65] [startKey=7480000000000000485f728000000000000001] [endKey=7480000000000000485f72800000000098f84d]
[2022/10/26 13:54:49.549 +08:00] [INFO] [backfilling.go:289] ["[ddl] backfill worker start"] [type="add index"] [workerID=3]
...
[2022/10/26 13:55:10.579 +08:00] [INFO] [reorg.go:237] ["[ddl] run reorg job done"] ["handled rows"=10000000]
[2022/10/26 13:55:10.579 +08:00] [INFO] [backfilling.go:326] ["[ddl] backfill worker exit"] [type="add index"] [workerID=0]

It takes 21 seconds to finish the backfilling stage.

After this PR:

mysql> alter table sbtest1 add index idx(k);
Query OK, 0 rows affected (30.44 sec)
[2022/10/26 13:52:41.490 +08:00] [INFO] [backfilling.go:736] ["[ddl] start backfill workers to reorg record"] [type="add index"] [workerCnt=4] [regionCnt=65] [startKey=7480000000000000485f728000000000000001] [endKey=7480000000000000485f72800000000098f84d]
[2022/10/26 13:52:41.490 +08:00] [INFO] [backfilling.go:296] ["[ddl] backfill worker start"] [type="add index"] [workerID=3]
...
[2022/10/26 13:52:56.673 +08:00] [INFO] [backfilling.go:343] ["[ddl] backfill worker exit"] [type="add index"] [workerID=3]
[2022/10/26 13:52:56.673 +08:00] [INFO] [reorg.go:237] ["[ddl] run reorg job done"] ["handled rows"=10000000]
[2022/10/26 13:52:56.673 +08:00] [INFO] [backfilling.go:343] ["[ddl] backfill worker exit"] [type="add index"] [workerID=0]
[2022/10/26 13:52:56.673 +08:00] [INFO] [backfilling.go:343] ["[ddl] backfill worker exit"] [type="add index"] [workerID=1]
[2022/10/26 13:52:56.673 +08:00] [INFO] [backfilling.go:343] ["[ddl] backfill worker exit"] [type="add index"] [workerID=2]

It takes 15 seconds to finish the backfill stage.


  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Oct 26, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Defined2014
  • zimulala

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 26, 2022
ddl/main_test.go Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 26, 2022
@Benjamin2037 Benjamin2037 self-requested a review October 27, 2022 01:08
ddl/backfilling.go Outdated Show resolved Hide resolved
ddl/backfilling.go Outdated Show resolved Hide resolved
ddl/backfilling.go Outdated Show resolved Hide resolved
ddl/backfilling.go Outdated Show resolved Hide resolved
ddl/backfilling.go Show resolved Hide resolved
ddl/backfilling.go Show resolved Hide resolved
ddl/backfilling.go Show resolved Hide resolved
ddl/backfilling.go Show resolved Hide resolved
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Nov 8, 2022
ddl/backfilling.go Show resolved Hide resolved
ddl/backfilling_test.go Show resolved Hide resolved
ddl/backfilling_test.go Show resolved Hide resolved
ddl/backfilling.go Outdated Show resolved Hide resolved
ddl/backfilling.go Outdated Show resolved Hide resolved
Copy link
Contributor

@zimulala zimulala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Nov 9, 2022
@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 9, 2022
@tangenta
Copy link
Contributor Author

tangenta commented Nov 9, 2022

/hold because the unit test failed.

@ti-chi-bot ti-chi-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2022
@ti-chi-bot ti-chi-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed status/can-merge Indicates a PR has been approved by a committer. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 9, 2022
@tangenta
Copy link
Contributor Author

tangenta commented Nov 9, 2022

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: cb06517

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 9, 2022
@ti-chi-bot ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Nov 9, 2022
@ti-chi-bot ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 9, 2022
@hawkingrei
Copy link
Member

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 8099780

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 9, 2022
@tangenta
Copy link
Contributor Author

/unhold

@ti-chi-bot ti-chi-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2022
@ti-chi-bot ti-chi-bot merged commit cfbe3c9 into pingcap:master Nov 10, 2022
@sre-bot
Copy link
Contributor

sre-bot commented Nov 10, 2022

TiDB MergeCI notify

🔴 Bad News! New failing [1] after this pr merged.
These new failed integration tests seem to be caused by the current PR, please try to fix these new failed integration tests, thanks!

CI Name Result Duration Compare with Parent commit
idc-jenkins-ci-tidb/integration-compatibility-test 🟥 failed 1, success 0, total 1 2 min 25 sec New failing
idc-jenkins-ci-tidb/integration-ddl-test 🔴 failed 1, success 5, total 6 44 min Existing failure
idc-jenkins-ci-tidb/mybatis-test 🔴 failed 1, success 0, total 1 11 min Existing failure
idc-jenkins-ci/integration-cdc-test 🟢 all 39 tests passed 25 min Existing passed
idc-jenkins-ci-tidb/tics-test 🟢 all 1 tests passed 16 min Existing passed
idc-jenkins-ci-tidb/integration-common-test 🟢 all 17 tests passed 12 min Existing passed
idc-jenkins-ci-tidb/common-test 🟢 all 11 tests passed 11 min Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1 🟢 all 26 tests passed 5 min 33 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2 🟢 all 28 tests passed 5 min 14 sec Existing passed
idc-jenkins-ci-tidb/plugin-test 🟢 build success, plugin test success 4min Existing passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants