Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drainer/: Reduce memory usage (#735) #737

Merged
merged 23 commits into from
Sep 25, 2019

Conversation

lichunzhu
Copy link
Contributor

@lichunzhu lichunzhu commented Sep 2, 2019

What problem does this PR solve?

Cherry pick of #735.
TOOL-1480
Current drainer uses buffed channel to cache some instances which may cause OOM when facing binlogs which consumes too much memory.

What is changed and how it works?

  1. Avoid any buffer at upstream to avoid too much memory usage and not decreasing performance.
  2. Add txnManager in loader to manage cached txn memory usage. Without txnManager the performance of loader will be affected.

To check the effectiveness, we make two tests

In the following:
v1 means 097db42, which is the newest master branch.
v2 means 1e51514, which is the current branch.
In v1, maxBinlogItemCount=512
In v2, maxBinlogItemCount=0

TEST 1: Small binlogs, to validate the efficiency of drainer

I put the test of v1 and v2 in one picture. 15:30 ~ 15:50 is test of v1 while 15:53 ~ 16:13 is test of v2. As we can see the memory usage and drainer event of both branches are similar.
drainer event:
image
memory:
image
execution time:
image

TEST 2: the upstream data one binlog near 60M, to validate the momery usage of drainer

For v1 and situations we set loader.input,loader.success, dsyncer.success buffer to 1024 drainer will get an OOM problem. The memory usage will rise up to more than 100G.
For v2, the memory usage is around 2G. The drainer event for DDL is around 2k and is around 14k for DMLs.
drainer event:
image
memory:
image
execute time:
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Code changes

Side effects

  • Possible performance regression

Related changes

  • Need to update the documentation
  • Need to update the tidb-ansible repository
  • Need to be included in the release note

@lichunzhu lichunzhu changed the title Czli/drainer/reduce memory usage drainer/: Reduce memory usage Sep 3, 2019
@lichunzhu lichunzhu changed the title drainer/: Reduce memory usage drainer/: Reduce memory usage https://github.com/pingcap/tidb-binlog/pull/735 Sep 3, 2019
@lichunzhu lichunzhu changed the title drainer/: Reduce memory usage https://github.com/pingcap/tidb-binlog/pull/735 drainer/: Reduce memory usage Sep 3, 2019
@lichunzhu lichunzhu changed the title drainer/: Reduce memory usage drainer/: Reduce memory usage (#735) Sep 3, 2019
@WangXiangUSTC
Copy link
Contributor

LGTM

suzaku
suzaku previously approved these changes Sep 6, 2019
Copy link
Contributor

@suzaku suzaku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lichunzhu
Copy link
Contributor Author

DNM now. This PR should better be merged after release-2.1 version.

@july2993
Copy link
Contributor

july2993 commented Sep 8, 2019

do you run it at the case there lag ?
In v1, maxBinlogItemCount=512

for this, the mem will not take only less than 1G
60M * 512 = at least 30G

also you may need to change pkg/loader ' input buf size.
keep in mind what you need to do is prove there will not take too much memory when there's no lag.

@IANTHEREAL
Copy link
Collaborator

@suzaku PTAL

suzaku
suzaku previously approved these changes Sep 25, 2019
@lichunzhu
Copy link
Contributor Author

DNM now, I will add some unit tests about txnManager.

@IANTHEREAL
Copy link
Collaborator

please complete it today @lichunzhu

… when block at cond.Wait() or input is closed. add one unit test to test whether txnManager can handle txn whose size if bigger than maxCacheSize
@lichunzhu
Copy link
Contributor Author

/run-all-tests tidb=release-3.0 tikv=release-3.0 pd=release-3.0

@lichunzhu
Copy link
Contributor Author

@GregoryIan PTAL

select {
case input <- txn:
case <-time.After(50 * time.Microsecond):
c.Fatal("txnManager gets blocked while receiving txns")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the 5th txn get blocked sending to input, why didn't this test fail?

Copy link
Contributor Author

@lichunzhu lichunzhu Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5th txn was picked from input but can't be added to the txnManager.cacheChan. Although it gets blocked at cond.Wait() but it's still picked out from input. By the way, we can't know a txn's size when it is not picked out from input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check c.Assert(output, check.HasLen, 4) can make sure it's not added.

pkg/loader/load_test.go Outdated Show resolved Hide resolved
case t := <-output:
txnManager.pop(t)
c.Assert(t, check.DeepEquals, txn)
case <-time.After(50 * time.Microsecond):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to wait? Just use default, because if we can't get a txn at this point, something must be wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This place default can be used.

output := txnManager.run()
select {
case input <- txnSmall:
case <-time.After(50 * time.Microsecond):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In txnManager.run() it will start a new goroutine and start watching input channel which will take some time. Using default may get fault.

@lichunzhu
Copy link
Contributor Author

/run-unit-tests

@@ -413,13 +413,13 @@ func (s *txnManagerSuite) TestRunTxnManager(c *check.C) {
case t := <-output:
txnManager.pop(t)
c.Assert(t, check.DeepEquals, txn)
case <-time.After(50 * time.Microsecond):
default:
c.Fatal("Fail to pick txn from txnManager")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select {
case t := <-output:
case <-time.After...
}

This pattern appears a lot in this test, is it possible to extract this as a helper function to simplify the test?

@lichunzhu
Copy link
Contributor Author

@suzaku PTAL

@lichunzhu
Copy link
Contributor Author

/run-unit-tests

@IANTHEREAL
Copy link
Collaborator

Rest LGTM

Co-Authored-By: satoru <satorulogic@gmail.com>
@lichunzhu
Copy link
Contributor Author

/run-unit-tests

@july2993
Copy link
Contributor

/run-all-tests tidb=release-3.0 tikv=release-3.0 pd=release-3.0

Copy link
Collaborator

@IANTHEREAL IANTHEREAL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lichunzhu
Copy link
Contributor Author

/run-integration-tests tidb=release-3.0 tikv=release-3.0 pd=release-3.0

@lichunzhu lichunzhu merged commit bbe5d6e into pingcap:master Sep 25, 2019
@lichunzhu lichunzhu deleted the czli/drainer/reduceMemoryUsage branch September 25, 2019 13:52
@july2993 july2993 mentioned this pull request Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants