Skip to content
This repository has been archived by the owner on Dec 8, 2021. It is now read-only.

Replace CSV Ragel parser by a hand-written parser copied from encoding/csv #275

Merged
merged 11 commits into from
Mar 16, 2020

Conversation

kennytm
Copy link
Collaborator

@kennytm kennytm commented Mar 2, 2020

What problem does this PR solve?

The Ragel-based CSV parser is much slower than the standard encoding/csv parser.

As explained in #111, the parser was using Ragel to reuse existing framework for simplicity. However as more and more customer start to use Lightning with CSV input, this is the time we need to optimize the things.

What is changed and how it works?

The new parser was inspired by encoding/csv but almost nothing remained except the recordBuffer/fieldIndexes members to reduce the amount of allocations. We cannot reuse encoding/csv because:

  • encoding/csv does not allow us to track the current read position
  • encoding/csv does not recognize backslash-escaped fields (required for MySQL-generated CSV)
  • encoding/csv does not support disabling quoting

Implementing these make the new parser still worse than encoding/csv, but much better than the original one.

Parser TPC-C "CUSTOMER" Row
encoding/csv 1898 ns/op
Original parser 6930 ns/op
New parser 2426 ns/op

So the new parser is 35% of the original parser and encoding/csv is 80% of the new parser. We can investigate how to squeeze out the remaining 20% later.

Check List

Tests

  • Unit test

Side effects

  • Increased code complexity

Related changes

  • Need to be included in the release note
    • Improved CSV parsing speed by (TO BE FILLED IN)%

@kennytm kennytm force-pushed the kennytm/stop-using-ragel-csv-parser branch from 1ff6963 to 31ff030 Compare March 2, 2020 18:32
@kennytm kennytm marked this pull request as ready for review March 2, 2020 18:56
@kennytm
Copy link
Collaborator Author

kennytm commented Mar 2, 2020

/run-all-tests

@kennytm kennytm force-pushed the kennytm/stop-using-ragel-csv-parser branch from 9d6f8e2 to 82c3067 Compare March 2, 2020 21:20
@kennytm kennytm force-pushed the kennytm/stop-using-ragel-csv-parser branch from 82c3067 to fa277ed Compare March 2, 2020 21:39
@siddontang
Copy link
Member

do we have a benchmark test for lightning, e.g. 10GB data, 3 TiKV, running /bench, @zhouqiang-cl

@zhouqiang-cl
Copy link
Contributor

do we have a benchmark test for lightning, e.g. 10GB data, 3 TiKV, running /bench, @zhouqiang-cl

Now. we did not have benchmark for lightning. will scheduler for it

@kennytm kennytm added priority/normal status/PTAL This PR is ready for review. Add this label back after committing new changes type/enhancement Performance improvement or refactoring labels Mar 5, 2020
@kennytm
Copy link
Collaborator Author

kennytm commented Mar 11, 2020

PTAL again @XuHuaiyu

before recycling:

PASS: csv_parser_test.go:690: benchCSVParserSuite.BenchmarkReadRowUsingMydumpCSVParser   1000000              2875 ns/op            2144 B/op          3 allocs/op

after recycling:

PASS: csv_parser_test.go:690: benchCSVParserSuite.BenchmarkReadRowUsingMydumpCSVParser    500000              2415 ns/op             899 B/op          3 allocs/op

Copy link
Contributor

@july2993 july2993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

no CI to make sure the static checker runs now

lightning/mydump/csv_parser.go Outdated Show resolved Hide resolved
lightning/mydump/csv_parser_test.go Show resolved Hide resolved
lightning/mydump/csv_parser_test.go Show resolved Hide resolved
@kennytm
Copy link
Collaborator Author

kennytm commented Mar 13, 2020

/run-all-tests tidb=release-3.1 pd=release-3.1 tikv=release-3.1

@july2993
Copy link
Contributor

fail to start tidb? 🤔

[2020-03-13T21:46:01.463Z] [2020/03/14 05:45:58.739 +08:00] [WARN] [base_client.go:170] ["[pd] failed to get cluster id"] [url=http://127.0.0.1:2379] [error="tls: found a certificate rather than a key in the PEM for the private key"] [errorVerbose="tls: found a certificate rather than a key in the PEM for the private key\ngithub.com/pingcap/pd/pkg/grpcutil.SecurityConfig.ToTLSConfig\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/pkg/grpcutil/grpcutil.go:49\ngithub.com/pingcap/pd/client.(*baseClient).getOrCreateGRPCConn\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:264\ngithub.com/pingcap/pd/client.(*baseClient).getMembers\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:202\ngithub.com/pingcap/pd/client.(*baseClient).initClusterID\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:167\ngithub.com/pingcap/pd/client.(*baseClient).initRetry\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:103\ngithub.com/pingcap/pd/client.newBaseClient\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:84\ngithub.com/pingcap/pd/client.NewClientWithContext\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/client.go:136\ngithub.com/pingcap/pd/client.NewClient\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/client.go:130\ngithub.com/pingcap/tidb/store/tikv.Driver.Open\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/store/tikv/kv.go:82\ngithub.com/pingcap/tidb/store.newStoreWithRetry.func1\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/store/store.go:70\ngithub.com/pingcap/tidb/util.RunWithRetry\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/util/misc.go:51\ngithub.com/pingcap/tidb/store.newStoreWithRetry\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/store/store.go:68\ngithub.com/pingcap/tidb/store.New\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/store/store.go:52\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:207\nmain.main\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:172\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\ngithub.com/pingcap/pd/client.(*baseClient).getOrCreateGRPCConn\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:266\ngithub.com/pingcap/pd/client.(*baseClient).getMembers\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:202\ngithub.com/pingcap/pd/client.(*baseClient).initClusterID\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:167\ngithub.com/pingcap/pd/client.(*baseClient).initRetry\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:103\ngithub.com/pingcap/pd/client.newBaseClient\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/base_client.go:84\ngithub.com/pingcap/pd/client.NewClientWithContext\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/client.go:136\ngithub.com/pingcap/pd/client.NewClient\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/pkg/mod/github.com/pingcap/pd@v1.1.0-beta.0.20200213133706-fbbe75e180e6/client/client.go:130\ngithub.com/pingcap/tidb/store/tikv.Driver.Open\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/store/tikv/kv.go:82\ngithub.com/pingcap/tidb/store.newStoreWithRetry.func1\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/store/store.go:70\ngithub.com/pingcap/tidb/util.RunWithRetry\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/util/misc.go:51\ngithub.com/pingcap/tidb/store.newStoreWithRetry\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/store/store.go:68\ngithub.com/pingcap/tidb/store.New\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/store/store.go:52\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:207\nmain.main\n\t/home/jenkins/agent/workspace/tidb_release-3.1/go/src/github.com/pingcap/tidb/tidb-```

@july2993 july2993 added status/LGT1 One reviewer already commented LGTM (LGTM1) and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Mar 14, 2020
@july2993
Copy link
Contributor

/run-all-tests

@3pointer
Copy link
Contributor

this seek seems unnecessary after this PR changed

Copy link
Contributor

@3pointer 3pointer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM except #275 (comment)

@july2993
Copy link
Contributor

@kennytm pls resolve conflicts

@july2993
Copy link
Contributor

/run-all-test
seems tests/check_requirements/ fail by chance

@july2993
Copy link
Contributor

/run-all-tests

@kennytm
Copy link
Collaborator Author

kennytm commented Mar 16, 2020

PTAL @3pointer

@july2993 july2993 merged commit f53609a into master Mar 16, 2020
@july2993 july2993 deleted the kennytm/stop-using-ragel-csv-parser branch March 16, 2020 08:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status/LGT1 One reviewer already commented LGTM (LGTM1) type/enhancement Performance improvement or refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants