Fix parsing issues in models.scanKey #4841

e-dard · 2015-11-19T00:52:53Z

The PR refactors models.scanKey, which is used to parse the measurements and tags. It:

Simplifies the scanKey logic by broadly modelling it as a state machine;
Fixes issues line protocol does not allow for some edge cases with backslash and escaping #3070 and Incorrect error when inserting cpu, #4770 (and some other non-documented incorrect error messages);
Refactors some tests to add coverage and improve robustness, e.g., checking correct errors returned, rather than an error returned;
Removes redundant conditionals in the scanKey function.

Initially I planned on simply fixing #3070 and #4770, but I felt that the scanKey function was pretty complicated, and had had extra conditionals and checks added over time, some of which were redundant, or unreachable. It seemed like it was worth refactoring this function into something easier to understand.

I've ran benchmarks before and after and performance is broadly inline with the previous implementation, if not a little faster for some benchmarks.

(Updated)

name                         old time/op    new time/op    delta
Marshal-4                      6.08µs ± 1%    6.10µs ± 1%     ~     (p=0.132 n=6+6)
ParsePointNoTags-4              838ns ± 0%     828ns ± 1%   -1.23%  (p=0.004 n=5+6)
ParsePointsTagsSorted2-4       1.16µs ± 2%    1.14µs ± 0%   -1.95%  (p=0.010 n=6+4)
ParsePointsTagsSorted5-4       1.52µs ± 1%    1.42µs ± 0%   -6.22%  (p=0.004 n=6+5)
ParsePointsTagsSorted10-4      2.28µs ± 1%    2.07µs ± 0%   -9.45%  (p=0.004 n=6+5)
ParsePointsTagsUnSorted2-4     1.42µs ± 0%    1.41µs ± 1%   -1.18%  (p=0.010 n=4+6)
ParsePointsTagsUnSorted5-4     1.55µs ± 1%    1.44µs ± 0%   -7.24%  (p=0.002 n=6+6)
ParsePointsTagsUnSorted10-4    3.87µs ± 1%    3.59µs ± 1%   -7.22%  (p=0.002 n=6+6)

name                         old speed      new speed      delta
ParsePointNoTags-4           27.4MB/s ± 0%  27.8MB/s ± 1%   +1.24%  (p=0.004 n=5+6)
ParsePointsTagsSorted2-4     43.8MB/s ± 2%  44.6MB/s ± 0%   +1.98%  (p=0.010 n=6+4)
ParsePointsTagsSorted5-4     54.7MB/s ± 1%  58.3MB/s ± 0%   +6.62%  (p=0.004 n=6+5)
ParsePointsTagsSorted10-4    62.6MB/s ± 1%  69.1MB/s ± 0%  +10.44%  (p=0.004 n=6+5)
ParsePointsTagsUnSorted2-4   35.8MB/s ± 0%  36.2MB/s ± 1%   +1.29%  (p=0.004 n=5+6)
ParsePointsTagsUnSorted5-4   53.6MB/s ± 1%  57.8MB/s ± 0%   +7.79%  (p=0.002 n=6+6)
ParsePointsTagsUnSorted10-4  36.9MB/s ± 1%  39.8MB/s ± 1%   +7.77%  (p=0.002 n=6+6)

CHANGELOG.md updated
Rebased/mergable
Tests pass
Sign CLA (if not already signed)

e-dard · 2015-11-19T00:58:15Z

I also plan on fixing the lint issues in the models package, but will wait until the current changes have been reviewed.

jwilder · 2015-11-19T16:00:25Z

This parsing code has been pretty highly tuned because it's used in a lot of places (HTTP/UDP inputs, hinted handoff, WAL, cluster writes, exports, etc..). The performance decrease is fairly significant because it reduces throughput by 20-45% in common cases.

name                         old time/op    new time/op    delta
Marshal-4                      6.11µs ± 0%    6.14µs ± 0%   +0.62%
ParsePointNoTags-4              844ns ± 0%     838ns ± 0%   -0.71%
ParsePointsTagsSorted2-4       1.45µs ± 0%    1.48µs ± 0%   +2.00%
ParsePointsTagsSorted5-4       1.70µs ± 0%    2.07µs ± 0%  +21.76%
ParsePointsTagsSorted10-4      2.29µs ± 0%    3.32µs ± 0%  +44.72%
ParsePointsTagsUnSorted2-4     1.47µs ± 0%    1.76µs ± 0%  +20.26%
ParsePointsTagsUnSorted5-4     1.55µs ± 0%    2.17µs ± 0%  +39.54%
ParsePointsTagsUnSorted10-4    3.93µs ± 0%    4.99µs ± 0%  +26.91%

name                         old speed      new speed      delta
ParsePointNoTags-4           27.2MB/s ± 0%  27.4MB/s ± 0%   +0.62%
ParsePointsTagsSorted2-4     35.1MB/s ± 0%  34.4MB/s ± 0%   -1.91%
ParsePointsTagsSorted5-4     48.8MB/s ± 0%  40.1MB/s ± 0%  -17.85%
ParsePointsTagsSorted10-4    62.4MB/s ± 0%  43.1MB/s ± 0%  -30.89%
ParsePointsTagsUnSorted2-4   34.8MB/s ± 0%  28.9MB/s ± 0%  -16.90%
ParsePointsTagsUnSorted5-4   53.4MB/s ± 0%  38.3MB/s ± 0%  -28.34%
ParsePointsTagsUnSorted10-4  36.4MB/s ± 0%  28.7MB/s ± 0%  -21.18%

The UnSorted cases would affect ingestion endpoints that parse line protocol such as HTTP/UDP. The Sorted cases affect all the other parts of the system that are using the the line protocol as a serialization format after already being parsed by an input plugin.

I have not profiled this change, but I suspect the performance decrease is due to that closures and calling a func per char whereas before it would just increment an integer. The code doesn't seem much clearer to me either unfortunately, but I see where you were going with it. I'd suggest reworking this to have scanKey internally call a scanMeasurement and scanTags. scanTags could call a scanTagKey and scanTagValue, etc. if needed. Each of those would have slice passed in similar to how the funcs in the package work. That is the pattern used in the other funcs (at times).

I'm 👎 on this current PR, but I think it could be reworked to address these concerns and fix the original parser bugs.

e-dard · 2015-11-19T16:08:15Z

Hi @jwilder, your points seem completely reasonable to me.

This is my first contribution to the project and as you can imagine I have a shallow understanding of the entire codebase, and the impact of scanKeys on it, so your feedback is useful.

I will profile and see what I can do to get the benchmarks up to their previous levels 👍

If that's unsuccessful, then I have another branch waiting that will simply fix the outstanding issues.

…ata#4770

e-dard · 2015-11-20T01:46:40Z

@jwilder,

I've modified the approach to bring performance in line with expectation, while keeping separation of the different parts of the scanning process in different functions.

And of course the bugs should still be fixed 😄

Results of six runs of the following benchmark command: $ go test -benchtime=3s -bench . for master and this PR.

name                         old time/op    new time/op    delta
Marshal-4                      6.08µs ± 1%    6.10µs ± 1%     ~     (p=0.132 n=6+6)
ParsePointNoTags-4              838ns ± 0%     828ns ± 1%   -1.23%  (p=0.004 n=5+6)
ParsePointsTagsSorted2-4       1.16µs ± 2%    1.14µs ± 0%   -1.95%  (p=0.010 n=6+4)
ParsePointsTagsSorted5-4       1.52µs ± 1%    1.42µs ± 0%   -6.22%  (p=0.004 n=6+5)
ParsePointsTagsSorted10-4      2.28µs ± 1%    2.07µs ± 0%   -9.45%  (p=0.004 n=6+5)
ParsePointsTagsUnSorted2-4     1.42µs ± 0%    1.41µs ± 1%   -1.18%  (p=0.010 n=4+6)
ParsePointsTagsUnSorted5-4     1.55µs ± 1%    1.44µs ± 0%   -7.24%  (p=0.002 n=6+6)
ParsePointsTagsUnSorted10-4    3.87µs ± 1%    3.59µs ± 1%   -7.22%  (p=0.002 n=6+6)

name                         old speed      new speed      delta
ParsePointNoTags-4           27.4MB/s ± 0%  27.8MB/s ± 1%   +1.24%  (p=0.004 n=5+6)
ParsePointsTagsSorted2-4     43.8MB/s ± 2%  44.6MB/s ± 0%   +1.98%  (p=0.010 n=6+4)
ParsePointsTagsSorted5-4     54.7MB/s ± 1%  58.3MB/s ± 0%   +6.62%  (p=0.004 n=6+5)
ParsePointsTagsSorted10-4    62.6MB/s ± 1%  69.1MB/s ± 0%  +10.44%  (p=0.004 n=6+5)
ParsePointsTagsUnSorted2-4   35.8MB/s ± 0%  36.2MB/s ± 1%   +1.29%  (p=0.004 n=5+6)
ParsePointsTagsUnSorted5-4   53.6MB/s ± 1%  57.8MB/s ± 0%   +7.79%  (p=0.002 n=6+6)
ParsePointsTagsUnSorted10-4  36.9MB/s ± 1%  39.8MB/s ± 1%   +7.77%  (p=0.002 n=6+6)

So performance is now on a par or slightly better than before.

Thoughts?

jwilder · 2015-11-20T06:21:17Z

models/points_test.go

-		t.Errorf(`ParsePoints("%s") mismatch. got nil, exp error`, "cpu, value=1")
-	}
-
-	_, err = models.ParsePointsString("cpu,,, value=1")


This test case appears to be missing. Assuming this is covered by the cpu, case?

Yes it is covered by the cpu, case, but I didn't intend to leave it out. I've added it back (you can never have enough test cases 😆).

I gave them all the once over and added a couple more.

jwilder · 2015-11-20T06:29:01Z

@e-dard Nice work! Not only fixed the bugs, but code is simpler and faster.

One question, but this looks great. 👍

e-dard · 2015-11-20T11:23:09Z

@jwilder couple of things:

I will add a commit that lints the package.
I've found other places in the file where there are potentially escape sequence bugs, along the same lines as the ones I've fixed. Obviously the suite is passing so it's not clear if they can get triggered. Example below:

Consider scanTo for example, if it's given an input of a literal backslash followed by an escaped stop character it will not return the expected token (at least as I understand the purpose of scanTo).

Playground

I think the file could do with a review of all the functions that are incorrectly skipping buf when they see a literal slash. Maybe I should do this in another issue/PR? What do you think?

otoolep · 2015-11-20T14:14:25Z

CHANGELOG.md

@@ -26,6 +26,7 @@ There are breaking changes in this release:
  - Scripts are now located in `/usr/lib/influxdb/scripts` (previously `/opt/influxdb`)

 ### Features
+- [#4841](https://github.com/influxdb/influxdb/pull/4841): Improve parsing of measurements and tags


Can we be a little bit more specific here? The user-visible improvement would be less CPU consumption?, is that right?

Sorry, yes. I wrote that before I made the performance improvements. Not added to the CHANGELOG before, how about:

Improve point parsing speed

Also, I just realised I put it under the features section when it started off as a bug fix. Should I move it or leave it where it is?

I don't have a strong opinion, it's neither one or other.

OK I'll leave it where it is. Are you happy with:

Improve point parsing speed

Any better suggestions?

Fine with me -- thanks. You can mention you linted the code as well.

otoolep · 2015-11-20T14:15:44Z

@e-dard - have you signed the CLA?

e-dard · 2015-11-20T14:19:49Z

@otoolep I did when I contributed to influxdb/influxdb.com. Do I need to sign a different one or is it the same CLA?

otoolep · 2015-11-20T14:21:15Z

As long as you signed https://influxdb.com/community/cla.html we're good.

Fix parsing issues in models.scanKey

e-dard added 4 commits November 20, 2015 01:26

Improve robustness of tests, highlighting influxdata#3070 and influxd…

98b8d7d

…ata#4770

Add extra test coverage; fix test typo

c6536b6

Fixes influxdata#3070 and influxdata#4770

bbbb751

Update CHANGELOG

e995fba

e-dard force-pushed the points-refactor branch from c113d99 to e995fba Compare November 20, 2015 01:39

Improve scanning performance

7ce385f

jwilder reviewed Nov 20, 2015
View reviewed changes

Add extra test cases.

4b5eb30

Fix golint warnings

04eea17

e-dard force-pushed the points-refactor branch from c7f113e to 04eea17 Compare November 20, 2015 12:19

otoolep reviewed Nov 20, 2015
View reviewed changes

Update CHANGELOG

9e506a2

jwilder added a commit that referenced this pull request Nov 20, 2015

Merge pull request #4841 from e-dard/points-refactor

d2c94fc

Fix parsing issues in models.scanKey

jwilder merged commit d2c94fc into influxdata:master Nov 20, 2015

e-dard deleted the points-refactor branch November 22, 2015 13:31

e-dard mentioned this pull request Dec 2, 2015

Incorrect error when inserting cpu, #4770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parsing issues in models.scanKey #4841

Fix parsing issues in models.scanKey #4841

e-dard commented Nov 19, 2015

e-dard commented Nov 19, 2015

jwilder commented Nov 19, 2015

e-dard commented Nov 19, 2015

e-dard commented Nov 20, 2015

jwilder Nov 20, 2015

e-dard Nov 20, 2015

jwilder commented Nov 20, 2015

e-dard commented Nov 20, 2015

otoolep Nov 20, 2015

e-dard Nov 20, 2015

otoolep Nov 20, 2015

e-dard Nov 20, 2015

otoolep Nov 20, 2015

otoolep commented Nov 20, 2015

e-dard commented Nov 20, 2015

otoolep commented Nov 20, 2015

Fix parsing issues in models.scanKey #4841

Fix parsing issues in models.scanKey #4841

Conversation

e-dard commented Nov 19, 2015

e-dard commented Nov 19, 2015

jwilder commented Nov 19, 2015

e-dard commented Nov 19, 2015

e-dard commented Nov 20, 2015

jwilder Nov 20, 2015

Choose a reason for hiding this comment

e-dard Nov 20, 2015

Choose a reason for hiding this comment

jwilder commented Nov 20, 2015

e-dard commented Nov 20, 2015

otoolep Nov 20, 2015

Choose a reason for hiding this comment

e-dard Nov 20, 2015

Choose a reason for hiding this comment

otoolep Nov 20, 2015

Choose a reason for hiding this comment

e-dard Nov 20, 2015

Choose a reason for hiding this comment

otoolep Nov 20, 2015

Choose a reason for hiding this comment

otoolep commented Nov 20, 2015

e-dard commented Nov 20, 2015

otoolep commented Nov 20, 2015