Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing issues in models.scanKey #4841

Merged
merged 8 commits into from
Nov 20, 2015
Merged

Conversation

e-dard
Copy link
Contributor

@e-dard e-dard commented Nov 19, 2015

The PR refactors models.scanKey, which is used to parse the measurements and tags. It:

Initially I planned on simply fixing #3070 and #4770, but I felt that the scanKey function was pretty complicated, and had had extra conditionals and checks added over time, some of which were redundant, or unreachable. It seemed like it was worth refactoring this function into something easier to understand.

I've ran benchmarks before and after and performance is broadly inline with the previous implementation, if not a little faster for some benchmarks.

(Updated)

name                         old time/op    new time/op    delta
Marshal-4                      6.08µs ± 1%    6.10µs ± 1%     ~     (p=0.132 n=6+6)
ParsePointNoTags-4              838ns ± 0%     828ns ± 1%   -1.23%  (p=0.004 n=5+6)
ParsePointsTagsSorted2-4       1.16µs ± 2%    1.14µs ± 0%   -1.95%  (p=0.010 n=6+4)
ParsePointsTagsSorted5-4       1.52µs ± 1%    1.42µs ± 0%   -6.22%  (p=0.004 n=6+5)
ParsePointsTagsSorted10-4      2.28µs ± 1%    2.07µs ± 0%   -9.45%  (p=0.004 n=6+5)
ParsePointsTagsUnSorted2-4     1.42µs ± 0%    1.41µs ± 1%   -1.18%  (p=0.010 n=4+6)
ParsePointsTagsUnSorted5-4     1.55µs ± 1%    1.44µs ± 0%   -7.24%  (p=0.002 n=6+6)
ParsePointsTagsUnSorted10-4    3.87µs ± 1%    3.59µs ± 1%   -7.22%  (p=0.002 n=6+6)

name                         old speed      new speed      delta
ParsePointNoTags-4           27.4MB/s ± 0%  27.8MB/s ± 1%   +1.24%  (p=0.004 n=5+6)
ParsePointsTagsSorted2-4     43.8MB/s ± 2%  44.6MB/s ± 0%   +1.98%  (p=0.010 n=6+4)
ParsePointsTagsSorted5-4     54.7MB/s ± 1%  58.3MB/s ± 0%   +6.62%  (p=0.004 n=6+5)
ParsePointsTagsSorted10-4    62.6MB/s ± 1%  69.1MB/s ± 0%  +10.44%  (p=0.004 n=6+5)
ParsePointsTagsUnSorted2-4   35.8MB/s ± 0%  36.2MB/s ± 1%   +1.29%  (p=0.004 n=5+6)
ParsePointsTagsUnSorted5-4   53.6MB/s ± 1%  57.8MB/s ± 0%   +7.79%  (p=0.002 n=6+6)
ParsePointsTagsUnSorted10-4  36.9MB/s ± 1%  39.8MB/s ± 1%   +7.77%  (p=0.002 n=6+6)
  • CHANGELOG.md updated
  • Rebased/mergable
  • Tests pass
  • Sign CLA (if not already signed)

@e-dard
Copy link
Contributor Author

e-dard commented Nov 19, 2015

I also plan on fixing the lint issues in the models package, but will wait until the current changes have been reviewed.

@jwilder
Copy link
Contributor

jwilder commented Nov 19, 2015

This parsing code has been pretty highly tuned because it's used in a lot of places (HTTP/UDP inputs, hinted handoff, WAL, cluster writes, exports, etc..). The performance decrease is fairly significant because it reduces throughput by 20-45% in common cases.

name                         old time/op    new time/op    delta
Marshal-4                      6.11µs ± 0%    6.14µs ± 0%   +0.62%
ParsePointNoTags-4              844ns ± 0%     838ns ± 0%   -0.71%
ParsePointsTagsSorted2-4       1.45µs ± 0%    1.48µs ± 0%   +2.00%
ParsePointsTagsSorted5-4       1.70µs ± 0%    2.07µs ± 0%  +21.76%
ParsePointsTagsSorted10-4      2.29µs ± 0%    3.32µs ± 0%  +44.72%
ParsePointsTagsUnSorted2-4     1.47µs ± 0%    1.76µs ± 0%  +20.26%
ParsePointsTagsUnSorted5-4     1.55µs ± 0%    2.17µs ± 0%  +39.54%
ParsePointsTagsUnSorted10-4    3.93µs ± 0%    4.99µs ± 0%  +26.91%

name                         old speed      new speed      delta
ParsePointNoTags-4           27.2MB/s ± 0%  27.4MB/s ± 0%   +0.62%
ParsePointsTagsSorted2-4     35.1MB/s ± 0%  34.4MB/s ± 0%   -1.91%
ParsePointsTagsSorted5-4     48.8MB/s ± 0%  40.1MB/s ± 0%  -17.85%
ParsePointsTagsSorted10-4    62.4MB/s ± 0%  43.1MB/s ± 0%  -30.89%
ParsePointsTagsUnSorted2-4   34.8MB/s ± 0%  28.9MB/s ± 0%  -16.90%
ParsePointsTagsUnSorted5-4   53.4MB/s ± 0%  38.3MB/s ± 0%  -28.34%
ParsePointsTagsUnSorted10-4  36.4MB/s ± 0%  28.7MB/s ± 0%  -21.18%

The UnSorted cases would affect ingestion endpoints that parse line protocol such as HTTP/UDP. The Sorted cases affect all the other parts of the system that are using the the line protocol as a serialization format after already being parsed by an input plugin.

I have not profiled this change, but I suspect the performance decrease is due to that closures and calling a func per char whereas before it would just increment an integer. The code doesn't seem much clearer to me either unfortunately, but I see where you were going with it. I'd suggest reworking this to have scanKey internally call a scanMeasurement and scanTags. scanTags could call a scanTagKey and scanTagValue, etc. if needed. Each of those would have slice passed in similar to how the funcs in the package work. That is the pattern used in the other funcs (at times).

I'm 👎 on this current PR, but I think it could be reworked to address these concerns and fix the original parser bugs.

@e-dard
Copy link
Contributor Author

e-dard commented Nov 19, 2015

Hi @jwilder, your points seem completely reasonable to me.

This is my first contribution to the project and as you can imagine I have a shallow understanding of the entire codebase, and the impact of scanKeys on it, so your feedback is useful.

I will profile and see what I can do to get the benchmarks up to their previous levels 👍

If that's unsuccessful, then I have another branch waiting that will simply fix the outstanding issues.

@e-dard
Copy link
Contributor Author

e-dard commented Nov 20, 2015

@jwilder,

I've modified the approach to bring performance in line with expectation, while keeping separation of the different parts of the scanning process in different functions.

And of course the bugs should still be fixed 😄

Results of six runs of the following benchmark command: $ go test -benchtime=3s -bench . for master and this PR.

name                         old time/op    new time/op    delta
Marshal-4                      6.08µs ± 1%    6.10µs ± 1%     ~     (p=0.132 n=6+6)
ParsePointNoTags-4              838ns ± 0%     828ns ± 1%   -1.23%  (p=0.004 n=5+6)
ParsePointsTagsSorted2-4       1.16µs ± 2%    1.14µs ± 0%   -1.95%  (p=0.010 n=6+4)
ParsePointsTagsSorted5-4       1.52µs ± 1%    1.42µs ± 0%   -6.22%  (p=0.004 n=6+5)
ParsePointsTagsSorted10-4      2.28µs ± 1%    2.07µs ± 0%   -9.45%  (p=0.004 n=6+5)
ParsePointsTagsUnSorted2-4     1.42µs ± 0%    1.41µs ± 1%   -1.18%  (p=0.010 n=4+6)
ParsePointsTagsUnSorted5-4     1.55µs ± 1%    1.44µs ± 0%   -7.24%  (p=0.002 n=6+6)
ParsePointsTagsUnSorted10-4    3.87µs ± 1%    3.59µs ± 1%   -7.22%  (p=0.002 n=6+6)

name                         old speed      new speed      delta
ParsePointNoTags-4           27.4MB/s ± 0%  27.8MB/s ± 1%   +1.24%  (p=0.004 n=5+6)
ParsePointsTagsSorted2-4     43.8MB/s ± 2%  44.6MB/s ± 0%   +1.98%  (p=0.010 n=6+4)
ParsePointsTagsSorted5-4     54.7MB/s ± 1%  58.3MB/s ± 0%   +6.62%  (p=0.004 n=6+5)
ParsePointsTagsSorted10-4    62.6MB/s ± 1%  69.1MB/s ± 0%  +10.44%  (p=0.004 n=6+5)
ParsePointsTagsUnSorted2-4   35.8MB/s ± 0%  36.2MB/s ± 1%   +1.29%  (p=0.004 n=5+6)
ParsePointsTagsUnSorted5-4   53.6MB/s ± 1%  57.8MB/s ± 0%   +7.79%  (p=0.002 n=6+6)
ParsePointsTagsUnSorted10-4  36.9MB/s ± 1%  39.8MB/s ± 1%   +7.77%  (p=0.002 n=6+6)

So performance is now on a par or slightly better than before.

Thoughts?

t.Errorf(`ParsePoints("%s") mismatch. got nil, exp error`, "cpu, value=1")
}

_, err = models.ParsePointsString("cpu,,, value=1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case appears to be missing. Assuming this is covered by the cpu, case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is covered by the cpu, case, but I didn't intend to leave it out. I've added it back (you can never have enough test cases 😆).

I gave them all the once over and added a couple more.

@jwilder
Copy link
Contributor

jwilder commented Nov 20, 2015

@e-dard Nice work! Not only fixed the bugs, but code is simpler and faster.

One question, but this looks great. 👍

@e-dard
Copy link
Contributor Author

e-dard commented Nov 20, 2015

@jwilder couple of things:

  1. I will add a commit that lints the package.
  2. I've found other places in the file where there are potentially escape sequence bugs, along the same lines as the ones I've fixed. Obviously the suite is passing so it's not clear if they can get triggered. Example below:

Consider scanTo for example, if it's given an input of a literal backslash followed by an escaped stop character it will not return the expected token (at least as I understand the purpose of scanTo).

Playground

I think the file could do with a review of all the functions that are incorrectly skipping buf when they see a literal slash. Maybe I should do this in another issue/PR? What do you think?

@@ -26,6 +26,7 @@ There are breaking changes in this release:
- Scripts are now located in `/usr/lib/influxdb/scripts` (previously `/opt/influxdb`)

### Features
- [#4841](https://github.com/influxdb/influxdb/pull/4841): Improve parsing of measurements and tags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be a little bit more specific here? The user-visible improvement would be less CPU consumption?, is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes. I wrote that before I made the performance improvements. Not added to the CHANGELOG before, how about:

Improve point parsing speed

Also, I just realised I put it under the features section when it started off as a bug fix. Should I move it or leave it where it is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion, it's neither one or other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll leave it where it is. Are you happy with:

Improve point parsing speed

Any better suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me -- thanks. You can mention you linted the code as well.

@otoolep
Copy link
Contributor

otoolep commented Nov 20, 2015

@e-dard - have you signed the CLA?

@e-dard
Copy link
Contributor Author

e-dard commented Nov 20, 2015

@otoolep I did when I contributed to influxdb/influxdb.com. Do I need to sign a different one or is it the same CLA?

@otoolep
Copy link
Contributor

otoolep commented Nov 20, 2015

As long as you signed https://influxdb.com/community/cla.html we're good.

jwilder added a commit that referenced this pull request Nov 20, 2015
Fix parsing issues in models.scanKey
@jwilder jwilder merged commit d2c94fc into influxdata:master Nov 20, 2015
@e-dard e-dard deleted the points-refactor branch November 22, 2015 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants