Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise MemoryParserState #235

Open
5 tasks
TotalVerb opened this issue Jan 29, 2018 · 4 comments
Open
5 tasks

Revise MemoryParserState #235

TotalVerb opened this issue Jan 29, 2018 · 4 comments

Comments

@TotalVerb
Copy link
Collaborator

cc @ScottPJones, @stevengj

MemoryParserState needs a big revision as it has aged over the years. I merged #234 in the meantime as the performance has not been benchmarked in a while and it is probably not worth the extra effort to continue to maintain the current code. I expect to have some time this week to take on this project, and so far have the following ideas in mind, alongside the Julia v0.6 and v0.7 modernization:

  • Instead of using Vector{UInt8}, String can be used directly in many situations.
  • Instead of totally falling back on the slow parsing if any escapes are found, we should copy as much as we can directly, and then append the escape, and continue. This should help with strings with few escapes and long strings of memcpy-able memory.
  • Audit and profile to see what is actually in need of improvement (this has not been done in a while).
  • Hopefully simplify the code, overall.
  • Hopefully also address some of the issues like JSON.parse fails for cweb/url-testing/master/urls.json #232.
@ScottPJones
Copy link
Contributor

As far as the fallback on slow parsing, I think instead, the approach of calculating the correct length first (which can be very fast, simply scan for \, sum up how many bytes less will be needed in the output, and then return that.
If that value is 0, (i.e. no escape sequences), then you can simply do a memcpy. You might also return a tuple, with a count of the number of escape sequences, the number of bytes fewer to output, and the offset of the first escape sequence found. That will allow optimizing for the case where an escape sequence is near the beginning or near the end (because once the last escape sequence is processed, then the rest of the string can be copied with memcpy).
If you'd like, I could write a PR for that performance enhancement, if you want to take care of the changes such as using String, profiling, etc.

@stevengj
Copy link
Member

stevengj commented Feb 1, 2018

See #236, which addresses several of these points, although further optimization is of course possible.

@samoconnor
Copy link

samoconnor commented Feb 4, 2018

FYI, I've been experimenting with a lazy JSON parser: https://github.com/samoconnor/jsonhack

[Update: a simpler and lazyer variation LazyJSON.jl improves speed of access near the end of a large file. See benchmark output below. This variant uses iterators that work directly with the JSON bytes and represents JSON values using just the byte index of the start of the value. This approach uses 0.001% of the memory used by JSON.jl ]

test/benchmark.jl uses a 1MB AWS API definition JSON file to compare performance vs JSON.jl. When accessing a value close to the start of the file the lazy parser is ~2000 times faster than JSON.jl, for 2 values near then end of the file, the lazyer parser is ~6 times faster and allocates 0.07MiB vs 913 MiB.

Access value close to start:
LazyJSON.jl:  0.000558 seconds (3.23 k allocations: 121.719 KiB)
JSON.jl:      1.025800 seconds (9.34 M allocations: 913.682 MiB, 12.53% gc time)


Access 2 values close to end:
LazyJSON.jl:  0.162446 seconds (2.66 k allocations: 71.250 KiB)
JSON.jl:      1.019581 seconds (9.34 M allocations: 913.682 MiB, 12.22% gc time)

test/runtests.jl uses https://github.com/nst/JSONTestSuite to check parser corner case behaviour.

@ScottPJones
Copy link
Contributor

I wasn't aware of the test suite, that's great. I found some bugs (which I've fixed in my parsing rewrite PR that I'm reading in), I'm sure that test suite would have caught them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants