perf: Optimize UTF8/ASCII byte offset index #3439
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the memory consumption and performance of our line/column based
Location
s to byte offsets. The PR also introduces two new zero-cost wrappers so that the functionality can be implemented as methods rather than free floating functions.The first improvement is that the implementation no longer performs two passes over the string:
The implementation now assumes ASCII and falls back (and converts the intermediate state) to UTF8 when it sees the first non-ascii character while building the index.
The second optimization is to use
u32
to store the offsets (has anyone ever tried running a 4GB+ source document with Python?)The last optimization avoids the nested
Vec<Vec<usize>>
for theUtf8Index
. I tried two different approachesLine -> Char and Char -> Byte Index
The idea is to use two vectors instead of a nested vector:
You can then compute the [Location] offset by:
0b39da5
Lazy column computation
This implementation makes the assumption that we're only querying a few locations and that computing the column offset for a single line is cheap (no minified documents where everything is on a single line).
Based on this assumption, it is sufficient to only store the byte offsets for every line and then lazily compute the column offset.
Computing the index lazily has the added benefit of reducing the memory footprint.
7527612
Benchmark
Memory usage
I used
/usr/bin/time
to get the max resident memory when linting cpythonmain
: ~324MBtwo
: ~298MBlazy
: ~287MBVerdict
This PR implements the
lazy
computation as it has a similar performance as storing all character positions but requires significantly less memory (only stores line mappings, not also mappings for every line)