Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Optimize UTF8/ASCII byte offset index #3439

Merged
merged 5 commits into from
Mar 11, 2023
Merged

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Mar 10, 2023

This PR improves the memory consumption and performance of our line/column based Locations to byte offsets. The PR also introduces two new zero-cost wrappers so that the functionality can be implemented as methods rather than free floating functions.

The first improvement is that the implementation no longer performs two passes over the string:

  1. To determine if its ASCII
  2. To build up the index

The implementation now assumes ASCII and falls back (and converts the intermediate state) to UTF8 when it sees the first non-ascii character while building the index.

The second optimization is to use u32 to store the offsets (has anyone ever tried running a 4GB+ source document with Python?)

The last optimization avoids the nested Vec<Vec<usize>> for the Utf8Index. I tried two different approaches

Line -> Char and Char -> Byte Index

The idea is to use two vectors instead of a nested vector:

  • The first vector stores the character offset at which a new line starts
  • The second vector stores the byte offsets for each character

You can then compute the [Location] offset by:

// retrieving the start character on that line and add the column (character offset)
let char_offset = lines[location.row() - 1] + location.column();
let byte_offset = char_to_byte_offsets[char_offset]

0b39da5

Lazy column computation

This implementation makes the assumption that we're only querying a few locations and that computing the column offset for a single line is cheap (no minified documents where everything is on a single line).

Based on this assumption, it is sufficient to only store the byte offsets for every line and then lazily compute the column offset.

Computing the index lazily has the added benefit of reducing the memory footprint.

7527612

Benchmark

❯ hyperfine --ignore-failure --warmup 10 \
        "./ruff_two ./crates/ruff/resources/test/cpython/ --no-cache" \
        "./ruff_main ./crates/ruff/resources/test/cpython/ --no-cache" \
        "./ruff_lazy ./crates/ruff/resources/test/cpython/ --no-cache"
Benchmark 1: ./ruff_two ./crates/ruff/resources/test/cpython/ --no-cache
  Time (mean ± σ):     186.3 ms ±   3.5 ms    [User: 3177.1 ms, System: 91.2 ms]
  Range (min … max):   180.5 ms … 192.1 ms    15 runs
 
  Warning: Ignoring non-zero exit code.
 
Benchmark 2: ./ruff_main ./crates/ruff/resources/test/cpython/ --no-cache
  Time (mean ± σ):     187.8 ms ±   3.5 ms    [User: 3215.8 ms, System: 99.4 ms]
  Range (min … max):   183.9 ms … 195.6 ms    15 runs
 
  Warning: Ignoring non-zero exit code.
 
Benchmark 3: ./ruff_lazy ./crates/ruff/resources/test/cpython/ --no-cache
  Time (mean ± σ):     184.7 ms ±   4.1 ms    [User: 3188.5 ms, System: 97.6 ms]
  Range (min … max):   179.5 ms … 195.8 ms    16 runs
 
  Warning: Ignoring non-zero exit code.
 
Summary
  './ruff_lazy ./crates/ruff/resources/test/cpython/ --no-cache' ran
    1.01 ± 0.03 times faster than './ruff_two ./crates/ruff/resources/test/cpython/ --no-cache'
    1.02 ± 0.03 times faster than './ruff_main ./crates/ruff/resources/test/cpython/ --no-cache'

Memory usage

I used /usr/bin/time to get the max resident memory when linting cpython

  • main: ~324MB
  • two: ~298MB
  • lazy: ~287MB

Verdict

This PR implements the lazy computation as it has a similar performance as storing all character positions but requires significantly less memory (only stores line mappings, not also mappings for every line)

@MichaReiser MichaReiser force-pushed the perf/byte-index branch 3 times, most recently from a01adac to 7527612 Compare March 10, 2023 16:20
@MichaReiser MichaReiser marked this pull request as ready for review March 10, 2023 16:30
Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are nice improvements!

crates/ruff_python_ast/src/source_code/locator.rs Outdated Show resolved Hide resolved
crates/ruff_python_ast/src/source_code/locator.rs Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

✅ ecosystem check detected no changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants