perf: Optimize UTF8/ASCII byte offset index #3439

MichaReiser · 2023-03-10T13:39:36Z

This PR improves the memory consumption and performance of our line/column based Locations to byte offsets. The PR also introduces two new zero-cost wrappers so that the functionality can be implemented as methods rather than free floating functions.

The first improvement is that the implementation no longer performs two passes over the string:

To determine if its ASCII
To build up the index

The implementation now assumes ASCII and falls back (and converts the intermediate state) to UTF8 when it sees the first non-ascii character while building the index.

The second optimization is to use u32 to store the offsets (has anyone ever tried running a 4GB+ source document with Python?)

The last optimization avoids the nested Vec<Vec<usize>> for the Utf8Index. I tried two different approaches

Line -> Char and Char -> Byte Index

The idea is to use two vectors instead of a nested vector:

The first vector stores the character offset at which a new line starts
The second vector stores the byte offsets for each character

You can then compute the [Location] offset by:

// retrieving the start character on that line and add the column (character offset)
let char_offset = lines[location.row() - 1] + location.column();
let byte_offset = char_to_byte_offsets[char_offset]

0b39da5

Lazy column computation

This implementation makes the assumption that we're only querying a few locations and that computing the column offset for a single line is cheap (no minified documents where everything is on a single line).

Based on this assumption, it is sufficient to only store the byte offsets for every line and then lazily compute the column offset.

Computing the index lazily has the added benefit of reducing the memory footprint.

7527612

Benchmark

❯ hyperfine --ignore-failure --warmup 10 \
        "./ruff_two ./crates/ruff/resources/test/cpython/ --no-cache" \
        "./ruff_main ./crates/ruff/resources/test/cpython/ --no-cache" \
        "./ruff_lazy ./crates/ruff/resources/test/cpython/ --no-cache"
Benchmark 1: ./ruff_two ./crates/ruff/resources/test/cpython/ --no-cache
  Time (mean ± σ):     186.3 ms ±   3.5 ms    [User: 3177.1 ms, System: 91.2 ms]
  Range (min … max):   180.5 ms … 192.1 ms    15 runs
 
  Warning: Ignoring non-zero exit code.
 
Benchmark 2: ./ruff_main ./crates/ruff/resources/test/cpython/ --no-cache
  Time (mean ± σ):     187.8 ms ±   3.5 ms    [User: 3215.8 ms, System: 99.4 ms]
  Range (min … max):   183.9 ms … 195.6 ms    15 runs
 
  Warning: Ignoring non-zero exit code.
 
Benchmark 3: ./ruff_lazy ./crates/ruff/resources/test/cpython/ --no-cache
  Time (mean ± σ):     184.7 ms ±   4.1 ms    [User: 3188.5 ms, System: 97.6 ms]
  Range (min … max):   179.5 ms … 195.8 ms    16 runs
 
  Warning: Ignoring non-zero exit code.
 
Summary
  './ruff_lazy ./crates/ruff/resources/test/cpython/ --no-cache' ran
    1.01 ± 0.03 times faster than './ruff_two ./crates/ruff/resources/test/cpython/ --no-cache'
    1.02 ± 0.03 times faster than './ruff_main ./crates/ruff/resources/test/cpython/ --no-cache'

Memory usage

I used /usr/bin/time to get the max resident memory when linting cpython

main: ~324MB
two: ~298MB
lazy: ~287MB

Verdict

This PR implements the lazy computation as it has a similar performance as storing all character positions but requires significantly less memory (only stores line mappings, not also mappings for every line)

charliermarsh

These are nice improvements!

crates/ruff_python_ast/src/source_code/locator.rs

github-actions · 2023-03-11T10:47:37Z

✅ ecosystem check detected no changes.

MichaReiser force-pushed the perf/byte-index branch 3 times, most recently from a01adac to 7527612 Compare March 10, 2023 16:20

MichaReiser marked this pull request as ready for review March 10, 2023 16:30

charliermarsh approved these changes Mar 10, 2023

View reviewed changes

MichaReiser added 3 commits March 11, 2023 11:19

Two Vec UTF8 index

2107def

Documentation

6150a44

Lazy compute offset

37fcecc

MichaReiser force-pushed the perf/byte-index branch from 7527612 to 60b658b Compare March 11, 2023 10:31

Correctly handle \r

dc18ad0

MichaReiser force-pushed the perf/byte-index branch from 60b658b to dc18ad0 Compare March 11, 2023 10:32

Add test for \r\n

0cfc408

MichaReiser merged commit d298804 into main Mar 11, 2023

MichaReiser deleted the perf/byte-index branch March 11, 2023 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize UTF8/ASCII byte offset index #3439

perf: Optimize UTF8/ASCII byte offset index #3439

MichaReiser commented Mar 10, 2023 •

edited

Loading

charliermarsh left a comment

github-actions bot commented Mar 11, 2023

perf: Optimize UTF8/ASCII byte offset index #3439

perf: Optimize UTF8/ASCII byte offset index #3439

Conversation

MichaReiser commented Mar 10, 2023 • edited Loading

Line -> Char and Char -> Byte Index

Lazy column computation

Benchmark

Memory usage

Verdict

charliermarsh left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 11, 2023

MichaReiser commented Mar 10, 2023 •

edited

Loading