Skip to content
This repository has been archived by the owner on May 24, 2022. It is now read-only.

Use csv's zero-copy API #3

Closed
wants to merge 2 commits into from
Closed

Conversation

Dr-Emann
Copy link

@Dr-Emann Dr-Emann commented Feb 4, 2017

While our use is not zero-copy, it allows us to avoid some re-allocations by reusing existing vectors.

This achieves a larger speedup in Rust 1.15, because of rust-lang/rust#38182, which specializes Vec<T>::extend when T : Copy

While our use is not zero-copy, it allows us to avoid some
re-allocations by reusing existing vectors.
In rust 1.14, extend is specialized to extend_from_slice if passed a
slice.
@emk
Copy link
Contributor

emk commented Feb 6, 2017

Great idea! Do you have any idea how much speedup this produces? It would be fine to just capture the performance numbers printed by scrubcsv before and after your change.

I actually have some ideas on how we can eliminate those remaining allocations, and improve our handling of interior quotes, but it will require a bit more code. Basically we could have a single Vec<u8> as a working buffer, and a second Vec<&[u8]> which kept track of slices into it. Whenever we found a column, we'd append the relevant information to both vectors. This would require a fast Copy but no memory allocation in the inner loop. We'd also need to track what's going on with quotes in each CSV cell.

This should probably double or maybe even triple the speed of scrubcsv, with any luck. But your patch is a great first step!

@Dr-Emann
Copy link
Author

Dr-Emann commented Feb 6, 2017

On my box (Arch Linux, in a Virtualbox VM, rust 1.15), the speed before averaged at 75.51 MiB/s. After the change, the speed averaged at 90.75 MiB/s.
Both runs were averages of 3 runs.

@rossmeissl
Copy link
Member

20% increase! Nice!

@Dr-Emann
Copy link
Author

Dr-Emann commented Feb 6, 2017

I don't think switching to a single Vec<u8> would considerably affect performance.

This implementation only allocates:

  • For the first row, one Vec is allocated per column (at least 1KiB in size)
  • If a column who's index is less than the number of columns in the first row is longer than the capacity for that column

Whereas a single Vec Implementation allocates:

  • In piecemeal up to the sum of the lengths of the first row
  • For every row in which the sum of the length of all columns (up to the length of the first row) exceeds the sum of lengths of previous rows.

I would guess that a single Vec implementation would be within 5-10% of this implementation.

@emk
Copy link
Contributor

emk commented Jan 26, 2020

This was a brilliant idea, but it would need to be rethought for use with the stable csv API.

@emk emk closed this Jan 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants