Use csv's zero-copy API #3

Dr-Emann · 2017-02-04T05:42:00Z

While our use is not zero-copy, it allows us to avoid some re-allocations by reusing existing vectors.

This achieves a larger speedup in Rust 1.15, because of rust-lang/rust#38182, which specializes Vec<T>::extend when T : Copy

While our use is not zero-copy, it allows us to avoid some re-allocations by reusing existing vectors.

In rust 1.14, extend is specialized to extend_from_slice if passed a slice.

emk · 2017-02-06T14:46:25Z

Great idea! Do you have any idea how much speedup this produces? It would be fine to just capture the performance numbers printed by scrubcsv before and after your change.

I actually have some ideas on how we can eliminate those remaining allocations, and improve our handling of interior quotes, but it will require a bit more code. Basically we could have a single Vec<u8> as a working buffer, and a second Vec<&[u8]> which kept track of slices into it. Whenever we found a column, we'd append the relevant information to both vectors. This would require a fast Copy but no memory allocation in the inner loop. We'd also need to track what's going on with quotes in each CSV cell.

This should probably double or maybe even triple the speed of scrubcsv, with any luck. But your patch is a great first step!

Dr-Emann · 2017-02-06T15:54:19Z

On my box (Arch Linux, in a Virtualbox VM, rust 1.15), the speed before averaged at 75.51 MiB/s. After the change, the speed averaged at 90.75 MiB/s.
Both runs were averages of 3 runs.

rossmeissl · 2017-02-06T15:55:27Z

20% increase! Nice!

Dr-Emann · 2017-02-06T18:24:07Z

I don't think switching to a single Vec<u8> would considerably affect performance.

This implementation only allocates:

For the first row, one Vec is allocated per column (at least 1KiB in size)
If a column who's index is less than the number of columns in the first row is longer than the capacity for that column

Whereas a single Vec Implementation allocates:

In piecemeal up to the sum of the lengths of the first row
For every row in which the sum of the length of all columns (up to the length of the first row) exceeds the sum of lengths of previous rows.

I would guess that a single Vec implementation would be within 5-10% of this implementation.

emk · 2020-01-26T13:47:16Z

This was a brilliant idea, but it would need to be rethought for use with the stable csv API.

Dr-Emann added 2 commits February 4, 2017 00:19

Make use of csv's zero-copy API

a12e81c

While our use is not zero-copy, it allows us to avoid some re-allocations by reusing existing vectors.

Use Vec::extend over Vec::extend_from_slice

8d6a6a4

In rust 1.14, extend is specialized to extend_from_slice if passed a slice.

emk closed this Jan 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use csv's zero-copy API #3

Use csv's zero-copy API #3

Dr-Emann commented Feb 4, 2017

emk commented Feb 6, 2017

Dr-Emann commented Feb 6, 2017

rossmeissl commented Feb 6, 2017

Dr-Emann commented Feb 6, 2017

emk commented Jan 26, 2020

Use csv's zero-copy API #3

Use csv's zero-copy API #3

Conversation

Dr-Emann commented Feb 4, 2017

emk commented Feb 6, 2017

Dr-Emann commented Feb 6, 2017

rossmeissl commented Feb 6, 2017

Dr-Emann commented Feb 6, 2017

emk commented Jan 26, 2020