Read whole leaf-level arrays at once when possible, rather than growing lists #295

adamreeve · 2022-08-08T09:52:41Z

This PR changes how array values are read to avoid growing lists dynamically, and instead create arrays directly from spans of the buffered logical values.

This is based on the work done by @philjdf in philjdf#4 and philjdf#5. The new benchmark has mostly been copied as-is, but the array reading changes have been adapted to work with the latest master code.

On my machine (AMD Ryzen 9 5900 X with 64 GB RAM running Fedora Linux 36 with dotnet SDK 6.0.107), I get the following benchmark results:

Test case	Time taken (s)	Time taken with server GC (s)
Parquet.Net	10.8	10.1
ParquetSharp 7.0	20.4	20.9
ParquetSharp 8.0 beta	12.5	10.9
ParquetSharp with this PR	4.5	2.6

The speed up from ParquetSharp 7.0 to 8.0 looks to be due to changes in Arrow rather than any of the changes we've made in ParquetSharp. I've also included benchmark results when enabling the dotnet server GC as this can improve performance with ParquetSharp significantly.

…lists

philjdf · 2022-08-08T23:32:51Z

Nice speed improvement, thanks Adam!

marcin-krystianc

Nice improvement. See my comment, maybe it is possible to make it even better?

csharp/LogicalColumnReader.cs

marcin-krystianc · 2022-08-11T10:45:40Z

I've experimented with some further optimisations, but I didn't see an actual improvement.
The most promising approach is https://github.com/G-Research/ParquetSharp/compare/master...marcin-krystianc:ParquetSharp:marcink-20220810-arraypool?expand=1&w=1 where I used ArrayPool to reduce allocations for temporary arrays. Unfortunately in benchmark it shows no actual improvement, I think it is mainly because temporary arrays are not used that often as most of the times there is a single chunk being created and returned.

adamreeve added 2 commits August 8, 2022 11:39

Add float array timeseries read benchmark

5f20ffc

Read whole leaf-level arrays at once when possible, rather than grow …

ab5799f

…lists

adamreeve requested a review from philjdf August 8, 2022 09:52

philjdf approved these changes Aug 8, 2022

View reviewed changes

marcin-krystianc reviewed Aug 9, 2022

View reviewed changes

csharp/LogicalColumnReader.cs Show resolved Hide resolved

adamreeve merged commit 1519bc0 into G-Research:master Aug 10, 2022

adamreeve deleted the array_read_perf branch August 10, 2022 08:25

adamreeve mentioned this pull request Aug 29, 2022

Support round-tripping nested values #301

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read whole leaf-level arrays at once when possible, rather than growing lists #295

Read whole leaf-level arrays at once when possible, rather than growing lists #295

adamreeve commented Aug 8, 2022

philjdf commented Aug 8, 2022

marcin-krystianc left a comment

marcin-krystianc commented Aug 11, 2022

Read whole leaf-level arrays at once when possible, rather than growing lists #295

Read whole leaf-level arrays at once when possible, rather than growing lists #295

Conversation

adamreeve commented Aug 8, 2022

philjdf commented Aug 8, 2022

marcin-krystianc left a comment

Choose a reason for hiding this comment

marcin-krystianc commented Aug 11, 2022