Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read whole leaf-level arrays at once when possible, rather than growing lists #295

Merged
merged 2 commits into from
Aug 10, 2022

Conversation

adamreeve
Copy link
Contributor

This PR changes how array values are read to avoid growing lists dynamically, and instead create arrays directly from spans of the buffered logical values.

This is based on the work done by @philjdf in philjdf#4 and philjdf#5. The new benchmark has mostly been copied as-is, but the array reading changes have been adapted to work with the latest master code.

On my machine (AMD Ryzen 9 5900 X with 64 GB RAM running Fedora Linux 36 with dotnet SDK 6.0.107), I get the following benchmark results:

Test case Time taken (s) Time taken with server GC (s)
Parquet.Net 10.8 10.1
ParquetSharp 7.0 20.4 20.9
ParquetSharp 8.0 beta 12.5 10.9
ParquetSharp with this PR 4.5 2.6

The speed up from ParquetSharp 7.0 to 8.0 looks to be due to changes in Arrow rather than any of the changes we've made in ParquetSharp. I've also included benchmark results when enabling the dotnet server GC as this can improve performance with ParquetSharp significantly.

@adamreeve adamreeve requested a review from philjdf August 8, 2022 09:52
@philjdf
Copy link
Contributor

philjdf commented Aug 8, 2022

Nice speed improvement, thanks Adam!

Copy link
Contributor

@marcin-krystianc marcin-krystianc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement. See my comment, maybe it is possible to make it even better?

csharp/LogicalColumnReader.cs Show resolved Hide resolved
@adamreeve adamreeve merged commit 1519bc0 into G-Research:master Aug 10, 2022
@adamreeve adamreeve deleted the array_read_perf branch August 10, 2022 08:25
@marcin-krystianc
Copy link
Contributor

I've experimented with some further optimisations, but I didn't see an actual improvement.
The most promising approach is https://github.com/G-Research/ParquetSharp/compare/master...marcin-krystianc:ParquetSharp:marcink-20220810-arraypool?expand=1&w=1 where I used ArrayPool to reduce allocations for temporary arrays. Unfortunately in benchmark it shows no actual improvement, I think it is mainly because temporary arrays are not used that often as most of the times there is a single chunk being created and returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants