Read whole leaf-level arrays at once when possible, rather than growing lists #295
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes how array values are read to avoid growing lists dynamically, and instead create arrays directly from spans of the buffered logical values.
This is based on the work done by @philjdf in philjdf#4 and philjdf#5. The new benchmark has mostly been copied as-is, but the array reading changes have been adapted to work with the latest master code.
On my machine (AMD Ryzen 9 5900 X with 64 GB RAM running Fedora Linux 36 with dotnet SDK 6.0.107), I get the following benchmark results:
The speed up from ParquetSharp 7.0 to 8.0 looks to be due to changes in Arrow rather than any of the changes we've made in ParquetSharp. I've also included benchmark results when enabling the dotnet server GC as this can improve performance with ParquetSharp significantly.