Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async ParquetExec #1617

Closed
wants to merge 4 commits into from
Closed

Async ParquetExec #1617

wants to merge 4 commits into from

Conversation

tustvold
Copy link
Contributor

I've not thoroughly tested this yet, it may represent a performance regression, and will likely need more memory

Which issue does this PR close?

Closes #TBD.

Rationale for this change

Avoids needing to use a blocking threadpool for reading Parquet data, and opens the door for fetching data from object storage.

What changes are included in this PR?

Updates ParquetExec to use the POC async ParquetRecordBatchStream in apache/arrow-rs#1154

Are there any user-facing changes?

Yes, I had to alter the ObjectReader trait

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jan 19, 2022
@tustvold tustvold marked this pull request as draft January 19, 2022 19:12
@alamb
Copy link
Contributor

alamb commented Feb 13, 2022

This can probably be refreshed after #1775 is merged

@tustvold
Copy link
Contributor Author

tustvold commented Feb 15, 2022

I'm going to wait until after #1738 is merged and I have time to do some comparative investigation. Async is certainly not free as the block structure of parquet is not particularly amenable to streaming. Prior to merge I'd like a better handle on what the trade-offs here are, I'd expect memory usage to be higher, but performance may be better, not sure 😄

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

#1738 is merged Edit: approved

@tustvold
Copy link
Contributor Author

Running the benchmarks against a backported apache/arrow-rs#1418 show the performance of this to be significantly worse than the current Sync reader (~2x). It is possible this is just a simple oversight with a simple fix, but I'm unlikely to have time to look into it this week. Async is far from free and so I'd expected the performance to be somewhat worse, but not that much worse...

Hopefully I will have some time next week, otherwise I'll keep this pootling along on the backburner until it is ready for prime-time 😄

@alamb
Copy link
Contributor

alamb commented May 2, 2022

Marking PR as stale - will close in a while to clean up the PR list. Feel free to reopen if more work is planned

@tustvold tustvold closed this May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants