Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient Dictionary / constant encoding for partition values in ListingFileProvider #1931

Open
alamb opened this issue Mar 5, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 5, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The ListingFileProvider after #1860 uses UInt16 for the indexes of the DictionaryArray type

This means that the number of partitions is limited to 2^16 ~ 64K. It also means when scanning files from a source that have fewer than 256 distinct values (that could have fit in UInt8) there is wasted space and time using larger than needed dictionary columns (which will all have the same value).

Describe the solution you'd like
Ideally the partition column would be a constant (or a DictionaryArray with UInt8 indexes) and the various upstream operations would create DictionaryArrays with larger index sizes as needed

Additional context
SUggested by @rdettai on #1860 at #1860 (comment) and #1860 (comment)

@rdettai
Copy link
Contributor

rdettai commented Mar 7, 2022

Thanks @alamb for tracking this. Am I correct in my understanding that #1248 would be another solution (and probably a better one) to this issue?

@alamb
Copy link
Contributor Author

alamb commented Mar 7, 2022

Thanks @alamb for tracking this. Am I correct in my understanding that #1248 would be another solution (and probably a better one) to this issue?

Probably @rdettai -- though in that case we'll have to figure out what happens when RecordBatches with constants are concatenated 🤔 -- https://docs.rs/arrow/9.1.0/arrow/compute/kernels/concat/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants