Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime-adaptive data representation #12720

Open
findepi opened this issue Oct 2, 2024 · 0 comments
Open

Runtime-adaptive data representation #12720

findepi opened this issue Oct 2, 2024 · 0 comments

Comments

@findepi
Copy link
Member

findepi commented Oct 2, 2024

Originally posted by @andygrove in #11513 (comment)

We are running into the RecordBatches with same logical type but different physical types issue in DataFusion Comet. For a single table, a column may be dictionary-encoded in some Parquet files, and not in others, so we are forced to cast them all to the same type, which introduces unnecessary dictionary encoding (or decoding) overhead.

DataFusion physical planning result mandates particular Arrow type (DataType) for each of the processed columns.
This doesn't reflect reality of modern systems though.

  • source data may be naturally representable in different Arrow types (DataTypes) and forcing single common representation is not efficient
  • adaptive execution of certain operations (like pre aggregation) would benefit from being able to adjust data processing in response to incoming data characteristics observed at runtime

Example 1:
plain table scan reading Parquet files. Same column may be represented differently in individual files (plain array vs RLE/REE vs Dictionary) and it is not optimally efficient to force a particular data layout on the output of the table scan.

Example 2
UNION ALL query may union data from multiple sources, which can naturally produces data in different data types.

Context

  • not prescribing a particular Arrow type (DataType) requires some higher-level notion of types in DataFusion, which is to be delivered in [Proposal] Decouple logical from physical types #11513
    • that issue explicitly lists runtime adaptivity ("RecordBatches with same logical type but different physical types") as a non-goal, so there is no overlap
  • formal signatures of functions inserted into the plan need to operate on higher level notion of types; this relates to Simple Functions  #12635
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant