Support Partitioning Data by Dictionary Encoded String Array Types #7896

devinjdangelo · 2023-10-21T14:43:15Z

Which issue does this PR close?

Rationale for this change

The initial implementation of hive style partition demux code only supported plain UTF8 arrays. This PR adds support for dictionary encoded UTF8 arrays.

What changes are included in this PR?

Extracts string values from dictionary encoded UTF8 arrays to determine hive partitions
Adds additional cases to schema semantic equivalence checking: Dictionary(_, dtype) is now considered semantically equivalent to dtype.

Are these changes tested?

Yes using dictionary encoded values from parquet-testing submodule

Are there any user-facing changes?

No, just (hopefully) more robust support for inserting inferred schema string data as a partition column.

devinjdangelo · 2023-10-21T14:54:23Z

datafusion/core/src/datasource/file_format/write/demux.rs

+            },
+            DataType::Dictionary(key_type, _) => {
+                match **key_type{
+                    DataType::UInt16 => {


It would be nice to not have to enumerate every possible key_type here. It would be nice if there was a way to cast any Dictionary(_, V) -> V. Or at least iterate over any Dictionary(_, V) to pull out the V values in order without caring what type the key is.

Hey Devin, there is a downcast_dictionary_array! macro that I encountered while attempting to fix this issue. (I was going to submit a PR but was figuring out if/how to add tests!) It lets you write code once that works for all key types, if I am not mistaken. I saw it used elsewhere in DataFusion code: https://github.com/apache/arrow-datafusion/blob/9fde5c4282fd9f0e3332fb40998bf1562c17fcda/datafusion/common/src/hash_utils.rs#L326-L329

You can see the code I used here:
https://github.com/polygon-io/arrow-datafusion/blob/8c955891a6e5b6002faa07e54a67a6ec93347ade/datafusion/core/src/datasource/file_format/write/demux.rs#L316-L328

DataType::Dictionary(_, value_type) if value_type.as_ref() == &DataType::Utf8 => { downcast_dictionary_array!( col_array => { let array = col_array.downcast_dict::<StringArray>().unwrap(); for i in 0..rb.num_rows() { partition_values.push(array.value(i)); } }, _ => unreachable!(), ) }

Thanks @suremarc for the pointer to that macro! I updated this PR and it seems to work as expected!

I also added a new test based on a dictionary encoded parquet file from one of the testing submodules. It is working without error, but the values look a bit off. Not sure at this point if related to the downcasting or something else.

devinjdangelo · 2023-10-21T14:58:56Z

datafusion/core/src/datasource/file_format/write/demux.rs

+                            .ok_or(DataFusionError::NotImplemented(format!("It is not yet supported to write to hive partitioned with datatype {}", dtype)))?;
+                        for val in array.into_iter() {
+                            partition_values.push(
+                                val.ok_or(DataFusionError::Execution("Partition values cannot be null!".into()))?


Traditionally null values are sent to __HIVE_DEFAULT_PARTITION__ imo it is also a reasonable choice to throw an error instead. If the user desired the traditional behavior, they could do COALESCE(part_col, '__HIVE_DEFAULT_PARTITION__') as part_col

https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

devinjdangelo · 2023-10-22T13:59:00Z

datafusion/sqllogictest/test_files/insert_to_external.slt

+select * from dictionary_encoded_parquet_partitioned order by (id);
+----
+0 true 0 0 0 0 0 0 01%2F01%2F09 0 2009-01-01T00:00:00
+1 false 1 1 1 10 1.1 10.1 01%2F01%2F09 1 2009-01-01T00:01:00


date_string_col and string_col look off. I'm not sure if that is due to how the values are being downcast or something else.

datafusion-cli shows the values slightly different but also odd:

select * from './parquet-testing/data/alltypes_dictionary.parquet'; +----+----------+-------------+--------------+---------+------------+-----------+------------+------------------+------------+---------------------+ | id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col | string_col | timestamp_col | +----+----------+-------------+--------------+---------+------------+-----------+------------+------------------+------------+---------------------+ | 0 | true | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 30312f30312f3039 | 30 | 2009-01-01T00:00:00 | | 1 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 30312f30312f3039 | 31 | 2009-01-01T00:01:00 | +----+----------+-------------+--------------+---------+------------+-----------+------------+------------------+------------+---------------------+ 2 rows in set. Query took 0.001 seconds.

This might be an easier way to create a data with dictionary encoded data:

❯ create table test as values (arrow_cast('foo', 'Dictionary(Int32, Utf8)')), (arrow_cast('bar', 'Dictionary(Int32, Utf8)')); 0 rows in set. Query took 0.002 seconds. ❯ select arrow_typeof(column1) from test; +----------------------------+ | arrow_typeof(test.column1) | +----------------------------+ | Dictionary(Int32, Utf8) | | Dictionary(Int32, Utf8) | +----------------------------+ 2 rows in set. Query took 0.001 seconds. ❯ select * from test; +---------+ | column1 | +---------+ | foo | | bar | +---------+ 2 rows in set. Query took 0.000 seconds.

Thanks this worked great for the test

alamb

Thank you very much @devinjdangelo and @suremarc -- it is really appreciated to get this case working 🙏

alamb · 2023-10-23T20:35:14Z

datafusion/sqllogictest/test_files/insert_to_external.slt


+statement ok
+CREATE EXTERNAL TABLE dictionary_encoded_parquet


I don't see any "dictionary" here -- why is it called dictionary_encoded_parquet?

I assumed (incorrectly it seems) that alltypes_dictionary.parquet in the parquet-testing submodule was dictionary encoded. https://github.com/apache/parquet-testing/tree/e45cd23f784aab3d6bf0701f8f4e621469ed3be7/data

I will use your suggestion to generate dictionary encoded values instead.

alamb · 2023-10-23T20:40:53Z

datafusion/common/src/dfschema.rs

@@ -420,6 +420,11 @@ impl DFSchema {
                Self::datatype_is_semantically_equal(k1.as_ref(), k2.as_ref())
                    && Self::datatype_is_semantically_equal(v1.as_ref(), v2.as_ref())
            }
+            // The next two cases allow for the possibility that one schema has a dictionary encoded array


I am somewhat worried about this change (as it effectively changes the meaning of the function to also include logical types).

I think we should either update the comments to reflect this change, or (my preference) make a new function that is explicit. Perhaps something like :

/// Returns true of two [`DataType`]s have the same logical data type: /// Either the same name and type, or if one is dictionary encoded, then /// the same name and type of the values. /// E.g. Dictionary(_, Utf8) is semantically equivalent to Utf8 since both represent an array of strings fn datatype_has_same_logical_type( match (dt1, dt2) { (DataType::Dictionary(_, v1), othertype) => v1.as_ref() == othertype, (othertype, DataType::Dictionary(_, v1)) => v1.as_ref() == othertype, _ => Self::datatype_is_semantically_equal(dt1, dt2) } }

I think your concern is justified as the optimizer also relies on this function and might have stricter equivalence requirements. I created a separate method for logical equivalence which allows for different dictionary encodings as long as values can ultimately be resolved to the same type.

The optimizer continues to use the original semantic equivalence method, and I've updated the insert_into methods to use the softer logical equivalence check.

alamb · 2023-10-23T20:43:08Z

datafusion/sqllogictest/test_files/insert_to_external.slt

+select * from dictionary_encoded_parquet_partitioned order by (id);
+----
+0 true 0 0 0 0 0 0 01%2F01%2F09 0 2009-01-01T00:00:00
+1 false 1 1 1 10 1.1 10.1 01%2F01%2F09 1 2009-01-01T00:01:00


This might be an easier way to create a data with dictionary encoded data:

❯ create table test as values (arrow_cast('foo', 'Dictionary(Int32, Utf8)')), (arrow_cast('bar', 'Dictionary(Int32, Utf8)')); 0 rows in set. Query took 0.002 seconds. ❯ select arrow_typeof(column1) from test; +----------------------------+ | arrow_typeof(test.column1) | +----------------------------+ | Dictionary(Int32, Utf8) | | Dictionary(Int32, Utf8) | +----------------------------+ 2 rows in set. Query took 0.001 seconds. ❯ select * from test; +---------+ | column1 | +---------+ | foo | | bar | +---------+ 2 rows in set. Query took 0.000 seconds.

devinjdangelo · 2023-10-23T21:50:46Z

Checks currently failing, but I don't think it is related to the changes in this PR:

Run cargo check --no-default-features -p datafusion
    Updating crates.io index
error: failed to select a version for the requirement `ahash = "^0.7.0"`
candidate versions found which didn't match: 0.8.5, 0.8.4, 0.4.1, ...
location searched: crates.io index
required by package `hashbrown v0.11.0`
    ... which satisfies dependency `hashbrown = "^0.11"` of package `rkyv v0.7.0`
    ... which satisfies dependency `rkyv = "^0.7"` of package `rust_decimal v1.27.0`
    ... which satisfies dependency `rust_decimal = "^1.27.0"` of package `datafusion v32.0.0 (/__w/arrow-datafusion/arrow-datafusion/datafusion/core)`
    ... which satisfies path dependency `datafusion` of package `datafusion-benchmarks v32.0.0 (/__w/arrow-datafusion/arrow-datafusion/benchmarks)`
perhaps a crate was updated and forgotten to be re-vendored?

alamb · 2023-10-23T22:12:18Z

I think ahash 0.8.3 just got yanked... Maybe we also need to update hashbrown 🤔

alamb

Thank you @devinjdangelo -- this looks great now

* Cleanup logical optimizer rules. (apache#7919) * Initial commit * Address todos * Update comments * Simplifications * Minor simplifications * Address reviews * Add TableScan constructor * Minor changes * make try_new_with_schema method of Aggregate private * Use projection try_new instead of try_new_schema * Simplifications, add comment * Review changes * Improve comments * Move get_wider_type to type_coercion module * Clean up type coercion file --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Parallelize Serialization of Columns within Parquet RowGroups (apache#7655) * merge main * fixes and cmt * review comments, tuning parameters, updating docs * cargo fmt * reduce default buffer size to 2 and update docs * feat: Use bloom filter when reading parquet to skip row groups (apache#7821) * feat: implement read bloom filter support * test: add unit test for read bloom filter * Simplify bloom filter application * test: add unit test for bloom filter with sql `in` * fix: imrpove bloom filter match express * fix: add more test for bloom filter * ci: rollback dependences * ci: merge main branch * fix: unit tests for bloom filter * ci: cargo clippy * ci: cargo clippy --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: don't push down volatile predicates in projection (apache#7909) * fix: don't push down volatile predicates in projection * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add suggestions * fix * fix doc * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Jonah Gao <jonahgaox@gmail.com> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Jonah Gao <jonahgaox@gmail.com> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Jonah Gao <jonahgaox@gmail.com> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Jonah Gao <jonahgaox@gmail.com> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Jonah Gao <jonahgaox@gmail.com> * Add `parquet` feature flag, enabled by default, and make parquet conditional (apache#7745) * Make parquet an option by adding multiple cfg attributes without significant code changes. * Extract parquet logic into submodule from execution::context * Extract parquet logic into submodule from datafusion_core::dataframe * Extract more logic into submodule from execution::context * Move tests from execution::context * Rename submodules * [MINOR]: Simplify enforce_distribution, minor changes (apache#7924) * Initial commit * Simplifications * Cleanup imports * Review --------- Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Add simple window query to sqllogictest (apache#7928) * ci: upgrade node to version 20 (apache#7918) * Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` (apache#7844) * Change input for `to_timestamp` function * docs * fix examples * output `to_timestamp` signature as ns * Minor: Document `parquet` crate feature (apache#7927) * Minor: reduce some #cfg(feature = "parquet") (apache#7929) * Minor: reduce use of cfg(parquet) in tests (apache#7930) * Fix CI failures on `to_timestamp()` calls (apache#7941) * Change input for `to_timestamp` function * docs * fix examples * output `to_timestamp` signature as ns * Fix CI `to_timestamp()` failed * Update datafusion/expr/src/built_in_function.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix typo * fix --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * minor: add a datatype casting for the updated value (apache#7922) * minor: cast the updated value to the data type of target column * Update datafusion/sqllogictest/test_files/update.slt Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Update datafusion/sqllogictest/test_files/update.slt Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Update datafusion/sqllogictest/test_files/update.slt Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * fix tests --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * fix (apache#7946) * Add simple exclude all columns test to sqllogictest (apache#7945) * Add simple exclude all columns test to sqllogictest * Add more exclude test cases * Support Partitioning Data by Dictionary Encoded String Array Types (apache#7896) * support dictionary encoded string columns for partition cols * remove debug prints * cargo fmt * generic dictionary cast and dict encoded test * updates from review * force retry checks * try checks again * Minor: Remove array() in array_expression (apache#7961) * remove array Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup others Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clippy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup cast Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup cast Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor: simplify update code (apache#7943) * Add some initial content about creating logical plans (apache#7952) * Minor: Change from `&mut SessionContext` to `&SessionContext` in substrait (apache#7965) * Lower &mut SessionContext in substrait * rm mut ctx in tests * Fix crate READMEs (apache#7964) * Minor: Improve `HashJoinExec` documentation (apache#7953) * Minor: Improve `HashJoinExec` documentation * Apply suggestions from code review Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * chore: clean useless clone baesd on clippy (apache#7973) * Add README.md to `core`, `execution` and `physical-plan` crates (apache#7970) * Add README.md to `core`, `execution` and `physical-plan` crates * prettier * Update datafusion/physical-plan/README.md * Update datafusion/wasmtest/README.md --------- Co-authored-by: Daniël Heres <danielheres@gmail.com> * Move source repartitioning into `ExecutionPlan::repartition` (apache#7936) * Move source repartitioning into ExecutionPlan::repartition * cleanup * update test * update test * refine docs * fix merge * minor: fix broken links in README.md (apache#7986) * minor: fix broken links in README.md * fix proto link * Minor: Upate the `sqllogictest` crate README (apache#7971) * Minor: Upate the sqllogictest crate README * prettier * Apply suggestions from code review Co-authored-by: Jonah Gao <jonahgaox@gmail.com> Co-authored-by: jakevin <jakevingoo@gmail.com> --------- Co-authored-by: Jonah Gao <jonahgaox@gmail.com> Co-authored-by: jakevin <jakevingoo@gmail.com> * Improve MemoryCatalogProvider default impl block placement (apache#7975) * Fix `ScalarValue` handling of NULL values for ListArray (apache#7969) * Fix try_from_array data type for NULL value in ListArray * Fix * Explicitly assert the datatype * For review * Refactor of Ordering and Prunability Traversals and States (apache#7985) * simplify ExprOrdering * Comment improvements * Move map/transform comment up --------- Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Keep output as scalar for scalar function if all inputs are scalar (apache#7967) * Keep output as scalar for scalar function if all inputs are scalar * Add end-to-end tests * Fix crate READMEs for core, execution, physical-plan (apache#7990) * Update sqlparser requirement from 0.38.0 to 0.39.0 (apache#7983) * chore: Update sqlparser requirement from 0.38.0 to 0.39.0 * support FILTER Aggregates * Fix panic in multiple distinct aggregates by fixing `ScalarValue::new_list` (apache#7989) * Fix panic in multiple distinct aggregates by fixing ScalarValue::new_list * Update datafusion/common/src/scalar.rs Co-authored-by: Daniël Heres <danielheres@gmail.com> --------- Co-authored-by: Daniël Heres <danielheres@gmail.com> * MemoryReservation exposes MemoryConsumer (apache#8000) ... as a getter method. * fix: generate logical plan for `UPDATE SET FROM` statement (apache#7984) * Create temporary files for reading or writing (apache#8005) * Create temporary files for reading or writing * nit * addr comment --------- Co-authored-by: zhongjingxiong <zhongjingxiong@bytedance.com> * doc: minor fix to SortExec::with_fetch comment (apache#8011) * Fix: dataframe_subquery example Optimizer rule `common_sub_expression_eliminate` failed (apache#8016) * Fix: Optimizer rule 'common_sub_expression_eliminate' failed * nit * nit * nit --------- Co-authored-by: zhongjingxiong <zhongjingxiong@bytedance.com> * Percent Decode URL Paths (apache#8009) (apache#8012) * Treat ListingTableUrl as URL-encoded (apache#8009) * Update lockfile * Review feedback * Minor: Extract common deps into workspace (apache#7982) * Improve datafusion-* * More common crates * Extract async-trait * Extract more * Fix cli --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * minor: change some plan_err to exec_err (apache#7996) * minor: change some plan_err to exec_err Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * change unreachable code to internal error Signed-off-by: Ruihang Xia <waynestxia@gmail.com> --------- Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * Minor: error on unsupported RESPECT NULLs syntax (apache#7998) * Minor: error on unsupported RESPECT NULLs syntax * fix clippy * Update datafusion/sql/tests/sql_integration.rs Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * GroupedHashAggregateStream breaks spill batch (apache#8004) ... into smaller chunks to decrease memory required for merging. * Minor: Add implementation examples to ExecutionPlan::execute (apache#8013) * Add implementation examples to ExecutionPlan::execute * Review feedback * address comment (apache#7993) Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * GroupedHashAggregateStream should register spillable consumer (apache#8002) * fix: single_distinct_aggretation_to_group_by fail (apache#7997) * fix: single_distinct_aggretation_to_group_by faile * fix * move test to groupby.slt * Read only enough bytes to infer Arrow IPC file schema via stream (apache#7962) * Read only enough bytes to infer Arrow IPC file schema via stream * Error checking for collect bytes func * Update datafusion/core/src/datasource/file_format/arrow.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: remove a strange char (apache#8030) * Minor: Improve documentation for Filter Pushdown (apache#8023) * Minor: Improve documentation for Fulter Pushdown * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: jakevin <jakevingoo@gmail.com> * Apply suggestions from code review * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Alex Huang <huangweijun1001@gmail.com> --------- Co-authored-by: jakevin <jakevingoo@gmail.com> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Minor: Improve `ExecutionPlan` documentation (apache#8019) * Minor: Improve `ExecutionPlan` documentation * Add link to Partitioning * fix: clippy warnings from nightly rust 1.75 (apache#8025) Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * Minor: Avoid recomputing compute_array_ndims in align_array_dimensions (apache#7963) * Refactor align_array_dimensions Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * address comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove unwrap Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * address comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix rebase Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor: fix doc check (apache#8037) * Minor: remove uncessary #cfg test (apache#8036) * Minor: remove uncessary #cfg test * fmt * Update datafusion/core/src/datasource/file_format/arrow.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com> --------- Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com> * Minor: Improve documentation for `PartitionStream` and `StreamingTableExec` (apache#8035) * Minor: Improve documentation for `PartitionStream` and `StreamingTableExec` * fmt * fmt * Combine Equivalence and Ordering equivalence to simplify state (apache#8006) * combine equivalence and ordering equivalence * Remove EquivalenceProperties struct * Minor changes * all tests pass * Refactor oeq * Simplifications * Resolve linter errors * Minor changes * Minor changes * Add new tests * Simplifications window mode selection * Simplifications * Use set_satisfy api * Use utils for aggregate * Minor changes * Minor changes * Minor changes * All tests pass * Simplifications * Simplifications * Minor changes * Simplifications * All tests pass, fix bug * Remove unnecessary code * Simplifications * Minor changes * Simplifications * Move oeq join to methods * Simplifications * Remove redundant code * Minor changes * Minor changes * Simplifications * Simplifications * Simplifications * Move window to util from method, simplifications * Simplifications * Propagate meet in the union * Simplifications * Minor changes, rename * Address berkay reviews * Simplifications * Add new buggy test * Add data test for sort requirement * Add experimental check * Add random test * Minor changes * Random test gives error * Fix missing test case * Minor changes * Minor changes * Simplifications * Minor changes * Add new test case * Minor changes * Address reviews * Minor changes * Increase coverage of random tests * Remove redundant code * Simplifications * Simplifications * Refactor on tests * Solving clippy errors * prune_lex improvements * Fix failing tests * Update get_finer and get_meet * Fix window lex ordering implementation * Buggy state * Do not use output ordering in the aggregate * Add union test * Update comment * Fix bug, when batch_size is small * Review Part 1 * Review Part 2 * Change union meet implementation * Update comments * Remove redundant check * Simplify project out_expr function * Remove Option<Vec<_>> API. * Do not use project_out_expr * Simplifications * Review Part 3 * Review Part 4 * Review Part 5 * Review Part 6 * Review Part 7 * Review Part 8 * Update comments * Add new unit tests, simplifications * Resolve linter errors * Simplify test codes * Review Part 9 * Add unit tests for remove_redundant entries * Simplifications * Review Part 10 * Fix test * Add new test case, fix implementation * Review Part 11 * Review Part 12 * Update comments * Review Part 13 * Review Part 14 * Review Part 15 * Review Part 16 * Review Part 17 * Review Part 18 * Review Part 19 * Review Part 20 * Review Part 21 * Review Part 22 * Review Part 23 * Review Part 24 * Do not construct idx and sort_expr unnecessarily, Update comments, Union meet single entry * Review Part 25 * Review Part 26 * Name Changes, comment updates * Review Part 27 * Add issue links * Address reviews * Fix failing test * Update comments * SortPreservingMerge, SortPreservingRepartition only preserves given expression ordering among input ordering equivalences --------- Co-authored-by: metesynnada <100111937+metesynnada@users.noreply.github.com> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Encapsulate `ProjectionMapping` as a struct (apache#8033) * Minor: Fix bugs in docs for `to_timestamp`, `to_timestamp_seconds`, ... (apache#8040) * Minor: Fix bugs in docs for `to_timestamp`, `to_timestamp_seconds`, etc * prettier * Update docs/source/user-guide/sql/scalar_functions.md Co-authored-by: comphead <comphead@users.noreply.github.com> * Update docs/source/user-guide/sql/scalar_functions.md Co-authored-by: comphead <comphead@users.noreply.github.com> --------- Co-authored-by: comphead <comphead@users.noreply.github.com> * Improve comments for `PartitionSearchMode` struct (apache#8047) * Improve comments * Make comments partition/group agnostic * General approach for Array replace (apache#8050) * checkpoint Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * optimize non-list Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * replace list ver Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rename Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor: Remove the irrelevant note from the Expression API doc (apache#8053) * Minor: Add more documentation about Partitioning (apache#8022) * Minor: Add more documentation about Partitioning * fix typo * Apply suggestions from code review Co-authored-by: comphead <comphead@users.noreply.github.com> * Add more diagrams, improve text * undo unintended changes * undo unintended changes * fix links * Try and clarify --------- Co-authored-by: comphead <comphead@users.noreply.github.com> * Minor: improve documentation for IsNotNull, DISTINCT, etc (apache#8052) * Minor: improve documentation for IsNotNull, DISTINCT, etc * fix * Prepare 33.0.0 Release (apache#8057) * changelog * update version * update changelog * Minor: improve error message by adding types to message (apache#8065) * Minor: improve error message * add test * Minor: Remove redundant BuiltinScalarFunction::supports_zero_argument() (apache#8059) * deprecate BuiltinScalarFunction::supports_zero_argument() * unify old supports_zero_argument() impl * Add example to ci (apache#8060) * feat: add example to ci * nit * addr comments --------- Co-authored-by: zhongjingxiong <zhongjingxiong@bytedance.com> * Update substrait requirement from 0.18.0 to 0.19.0 (apache#8076) Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](substrait-io/substrait-rs@v0.18.0...v0.19.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix incorrect results in COUNT(*) queries with LIMIT (apache#8049) Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> * feat: Support determining extensions from names like `foo.parquet.snappy` as well as `foo.parquet` (apache#7972) * feat: read files based on the file extention * fix: some the file extension might be started with . and some not * fix: rename extention to extension * chore: use exec_err * chore: rename extention to extension * chore: rename extention to extension * chore: simplify the code * fix: check table is empty * ci: fix test * fix: add err info * refactor: extract the logic to infer_types * fix: add tests for different extensions * fix: ci clippy * fix: add more tests * fix: simplify the logic * fix: ci * Use FairSpillPool for TaskContext with spillable config (apache#8072) * Minor: Improve HashJoinStream docstrings (apache#8070) * Minor: Improve HashJoinStream docstrings * fix comments * Update datafusion/physical-plan/src/joins/hash_join.rs Co-authored-by: comphead <comphead@users.noreply.github.com> * Update datafusion/physical-plan/src/joins/hash_join.rs Co-authored-by: comphead <comphead@users.noreply.github.com> --------- Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: comphead <comphead@users.noreply.github.com> * Fixing broken link (apache#8085) * Fixing broken link * Update docs/source/contributor-guide/index.md Thanks for spotting this as well Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * fix: DataFusion suggests invalid functions (apache#8083) * fix: DataFusion suggests invalid functions * update test * Add test for BuiltInWindowFunction * Replace macro with function for `array_repeat` (apache#8071) * General array repeat Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * done Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fm Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor: remove unnecessary projection in `single_distinct_to_group_by` rule (apache#8061) * Minor: remove unnecessary projection * fix ci * minor: Remove duplicate version numbers for arrow, object_store, and parquet dependencies (apache#8095) * remove duplicate version numbers for arrow, object_store, and parquet dependencies * cargo update * use default features in parquet crate * disable default parquet features in wasmtest * fix: add match encode/decode scalar function type (apache#8089) * feat: Protobuf serde for Json file sink (apache#8062) * Protobuf serde for Json file sink * Fix tests * Fix test * Minor: use `Expr::alias` in a few places to make the code more concise (apache#8097) * Minor: Cleanup BuiltinScalarFunction::return_type() (apache#8088) * Expose metrics from FileSinkExec impl of ExecutionPlan --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Mustafa Akur <106137913+mustafasrepo@users.noreply.github.com> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> Co-authored-by: Devin D'Angelo <devinjdangelo@gmail.com> Co-authored-by: Hengfei Yang <hengfei.yang@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Huaijin <haohuaijin@gmail.com> Co-authored-by: Jonah Gao <jonahgaox@gmail.com> Co-authored-by: Chih Wang <ongchi@users.noreply.github.com> Co-authored-by: Jeffrey <22608443+Jefffrey@users.noreply.github.com> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: comphead <comphead@users.noreply.github.com> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Jay Zhan <jayzhan211@gmail.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: yi wang <48236141+my-vegetable-has-exploded@users.noreply.github.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: jakevin <jakevingoo@gmail.com> Co-authored-by: 张林伟 <lewiszlw520@gmail.com> Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> Co-authored-by: Marko Milenković <milenkovicm@users.noreply.github.com> Co-authored-by: jokercurry <982458633@qq.com> Co-authored-by: zhongjingxiong <zhongjingxiong@bytedance.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: metesynnada <100111937+metesynnada@users.noreply.github.com> Co-authored-by: Yongting You <2010youy01@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mark Sirek <mwsirek@gmail.com> Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Edmondo Porcu <edmondo.porcu@gmail.com> Co-authored-by: Syleechan <38198463+Syleechan@users.noreply.github.com> Co-authored-by: Dan Harris <dan@coralogix.com>

support dictionary encoded string columns for partition cols

1bda2e2

github-actions bot added the core Core DataFusion crate label Oct 21, 2023

This was referenced Oct 21, 2023

Error writing to a partitioned table: : it is not yet supported to write to hive partitions with datatype Dictionary(UInt16, Utf8) #7891

Closed

writing to partitioned table uses the wrong column as partition key #7892

Closed

remove debug prints

e6aafb4

devinjdangelo commented Oct 21, 2023

View reviewed changes

devinjdangelo added 2 commits October 21, 2023 11:00

cargo fmt

878fd6a

generic dictionary cast and dict encoded test

7484e92

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 22, 2023

devinjdangelo commented Oct 22, 2023

View reviewed changes

alamb reviewed Oct 23, 2023

View reviewed changes

devinjdangelo added 2 commits October 23, 2023 17:39

updates from review

457ebea

force retry checks

21a8583

try checks again

984b0dc

alamb approved these changes Oct 28, 2023

View reviewed changes

alamb merged commit 250e716 into apache:main Oct 28, 2023
22 checks passed

andygrove added the enhancement New feature or request label Nov 5, 2023

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Partitioning Data by Dictionary Encoded String Array Types #7896

Support Partitioning Data by Dictionary Encoded String Array Types #7896

devinjdangelo commented Oct 21, 2023 •

edited

Loading

devinjdangelo Oct 21, 2023 •

edited

Loading

suremarc Oct 22, 2023

devinjdangelo Oct 22, 2023

devinjdangelo Oct 21, 2023 •

edited

Loading

devinjdangelo Oct 22, 2023

devinjdangelo Oct 22, 2023

alamb Oct 23, 2023

devinjdangelo Oct 23, 2023

alamb left a comment

alamb Oct 23, 2023

devinjdangelo Oct 23, 2023

alamb Oct 23, 2023

devinjdangelo Oct 23, 2023

alamb Oct 23, 2023

devinjdangelo commented Oct 23, 2023

alamb commented Oct 23, 2023

alamb left a comment


		statement ok
		CREATE EXTERNAL TABLE dictionary_encoded_parquet

Support Partitioning Data by Dictionary Encoded String Array Types #7896

Support Partitioning Data by Dictionary Encoded String Array Types #7896

Conversation

devinjdangelo commented Oct 21, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

devinjdangelo Oct 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinjdangelo Oct 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinjdangelo commented Oct 23, 2023

alamb commented Oct 23, 2023

alamb left a comment

Choose a reason for hiding this comment

devinjdangelo commented Oct 21, 2023 •

edited

Loading

devinjdangelo Oct 21, 2023 •

edited

Loading

devinjdangelo Oct 21, 2023 •

edited

Loading