Materialize Dictionaries in Group Keys #7647

tustvold · 2023-09-25T15:45:05Z

Is your feature request related to a problem or challenge?

Currently grouping on a dictionary column will return dictionary-encoded group keys. Given that group keys inherently have few repeated values, especially when grouping on a single column, the use of dictionary encoding is unlikely to be yielding significant returns. Additionally following #7587 computing the dictionary is a non-trivial operation that could be eliminated

Describe the solution you'd like

When grouping on a dictionary column, e.g. Dictionary(DataType::Int32, DataType::Utf8), the returned schema should be the underlying value type, i.e. DataType::Utf8.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

qrilka · 2023-10-07T09:50:41Z

Hi @tustvold
I wanted to look into this ticket, I see the TODO you've left in #7587. Could you advise me some easy way to to simulate this situation e.g. in datafusion-cli or maybe there is some similar test in the code base?

alamb · 2023-10-07T11:26:14Z

You can great dictionary encoded columns using arrow_cast

~~Here is an example~~

Here is a better example (use string dictionaries):
https://github.com/apache/arrow-datafusion/blob/e23d34bae60bb2f9c496241e218bab795af3af83/datafusion/sqllogictest/test_files/topk.slt#L223-L232

qrilka · 2023-10-14T12:54:22Z

@tustvold I've spent some time getting familiar with the code for aggregations and I have a couple of questions. Maybe you could help me here?
I see that for aggregates in an initial plan we have in the base case 2 nested AggregateExecs: Partial and Final(Partitioned). And every AggregateExec uses RowConverter to convert arrays into rows for its input and back from rows into arrays for its output.
I assume that these conversions are basically inevitable because RecordBatches are sent between execution stages and aggregation internally operates on rows, not arrays, is my understanding correct here?
It looks to me that it makes sense to convert from Dictionary to Utf8 on the first AggregateExec but it doesn't look like there's no mode like AggregateInitial. What could be the proper way out here? Maybe there could be a conversion step before aggregation added if any of the fields has a dictionary type? What would you advise?

qrilka · 2023-11-15T09:37:26Z

@alamb maybe you could help me here?

tustvold · 2023-11-15T09:45:20Z

You should not need to do any conversion, the conversion is already being done by RowConverter. Currently there is a cast annotated with a TODO linking back to this ticket. It should just be a case of removing this cast and updating the schema logic within the plan to account for this

qrilka · 2023-11-15T10:04:44Z

@tustvold I saw the conversion in RowConverter but as I understand it happens multiple times: i.e. we convert dictionary into utf8 in AggregatePartial, then cast results back to dictionary and then we do the same in AggregateFinal.

Please let me know if I get your idea right: in the case of AggregatePartial followed by AggregateFinal, the schema for the first one should have Utf8 as its output and the second aggregate will then use Utf8 both as its input and as output.

Given that group keys inherently have few repeated values, especially when grouping on a single column, the use of dictionary encoding is unlikely to be yielding significant returns

alamb · 2024-01-05T19:28:20Z

TLDR I recommend we revert this change and reopen this ticket while we reconsider how to handle this case better.

Background

This ticket caused a functional regression for us downstream in #8738 (a query that used to run started erroring).

The cause of the issue is that the LogicalPlan and ExecutionPlans schemas no longer match. I started exploring making them match in #8766

However while working on that issue, I thought more about this change and I am not sure the change described in this issue is good for multi column groupings at all.

Rationale

Dictionary Encoding on single column group keys is bad because there is no repetition in the data and therefore the Dictionary encoding is pure over head. For example, given

SELECT ... GROUP BY a

Dictionary Encoding is worse than native encoding:

┌──────┐                   ┌──────┐      ┌────┐     
│ foo  │                   │ foo  │      │ 0  │     
├──────┤                   ├──────┤      ├────┤     
│ bar  │                   │ bar  │      │ 1  │     
├──────┤                   ├──────┤      ├────┤     
│ baz  │  ────────▶        │ baz  │      │ 2  │     
├──────┤                   ├──────┤      ├────┤     
│  ff  │                   │  ff  │      │ 3  │     
├──────┤                   ├──────┤      ├────┤     
│  ..  │                   │  ..  │      │ .. │     
├──────┤                   ├──────┤      ├────┤     
│ aaz  │                   │ aaz  │      │9999│     
└──────┘                   └──────┘      └────┘     
                                                    
 Group Values            values array   keys array  
   (distinct                                        
 values of a)

However, the story is different when there is a multi column group key, as in that case, dictionary encoding each column can be a significant performance improvement as they are applied to each column individually and each column may have substantial redundancy. For example, given this query

SELECT ... GROUP BY a,b

┌──────┐  ┌──────┐               ┌──────┐      ┌────┐         ┌──────┐      ┌────┐      
│ foo  │  │ tag1 │               │ foo  │      │ 0  │         │ tag1 │      │ 0  │      
├──────┤  ├──────┤               ├──────┤      ├────┤         ├──────┤      ├────┤      
│ foo  │  │ tag2 │               │ baz  │      │ 0  │         │ tag2 │      │ 1  │      
├──────┤  ├──────┤               └──────┘      ├────┤         ├──────┤      ├────┤      
│ foo  │  │ tag3 │ ────────▶                   │ 0  │         │ tag3 │      │ 2  │      
├──────┤  ├──────┤                             ├────┤         ├──────┤      ├────┤      
│ foo  │  │ tag4 │                             │ 0  │         │ tag4 │      │ 3  │      
├──────┤  ├──────┤                             ├────┤         └──────┘      ├────┤      
│  ..  │  │  ..  │                             │ .. │                       │ .. │      
├──────┤  ├──────┤                             ├────┤                       ├────┤      
│ baz  │  │ tag4 │                             │ 1  │                       │ 3  │      
└──────┘  └──────┘                             └────┘                       └────┘      
                                                                                        
       Group Values            values array   keys array    values array   keys array   
    (distinct values of             (a)           (a)            (b)           (b)      
           a,b)

This could especially impact multi-phase grouping where dictionary encoding will save significant time hashing values for low cardinality string columns.

In fact we think we may have seen a performance regression when picking up this change downstream as well, which could also be explained by the observation above

Thus I recommend we revert this change via #8740 while we reconsider how to handle this case (maybe just for single column group by? Maybe do the dictionary encoding within the RowEncoder to avoid generating many redundant strings?

tustvold · 2024-01-05T20:04:23Z

This could especially impact multi-phase grouping where dictionary encoding will save significant time hashing values for low cardinality string columns.

If anything this is precisely the case where the old behaviour was especially inefficient. It will compute the rows, expanding the dictionary, perform the grouping, convert the rows back to strings, convert the strings back to dictionaries, convert back to strings in the row format, again expanding the dictionaries, and repeat for each stage

alamb · 2024-01-05T20:23:14Z

I want to be clear that I have no particular evidence one way or the other about the performance implications of this particular change (and I probably confused the issue with speculation in #7647 (comment))

So what I would like to do is:

Revert the change
Reopen this ticket
Work on getting a PR ready to put the change back in that both 1) doesn't cause a functional regression and 2) we have evidence improves performance

qrilka · 2024-01-06T18:58:43Z

@alamb what could be some evidence for 3.2? Is there anything in the code base or maybe in some other ticket? The plan you show makes total sense

alamb · 2024-01-08T12:31:07Z

Thanks @qrilka

@alamb what could be some evidence for 3.2?

I think we need to create a benchmark that does aggregation queries on dictionary encoded string columns (I know the existing end to end TPCH, and ClickBench benchmarks do not do this). I will file a ticket shortly about this

alamb · 2024-01-08T18:44:15Z

I filed #8791 to track adding additional test coverage, and I plan to work on that in the next few days

tustvold added the enhancement New feature or request label Sep 25, 2023

tustvold mentioned this issue Sep 25, 2023

Update arrow 47.0.0 in DataFusion #7587

Merged

alamb mentioned this issue Sep 25, 2023

[EPIC] A list of performance improvement tickets #5546

Open

29 tasks

qrilka mentioned this issue Nov 16, 2023

Remove unnecessary reassignment #8232

Merged

qrilka mentioned this issue Nov 21, 2023

Materialize dictionaries in group keys #8291

Merged

alamb closed this as completed in #8291 Dec 1, 2023

alamb mentioned this issue Jan 5, 2024

Make schema calculations for LogicalPlan::Aggregate and LogicalPlan::Distinct consistent with the HashExec #8766

Closed

This was referenced Jan 5, 2024

Regression "RowConverter column schema mismatch, expected Utf8 got Dictionary(Int32, Utf8)" after upgrade #8738

Closed

Fix regression by reverting Materialize dictionaries in group keys #8740

Merged

alamb reopened this Jan 8, 2024

alamb mentioned this issue Jan 8, 2024

Add test coverage for grouping on dictionary encoded columns #8791

Open

matthewgapp mentioned this issue Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

alamb mentioned this issue Jan 27, 2024

[WIP] Add timeseries / observability benchmarks #9017

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Materialize Dictionaries in Group Keys #7647

Materialize Dictionaries in Group Keys #7647

tustvold commented Sep 25, 2023

qrilka commented Oct 7, 2023

alamb commented Oct 7, 2023 •

edited

Loading

qrilka commented Oct 14, 2023

qrilka commented Nov 15, 2023

tustvold commented Nov 15, 2023 •

edited

Loading

qrilka commented Nov 15, 2023

alamb commented Jan 5, 2024 •

edited

Loading

tustvold commented Jan 5, 2024

alamb commented Jan 5, 2024

qrilka commented Jan 6, 2024

alamb commented Jan 8, 2024

alamb commented Jan 8, 2024

Materialize Dictionaries in Group Keys #7647

Materialize Dictionaries in Group Keys #7647

Comments

tustvold commented Sep 25, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

qrilka commented Oct 7, 2023

alamb commented Oct 7, 2023 • edited Loading

qrilka commented Oct 14, 2023

qrilka commented Nov 15, 2023

tustvold commented Nov 15, 2023 • edited Loading

qrilka commented Nov 15, 2023

alamb commented Jan 5, 2024 • edited Loading

Background

Rationale

tustvold commented Jan 5, 2024

alamb commented Jan 5, 2024

qrilka commented Jan 6, 2024

alamb commented Jan 8, 2024

alamb commented Jan 8, 2024

alamb commented Oct 7, 2023 •

edited

Loading

tustvold commented Nov 15, 2023 •

edited

Loading

alamb commented Jan 5, 2024 •

edited

Loading