Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft v2] Another Multi group by optimization #10976

Closed
wants to merge 2 commits into from

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jun 18, 2024

Which issue does this PR close?

Closes #.

Related to #9403

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@jayzhan211
Copy link
Contributor Author

Screenshot 2024-06-18 at 3 14 28 PM

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jun 18, 2024

Great news! Although this approach is slightly slow on small string, but it is significant better if string is large.

THIS PR

Q0: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
Query 0 iteration 0 took 2873.6 ms and returned 10 rows
Query 0 iteration 1 took 2595.7 ms and returned 10 rows
Query 0 iteration 2 took 2320.4 ms and returned 10 rows
Query 0 iteration 3 took 2269.0 ms and returned 10 rows
Query 0 iteration 4 took 2321.2 ms and returned 10 rows
Q1: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
Query 1 iteration 0 took 2280.9 ms and returned 10 rows
Query 1 iteration 1 took 2679.1 ms and returned 10 rows
Query 1 iteration 2 took 2683.2 ms and returned 10 rows
Query 1 iteration 3 took 2534.8 ms and returned 10 rows
Query 1 iteration 4 took 2804.4 ms and returned 10 rows
Q2: SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;
Query 2 iteration 0 took 28927.7 ms and returned 10 rows
Query 2 iteration 1 took 27570.2 ms and returned 10 rows
Query 2 iteration 2 took 28776.2 ms and returned 10 rows
Query 2 iteration 3 took 32869.2 ms and returned 10 rows
Query 2 iteration 4 took 35559.9 ms and returned 10 rows
Done

Main

Q0: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
Query 0 iteration 0 took 2010.2 ms and returned 10 rows
Query 0 iteration 1 took 1896.5 ms and returned 10 rows
Query 0 iteration 2 took 1707.0 ms and returned 10 rows
Query 0 iteration 3 took 1776.9 ms and returned 10 rows
Query 0 iteration 4 took 1685.8 ms and returned 10 rows
Q1: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
Query 1 iteration 0 took 1695.3 ms and returned 10 rows
Query 1 iteration 1 took 1727.3 ms and returned 10 rows
Query 1 iteration 2 took 1704.5 ms and returned 10 rows
Query 1 iteration 3 took 1832.4 ms and returned 10 rows
Query 1 iteration 4 took 1709.8 ms and returned 10 rows
Q2: SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;
Query 2 iteration 0 took 57890.4 ms and returned 10 rows
Query 2 iteration 1 took 60489.2 ms and returned 10 rows
Query 2 iteration 2 took 59249.8 ms and returned 10 rows
Query 2 iteration 3 took 57315.8 ms and returned 10 rows
Query 2 iteration 4 took 53942.7 ms and returned 10 rows
Done

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool @jayzhan211 -- exactly the right idea I think

SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "TraficSourceID" IN (-1, 6) AND "RefererHash" = 3594120000172545465 GROUP BY "URLHash", "EventDate"::INT::DATE ORDER BY PageViews DESC LIMIT 10 OFFSET 100;
SELECT "WindowClientWidth", "WindowClientHeight", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "DontCountHits" = 0 AND "URLHash" = 2868770270353813622 GROUP BY "WindowClientWidth", "WindowClientHeight" ORDER BY PageViews DESC LIMIT 10 OFFSET 10000;
SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY DATE_TRUNC('minute', M) LIMIT 10 OFFSET 1000;
SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever way to quickly iterate for benchmark tests 👍

);


let u64_vec: Vec<u64> = groups.iter().map(|&x| x as u64).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be faster to write into u64_vec directly rather than write to groups and then copy over



// Index Array: [0, 1, 1, 0]
// Data Array: ['a', 'c']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code copies all the strings again, which likely takes significant time

One way to make this faster is likey to special case the "has no nulls" path (so we can avoid if let Some(.) check each row

The only other ways I can think to make this faster is to avoid copying the strings all together. The only way to do so I can figure are:

  1. Return a DictionaryArray as output
  2. Return a StringViewArray (similar to DictionayArray) [Epic] Implement support for StringView in DataFusion #10918

🤔

Maybe we can try to hack in the ability to have the HashAggregateExec return Dictionary(Int32, String) for these multi-column groups and see if we can show significant performance improvements

If so then we can figure out how to thread that ability through the engine.

@alamb
Copy link
Contributor

alamb commented Jun 21, 2024

BTW I hope/plan to use a plane trip on Sunday to prototype the "add a physical optimizer pass that sets types to StringView" approach

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

I am starting to play with this PR again

@alamb alamb closed this Jun 28, 2024
@alamb alamb reopened this Jun 28, 2024
alamb and others added 2 commits June 28, 2024 16:19
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Aug 28, 2024
@github-actions github-actions bot closed this Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants