Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize regex_replace for scalar patterns #3614

Merged
merged 2 commits into from
Sep 27, 2022

Conversation

isidentical
Copy link
Contributor

@isidentical isidentical commented Sep 25, 2022

Which issue does this PR close?

Closes #3613.

Rationale for this change

@Dandandan noticed regex_replace with a known pattern seems to be taking an extremely long amount of time during ClickBench suite in #3518. This seems to be true due to many factors, but mainly due to how generic regex_replace implementation is (it can handle 2⁴ combinations when it comes to scalars/arrays). Having a generic version ready is good for compatibility, but at the same time, it makes us pay the overhead for common cases (like the example in #3518, where the pattern is static).

What changes are included in this PR?

This PR adds a scalarity (not sure if this is a real word) based specialization system where at the runtime the best regex_replace variation can be picked and executed for the given set of inputs. The system here is just the start, and if there is enough gains we might add a third case where the replacement is also known.

Are there any user-facing changes?

This is mainly an optimization, and there shouldn't be any user facing changes.

Benchmarks

New benchmarks are here #3614 (comment), and overall it shows a speed-up in the range of 20-35X depending on the query & input.

Old benchmarks

Running all benchmarks with --release mode (using the datafusion-cli crate with -f option).

The initial benchmark is the Query 28 from clickhouse

SELECT
    REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k,
    AVG(length("Referer")) AS l,
    COUNT(*) AS c,
    MIN("Referer")
FROM hits_1
    WHERE "Referer" <> ''
    GROUP BY k
    HAVING COUNT(*) > 100000
    ORDER BY l DESC
LIMIT 25;
Master This Branch Factor
Cold Run 2.875 seconds 0.318 seconds 9.04x speed-up
Hot Run (6th consecutive run) 2.252 seconds 0.266 seconds 8.46x speed-up
Average 2.408 seconds 0.277 seconds 8.69x speed-up

(Note: I don't have the full ClickBench data, just have a partition of it [1/100 scale] so this might not be very reflective)

A second benchmark is the one where we have both the source and the replacements as arrays, which shows speed-up factor of 1.7X.

-- Generate data
--
-- import secrets
-- import random
--
-- rows = 1_000_000
--
-- data = {"user_id": [], "website": []}
-- for _ in range(rows):
--     data["user_id"].append(secrets.token_hex(8))
--
--     # Sometimes it is proper URL, and sometimes it is not.
--     data["website"].append(
--         random.choice(["http", "https", "unknown", ""])
--         + random.choice([":", "://"])
--         + random.choice(["google", "facebook"])
--         + random.choice([".com", ".org", ""])
--     )
--
-- import pandas as pd
-- df = pd.DataFrame(data)
-- df.to_parquet("data.parquet")

CREATE EXTERNAL TABLE generated_data
STORED AS PARQUET
LOCATION 'data.parquet';

-- Query 1
EXPLAIN ANALYZE
SELECT
    REGEXP_REPLACE("website", '^https?://(?:www.)?([^/]+)$', "user_id") AS encoded_website
FROM generated_data;

@github-actions github-actions bot added the physical-expr Physical Expressions label Sep 25, 2022
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2022

Codecov Report

Merging #3614 (bbb8c8b) into master (ebb28f5) will decrease coverage by 0.07%.
The diff coverage is 85.23%.

❗ Current head bbb8c8b differs from pull request most recent head d0f1020. Consider uploading reports for the commit d0f1020 to get more accurate results

@@            Coverage Diff             @@
##           master    #3614      +/-   ##
==========================================
- Coverage   86.07%   85.99%   -0.08%     
==========================================
  Files         300      300              
  Lines       56314    56449     +135     
==========================================
+ Hits        48473    48546      +73     
- Misses       7841     7903      +62     
Impacted Files Coverage Δ
datafusion/physical-expr/src/functions.rs 92.66% <50.00%> (-0.10%) ⬇️
datafusion/optimizer/src/simplify_expressions.rs 82.67% <82.60%> (-0.01%) ⬇️
datafusion/physical-expr/src/regex_expressions.rs 65.76% <86.88%> (-17.27%) ⬇️
datafusion/common/src/scalar.rs 85.18% <0.00%> (-0.07%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 77.10% <0.00%> (ø)
datafusion/core/src/physical_plan/metrics/value.rs 87.56% <0.00%> (+0.49%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Dandandan Dandandan merged commit 15c19c3 into apache:master Sep 27, 2022
@Dandandan
Copy link
Contributor

Thank you @isidentical !

@ursabot
Copy link

ursabot commented Sep 27, 2022

Benchmark runs are scheduled for baseline = ea3dbb6 and contender = 15c19c3. 15c19c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize regex_replace with a known pattern / replacement
4 participants