Optimize `regex_replace` for scalar patterns #3614

isidentical · 2022-09-25T23:07:29Z

Which issue does this PR close?

Closes #3613.

Rationale for this change

@Dandandan noticed regex_replace with a known pattern seems to be taking an extremely long amount of time during ClickBench suite in #3518. This seems to be true due to many factors, but mainly due to how generic regex_replace implementation is (it can handle 2⁴ combinations when it comes to scalars/arrays). Having a generic version ready is good for compatibility, but at the same time, it makes us pay the overhead for common cases (like the example in #3518, where the pattern is static).

What changes are included in this PR?

This PR adds a scalarity (not sure if this is a real word) based specialization system where at the runtime the best regex_replace variation can be picked and executed for the given set of inputs. The system here is just the start, and if there is enough gains we might add a third case where the replacement is also known.

Are there any user-facing changes?

This is mainly an optimization, and there shouldn't be any user facing changes.

Benchmarks

New benchmarks are here #3614 (comment), and overall it shows a speed-up in the range of 20-35X depending on the query & input.

Old benchmarks

Running all benchmarks with --release mode (using the datafusion-cli crate with -f option).

The initial benchmark is the Query 28 from clickhouse

SELECT
    REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k,
    AVG(length("Referer")) AS l,
    COUNT(*) AS c,
    MIN("Referer")
FROM hits_1
    WHERE "Referer" <> ''
    GROUP BY k
    HAVING COUNT(*) > 100000
    ORDER BY l DESC
LIMIT 25;

	Master	This Branch	Factor
Cold Run	2.875 seconds	0.318 seconds	9.04x speed-up
Hot Run (6th consecutive run)	2.252 seconds	0.266 seconds	8.46x speed-up
Average	2.408 seconds	0.277 seconds	8.69x speed-up

(Note: I don't have the full ClickBench data, just have a partition of it [1/100 scale] so this might not be very reflective)

A second benchmark is the one where we have both the source and the replacements as arrays, which shows speed-up factor of 1.7X.

-- Generate data
--
-- import secrets
-- import random
--
-- rows = 1_000_000
--
-- data = {"user_id": [], "website": []}
-- for _ in range(rows):
--     data["user_id"].append(secrets.token_hex(8))
--
--     # Sometimes it is proper URL, and sometimes it is not.
--     data["website"].append(
--         random.choice(["http", "https", "unknown", ""])
--         + random.choice([":", "://"])
--         + random.choice(["google", "facebook"])
--         + random.choice([".com", ".org", ""])
--     )
--
-- import pandas as pd
-- df = pd.DataFrame(data)
-- df.to_parquet("data.parquet")

CREATE EXTERNAL TABLE generated_data
STORED AS PARQUET
LOCATION 'data.parquet';

-- Query 1
EXPLAIN ANALYZE
SELECT
    REGEXP_REPLACE("website", '^https?://(?:www.)?([^/]+)$', "user_id") AS encoded_website
FROM generated_data;

codecov-commenter · 2022-09-26T00:03:01Z

Codecov Report

Merging #3614 (bbb8c8b) into master (ebb28f5) will decrease coverage by 0.07%.
The diff coverage is 85.23%.

❗ Current head bbb8c8b differs from pull request most recent head d0f1020. Consider uploading reports for the commit d0f1020 to get more accurate results

@@            Coverage Diff             @@
##           master    #3614      +/-   ##
==========================================
- Coverage   86.07%   85.99%   -0.08%     
==========================================
  Files         300      300              
  Lines       56314    56449     +135     
==========================================
+ Hits        48473    48546      +73     
- Misses       7841     7903      +62

Impacted Files	Coverage Δ
datafusion/physical-expr/src/functions.rs	`92.66% <50.00%> (-0.10%)`	⬇️
datafusion/optimizer/src/simplify_expressions.rs	`82.67% <82.60%> (-0.01%)`	⬇️
datafusion/physical-expr/src/regex_expressions.rs	`65.76% <86.88%> (-17.27%)`	⬇️
datafusion/common/src/scalar.rs	`85.18% <0.00%> (-0.07%)`	⬇️
datafusion/expr/src/logical_plan/plan.rs	`77.10% <0.00%> (ø)`
datafusion/core/src/physical_plan/metrics/value.rs	`87.56% <0.00%> (+0.49%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

datafusion/physical-expr/src/regex_expressions.rs

Dandandan · 2022-09-27T06:32:50Z

Thank you @isidentical !

ursabot · 2022-09-27T06:42:23Z

Benchmark runs are scheduled for baseline = ea3dbb6 and contender = 15c19c3. 15c19c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the physical-expr Physical Expressions label Sep 25, 2022

isidentical force-pushed the gh-3613 branch 2 times, most recently from 9ae012e to fe85ff8 Compare September 25, 2022 23:22

Optimize regex_replace for scalar patterns

d0f1020

isidentical force-pushed the gh-3613 branch from fe85ff8 to d0f1020 Compare September 26, 2022 15:50

isidentical marked this pull request as ready for review September 26, 2022 15:51

Dandandan reviewed Sep 26, 2022

View reviewed changes

datafusion/physical-expr/src/regex_expressions.rs Outdated Show resolved Hide resolved

Change the hot-path on regexp_replace to only variadic source (#2)

b542898

isidentical force-pushed the gh-3613 branch from a7d139e to b542898 Compare September 26, 2022 23:16

isidentical requested a review from Dandandan September 26, 2022 23:16

Dandandan approved these changes Sep 27, 2022

View reviewed changes

Dandandan merged commit 15c19c3 into apache:master Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `regex_replace` for scalar patterns #3614

Optimize `regex_replace` for scalar patterns #3614

isidentical commented Sep 25, 2022 •

edited

Loading

codecov-commenter commented Sep 26, 2022 •

edited

Loading

Dandandan commented Sep 27, 2022

ursabot commented Sep 27, 2022

Optimize regex_replace for scalar patterns #3614

Optimize regex_replace for scalar patterns #3614

Conversation

isidentical commented Sep 25, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Benchmarks

Old benchmarks

codecov-commenter commented Sep 26, 2022 • edited Loading

Codecov Report

Dandandan commented Sep 27, 2022

ursabot commented Sep 27, 2022

Optimize `regex_replace` for scalar patterns #3614

Optimize `regex_replace` for scalar patterns #3614

isidentical commented Sep 25, 2022 •

edited

Loading

codecov-commenter commented Sep 26, 2022 •

edited

Loading