-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize regex_replace
for scalar patterns
#3614
Conversation
9ae012e
to
fe85ff8
Compare
Codecov Report
@@ Coverage Diff @@
## master #3614 +/- ##
==========================================
- Coverage 86.07% 85.99% -0.08%
==========================================
Files 300 300
Lines 56314 56449 +135
==========================================
+ Hits 48473 48546 +73
- Misses 7841 7903 +62
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
fe85ff8
to
d0f1020
Compare
a7d139e
to
b542898
Compare
Thank you @isidentical ! |
Benchmark runs are scheduled for baseline = ea3dbb6 and contender = 15c19c3. 15c19c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3613.
Rationale for this change
@Dandandan noticed
regex_replace
with a known pattern seems to be taking an extremely long amount of time during ClickBench suite in #3518. This seems to be true due to many factors, but mainly due to how genericregex_replace
implementation is (it can handle 2⁴ combinations when it comes to scalars/arrays). Having a generic version ready is good for compatibility, but at the same time, it makes us pay the overhead for common cases (like the example in #3518, where the pattern is static).What changes are included in this PR?
This PR adds a scalarity (not sure if this is a real word) based specialization system where at the runtime the best
regex_replace
variation can be picked and executed for the given set of inputs. The system here is just the start, and if there is enough gains we might add a third case where the replacement is also known.Are there any user-facing changes?
This is mainly an optimization, and there shouldn't be any user facing changes.
Benchmarks
New benchmarks are here #3614 (comment), and overall it shows a speed-up in the range of 20-35X depending on the query & input.
Old benchmarks
Running all benchmarks with
--release
mode (using the datafusion-cli crate with-f
option).The initial benchmark is the Query 28 from clickhouse
(Note: I don't have the full ClickBench data, just have a partition of it [1/100 scale] so this might not be very reflective)
A second benchmark is the one where we have both the source and the replacements as arrays, which shows speed-up factor of 1.7X.