-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite some regex conditions to use index-compatible conditions #7495
Conversation
hawt |
} | ||
|
||
// Handle regex-based condition. | ||
var matched bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only one case for matched so why take the hit on the defer, surely a straight inline the switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no measurable hit on a defer
in this context. Query parsing is not a bottleneck when executing a query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That as may be but its actually unnecessary so should be avoided as it will add overhead all be it small and every little bit helps, but mainly as complicates the flow of the code,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason it's there is in preparation for when I add further regex rewrites along side this exact match rewrite. Once they're added the defer
block will be cleaner since there will be multiple matched return points. Happy to remove it until then though, if it's complicating things 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes more sense then, thanks for explaining 👍
if !matched { | ||
return | ||
} | ||
switch be.Op { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be optimised to:
if be.Op == EQREGEX {
be.op = EQ
} else {
be.op = NEQ
}
Any reason for not using regexp/syntax to inspect the regular expression, rather than inspecting the string? That would make it more straightforward to handle some of the other one-of-many-exact matches, e.g. |
rhs := be.RHS.(*RegexLiteral) // This must be a regex. | ||
val := rhs.Val.String() | ||
|
||
if strings.ContainsRune(val, '\\') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other regex special chars? Can you add some tests for things like /^foo[12]$/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the following, which deals with most basic regexps, may be useful: https://play.golang.org/p/EwEb9rXfX1
All you need to do is rewrite if you get a none empty slice back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @stevenh I'll take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be aware its intentionally limited to only do the basics, specifically:
- It only deals with anchored i.e /^....$/
- It only deals with captures at the base level i.e. /^(...)$/
- It only deals with concats containing only literals and char classes
- It only deals with alternatives containing literals and concats
So it doesn't deal with all cases as that would open it up to potential issues but here's some that it does deal with
/^1test$/
/^(tester|wibble|blip)$/
/^(1test[1-2][4-5]|wibble[g-hk])$/
/^(2test1|3test2|3test3|3test5|3test7|3test9|3testx)$/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenh I'm currently working on an alternative approach which will just involve rejecting all regexes with meta characters in them (with the exception of beginning and ending with ^
and $
of course). It will be quite restrictive (probably more restrictive than the playground snippet you linked to?), but easier to review and test.
We can improve on it for 1.2
.
@jwilder @jsternberg updated. @mark-rushakoff I'll admit I'm pretty clueless about |
This looks like a great idea to achieve a free lunch! This may be out of the scope of this PR, but a really huge win here (that I briefly chatted to Paul about a few weeks ago) would be to add a more optimal "begins with" comparison, and filter those out. 99.9% of our regex's are host =~ /^server1/ style (may be expressed at host =~ /^server.$/ or host =~ /^server./) There are cases where the rex drops down to cases covered by this PR, but its not common (for us!) Edit: Github has screwed up my regex's, but you get the idea. Many of these go via Grafana, so the users dont even realize they are doing this. |
@daviesalex we have your use-case on our minds. I think the main problem preventing us solving the prefix issue with our current in-memory index, is that we can only do direct lookups for series keys. To do prefix lookups we would need to store series in-order. The new |
@e-dard |
Example of some code that I haven't finished yet: func literalMatches(re *regexp.Regexp) ([]string, error) {
s, err := syntax.Parse(re.String(), 0)
if err != nil {
return nil, err
}
// We are looking for EXACT matches which means we need
// a concatenation of a ^, some text, then $.
if s.Op != syntax.OpConcat && len(s.Sub) < 3 {
return nil, nil
}
start := s.Sub[0]
if !(start.Op == syntax.OpBeginLine || start.Op == syntax.OpBeginText) {
return nil, nil
}
end := s.Sub[len(s.Sub)-1]
if !(end.Op == syntax.OpEndLine || end.Op == syntax.OpEndText) {
return nil, nil
}
// Look at the inside to try and put together a string.
subs := s.Sub[1 : len(s.Sub)-1]
if len(subs) == 0 {
return []string{""}, nil
}
// now process the inside of the regular expression to try and find exact matches
return nil, nil
} |
@jsternberg though it's important to remember that in the scope of this PR we can't support character classes, since they don't map to literal strings. In the future (beyond the scope of this PR) it would be cool if we could take |
+1 for the code @jsternberg showed for this, and strong -1 from me on attempting to parse stringified regular expressions, with regular expressions. I think adding a check for @stevenh linked to this playground link in a now-hidden comment, which seems like a good example of a fuller solution, perhaps for 1.2. |
I believe the original use case came from us identifying that grafana generates its queries that reference vars with regexp so you can't (from the gui) get: mytag = '$var' instead you get: mytag = /^$var$/ However if $var is multiple then this results in: mytag =/^(value1|value2)$/ This seems like a simple case however syntax reduces this on parse to: mytag = /^value[1-2]$/ This is why I prototyped out a bit more in https://play.golang.org/p/EwEb9rXfX1 so we can actually deal with what grafana generates commonly. |
@e-dard this is what my prototype code actually did it would turn this into []string{"server-a","server-b"} which can then be easily translated into an OR. |
6eba0e7
to
b8c1398
Compare
} | ||
|
||
// We can rewrite this regex. | ||
res = make([]byte, 0, len(middle.Rune)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do string(middle.Run)
. It seems to work for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you can
|
||
// matchExactRegex matches regexes that have the following form: /^foo$/. It | ||
// considers /^$/ to be a matching regex. | ||
func (s *SelectStatement) matchExactRegex(v string) (string, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably just be a free function not attached to the SelectStatement
.
be.RHS = &StringLiteral{Val: val} | ||
|
||
// Update the condition operator. | ||
if be.Op == EQREGEX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be:
if be.Op == EQREGEX {
be.Op = EQ
} else {
be.Op = NEQ
}
return "", false | ||
} | ||
|
||
if len(re.Sub) < 2 || len(re.Sub) > 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean len(re.Sub) != 3 then yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I mean if there are fewer than 2 or more than 3 subexpressions. We will want to catch one regex with two subexpressions, namely /^$/
, because that would be more efficiently represented as = ''
than by using a regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right, I see that's good :)
return "", false | ||
} | ||
|
||
end := re.Sub[len(re.Sub)-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this must be 3 in len you could just use literal 2 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
} | ||
|
||
end := re.Sub[len(re.Sub)-1] | ||
if !(end.Op == syntax.OpEndLine || end.Op == syntax.OpEndText) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think technically the combination of this and the start check is wrong as it could be unbalanced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow? Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @stevenh is pointing out the code currently would accept OpBeginLine with OpEndText, or OpBeginText with OpEndLine, instead of only accepting Line/Line and Text/Text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's likely fine. I'm not 100% sure, but OpBeginText
is likely \A
and that's fully compatible with OpEndLine
which is $
. I don't believe they have to match. If we want to normalize it, I think we can use the flag syntax.OneLine
so it's always OpBeginText
and OpEndText
instead of checking for both.
} | ||
|
||
var res []byte | ||
if len(re.Sub) == 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be the case given the check above
if len(re.Sub) < 2 || len(re.Sub) > 3 {
} | ||
|
||
// We can rewrite this regex. | ||
res = make([]byte, 0, len(middle.Rune)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you can
re, err := syntax.Parse(v, syntax.Perl) | ||
if err != nil { | ||
// Nothing we can do or log. | ||
return "", false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For repeat returns of all Zero values like this, declaring the variables in the return declaration so you can just do "return" could be considered cleaner.
202efad
to
b19cda2
Compare
return "", false | ||
} | ||
|
||
if len(re.Sub) < 2 || len(re.Sub) > 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right, I see that's good :)
This commit adds support for replacing regexes with non-regex conditions when possible. Currently the following regexes are supported: - host =~ /^foo$/ will be converted into host = 'foo' - host !~ /^foo$/ will be converted into host != 'foo' Note: if the regex expression contains character classes, grouping, repetition or similar, it may not be rewritten. For example, the condition: name =~ /^foo|bar$/ will not be rewritten. Support for this may arrive in the future. Regexes that can be converted into simpler expression will be able to take advantage of the tsdb index, making them significantly faster.
Required for all non-trivial PRs
Fixes #7486.
This PR adds support for replacing regexes with non-regex conditions
when possible. Currently the following regexes are supported:
host =~ /^server1$/
will be converted intohost = 'server1'
host !~ /^server1$/
will be converted intohost != 'server1'
Note: if the regex expression contains repetition, groupings or character classes, then rewriting may not take place. For example, the condition:
name =~ /^foo.$/
will notbe rewritten as it cannot be rewritten as an exact match.
Support for some of these cases may arrive in the future.
Regexes that can be converted into simpler expression will be able to
take advantage of the tsdb index, making them significantly faster.
Benchmarks
The following benchmarks run a query along the lines of:
for the non-regex ones, and:
for the regex ones.
master
6fd74a6This PR produces the following results:
So you can see that the regexes have been rewritten and are now effectively running the same benchmarks.
Ratings repo
A real world query on the ratings dataset:
1.0.2
PR
@jwilder @jsternberg