Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite some regex conditions to use index-compatible conditions #7495

Merged
merged 2 commits into from
Oct 25, 2016

Conversation

e-dard
Copy link
Contributor

@e-dard e-dard commented Oct 20, 2016

Required for all non-trivial PRs
  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated

Fixes #7486.

This PR adds support for replacing regexes with non-regex conditions
when possible. Currently the following regexes are supported:

  • host =~ /^server1$/ will be converted into host = 'server1'
  • host !~ /^server1$/ will be converted into host != 'server1'

Note: if the regex expression contains repetition, groupings or character classes, then rewriting may not take place. For example, the condition: name =~ /^foo.$/ will not
be rewritten as it cannot be rewritten as an exact match.

Support for some of these cases may arrive in the future.

Regexes that can be converted into simpler expression will be able to
take advantage of the tsdb index, making them significantly faster.

Benchmarks

The following benchmarks run a query along the lines of:

SELECT count(value) FROM cpu WHERE host = 'server1'

for the non-regex ones, and:

SELECT count(value) FROM cpu WHERE host =~ /^server1$/

for the regex ones.

master 6fd74a6

BenchmarkServer_Query_Count_Where_500-4                 1000       1392067 ns/op      909527 B/op        572 allocs/op
BenchmarkServer_Query_Count_Where_1K-4                  1000       1349177 ns/op      909332 B/op        571 allocs/op
BenchmarkServer_Query_Count_Where_10K-4                 1000       1564794 ns/op      908493 B/op        569 allocs/op
BenchmarkServer_Query_Count_Where_100K-4                1000       2161792 ns/op      906965 B/op        566 allocs/op
BenchmarkServer_Query_Count_Where_Regex_500-4            100      21228299 ns/op     1717204 B/op        642 allocs/op
BenchmarkServer_Query_Count_Where_Regex_1K-4             100      22877996 ns/op     1717317 B/op        642 allocs/op
BenchmarkServer_Query_Count_Where_Regex_10K-4             50      20535616 ns/op     1717303 B/op        642 allocs/op
BenchmarkServer_Query_Count_Where_Regex_100K-4           100      18272231 ns/op     1717068 B/op        641 allocs/op
PASS
ok      github.com/influxdata/influxdb/cmd/influxd/run  21.132s

This PR produces the following results:

BenchmarkServer_Query_Count_Where_500-4                 1000       1272334 ns/op      909657 B/op        572 allocs/op
BenchmarkServer_Query_Count_Where_1K-4                  1000       1316998 ns/op      909420 B/op        571 allocs/op
BenchmarkServer_Query_Count_Where_10K-4                 1000       1770097 ns/op      908471 B/op        569 allocs/op
BenchmarkServer_Query_Count_Where_100K-4                1000       2368152 ns/op      906952 B/op        566 allocs/op
BenchmarkServer_Query_Count_Where_Regex_500-4           1000       2053715 ns/op      913136 B/op        635 allocs/op
BenchmarkServer_Query_Count_Where_Regex_1K-4            1000       2823591 ns/op      913161 B/op        635 allocs/op
BenchmarkServer_Query_Count_Where_Regex_10K-4            500       2273258 ns/op      913110 B/op        635 allocs/op
BenchmarkServer_Query_Count_Where_Regex_100K-4          1000       2327778 ns/op      913072 B/op        635 allocs/op
PASS
ok      github.com/influxdata/influxdb/cmd/influxd/run  24.178s

So you can see that the regexes have been rewritten and are now effectively running the same benchmarks.

Ratings repo

A real world query on the ratings dataset:

1.0.2

⇒  time influx -database ML1M -execute "SELECT mean(rating) FROM movielens WHERE userid =~ /^48$/"
name: movielens
---------------
time    mean
0   3.068561872909699

influx -database ML1M -execute   0.00s user 0.01s system 0% cpu 1.945 total

PR

⇒  time influx -database ML1M -execute "SELECT mean(rating) FROM movielens WHERE userid =~ /^48$/"
name: movielens
---------------
time    mean
0   3.068561872909699

influx -database ML1M -execute   0.00s user 0.01s system 0% cpu 1.399 total

@jwilder @jsternberg

@e-dard e-dard added this to the 1.1.0 milestone Oct 20, 2016
@pauldix
Copy link
Member

pauldix commented Oct 20, 2016

hawt

}

// Handle regex-based condition.
var matched bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one case for matched so why take the hit on the defer, surely a straight inline the switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no measurable hit on a defer in this context. Query parsing is not a bottleneck when executing a query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That as may be but its actually unnecessary so should be avoided as it will add overhead all be it small and every little bit helps, but mainly as complicates the flow of the code,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it's there is in preparation for when I add further regex rewrites along side this exact match rewrite. Once they're added the defer block will be cleaner since there will be multiple matched return points. Happy to remove it until then though, if it's complicating things 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes more sense then, thanks for explaining 👍

if !matched {
return
}
switch be.Op {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be optimised to:

if be.Op == EQREGEX {
   be.op = EQ
} else {
   be.op = NEQ
}

@mark-rushakoff
Copy link
Contributor

Any reason for not using regexp/syntax to inspect the regular expression, rather than inspecting the string? That would make it more straightforward to handle some of the other one-of-many-exact matches, e.g. /^(foo|bar)$/.

rhs := be.RHS.(*RegexLiteral) // This must be a regex.
val := rhs.Val.String()

if strings.ContainsRune(val, '\\') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other regex special chars? Can you add some tests for things like /^foo[12]$/?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the following, which deals with most basic regexps, may be useful: https://play.golang.org/p/EwEb9rXfX1

All you need to do is rewrite if you get a none empty slice back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stevenh I'll take a look!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware its intentionally limited to only do the basics, specifically:

  • It only deals with anchored i.e /^....$/
  • It only deals with captures at the base level i.e. /^(...)$/
  • It only deals with concats containing only literals and char classes
  • It only deals with alternatives containing literals and concats

So it doesn't deal with all cases as that would open it up to potential issues but here's some that it does deal with

/^1test$/
/^(tester|wibble|blip)$/
/^(1test[1-2][4-5]|wibble[g-hk])$/
/^(2test1|3test2|3test3|3test5|3test7|3test9|3testx)$/ 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenh I'm currently working on an alternative approach which will just involve rejecting all regexes with meta characters in them (with the exception of beginning and ending with ^ and $ of course). It will be quite restrictive (probably more restrictive than the playground snippet you linked to?), but easier to review and test.

We can improve on it for 1.2.

@e-dard
Copy link
Contributor Author

e-dard commented Oct 21, 2016

@jwilder @jsternberg updated.

@mark-rushakoff I'll admit I'm pretty clueless aboutregexp/syntax. In the end I just went for a simple regex to identify strings like ^foo$ and then a simple rejection of any regex with a meta character. Can regexp/syntax do this for us in as simple a way?

@daviesalex
Copy link
Contributor

daviesalex commented Oct 21, 2016

This looks like a great idea to achieve a free lunch!

This may be out of the scope of this PR, but a really huge win here (that I briefly chatted to Paul about a few weeks ago) would be to add a more optimal "begins with" comparison, and filter those out. 99.9% of our regex's are host =~ /^server1/ style (may be expressed at host =~ /^server.$/ or host =~ /^server./)

There are cases where the rex drops down to cases covered by this PR, but its not common (for us!)

Edit: Github has screwed up my regex's, but you get the idea. Many of these go via Grafana, so the users dont even realize they are doing this.

@e-dard
Copy link
Contributor Author

e-dard commented Oct 21, 2016

@daviesalex we have your use-case on our minds. I think the main problem preventing us solving the prefix issue with our current in-memory index, is that we can only do direct lookups for series keys. To do prefix lookups we would need to store series in-order.

The new tsi index approach over in https://github.com/influxdata/influxdb/tree/tsi could allow us to search through series keys in this way. Assuming that we add query support for prefix matches, we could then potentially rewrite /^server1/, /server1./, /server1.+$/ etc to use the index-compatible prefix search.

@jsternberg
Copy link
Contributor

jsternberg commented Oct 21, 2016

@e-dard regexp/syntax actually seems to be pretty simple. I can work on it with you later today if you want. It would certainly be better and would allow us to support more types of optimizations such as character classes more easily.

@jsternberg
Copy link
Contributor

jsternberg commented Oct 21, 2016

Example of some code that I haven't finished yet:

func literalMatches(re *regexp.Regexp) ([]string, error) {
    s, err := syntax.Parse(re.String(), 0)
    if err != nil {
        return nil, err
    }

    // We are looking for EXACT matches which means we need
    // a concatenation of a ^, some text, then $.
    if s.Op != syntax.OpConcat && len(s.Sub) < 3 {
        return nil, nil
    }

    start := s.Sub[0]
    if !(start.Op == syntax.OpBeginLine || start.Op == syntax.OpBeginText) {
        return nil, nil
    }
    end := s.Sub[len(s.Sub)-1]
    if !(end.Op == syntax.OpEndLine || end.Op == syntax.OpEndText) {
        return nil, nil
    }

    // Look at the inside to try and put together a string.
    subs := s.Sub[1 : len(s.Sub)-1]
    if len(subs) == 0 {
        return []string{""}, nil
    }

    // now process the inside of the regular expression to try and find exact matches
    return nil, nil
}

@e-dard
Copy link
Contributor Author

e-dard commented Oct 21, 2016

@jsternberg though it's important to remember that in the scope of this PR we can't support character classes, since they don't map to literal strings. In the future (beyond the scope of this PR) it would be cool if we could take host =~ /^server-[ab]$/ and convert that into host = 'server-a' OR host = 'server-b' 😃 .

@mark-rushakoff
Copy link
Contributor

+1 for the code @jsternberg showed for this, and strong -1 from me on attempting to parse stringified regular expressions, with regular expressions.

I think adding a check for len(subs) == 1 and subs[0].Op == syntax.OpLiteral would more or less implement the issue as written. I think we'd want to run (syntax.Regexp).Simplify at the start of that code, too.

@stevenh linked to this playground link in a now-hidden comment, which seems like a good example of a fuller solution, perhaps for 1.2.

@stevenh
Copy link
Contributor

stevenh commented Oct 21, 2016

I believe the original use case came from us identifying that grafana generates its queries that reference vars with regexp so you can't (from the gui) get:

mytag = '$var'

instead you get:

mytag = /^$var$/

However if $var is multiple then this results in:

mytag =/^(value1|value2)$/

This seems like a simple case however syntax reduces this on parse to:

mytag = /^value[1-2]$/

This is why I prototyped out a bit more in https://play.golang.org/p/EwEb9rXfX1 so we can actually deal with what grafana generates commonly.

@stevenh
Copy link
Contributor

stevenh commented Oct 21, 2016

it would be cool if we could take host =~ /^server-[ab]$/ and convert that into host = 'server-a' OR host = 'server-b'

@e-dard this is what my prototype code actually did it would turn this into []string{"server-a","server-b"} which can then be easily translated into an OR.

@e-dard e-dard force-pushed the er-reg-match branch 2 times, most recently from 6eba0e7 to b8c1398 Compare October 21, 2016 18:19
}

// We can rewrite this regex.
res = make([]byte, 0, len(middle.Rune))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do string(middle.Run). It seems to work for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you can


// matchExactRegex matches regexes that have the following form: /^foo$/. It
// considers /^$/ to be a matching regex.
func (s *SelectStatement) matchExactRegex(v string) (string, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably just be a free function not attached to the SelectStatement.

be.RHS = &StringLiteral{Val: val}

// Update the condition operator.
if be.Op == EQREGEX {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be:

if be.Op == EQREGEX {
    be.Op = EQ
} else {
    be.Op = NEQ
}

return "", false
}

if len(re.Sub) < 2 || len(re.Sub) > 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean len(re.Sub) != 3 then yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I mean if there are fewer than 2 or more than 3 subexpressions. We will want to catch one regex with two subexpressions, namely /^$/, because that would be more efficiently represented as = '' than by using a regex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, I see that's good :)

return "", false
}

end := re.Sub[len(re.Sub)-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this must be 3 in len you could just use literal 2 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

}

end := re.Sub[len(re.Sub)-1]
if !(end.Op == syntax.OpEndLine || end.Op == syntax.OpEndText) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think technically the combination of this and the start check is wrong as it could be unbalanced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow? Could you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @stevenh is pointing out the code currently would accept OpBeginLine with OpEndText, or OpBeginText with OpEndLine, instead of only accepting Line/Line and Text/Text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's likely fine. I'm not 100% sure, but OpBeginText is likely \A and that's fully compatible with OpEndLine which is $. I don't believe they have to match. If we want to normalize it, I think we can use the flag syntax.OneLine so it's always OpBeginText and OpEndText instead of checking for both.

}

var res []byte
if len(re.Sub) == 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be the case given the check above

if len(re.Sub) < 2 || len(re.Sub) > 3 {

}

// We can rewrite this regex.
res = make([]byte, 0, len(middle.Rune))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you can

re, err := syntax.Parse(v, syntax.Perl)
if err != nil {
// Nothing we can do or log.
return "", false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For repeat returns of all Zero values like this, declaring the variables in the return declaration so you can just do "return" could be considered cleaner.

@e-dard e-dard force-pushed the er-reg-match branch 2 times, most recently from 202efad to b19cda2 Compare October 22, 2016 21:52
return "", false
}

if len(re.Sub) < 2 || len(re.Sub) > 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, I see that's good :)

This commit adds support for replacing regexes with non-regex conditions
when possible. Currently the following regexes are supported:

 - host =~ /^foo$/ will be converted into host = 'foo'
 - host !~ /^foo$/ will be converted into host != 'foo'

Note: if the regex expression contains character classes, grouping,
repetition or similar, it may not be rewritten.

For example, the condition: name =~ /^foo|bar$/ will not be rewritten.
Support for this may arrive in the future.

Regexes that can be converted into simpler expression will be able to
take advantage of the tsdb index, making them significantly faster.
@e-dard e-dard merged commit dcd6b29 into master Oct 25, 2016
@e-dard e-dard deleted the er-reg-match branch October 25, 2016 10:41
@rbetts rbetts modified the milestones: 1.3.0, 1.1.0 Mar 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants