Rewrite some regex conditions to use index-compatible conditions #7495

e-dard · 2016-10-20T16:45:49Z

Required for all non-trivial PRs

Rebased/mergable
Tests pass
CHANGELOG.md updated

This PR adds support for replacing regexes with non-regex conditions
when possible. Currently the following regexes are supported:

host =~ /^server1$/ will be converted into host = 'server1'
host !~ /^server1$/ will be converted into host != 'server1'

Note: if the regex expression contains repetition, groupings or character classes, then rewriting may not take place. For example, the condition: name =~ /^foo.$/ will not
be rewritten as it cannot be rewritten as an exact match.

Support for some of these cases may arrive in the future.

Regexes that can be converted into simpler expression will be able to
take advantage of the tsdb index, making them significantly faster.

Benchmarks

The following benchmarks run a query along the lines of:

SELECT count(value) FROM cpu WHERE host = 'server1'

for the non-regex ones, and:

SELECT count(value) FROM cpu WHERE host =~ /^server1$/

for the regex ones.

master 6fd74a6

BenchmarkServer_Query_Count_Where_500-4                 1000       1392067 ns/op      909527 B/op        572 allocs/op
BenchmarkServer_Query_Count_Where_1K-4                  1000       1349177 ns/op      909332 B/op        571 allocs/op
BenchmarkServer_Query_Count_Where_10K-4                 1000       1564794 ns/op      908493 B/op        569 allocs/op
BenchmarkServer_Query_Count_Where_100K-4                1000       2161792 ns/op      906965 B/op        566 allocs/op
BenchmarkServer_Query_Count_Where_Regex_500-4            100      21228299 ns/op     1717204 B/op        642 allocs/op
BenchmarkServer_Query_Count_Where_Regex_1K-4             100      22877996 ns/op     1717317 B/op        642 allocs/op
BenchmarkServer_Query_Count_Where_Regex_10K-4             50      20535616 ns/op     1717303 B/op        642 allocs/op
BenchmarkServer_Query_Count_Where_Regex_100K-4           100      18272231 ns/op     1717068 B/op        641 allocs/op
PASS
ok      github.com/influxdata/influxdb/cmd/influxd/run  21.132s

This PR produces the following results:

BenchmarkServer_Query_Count_Where_500-4                 1000       1272334 ns/op      909657 B/op        572 allocs/op
BenchmarkServer_Query_Count_Where_1K-4                  1000       1316998 ns/op      909420 B/op        571 allocs/op
BenchmarkServer_Query_Count_Where_10K-4                 1000       1770097 ns/op      908471 B/op        569 allocs/op
BenchmarkServer_Query_Count_Where_100K-4                1000       2368152 ns/op      906952 B/op        566 allocs/op
BenchmarkServer_Query_Count_Where_Regex_500-4           1000       2053715 ns/op      913136 B/op        635 allocs/op
BenchmarkServer_Query_Count_Where_Regex_1K-4            1000       2823591 ns/op      913161 B/op        635 allocs/op
BenchmarkServer_Query_Count_Where_Regex_10K-4            500       2273258 ns/op      913110 B/op        635 allocs/op
BenchmarkServer_Query_Count_Where_Regex_100K-4          1000       2327778 ns/op      913072 B/op        635 allocs/op
PASS
ok      github.com/influxdata/influxdb/cmd/influxd/run  24.178s

So you can see that the regexes have been rewritten and are now effectively running the same benchmarks.

Ratings repo

A real world query on the ratings dataset:

1.0.2

⇒  time influx -database ML1M -execute "SELECT mean(rating) FROM movielens WHERE userid =~ /^48$/"
name: movielens
---------------
time    mean
0   3.068561872909699

influx -database ML1M -execute   0.00s user 0.01s system 0% cpu 1.945 total

PR

⇒  time influx -database ML1M -execute "SELECT mean(rating) FROM movielens WHERE userid =~ /^48$/"
name: movielens
---------------
time    mean
0   3.068561872909699

influx -database ML1M -execute   0.00s user 0.01s system 0% cpu 1.399 total

@jwilder @jsternberg

pauldix · 2016-10-20T17:21:45Z

hawt

stevenh · 2016-10-20T18:16:53Z

influxql/ast.go

+		}
+
+		// Handle regex-based condition.
+		var matched bool


There is only one case for matched so why take the hit on the defer, surely a straight inline the switch.

There is no measurable hit on a defer in this context. Query parsing is not a bottleneck when executing a query.

That as may be but its actually unnecessary so should be avoided as it will add overhead all be it small and every little bit helps, but mainly as complicates the flow of the code,

The reason it's there is in preparation for when I add further regex rewrites along side this exact match rewrite. Once they're added the defer block will be cleaner since there will be multiple matched return points. Happy to remove it until then though, if it's complicating things 👍

That makes more sense then, thanks for explaining 👍

stevenh · 2016-10-20T18:17:47Z

influxql/ast.go

+			if !matched {
+				return
+			}
+			switch be.Op {


This can be optimised to:

if be.Op == EQREGEX { be.op = EQ } else { be.op = NEQ }

mark-rushakoff · 2016-10-20T18:35:05Z

Any reason for not using regexp/syntax to inspect the regular expression, rather than inspecting the string? That would make it more straightforward to handle some of the other one-of-many-exact matches, e.g. /^(foo|bar)$/.

jwilder · 2016-10-20T18:40:55Z

influxql/ast.go

+		rhs := be.RHS.(*RegexLiteral) // This must be a regex.
+		val := rhs.Val.String()
+
+		if strings.ContainsRune(val, '\\') {


What about other regex special chars? Can you add some tests for things like /^foo[12]$/?

Indeed the following, which deals with most basic regexps, may be useful: https://play.golang.org/p/EwEb9rXfX1

All you need to do is rewrite if you get a none empty slice back

Thanks @stevenh I'll take a look!

Be aware its intentionally limited to only do the basics, specifically:

It only deals with anchored i.e /^....$/

It only deals with captures at the base level i.e. /^(...)$/

It only deals with concats containing only literals and char classes

It only deals with alternatives containing literals and concats

So it doesn't deal with all cases as that would open it up to potential issues but here's some that it does deal with

/^1test$/ /^(tester|wibble|blip)$/ /^(1test[1-2][4-5]|wibble[g-hk])$/ /^(2test1|3test2|3test3|3test5|3test7|3test9|3testx)$/

@stevenh I'm currently working on an alternative approach which will just involve rejecting all regexes with meta characters in them (with the exception of beginning and ending with ^ and $ of course). It will be quite restrictive (probably more restrictive than the playground snippet you linked to?), but easier to review and test.

We can improve on it for 1.2.

e-dard · 2016-10-21T10:34:16Z

@jwilder @jsternberg updated.

@mark-rushakoff I'll admit I'm pretty clueless aboutregexp/syntax. In the end I just went for a simple regex to identify strings like ^foo$ and then a simple rejection of any regex with a meta character. Can regexp/syntax do this for us in as simple a way?

daviesalex · 2016-10-21T13:10:33Z

This looks like a great idea to achieve a free lunch!

This may be out of the scope of this PR, but a really huge win here (that I briefly chatted to Paul about a few weeks ago) would be to add a more optimal "begins with" comparison, and filter those out. 99.9% of our regex's are host =~ /^server1/ style (may be expressed at host =~ /^server.$/ or host =~ /^server./)

There are cases where the rex drops down to cases covered by this PR, but its not common (for us!)

Edit: Github has screwed up my regex's, but you get the idea. Many of these go via Grafana, so the users dont even realize they are doing this.

e-dard · 2016-10-21T14:13:49Z

@daviesalex we have your use-case on our minds. I think the main problem preventing us solving the prefix issue with our current in-memory index, is that we can only do direct lookups for series keys. To do prefix lookups we would need to store series in-order.

The new tsi index approach over in https://github.com/influxdata/influxdb/tree/tsi could allow us to search through series keys in this way. Assuming that we add query support for prefix matches, we could then potentially rewrite /^server1/, /server1./, /server1.+$/ etc to use the index-compatible prefix search.

jsternberg · 2016-10-21T14:15:25Z

@e-dard regexp/syntax actually seems to be pretty simple. I can work on it with you later today if you want. It would certainly be better and would allow us to support more types of optimizations such as character classes more easily.

jsternberg · 2016-10-21T14:19:27Z

Example of some code that I haven't finished yet:

func literalMatches(re *regexp.Regexp) ([]string, error) {
    s, err := syntax.Parse(re.String(), 0)
    if err != nil {
        return nil, err
    }

    // We are looking for EXACT matches which means we need
    // a concatenation of a ^, some text, then $.
    if s.Op != syntax.OpConcat && len(s.Sub) < 3 {
        return nil, nil
    }

    start := s.Sub[0]
    if !(start.Op == syntax.OpBeginLine || start.Op == syntax.OpBeginText) {
        return nil, nil
    }
    end := s.Sub[len(s.Sub)-1]
    if !(end.Op == syntax.OpEndLine || end.Op == syntax.OpEndText) {
        return nil, nil
    }

    // Look at the inside to try and put together a string.
    subs := s.Sub[1 : len(s.Sub)-1]
    if len(subs) == 0 {
        return []string{""}, nil
    }

    // now process the inside of the regular expression to try and find exact matches
    return nil, nil
}

e-dard · 2016-10-21T14:26:48Z

@jsternberg though it's important to remember that in the scope of this PR we can't support character classes, since they don't map to literal strings. In the future (beyond the scope of this PR) it would be cool if we could take host =~ /^server-[ab]$/ and convert that into host = 'server-a' OR host = 'server-b' 😃 .

mark-rushakoff · 2016-10-21T15:31:05Z

+1 for the code @jsternberg showed for this, and strong -1 from me on attempting to parse stringified regular expressions, with regular expressions.

I think adding a check for len(subs) == 1 and subs[0].Op == syntax.OpLiteral would more or less implement the issue as written. I think we'd want to run (syntax.Regexp).Simplify at the start of that code, too.

@stevenh linked to this playground link in a now-hidden comment, which seems like a good example of a fuller solution, perhaps for 1.2.

stevenh · 2016-10-21T16:03:25Z

I believe the original use case came from us identifying that grafana generates its queries that reference vars with regexp so you can't (from the gui) get:

mytag = '$var'

instead you get:

mytag = /^$var$/

However if $var is multiple then this results in:

mytag =/^(value1|value2)$/

This seems like a simple case however syntax reduces this on parse to:

mytag = /^value[1-2]$/

This is why I prototyped out a bit more in https://play.golang.org/p/EwEb9rXfX1 so we can actually deal with what grafana generates commonly.

stevenh · 2016-10-21T17:26:26Z

it would be cool if we could take host =~ /^server-[ab]$/ and convert that into host = 'server-a' OR host = 'server-b'

@e-dard this is what my prototype code actually did it would turn this into []string{"server-a","server-b"} which can then be easily translated into an OR.

jsternberg · 2016-10-21T18:29:35Z

influxql/ast.go

+		}
+
+		// We can rewrite this regex.
+		res = make([]byte, 0, len(middle.Rune))


I think you can just do string(middle.Run). It seems to work for me.

yes you can

jsternberg · 2016-10-21T18:30:32Z

influxql/ast.go

+
+// matchExactRegex matches regexes that have the following form: /^foo$/. It
+// considers /^$/ to be a matching regex.
+func (s *SelectStatement) matchExactRegex(v string) (string, bool) {


This should probably just be a free function not attached to the SelectStatement.

jsternberg · 2016-10-21T18:31:19Z

influxql/ast.go

+		be.RHS = &StringLiteral{Val: val}
+
+		// Update the condition operator.
+		if be.Op == EQREGEX {


I think this should be:

if be.Op == EQREGEX { be.Op = EQ } else { be.Op = NEQ }

stevenh · 2016-10-21T22:21:13Z

influxql/ast.go

+		return "", false
+	}
+
+	if len(re.Sub) < 2 || len(re.Sub) > 3 {


So you mean len(re.Sub) != 3 then yes?

No I mean if there are fewer than 2 or more than 3 subexpressions. We will want to catch one regex with two subexpressions, namely /^$/, because that would be more efficiently represented as = '' than by using a regex.

Ahh right, I see that's good :)

stevenh · 2016-10-21T22:23:08Z

influxql/ast.go

+		return "", false
+	}
+
+	end := re.Sub[len(re.Sub)-1]


As this must be 3 in len you could just use literal 2 here.

See comment above.

stevenh · 2016-10-21T22:25:27Z

influxql/ast.go

+	}
+
+	end := re.Sub[len(re.Sub)-1]
+	if !(end.Op == syntax.OpEndLine || end.Op == syntax.OpEndText) {


I think technically the combination of this and the start check is wrong as it could be unbalanced.

Not sure I follow? Could you elaborate?

I believe @stevenh is pointing out the code currently would accept OpBeginLine with OpEndText, or OpBeginText with OpEndLine, instead of only accepting Line/Line and Text/Text.

That's likely fine. I'm not 100% sure, but OpBeginText is likely \A and that's fully compatible with OpEndLine which is $. I don't believe they have to match. If we want to normalize it, I think we can use the flag syntax.OneLine so it's always OpBeginText and OpEndText instead of checking for both.

stevenh · 2016-10-21T22:26:09Z

influxql/ast.go

+	}
+
+	var res []byte
+	if len(re.Sub) == 3 {


This must be the case given the check above

if len(re.Sub) < 2 || len(re.Sub) > 3 {

stevenh · 2016-10-21T22:26:25Z

influxql/ast.go

+		}
+
+		// We can rewrite this regex.
+		res = make([]byte, 0, len(middle.Rune))


yes you can

stevenh · 2016-10-21T22:28:40Z

influxql/ast.go

+	re, err := syntax.Parse(v, syntax.Perl)
+	if err != nil {
+		// Nothing we can do or log.
+		return "", false


For repeat returns of all Zero values like this, declaring the variables in the return declaration so you can just do "return" could be considered cleaner.

stevenh · 2016-10-25T00:18:20Z

influxql/ast.go

+		return "", false
+	}
+
+	if len(re.Sub) < 2 || len(re.Sub) > 3 {


Ahh right, I see that's good :)

This commit adds support for replacing regexes with non-regex conditions when possible. Currently the following regexes are supported: - host =~ /^foo$/ will be converted into host = 'foo' - host !~ /^foo$/ will be converted into host != 'foo' Note: if the regex expression contains character classes, grouping, repetition or similar, it may not be rewritten. For example, the condition: name =~ /^foo|bar$/ will not be rewritten. Support for this may arrive in the future. Regexes that can be converted into simpler expression will be able to take advantage of the tsdb index, making them significantly faster.

e-dard added this to the 1.1.0 milestone Oct 20, 2016

e-dard force-pushed the er-reg-match branch from 8b317a4 to 91f8257 Compare October 20, 2016 16:59

stevenh reviewed Oct 20, 2016

View reviewed changes

jwilder suggested changes Oct 20, 2016

View reviewed changes

e-dard force-pushed the er-reg-match branch from 91f8257 to 21ddcd0 Compare October 21, 2016 10:30

e-dard force-pushed the er-reg-match branch from 21ddcd0 to f999132 Compare October 21, 2016 10:36

e-dard force-pushed the er-reg-match branch 2 times, most recently from 6eba0e7 to b8c1398 Compare October 21, 2016 18:19

jsternberg suggested changes Oct 21, 2016

View reviewed changes

stevenh suggested changes Oct 21, 2016

View reviewed changes

e-dard force-pushed the er-reg-match branch 2 times, most recently from 202efad to b19cda2 Compare October 22, 2016 21:52

jwilder approved these changes Oct 24, 2016

View reviewed changes

jwilder assigned e-dard Oct 24, 2016

stevenh approved these changes Oct 25, 2016

View reviewed changes

influxql/ast.go

return "", false

}

if len(re.Sub) < 2 || len(re.Sub) > 3 {

Copy link

Contributor

stevenh Oct 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, I see that's good :)

jsternberg approved these changes Oct 25, 2016

View reviewed changes

e-dard added 2 commits October 25, 2016 11:10

Add regex benchmarks and fix existing approach

b12b0d1

e-dard force-pushed the er-reg-match branch from b19cda2 to b12b0d1 Compare October 25, 2016 10:10

e-dard merged commit dcd6b29 into master Oct 25, 2016

e-dard deleted the er-reg-match branch October 25, 2016 10:41

rbetts modified the milestones: 1.3.0, 1.1.0 Mar 15, 2017

Rewrite some regex conditions to use index-compatible conditions #7495

Rewrite some regex conditions to use index-compatible conditions #7495

Conversation

e-dard commented Oct 20, 2016 • edited Loading

Required for all non-trivial PRs

Benchmarks

Ratings repo

pauldix commented Oct 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mark-rushakoff commented Oct 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-dard commented Oct 21, 2016

daviesalex commented Oct 21, 2016 • edited Loading

e-dard commented Oct 21, 2016 • edited Loading

jsternberg commented Oct 21, 2016 • edited Loading

jsternberg commented Oct 21, 2016 • edited Loading

e-dard commented Oct 21, 2016

mark-rushakoff commented Oct 21, 2016

stevenh commented Oct 21, 2016

stevenh commented Oct 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-dard commented Oct 20, 2016 •

edited

Loading

daviesalex commented Oct 21, 2016 •

edited

Loading

e-dard commented Oct 21, 2016 •

edited

Loading

jsternberg commented Oct 21, 2016 •

edited

Loading

jsternberg commented Oct 21, 2016 •

edited

Loading

stevenh commented Oct 21, 2016 •

edited

Loading