Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: add conditional runtime parsing through semantic predicates #111995

Merged
merged 7 commits into from
Aug 27, 2024

Conversation

costin
Copy link
Member

@costin costin commented Aug 20, 2024

WIP branch for introducing condition parsing (and versioning) in the grammar (both lexer and parser).

That is make sure that the underlying grammar gets parsed only if certain conditions are met such as :

  • in development code is picked up only in the snapshot branch
  • new commands are not available on older versions (clusters with old nodes)
  • deprecated grammar is no longer allowed once removed

The goal of this branch is to test the concept and see whether it work both on the backend (Java) and front-end (Javascript) since the underlying predicates(aka conditions) are language specific.

Comment on lines +10 to +17
@header {
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated improvement to add the copyright in the generated code.

}

options {
superClass=LexerConfig;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main option that matters here is the superClass which is used to hook up the conditional code - the generated parser and lexer will invoke this code which is defined in the LexerConfig (for the lexer) and ParserConfig for the parser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

Comment on lines -71 to +96
: [A-Za-z]
: [a-z]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is case-insensitive, no need to define upper case (doing that throws an error which is a nice benefit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding this as a comment.

@@ -80,7 +105,7 @@ fragment UNESCAPED_CHARS
;

fragment EXPONENT
: [Ee] [+-]? DIGIT+
: [e] [+-]? DIGIT+
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - the file is case-insensitive, no need to define upper case (doing that throws an error which is a nice benefit).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokens are auto-generated which means changing their declaration order changes their assigned number.
Right now we don't use these numbers however I won't be surprised if in the future, different tools pick this up hence why I propose to own this file and not rely on the auto-generation.

A nice side-effect of that is in-development commands can be assigned their own range (e.g. 999_xxx) making them easy to skip inside the parser.

| showCommand
| metaCommand
// in development
| {devVersion()}? metricsCommand
Copy link
Member Author

@costin costin Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gate metrics command by a semantic predicate.
Note this can extended to include language versioning in the future as well:
such as {devVersion() && hasFeature("inlinestats") .
There's a question on whether using hasFeature with strings is manageable versus adding some kind of versioning only for the language (whether that gets tied to an internal counter, ES public versioning or something else is a separate discussion):
{devVersion() && languageVersion() >= 12345}

Since this declaration is used verbatim in the generated parser/lexer code, the declaration should be portable and minimal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small problem - some things are gated behind feature flags and those flags can be enabled by setting a jvm parameter OR using a snapshot build. I don't remember precisely what is gated, but it's pretty close to this list.

Doing it with a feature flag is a little more convenient that the SNAPSHOT version directly because, on a release, folks can set the jvm parameter and test against that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the mechanism in place, it's easy to extend this to include feature flags - the main issue is handling this outside the server (e.g. Kibana).

Comment on lines 57 to 60
// in development
| {devVersion()}? inlinestatsCommand
| {devVersion()}? lookupCommand
| {devVersion()}? matchCommand
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gate also the in-development commands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these need to be gated in the lexer and parser? Wouldn't one of the two suffice?

Comment on lines -157 to -161
inlinestatsCommand
: INLINESTATS stats=fields (BY grouping=fields)?
;


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move them to the end under the "// in development" section

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above regarding the token file.

Comment on lines +4 to +9
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now the copyright appears.

Comment on lines 12 to 24
class EsqlConfig {

// versioning information
boolean devVersion = Build.current().isSnapshot();

public void setDevVersion(boolean dev) {
this.devVersion = dev;
}

// not great because other grammar parser (Kibana) need the implement this method
boolean hasFeature(String featureName) {
return false;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this class is to hook the dev boolean flag and programatically set it for testing.
I've also added a potential hasFeature method to check if the underlying cluster supports certain command or not and disable it accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errr - do we expose feature flags through an API? How would Kibana determine if a feature flag is set on a given cluster?

/**
* Base class for hooking versioning information into the ANTLR parser.
*/
public abstract class LexerConfig extends Lexer {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base class used now by the generated lexer.

Comment on lines 26 to 28
boolean devVersion() {
return config.devVersion;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java method called at runtime during lexing - it's just checking a boolean that is passed on through the EsqlConfig (to make it easier to share updates of the primitives across classes).

Comment on lines 30 to 32
boolean releaseVersion() {
return devVersion() == false;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative method used to keep the semantic predicate inside the grammar to a minimum.

import org.antlr.v4.runtime.Parser;
import org.antlr.v4.runtime.TokenStream;

public abstract class ParserConfig extends Parser {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equivalent base class for the parser .


import static org.hamcrest.Matchers.containsString;

public class GrammarInDevelopmentParsingTest extends ESTestCase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test that checks the in-development commands fail, if the devVersion is set to false - that is, if the build is not a snapshot.

@@ -50,11 +60,14 @@ private <T> T invokeParser(
BiFunction<AstBuilder, ParserRuleContext, T> result
) {
try {
EsqlBaseLexer lexer = new EsqlBaseLexer(new CaseChangingCharStream(CharStreams.fromString(query)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to wrap the inputstream to CaseChangingCharStream - ANTLR already handles the case insensitivity for us.

@costin
Copy link
Member Author

costin commented Aug 27, 2024

@alex-spies

Why do these need to be gated in the lexer and parser? Wouldn't one of the two suffice?

If there's no gating in the lexer, any new keyword introduced can and will clash with other parts of the grammar such as field name. See for example the error message that reports 'match' and the rest of the dev commands in case of an error.

Comment on lines 23 to 56
/*
* Before modifying this file, please read the section above as changes here
* have significant impact in the ANTLR generated code and its consumption upstream
* (including Kibana).
*
* A. To add a new (production-ready) token
*
* Since tokens types (numbers) are generated by ANTLR in a continuous fashion,
* it is desirable to avoid changing these values.
* However the use of lexing modes prevents this since any addition to a mode
* (regardless where it occurs) shifts all the declarations that follow in other modes.
*
* B. To add a development token (only available behind in snapshot/dev builds)
*
* Since the tokens/modes are in development, simply define them under the
* "// in development section" and follow the section comments in that section.
* That is use the DEV_ prefix and use the {isDevVersion()}? conditional.
* Make sure to remove the prefix and conditional before promoting the tokens in
* production.
* They are defined at the end of the file, to minimize the impact on the existing
* token types.
*
* C. Renaming a token
*
* Avoid renaming the token. But if you really have to, please check with the
* Kibana team as they might be using consuming the dictionary.
*
* D. To remove a token
*
* If the tokens haven't made it to production (and make sure to double check),
* simply remove them from the grammar.
* If the tokens have made it, check with the Kibana team the impact the new
* tokens on it.
*/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added section on how to evolve the file moving forward - it introduces certain conventions in order to drop the DEV_ tokens from error messages (since the conditional cannot prevent the declaration itself).

Comment on lines +85 to +88
DEV_INLINESTATS : {isDevVersion()}? 'inlinestats' -> pushMode(EXPRESSION_MODE);
DEV_LOOKUP : {isDevVersion()}? 'lookup' -> pushMode(LOOKUP_MODE);
DEV_MATCH : {isDevVersion()}? 'match' -> pushMode(EXPRESSION_MODE);
DEV_METRICS : {isDevVersion()}? 'metrics' -> pushMode(METRICS_MODE);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DEV_ prefix is used to drop the token in a non-dev environment.

Comment on lines -49 to -57
//
// Explain
//
mode EXPLAIN_MODE;
EXPLAIN_OPENING_BRACKET : OPENING_BRACKET -> type(OPENING_BRACKET), pushMode(DEFAULT_MODE);
EXPLAIN_PIPE : PIPE -> type(PIPE), popMode;
EXPLAIN_WS : WS -> channel(HIDDEN);
EXPLAIN_LINE_COMMENT : LINE_COMMENT -> channel(HIDDEN);
EXPLAIN_MULTILINE_COMMENT : MULTILINE_COMMENT -> channel(HIDDEN);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this down since EXPLAIN mode is still in limbo - no need to have it high in the file, before the EXPRESSION mode.
The modes were initially done in alphabetical order however I think it's best to do them in chronological form.

Comment on lines +210 to +212
// move it in the main section if the feature gets promoted
DEV_MATCH_OP : {isDevVersion()}? DEV_MATCH -> type(DEV_MATCH);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ioanatia @carlosdelest identical tokens should be aliased as mentioned above (through the type() function) instead of being repeated.

Comment on lines -70 to +82
: qualifiedName MATCH_OPERATOR queryString=string
: valueExpression DEV_MATCH queryString=string
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ioanatia try to stick with the existing grammar pattern; use valueExpression instead of qualifiedName. Perform validation if necessary in the parser .

/**
* Utility class for filtering/processing TokenSource.
*/
abstract class DelegatingTokenSource implements TokenSource {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted this base class since I had another TokenSource implementation - I've removed it but kept it for future reuse (or maybe ANTLR will offer it).


import org.elasticsearch.Build;

class EsqlConfig {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the base class - no check for features.

Comment on lines +124 to +128
if (recognizer instanceof EsqlBaseParser parser && parser.isDevVersion() == false) {
Matcher m = REPLACE_DEV.matcher(message);
message = m.replaceAll(StringUtils.EMPTY);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hacky way to remove the DEV_ tokens - if anybody else has better ideas, let me know.
FTR, I've tried removing the token from the stream, moving them to a hidden channel however it doesn't work since the error is generated by the recognizer who's inspecting the ATN (the transition between states) and the DEV_ token show up.
The code above is much easier than manipulating the ATN, which gets generated by ANTLR - though it's probably quite fragile.

@costin costin removed the WIP label Aug 27, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 27, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/kibana-esql (ES|QL-ui)

@costin
Copy link
Member Author

costin commented Aug 27, 2024

@drewdaemon I've updated the PR with the flags and added some code on how to drop the DEV_ tokens when running on the stable branch when encountering a syntax error ("... expected on of {'from', 'eval' ...DEV_INLINE...").

@alex-spies
Copy link
Contributor

@alex-spies

Why do these need to be gated in the lexer and parser? Wouldn't one of the two suffice?

If there's no gating in the lexer, any new keyword introduced can and will clash with other parts of the grammar such as field name. See for example the error message that reports 'match' and the rest of the dev commands in case of an error.

I understand the gating in the lexer, but wonder about the gating in the parser.

I just removed the isDevVersion predicate for inlinestats from the parser, locally, and ran an inlinestats query; still resulted in the expected parsing exception:

{"error":{"root_cause":[{"type":"parsing_exception","reason":"line 1:20: mismatched input 'inlinestats' expecting {'dissect', 'drop', 'enrich', 'eval', 'grok', 'keep', 'limit', 'mv_expand', 'rename', 'sort', 'stats', 'where'}"}],"type":"parsing_exception","reason":"line 1:20: mismatched input 'inlinestats' expecting {'dissect', 'drop', 'enrich', 'eval', 'grok', 'keep', 'limit', 'mv_expand', 'rename', 'sort', 'stats', 'where'}","caused_by":{"type":"input_mismatch_exception","reason":null}},"status":400}% 

If the lexer doesn't accept the token, there's no reason for additional special treatment in the parser, no?

* C. Renaming a token
*
* Avoid renaming the token. But if you really have to, please check with the
* Kibana team as they might be using consuming the dictionary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo: might be using consuming ....

* If the tokens haven't made it to production (and make sure to double check),
* simply remove them from the grammar.
* If the tokens have made it, check with the Kibana team the impact the new
* tokens on it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence doesn't sound right, I think it's missing a verb...

check with the Kibana team the impact the new tokens have on it

// DEV_MYCOMMAND : {isDevVersion()}? 'mycommand' -> ...
//
// Once the command has been stabilized, remove the DEV_ prefix and the {}? conditional and move the command to the
// main section while preserving alphabetical order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you added an example with DEV_MYCOMMAND I think it's worth adding another example for the situation when the command is promoted to production.

@costin
Copy link
Member Author

costin commented Aug 27, 2024

I understand the gating in the lexer, but wonder about the gating in the parser.

The conditionals make sense for consistency - the rules can use 'released' tokens in which case they'll match. While in this PR, the tokens and rules go hand in hand, the goal is to provide a template that works in all cases (such as introducing a new rule that gets applied for existing tokens or that enables a feature only from a certain version onwards).

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines -71 to +96
: [A-Za-z]
: [a-z]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding this as a comment.


import org.elasticsearch.Build;

class EsqlConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be a record?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because records are immutable.

Copy link
Member Author

@costin costin Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding this as a comment.

No need, if one tries to define upper case characters, the grammar will complain.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@costin costin merged commit fff5c8f into elastic:main Aug 27, 2024
2 of 15 checks passed
@costin costin deleted the esql/semantic-predicates branch August 27, 2024 20:51
drewdaemon added a commit to elastic/kibana that referenced this pull request Sep 4, 2024
## Summary

Pulls in the changes ES team introduced in the ES|QL grammar in
elastic/elasticsearch#111995. Nothing should
change as far as our functionality except that commands the ES team
marks with the `DEV_` prefix will no longer show up in our validation
errors.

**Before**
<img width="859" alt="Screenshot 2024-09-03 at 3 11 21 PM"
src="https://github.com/user-attachments/assets/69dee5f1-dd26-4d85-b83b-a0b4689a3c09">

**After**
<img width="848" alt="Screenshot 2024-09-03 at 3 10 35 PM"
src="https://github.com/user-attachments/assets/31c07a0a-4e59-4e11-af72-a1eb7b7f1235">

Successful ES|QL grammar sync run:
https://buildkite.com/elastic/kibana-es-ql-grammar-sync/builds/53
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
…astic#111995)

Introducing condition parsing (and versioning) in the grammar (both lexer 
 and parser). 
That is make sure that the underlying grammar gets parsed only if certain 
 conditions are met such as :
- in development code is picked up only in the snapshot branch
- new commands are not available on older versions (clusters with old 
 nodes)
- deprecated grammar is no longer allowed once removed
The goal of this branch is to test the concept and see whether it work both 
 on the backend (Java) and front-end (Javascript) since the underlying 
 predicates(aka conditions) are language specific.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI >refactoring Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants