-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EQL: Replace ?"..." with """...""" for unescaped strings #62539
EQL: Replace ?"..." with """...""" for unescaped strings #62539
Conversation
Use triple doulbe quotes enclosing a string literal to interpret it as unescaped, in order to use `?` for marking query params and avoid user confusion. Relates to elastic#61659
Pinging @elastic/es-ql (:Query Languages/EQL) |
/cc @rw-access |
Currently the usage of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some parser tests to validate the error being thrown with the old and new format.
@@ -122,10 +122,18 @@ public static String unquoteString(Source source) { | |||
return null; | |||
} | |||
|
|||
// unescaped strings can be interpreted directly | |||
// catch old method of ?" and ?' to define unescaped strings | |||
if (text.startsWith("?")) { | |||
checkForSingleQuotedString(source, text, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect - in case of ?'
the error message is invalid - it tells that double quotes should be used which is not the case.
In fact I would argue that since the string is always quoted, if it starts with ? an exception should be thrown regardless of what follows it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, It's a 2-step error for the user: ?'
-> ?"
-> """
. I just thought that explicitly handling the single quotes (no matter of the preceding ?
) was a "better" approach. I'll gladly change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this has been decided as a new syntax so I would like to hold off on merging and run this by the EQL rules team.
@@ -195,6 +195,7 @@ STRING | |||
| '"' ('\\' [btnfr"'\\] | ~[\r\n"\\])* '"' | |||
| '?"' ('\\"' |~["\r\n])* '"' | |||
| '?\'' ('\\\'' |~['\r\n])* '\'' | |||
| '"""' ('\\\'' |~['\r\n])* '"""' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like '
needs to be escaped under this format.
I also have some questions about the desired behavior of triple double quotes:
How would each of these be interpreted? I think we should know our edge cases up front. I'm wondering what the implication is for "
at the beginning or at the end of a string. And how many of them are supported
""""""
"""""x"""
"""x"""""
""""""""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like
'
needs to be escaped under this format.
Correct, I rushed and overlooked that.
I also have some questions about the desired behavior of triple double quotes:
How would each of these be interpreted? I think we should know our edge cases up front. I'm wondering what the implication is for
"
at the beginning or at the end of a string. And how many of them are supported
""""""
"""""x"""
"""x"""""
""""""""
The extra "
in the beginning or end of the string would be part of the final string:
- `` -> empty string
""x
x""
""
there is no grammatical ambiguity for a lone sure there may be an ambiguity from the user's expectation of what |
That's true, It's only to have a clear user of |
…-triple-doublequotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -199,6 +199,7 @@ STRING | |||
| '"' ('\\' [btnfr"'\\] | ~[\r\n"\\])* '"' | |||
| '?"' ('\\"' |~["\r\n])* '"' | |||
| '?\'' ('\\\'' |~['\r\n])* '\'' | |||
| '"""' ('\\"' |~['\r\n])* '"""' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sttill doesn't seem right and needs more tests.
Under this regex, you start and end with """
, but you can't have any '
characters in between. In addition, you can have an infinite number of "
characters. This would mean """""""""""""""
is valid syntax. But I'm not sure that's good behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this particular implementation needs some more tests and an updated regex.
Regardless of the implementation though, I think we still need to hold off on merging this PR. I think the turnaround was impressive, but that we skipped a few crucial steps. Using """
for quoting literal strings went from proposal straight to implementation, but I think we still need the opportunity for a discussion, ironing out the behavior for edge cases, and ultimately agreement.
The specification going forward
I'm thinking that any changes to the EQL specification, like this example, should be collectively deliberated and agreed upon. I'm not completely sure if this syntax is desirable, and would like some more time to explore options. And when things merge, it's harder to back them out. Especially for 7.10+ changes or whenever we land as beta.
I think that all of the interested bodies should have weight, and I'm wondering if our existing syncs are sufficient to discuss changes to the specification. An idea was pitched on this PR elastic/beats#20994 (comment) to separate out the specification into a separate repo. I think this could help draw clearer lines between implementation and specification.
Thoughts @costin @philkra @paulewing @tsg?
Please see my comment here |
…-triple-doublequotes
e.getMessage()); | ||
} | ||
|
||
public void testTripleDoubleQuotedUnescapedString() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is """hello""" == """world"""
interpreted?
it should be equivalent to this expression Equals("hello", "world)
:
"hello" == "world"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a test for that
@@ -200,6 +200,7 @@ STRING | |||
| '"' ('\\' [btnfr"'\\] | ~[\r\n"\\])* '"' | |||
| '?"' ('\\"' |~["\r\n])* '"' | |||
| '?\'' ('\\\'' |~['\r\n])* '\'' | |||
| '"""' (~['\r\n])* '"""' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't need to forbid '
inside the string.
| '"""' (~['\r\n])* '"""' | |
| '"""' (~[\r\n])* '"""' |
@rw-access Was right about the greediness, so in my previous implementation an expression like:
would be parsed as a literal:
So instead I've pushed the python approach that within the Please advice for any more tests than the current ones. /cc @jrodewig |
@@ -200,6 +200,7 @@ STRING | |||
| '"' ('\\' [btnfr"'\\] | ~[\r\n"\\])* '"' | |||
| '?"' ('\\"' |~["\r\n])* '"' | |||
| '?\'' ('\\\'' |~['\r\n])* '\'' | |||
| '"""' ('\\"' |~[\r\n])*? '"""' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could really come to bite us. Mostly because it means that a raw string is not a raw string. With this, there's no way to express a literal \"
anywhere in the string.
This branch allows a literal \"
:
matriv/elasticsearch@replace-unescaped-triple-doublequotes...rw-access:raw-string-changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This grammar has other issues:
"""""hello\"""world!\"""" == """"foo"\""bar""\""""
=>
org.elasticsearch.xpack.eql.parser.ParsingException: line 1:21: token recognition error at: '!\'
and with the proposed solution you can have a \"
literal if you double escape it \\"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could replace \"
with "
only if part of an escaped triple double quote sequence, but I think that produces more confusion for a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it should give that syntax error because it's invalid syntax. the string was completed after seeing """
.
that sounds like the correct behavior. If we introduce any escapes at all we defeat the purpose of this being a raw string, and require some potentially complex parsing logic and strange edge cases. although I don't think we should stop immediately when seeing """
because then a string can't end with """
. we could process a few more quotes after without issue. then a string will be able to end with quotes, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""""hello\"""world!\""""
it's a valid syntax, it has a closing """
.
The previous ?"
couldn't accept any "
and every "
should be escaped.
With this approach we only need to escape a """
which is very rare to occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of issues at hand:
"
is a common character. To avoid escaping it too often, the raw string relies on 3 triple quotes"""
. This however can become a problem at the end of the raw string where a trailing"
can mess up quoting"
are allowed inside and can be escaped to avoid closing a string"""
through\
. This again is it occurs at the end of the string.
To wit:
""" dir "c:\dir""""
<-- 4 ending quotes""" dir c:\dir\"""
<-- \ escapes the"""
Due to file paths on windows 2, aka \"
is likely a common occurrence.
I see several options to solve this:
A. Require raw strings to not end with "
and advice folks to add a space for example:
""" dir "c:\dir" """
B. Use a different escape character from \
, say |
. It solves 2 but not 1.
C. Do not allow escaping inside raw strings. This solves 2 but not 1.
D. Use a different set for quoting characters. Say "?""and
""?`. However this does not solve the problem, rather just minimizes the occurrences in which they appear. Which might be a win in itself.
Considering \
is traditionally used for escaping and in Window paths, I have a preference for C over B. That's because introducing a new character for escaping is surprising and might backfire again depending on the raw string .
B however mean that one cannot escape """
inside a raw string but then for that there's the regular string definition. Unfortunately this doesn't solve 1 either...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A. I thing introducing a mandatory whitespace can lead to other issues, and is it only one space? what happens with more spaces, we keep all but one?, or with tabs?
B. I also don't like a new escape char, doesn't address the main issue no1.
D. I definitely don't like D
since it still contains the ?
char which usually refers to regex or query param.
C. I'm ok with C so if a user needs to escape chars or needs a """
, he/she has to resort to the normal "....."
syntax and properly escape what's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option E:
Allow for two optional additional trailing quotes with no escape sequences. then you can solve the trailing quotes issue without escapes.
The string can't contain """
anywhere inside. If you ever need to write a string with that sequence, a raw string is insufficient, so users will need to switch to a traditional escaped "
string.
"""[^\r\n]*?""""?"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rw-access I'm also happy with that, pushed the change and fixed all relevant tests.
…-triple-doublequotes
…-triple-doublequotes
@elasticmachine run elasticsearch-ci/packaging-sample-windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
thanks for your endurance!
Thank you too from my side! |
@elasticmachine run elasticsearch-ci/packaging-sample-windows |
@jrodewig So finally the change is that we replace |
…3174) Use triple double quotes enclosing a string literal to interpret it as unescaped, in order to use `?` for marking query params and avoid user confusion. `?` also usually implies regex expressions. Any character inside the `"""` beginning-closing markings is considered raw and the only thing that is not permitted is the `"""` sequence itself. If a user wants to use that, needs to resort to the normal `"` string literal and use proper escaping. Relates to #61659 (cherry picked from commit d87c2ca)
Woohoo |
Use triple double quotes enclosing a string literal to interpret it
as unescaped, in order to use
?
for marking query params and avoiduser confusion.
Relates to #61659