-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement caseInsensitive option #3399
Conversation
Hiya. As you know, my philosophy at this point is that any change to the parsing strategy really should be an emergency. Wouldn't this all be simpler if they passed in an uppercase or lowercase version of the input and then set the character source to the original unmodified stream for creating the token objects? This wouldn't require much in the way of changes at all... In fact it's probably just an idiom how to use a tool. You need two copies of the input stream but that usually won't be a dealbreaker. |
It's quite emergency because a lot of users still report about not working grammars despite the fact they actually work but users have to use an additional runtime class that normalizes the input stream. Now there are around 30 questions related to case insensitivity: https://github.com/antlr/grammars-v4/issues?q=label%3Acase-insensitive And two recently postponed merge requests: antlr/grammars-v4#2400, antlr/grammars-v4#2417
It's not simpler than using the suggested option. Also, maybe it's not fully correct because different runtimes work in different ways. For instance, in Java
It's not a big change but it's still a change that should be ideally eliminated to make ANTLR using simpler. BTW, the new option does not affect the parsing strategy and everything is fully back compatible with previous versions. All tests are passing (except for not stable Swift and C++). |
Also, I've tested the performance of grammar and runtimes case changing approaches and found out there is almost no difference between them: #2046 (comment) (but grammar approach is a bit better). |
i'll take a closer look. thanks. |
I support the idea of handling case insensitivity directly in ANTLR4, however, it should either be limited to ANSI characters (where simple up case / down case operations work) or the full Unicode case mapping process must be implemented. See also: https://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180 IMO, all we need is case insensitive keywords, which are always (?) written using ASCII letters (but the full ANSI script would work too). We don't need to match any possible string. So, a better solution is probably to add the case sensitivity flag to a lexer rule (an option, to make only this rule case insensitive) and ANTLR4 can check if the case mapping round trip works ( |
I think not only to ANSI characters. As I've already written, there are case insensitive languages with non-ANSI characters, for instance, well-known in Russia 1C. But characters with working round trip as you suggested ( The parser from the following grammar: lexer grammar L;
options { caseInsensitive = true; }
ENGLISH_TOKEN: [a-z]+;
GERMAN_TOKEN: [äéöüß]+;
FRENCH_TOKEN: [àâæ-ëîïôœùûüÿ]+;
CROATIAN_TOKEN: [ćčđšž]+;
ITALIAN_TOKEN: [àèéìòù]+;
SPANISH_TOKEN: [áéíñóúü¡¿]+;
GREEK_TOKEN: [α-ω]+;
RUSSIAN_TOKEN: [а-я]+;
WS: [ ]+ -> skip; Matches the following sequence of words:
I can extend this test with your suggested symbols.
Almost any other string does not have lower or UPPER case at all:
One exception is a declaration of STRING:
But I don't think it's a problem because in all languages strings include both lower and upper characters. In the worst case, the grammar can be rewritten without options { caseInsensitive=true; }
STRING: [a-zA-Z]+; Throws
I don't think it's a better solution because it's excess. In most languages, all tokens are case insensitive, at least modes. Also, it requires ANTLR grammar changing. All SQL dialects (MySql, T-Sql, PlSql, SQLite) are fully case insensitive, Pascal-based languages (Delphi). Only PHP has different modes that may be marked with different case sensitivity options (because it may contain JavaScript, HTML, and other "islands"): https://github.com/antlr/grammars-v4/blob/master/php/PhpLexer.g4
I think the solution with checking the length of the resulting char is better because it guarantees length equality of input and actual input stream. Also, it looks like Java (that is responsible for grammar conversion) works as expected without such tricks. And an error is also not okay, just ignoring such symbols (rare and maybe even not the existing case):
I'm basing on practical considerations from my grammars development experience (actually not only mine). And I think ANTLR is not a tool for natural language processing that should consider subtle case mapping nuances, but a tool for formal language processing with a straightforward and common case-insensitivity mechanism. |
Great discussion and thoughts, @mike-lischke and @KvanTTT. Thanks! Hmm...ok, let me review the implementation here. From user point of view, I like simple Just so I'm clear, we change auto convert in grammar 'a' -> [aA] and leave input char stream alone, right? |
Adding note that we should update doc and/or remove https://github.com/antlr/antlr4/blob/master/doc/case-insensitive-lexing.md if we merge this. Looks like you've updated doc I see. |
I think we need to delete |
This will require enhancing every target. I’m supportive but just raising awareness.
Envoyé de mon iPhone
… Le 24 déc. 2021 à 19:33, Terence Parr ***@***.***> a écrit :
Great discussion and thoughts, @mike-lischke and @KvanTTT. Thanks! Hmm...ok, let me review the implementation here. From user point of view, I like simple options { caseInsensitive=true; } idea if it covers almost all cases.
Just so I'm clear, we change auto convert in grammar 'a' -> [aA] and leave input char stream alone, right?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
My hope is to avoid changing targets at all. If grammar gets converted ala |
runtime/Java/src/org/antlr/v4/runtime/atn/CodePointTransitions.java
Outdated
Show resolved
Hide resolved
Code is looking go so far...tool tests all pass. checking other stuff. |
…code related to caseInsensitive option fixes antlr#3433
Fix `else` formatting for current pull request
23406e5
to
2f15b9a
Compare
@KvanTTT Looking good, but i think |
It's not true because this method calls itself and pass another value, see https://github.com/antlr/antlr4/pull/3399/files#diff-44489d61ffc263da31b35a8ca18d048cffb043b290fc929e79c5a3202262c487R579 |
Ok, I'm ready to merge if you're happy with it also now. |
I'm happy with merging, some tests are failing but it looks like yet another problem with CircleCI. |
Yeah, i'm fiddling with the tests to fix. |
woohoo! thanks, @KvanTTT :) |
### What changes were proposed in this pull request? This pr is aims upgrade `antlr4` from 4.9.3 to 4.13.1 ### Why are the changes needed? After 4.10, antlr4 is using Java 11 for the source code and the compiled .class files for the ANTLR tool. There are some bug fix and Improvements after 4.9.3: - antlr/antlr4#3399 - antlr/antlr4#1105 - antlr/antlr4#2788 - antlr/antlr4#3957 - antlr/antlr4#4394 The full release notes as follows: - https://github.com/antlr/antlr4/releases/tag/4.13.1 - https://github.com/antlr/antlr4/releases/tag/4.13.0 - https://github.com/antlr/antlr4/releases/tag/4.12.0 - https://github.com/antlr/antlr4/releases/tag/4.11.1 - https://github.com/antlr/antlr4/releases/tag/4.11.0 - https://github.com/antlr/antlr4/releases/tag/4.10.1 - https://github.com/antlr/antlr4/releases/tag/4.10 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43075 from LuciferYang/antlr4-4131. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? This pr is aims upgrade `antlr4` from 4.9.3 to 4.13.1 ### Why are the changes needed? After 4.10, antlr4 is using Java 11 for the source code and the compiled .class files for the ANTLR tool. There are some bug fix and Improvements after 4.9.3: - antlr/antlr4#3399 - antlr/antlr4#1105 - antlr/antlr4#2788 - antlr/antlr4#3957 - antlr/antlr4#4394 The full release notes as follows: - https://github.com/antlr/antlr4/releases/tag/4.13.1 - https://github.com/antlr/antlr4/releases/tag/4.13.0 - https://github.com/antlr/antlr4/releases/tag/4.12.0 - https://github.com/antlr/antlr4/releases/tag/4.11.1 - https://github.com/antlr/antlr4/releases/tag/4.11.0 - https://github.com/antlr/antlr4/releases/tag/4.10.1 - https://github.com/antlr/antlr4/releases/tag/4.10 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43075 from LuciferYang/antlr4-4131. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 13cd291)
## Summary Up till now, we had to define our own lexer rules for our client-side ES|QL validation. This was because we were using an unofficial ANTLR package (before the official ANTLR had typescript support). Now that we are using the official ANTLR library (as of #177211), we no longer have to encode case insensitivity into the lexer rules themselves because the [`caseInsensitive` option](antlr/antlr4#3399) is now available to us. This means we can adopt the very [same definitions](https://github.com/elastic/elasticsearch/blob/343b1ae1ba74fbf2e75c29adddb2790312dd680b/x-pack/plugin/esql/src/main/antlr/EsqlBaseLexer.g4) that Elasticsearch uses as long as we set `caseInsensitive` (Elasticsearch handles case insensitivity at runtime). ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Yet another try to suggest
caseInsensitive
option (the old is here).There are a lot of issues related to case insensitiveness in grammars-v4 repository. Actually, it's very inconvenient to write clear grammars without fragment rules and use CaseChangingCharStream for each runtime. Because it's yet another dependency that even is not available from ANTLR runtime. Moreover, different runtimes may have their own rules of changing input streams. It leads to the situation when a lot of users don't know how to do it and create useless and duplicated issues on GitHub.
I suggest a new option that resolves all the abovementioned problems. I tried to implement it in a less intrusive way to make reviewing easier (grammar is not affected). Yet another advantage of the new option: it allows to use ANTLR semantics checks, unlike runtime approach.
In short, just one line could resolve a lot of problems:
Later we can implement a case insensitive option for each mode if it will be required.
@parrt please take a look at this again, it's very important for grammars community :)