Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement caseInsensitive option #3399

Merged
merged 5 commits into from
Dec 25, 2021
Merged

Conversation

KvanTTT
Copy link
Member

@KvanTTT KvanTTT commented Dec 9, 2021

Yet another try to suggest caseInsensitive option (the old is here).

There are a lot of issues related to case insensitiveness in grammars-v4 repository. Actually, it's very inconvenient to write clear grammars without fragment rules and use CaseChangingCharStream for each runtime. Because it's yet another dependency that even is not available from ANTLR runtime. Moreover, different runtimes may have their own rules of changing input streams. It leads to the situation when a lot of users don't know how to do it and create useless and duplicated issues on GitHub.

I suggest a new option that resolves all the abovementioned problems. I tried to implement it in a less intrusive way to make reviewing easier (grammar is not affected). Yet another advantage of the new option: it allows to use ANTLR semantics checks, unlike runtime approach.

In short, just one line could resolve a lot of problems:

options { caseInsensitive=true; }

Later we can implement a case insensitive option for each mode if it will be required.

@parrt please take a look at this again, it's very important for grammars community :)

@parrt
Copy link
Member

parrt commented Dec 12, 2021

Hiya. As you know, my philosophy at this point is that any change to the parsing strategy really should be an emergency. Wouldn't this all be simpler if they passed in an uppercase or lowercase version of the input and then set the character source to the original unmodified stream for creating the token objects? This wouldn't require much in the way of changes at all... In fact it's probably just an idiom how to use a tool. You need two copies of the input stream but that usually won't be a dealbreaker.

@KvanTTT
Copy link
Member Author

KvanTTT commented Dec 12, 2021

my philosophy at this point is that any change to the parsing strategy really should be an emergency

It's quite emergency because a lot of users still report about not working grammars despite the fact they actually work but users have to use an additional runtime class that normalizes the input stream. Now there are around 30 questions related to case insensitivity: https://github.com/antlr/grammars-v4/issues?q=label%3Acase-insensitive And two recently postponed merge requests: antlr/grammars-v4#2400, antlr/grammars-v4#2417

Wouldn't this all be simpler if they passed in an uppercase or lowercase version of the input and then set the character source to the original unmodified stream for creating the token objects?

It's not simpler than using the suggested option. Also, maybe it's not fully correct because different runtimes work in different ways. For instance, in Java Character.toUpperCase('ß') returns just ß and it's correct. But in JavaScript 'ß'.toUpperCase() returns SS and it's not okay. Moreover, the runtime approach does not allow checking additional diagnostics such as CHARACTERS_COLLISION_IN_SET. Anyway, it will be still possible to use input stream normalization if you want but it's not recommended.

This wouldn't require much in the way of changes at all...

It's not a big change but it's still a change that should be ideally eliminated to make ANTLR using simpler.

BTW, the new option does not affect the parsing strategy and everything is fully back compatible with previous versions. All tests are passing (except for not stable Swift and C++).

@KvanTTT
Copy link
Member Author

KvanTTT commented Dec 12, 2021

Also, I've tested the performance of grammar and runtimes case changing approaches and found out there is almost no difference between them: #2046 (comment) (but grammar approach is a bit better).

@parrt
Copy link
Member

parrt commented Dec 13, 2021

i'll take a closer look. thanks.

@mike-lischke
Copy link
Member

I support the idea of handling case insensitivity directly in ANTLR4, however, it should either be limited to ANSI characters (where simple up case / down case operations work) or the full Unicode case mapping process must be implemented. See also: https://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180

IMO, all we need is case insensitive keywords, which are always (?) written using ASCII letters (but the full ANSI script would work too). We don't need to match any possible string.

So, a better solution is probably to add the case sensitivity flag to a lexer rule (an option, to make only this rule case insensitive) and ANTLR4 can check if the case mapping round trip works (a.toLower().toUpper().toLower() == a.toLower()). If it does not then an error is shown. Otherwise we can take both upper and lower variants into the ATN transition's label set.

@KvanTTT
Copy link
Member Author

KvanTTT commented Dec 24, 2021

it should either be limited to ANSI characters (where simple up case / down case operations work) or the full Unicode case mapping process must be implemented.

I think not only to ANSI characters. As I've already written, there are case insensitive languages with non-ANSI characters, for instance, well-known in Russia 1C. But characters with working round trip as you suggested (a.toLower().toUpper().toLower() == a.toLower()) or I suggested (do not change case if length of resulting char (actually string) more than 1). Also, in this PR request I've written tests for different cultures:

The parser from the following grammar:

lexer grammar L;
options { caseInsensitive = true; }
ENGLISH_TOKEN:   [a-z]+;
GERMAN_TOKEN:    [äéöüß]+;
FRENCH_TOKEN:    [àâæ-ëîïôœùûüÿ]+;
CROATIAN_TOKEN:  [ćčđšž]+;
ITALIAN_TOKEN:   [àèéìòù]+;
SPANISH_TOKEN:   [áéíñóúü¡¿]+;
GREEK_TOKEN:     [α-ω]+;
RUSSIAN_TOKEN:   [а-я]+;
WS:              [ ]+ -> skip;

Matches the following sequence of words:

abcXYZ äéöüßÄÉÖÜß àâæçÙÛÜŸ ćčđĐŠŽ àèéÌÒÙ áéÚÜ¡¿ αβγΧΨΩ абвЭЮЯ

I can extend this test with your suggested symbols.

We don't need to match any possible string.

Almost any other string does not have lower or UPPER case at all:

PLUS: '+';
MULT: '*';
...

One exception is a declaration of STRING:

STRING: [a-z]+;

But I don't think it's a problem because in all languages strings include both lower and upper characters. In the worst case, the grammar can be rewritten without caseInsensitive option (rare case). Moreover, with the current implementation the following declaration:

options { caseInsensitive=true; }
STRING: [a-zA-Z]+;

Throws CHARACTERS_COLLISION_IN_SET warning.

So, a better solution is probably to add the case sensitivity flag to a lexer rule (an option, to make only this rule case insensitive)

I don't think it's a better solution because it's excess. In most languages, all tokens are case insensitive, at least modes. Also, it requires ANTLR grammar changing. All SQL dialects (MySql, T-Sql, PlSql, SQLite) are fully case insensitive, Pascal-based languages (Delphi). Only PHP has different modes that may be marked with different case sensitivity options (because it may contain JavaScript, HTML, and other "islands"): https://github.com/antlr/grammars-v4/blob/master/php/PhpLexer.g4

if the case mapping round trip works (a.toLower().toUpper().toLower() == a.toLower()). If it does not then an error is shown

I think the solution with checking the length of the resulting char is better because it guarantees length equality of input and actual input stream. Also, it looks like Java (that is responsible for grammar conversion) works as expected without such tricks. And an error is also not okay, just ignoring such symbols (rare and maybe even not the existing case):

ß !-> ss, ß -> ß, ю -> Ю.

I'm basing on practical considerations from my grammars development experience (actually not only mine). And I think ANTLR is not a tool for natural language processing that should consider subtle case mapping nuances, but a tool for formal language processing with a straightforward and common case-insensitivity mechanism.

@parrt
Copy link
Member

parrt commented Dec 24, 2021

Great discussion and thoughts, @mike-lischke and @KvanTTT. Thanks! Hmm...ok, let me review the implementation here. From user point of view, I like simple options { caseInsensitive=true; } idea if it covers almost all cases.

Just so I'm clear, we change auto convert in grammar 'a' -> [aA] and leave input char stream alone, right?

@parrt
Copy link
Member

parrt commented Dec 24, 2021

Adding note that we should update doc and/or remove https://github.com/antlr/antlr4/blob/master/doc/case-insensitive-lexing.md if we merge this. Looks like you've updated doc I see.

@parrt
Copy link
Member

parrt commented Dec 24, 2021

I think we need to delete * [Case-Insensitive Lexing](case-insensitive-lexing.md) from index.md too.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Dec 24, 2021 via email

@parrt
Copy link
Member

parrt commented Dec 24, 2021

My hope is to avoid changing targets at all. If grammar gets converted ala 'a' -> [aA] everywhere then targets should be ok as-is.

@parrt
Copy link
Member

parrt commented Dec 24, 2021

Code is looking go so far...tool tests all pass. checking other stuff.

@KvanTTT KvanTTT force-pushed the case-insensitive-option branch from 23406e5 to 2f15b9a Compare December 25, 2021 19:53
@parrt
Copy link
Member

parrt commented Dec 25, 2021

@KvanTTT Looking good, but i think checkRangeAndAddToSet() also has arg we can remove.

@KvanTTT
Copy link
Member Author

KvanTTT commented Dec 25, 2021

It's not true because this method calls itself and pass another value, see https://github.com/antlr/antlr4/pull/3399/files#diff-44489d61ffc263da31b35a8ca18d048cffb043b290fc929e79c5a3202262c487R579

@parrt
Copy link
Member

parrt commented Dec 25, 2021

Ok, I'm ready to merge if you're happy with it also now.

@KvanTTT
Copy link
Member Author

KvanTTT commented Dec 25, 2021

I'm happy with merging, some tests are failing but it looks like yet another problem with CircleCI.

@parrt
Copy link
Member

parrt commented Dec 25, 2021

Yeah, i'm fiddling with the tests to fix.

@parrt parrt merged commit 7bc8257 into antlr:master Dec 25, 2021
@parrt
Copy link
Member

parrt commented Dec 25, 2021

woohoo! thanks, @KvanTTT :)

@KvanTTT KvanTTT deleted the case-insensitive-option branch December 25, 2021 20:39
srowen pushed a commit to apache/spark that referenced this pull request Sep 26, 2023
### What changes were proposed in this pull request?
This pr is aims upgrade `antlr4` from 4.9.3 to 4.13.1

### Why are the changes needed?
After 4.10, antlr4 is using Java 11 for the source code and the compiled .class files for the ANTLR tool. There are some bug fix and Improvements after 4.9.3:
- antlr/antlr4#3399
- antlr/antlr4#1105
- antlr/antlr4#2788
- antlr/antlr4#3957
- antlr/antlr4#4394

The full release notes as follows:

- https://github.com/antlr/antlr4/releases/tag/4.13.1
- https://github.com/antlr/antlr4/releases/tag/4.13.0
- https://github.com/antlr/antlr4/releases/tag/4.12.0
- https://github.com/antlr/antlr4/releases/tag/4.11.1
- https://github.com/antlr/antlr4/releases/tag/4.11.0
- https://github.com/antlr/antlr4/releases/tag/4.10.1
- https://github.com/antlr/antlr4/releases/tag/4.10

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43075 from LuciferYang/antlr4-4131.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
doruchiulan pushed a commit to BlueprintTechnologies/spark-lhm that referenced this pull request Feb 2, 2024
### What changes were proposed in this pull request?
This pr is aims upgrade `antlr4` from 4.9.3 to 4.13.1

### Why are the changes needed?
After 4.10, antlr4 is using Java 11 for the source code and the compiled .class files for the ANTLR tool. There are some bug fix and Improvements after 4.9.3:
- antlr/antlr4#3399
- antlr/antlr4#1105
- antlr/antlr4#2788
- antlr/antlr4#3957
- antlr/antlr4#4394

The full release notes as follows:

- https://github.com/antlr/antlr4/releases/tag/4.13.1
- https://github.com/antlr/antlr4/releases/tag/4.13.0
- https://github.com/antlr/antlr4/releases/tag/4.12.0
- https://github.com/antlr/antlr4/releases/tag/4.11.1
- https://github.com/antlr/antlr4/releases/tag/4.11.0
- https://github.com/antlr/antlr4/releases/tag/4.10.1
- https://github.com/antlr/antlr4/releases/tag/4.10

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#43075 from LuciferYang/antlr4-4131.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 13cd291)
drewdaemon added a commit to elastic/kibana that referenced this pull request Mar 8, 2024
## Summary

Up till now, we had to define our own lexer rules for our client-side
ES|QL validation. This was because we were using an unofficial ANTLR
package (before the official ANTLR had typescript support).

Now that we are using the official ANTLR library (as of
#177211), we no longer have to
encode case insensitivity into the lexer rules themselves because the
[`caseInsensitive` option](antlr/antlr4#3399) is
now available to us.

This means we can adopt the very [same
definitions](https://github.com/elastic/elasticsearch/blob/343b1ae1ba74fbf2e75c29adddb2790312dd680b/x-pack/plugin/esql/src/main/antlr/EsqlBaseLexer.g4)
that Elasticsearch uses as long as we set `caseInsensitive`
(Elasticsearch handles case insensitivity at runtime).

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants