Implement caseInsensitive option #3399

KvanTTT · 2021-12-09T20:58:07Z

Yet another try to suggest caseInsensitive option (the old is here).

There are a lot of issues related to case insensitiveness in grammars-v4 repository. Actually, it's very inconvenient to write clear grammars without fragment rules and use CaseChangingCharStream for each runtime. Because it's yet another dependency that even is not available from ANTLR runtime. Moreover, different runtimes may have their own rules of changing input streams. It leads to the situation when a lot of users don't know how to do it and create useless and duplicated issues on GitHub.

I suggest a new option that resolves all the abovementioned problems. I tried to implement it in a less intrusive way to make reviewing easier (grammar is not affected). Yet another advantage of the new option: it allows to use ANTLR semantics checks, unlike runtime approach.

In short, just one line could resolve a lot of problems:

options { caseInsensitive=true; }

Later we can implement a case insensitive option for each mode if it will be required.

@parrt please take a look at this again, it's very important for grammars community :)

parrt · 2021-12-12T17:28:51Z

Hiya. As you know, my philosophy at this point is that any change to the parsing strategy really should be an emergency. Wouldn't this all be simpler if they passed in an uppercase or lowercase version of the input and then set the character source to the original unmodified stream for creating the token objects? This wouldn't require much in the way of changes at all... In fact it's probably just an idiom how to use a tool. You need two copies of the input stream but that usually won't be a dealbreaker.

KvanTTT · 2021-12-12T19:20:49Z

my philosophy at this point is that any change to the parsing strategy really should be an emergency

It's quite emergency because a lot of users still report about not working grammars despite the fact they actually work but users have to use an additional runtime class that normalizes the input stream. Now there are around 30 questions related to case insensitivity: https://github.com/antlr/grammars-v4/issues?q=label%3Acase-insensitive And two recently postponed merge requests: antlr/grammars-v4#2400, antlr/grammars-v4#2417

Wouldn't this all be simpler if they passed in an uppercase or lowercase version of the input and then set the character source to the original unmodified stream for creating the token objects?

It's not simpler than using the suggested option. Also, maybe it's not fully correct because different runtimes work in different ways. For instance, in Java Character.toUpperCase('ß') returns just ß and it's correct. But in JavaScript 'ß'.toUpperCase() returns SS and it's not okay. Moreover, the runtime approach does not allow checking additional diagnostics such as CHARACTERS_COLLISION_IN_SET. Anyway, it will be still possible to use input stream normalization if you want but it's not recommended.

This wouldn't require much in the way of changes at all...

It's not a big change but it's still a change that should be ideally eliminated to make ANTLR using simpler.

BTW, the new option does not affect the parsing strategy and everything is fully back compatible with previous versions. All tests are passing (except for not stable Swift and C++).

KvanTTT · 2021-12-12T19:38:33Z

Also, I've tested the performance of grammar and runtimes case changing approaches and found out there is almost no difference between them: #2046 (comment) (but grammar approach is a bit better).

parrt · 2021-12-13T00:01:06Z

i'll take a closer look. thanks.

mike-lischke · 2021-12-24T10:52:51Z

I support the idea of handling case insensitivity directly in ANTLR4, however, it should either be limited to ANSI characters (where simple up case / down case operations work) or the full Unicode case mapping process must be implemented. See also: https://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180

IMO, all we need is case insensitive keywords, which are always (?) written using ASCII letters (but the full ANSI script would work too). We don't need to match any possible string.

So, a better solution is probably to add the case sensitivity flag to a lexer rule (an option, to make only this rule case insensitive) and ANTLR4 can check if the case mapping round trip works (a.toLower().toUpper().toLower() == a.toLower()). If it does not then an error is shown. Otherwise we can take both upper and lower variants into the ATN transition's label set.

KvanTTT · 2021-12-24T11:19:56Z

it should either be limited to ANSI characters (where simple up case / down case operations work) or the full Unicode case mapping process must be implemented.

I think not only to ANSI characters. As I've already written, there are case insensitive languages with non-ANSI characters, for instance, well-known in Russia 1C. But characters with working round trip as you suggested (a.toLower().toUpper().toLower() == a.toLower()) or I suggested (do not change case if length of resulting char (actually string) more than 1). Also, in this PR request I've written tests for different cultures:

The parser from the following grammar:

lexer grammar L;
options { caseInsensitive = true; }
ENGLISH_TOKEN:   [a-z]+;
GERMAN_TOKEN:    [äéöüß]+;
FRENCH_TOKEN:    [àâæ-ëîïôœùûüÿ]+;
CROATIAN_TOKEN:  [ćčđšž]+;
ITALIAN_TOKEN:   [àèéìòù]+;
SPANISH_TOKEN:   [áéíñóúü¡¿]+;
GREEK_TOKEN:     [α-ω]+;
RUSSIAN_TOKEN:   [а-я]+;
WS:              [ ]+ -> skip;

Matches the following sequence of words:

abcXYZ äéöüßÄÉÖÜß àâæçÙÛÜŸ ćčđĐŠŽ àèéÌÒÙ áéÚÜ¡¿ αβγΧΨΩ абвЭЮЯ

I can extend this test with your suggested symbols.

We don't need to match any possible string.

Almost any other string does not have lower or UPPER case at all:

PLUS: '+';
MULT: '*';
...

One exception is a declaration of STRING:

STRING: [a-z]+;

But I don't think it's a problem because in all languages strings include both lower and upper characters. In the worst case, the grammar can be rewritten without caseInsensitive option (rare case). Moreover, with the current implementation the following declaration:

options { caseInsensitive=true; }
STRING: [a-zA-Z]+;

Throws CHARACTERS_COLLISION_IN_SET warning.

So, a better solution is probably to add the case sensitivity flag to a lexer rule (an option, to make only this rule case insensitive)

I don't think it's a better solution because it's excess. In most languages, all tokens are case insensitive, at least modes. Also, it requires ANTLR grammar changing. All SQL dialects (MySql, T-Sql, PlSql, SQLite) are fully case insensitive, Pascal-based languages (Delphi). Only PHP has different modes that may be marked with different case sensitivity options (because it may contain JavaScript, HTML, and other "islands"): https://github.com/antlr/grammars-v4/blob/master/php/PhpLexer.g4

if the case mapping round trip works (a.toLower().toUpper().toLower() == a.toLower()). If it does not then an error is shown

I think the solution with checking the length of the resulting char is better because it guarantees length equality of input and actual input stream. Also, it looks like Java (that is responsible for grammar conversion) works as expected without such tricks. And an error is also not okay, just ignoring such symbols (rare and maybe even not the existing case):

ß !-> ss, ß -> ß, ю -> Ю.

I'm basing on practical considerations from my grammars development experience (actually not only mine). And I think ANTLR is not a tool for natural language processing that should consider subtle case mapping nuances, but a tool for formal language processing with a straightforward and common case-insensitivity mechanism.

parrt · 2021-12-24T18:33:19Z

Great discussion and thoughts, @mike-lischke and @KvanTTT. Thanks! Hmm...ok, let me review the implementation here. From user point of view, I like simple options { caseInsensitive=true; } idea if it covers almost all cases.

Just so I'm clear, we change auto convert in grammar 'a' -> [aA] and leave input char stream alone, right?

parrt · 2021-12-24T18:48:55Z

Adding note that we should update doc and/or remove https://github.com/antlr/antlr4/blob/master/doc/case-insensitive-lexing.md if we merge this. Looks like you've updated doc I see.

parrt · 2021-12-24T18:53:02Z

I think we need to delete * [Case-Insensitive Lexing](case-insensitive-lexing.md) from index.md too.

ericvergnaud · 2021-12-24T18:53:15Z

This will require enhancing every target. I’m supportive but just raising awareness. Envoyé de mon iPhone

…

Le 24 déc. 2021 à 19:33, Terence Parr ***@***.***> a écrit : Great discussion and thoughts, @mike-lischke and @KvanTTT. Thanks! Hmm...ok, let me review the implementation here. From user point of view, I like simple options { caseInsensitive=true; } idea if it covers almost all cases. Just so I'm clear, we change auto convert in grammar 'a' -> [aA] and leave input char stream alone, right? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

parrt · 2021-12-24T18:54:41Z

My hope is to avoid changing targets at all. If grammar gets converted ala 'a' -> [aA] everywhere then targets should be ok as-is.

tool/src/org/antlr/v4/automata/LexerATNFactory.java

runtime/Java/src/org/antlr/v4/runtime/atn/CodePointTransitions.java

parrt · 2021-12-24T19:16:13Z

Code is looking go so far...tool tests all pass. checking other stuff.

…code related to caseInsensitive option fixes antlr#3433

Fix `else` formatting for current pull request

parrt · 2021-12-25T20:22:11Z

@KvanTTT Looking good, but i think checkRangeAndAddToSet() also has arg we can remove.

KvanTTT · 2021-12-25T20:25:01Z

It's not true because this method calls itself and pass another value, see https://github.com/antlr/antlr4/pull/3399/files#diff-44489d61ffc263da31b35a8ca18d048cffb043b290fc929e79c5a3202262c487R579

parrt · 2021-12-25T20:26:36Z

Ok, I'm ready to merge if you're happy with it also now.

KvanTTT · 2021-12-25T20:29:01Z

I'm happy with merging, some tests are failing but it looks like yet another problem with CircleCI.

parrt · 2021-12-25T20:33:39Z

Yeah, i'm fiddling with the tests to fix.

parrt · 2021-12-25T20:36:20Z

woohoo! thanks, @KvanTTT :)

### What changes were proposed in this pull request? This pr is aims upgrade `antlr4` from 4.9.3 to 4.13.1 ### Why are the changes needed? After 4.10, antlr4 is using Java 11 for the source code and the compiled .class files for the ANTLR tool. There are some bug fix and Improvements after 4.9.3: - antlr/antlr4#3399 - antlr/antlr4#1105 - antlr/antlr4#2788 - antlr/antlr4#3957 - antlr/antlr4#4394 The full release notes as follows: - https://github.com/antlr/antlr4/releases/tag/4.13.1 - https://github.com/antlr/antlr4/releases/tag/4.13.0 - https://github.com/antlr/antlr4/releases/tag/4.12.0 - https://github.com/antlr/antlr4/releases/tag/4.11.1 - https://github.com/antlr/antlr4/releases/tag/4.11.0 - https://github.com/antlr/antlr4/releases/tag/4.10.1 - https://github.com/antlr/antlr4/releases/tag/4.10 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43075 from LuciferYang/antlr4-4131. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? This pr is aims upgrade `antlr4` from 4.9.3 to 4.13.1 ### Why are the changes needed? After 4.10, antlr4 is using Java 11 for the source code and the compiled .class files for the ANTLR tool. There are some bug fix and Improvements after 4.9.3: - antlr/antlr4#3399 - antlr/antlr4#1105 - antlr/antlr4#2788 - antlr/antlr4#3957 - antlr/antlr4#4394 The full release notes as follows: - https://github.com/antlr/antlr4/releases/tag/4.13.1 - https://github.com/antlr/antlr4/releases/tag/4.13.0 - https://github.com/antlr/antlr4/releases/tag/4.12.0 - https://github.com/antlr/antlr4/releases/tag/4.11.1 - https://github.com/antlr/antlr4/releases/tag/4.11.0 - https://github.com/antlr/antlr4/releases/tag/4.10.1 - https://github.com/antlr/antlr4/releases/tag/4.10 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43075 from LuciferYang/antlr4-4131. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 13cd291)

## Summary Up till now, we had to define our own lexer rules for our client-side ES|QL validation. This was because we were using an unofficial ANTLR package (before the official ANTLR had typescript support). Now that we are using the official ANTLR library (as of #177211), we no longer have to encode case insensitivity into the lexer rules themselves because the [`caseInsensitive` option](antlr/antlr4#3399) is now available to us. This means we can adopt the very [same definitions](https://github.com/elastic/elasticsearch/blob/343b1ae1ba74fbf2e75c29adddb2790312dd680b/x-pack/plugin/esql/src/main/antlr/EsqlBaseLexer.g4) that Elasticsearch uses as long as we set `caseInsensitive` (Elasticsearch handles case insensitivity at runtime). ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

parrt added lexers options type:improvement labels Dec 24, 2021

parrt added this to the 4.9.4 milestone Dec 24, 2021

parrt reviewed Dec 24, 2021

View reviewed changes

tool/src/org/antlr/v4/automata/LexerATNFactory.java Outdated Show resolved Hide resolved

parrt reviewed Dec 24, 2021

View reviewed changes

tool/src/org/antlr/v4/automata/LexerATNFactory.java Outdated Show resolved Hide resolved

parrt reviewed Dec 24, 2021

View reviewed changes

runtime/Java/src/org/antlr/v4/runtime/atn/CodePointTransitions.java Outdated Show resolved Hide resolved

KvanTTT added 2 commits December 25, 2021 16:54

Implement caseInsensitive option

d0c14f4

Remove caseInsensitive function argument where it's not necessary

8bf1039

KvanTTT mentioned this pull request Dec 25, 2021

Introduce MIX_OF_LOWER_AND_UPPER_CHAR_CASE_IN_RANGE warning #3433

Closed

KvanTTT added 3 commits December 25, 2021 22:43

Add RANGE_PROBABLY_CONTAINS_NOT_IMPLIED_CHARACTERS warning, refactor …

7fa59a4

…code related to caseInsensitive option fixes antlr#3433

Allow \r\n for runtime test descriptor files (Windows)

a8ef690

Add ij_java_else_on_new_line = true to .editorconfig

2f15b9a

Fix `else` formatting for current pull request

KvanTTT force-pushed the case-insensitive-option branch from 23406e5 to 2f15b9a Compare December 25, 2021 19:53

parrt merged commit 7bc8257 into antlr:master Dec 25, 2021

KvanTTT deleted the case-insensitive-option branch December 25, 2021 20:39

This was referenced Dec 25, 2021

ORCL PL/SQL grammar converted to case-insensitive; lexer simplified; antlr/grammars-v4#2400

Closed

Ignore case new syntax #1002

Closed

piacenti mentioned this pull request May 20, 2022

Implement new caseInsensitive Options Strumenta/antlr-kotlin#79

Closed

xerial mentioned this pull request Jun 2, 2022

Update antlr4, antlr4-runtime to 4.10.1 wvlet/airframe#2216

Merged

posth mentioned this pull request Jul 19, 2022

Incorporate the >4.10 release tunnelvisionlabs/antlr4ts#535

Open

Commodore68 mentioned this pull request Sep 3, 2022

Remove broken case insensitive lexing doc links from grammar readmes antlr/grammars-v4#2795

Closed

br0nstein mentioned this pull request Oct 13, 2022

Various fixes to the lexer generation ftomassetti/JavaCC2ANTLR#8

Merged

adangel mentioned this pull request Feb 16, 2023

Add support for T-SQL using Antlr4 lexer pmd/pmd#4390

Merged

4 tasks

LuciferYang mentioned this pull request Sep 24, 2023

[SPARK-44366][BUILD] Upgrade antlr4 to 4.13.1 apache/spark#43075

Closed

drewdaemon mentioned this pull request Mar 7, 2024

[ES|QL] use lexer from elasticsearch elastic/kibana#178257

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement caseInsensitive option #3399

Implement caseInsensitive option #3399

KvanTTT commented Dec 9, 2021 •

edited

Loading

parrt commented Dec 12, 2021

KvanTTT commented Dec 12, 2021 •

edited

Loading

KvanTTT commented Dec 12, 2021 •

edited

Loading

parrt commented Dec 13, 2021

mike-lischke commented Dec 24, 2021

KvanTTT commented Dec 24, 2021 •

edited

Loading

parrt commented Dec 24, 2021

parrt commented Dec 24, 2021 •

edited

Loading

parrt commented Dec 24, 2021

ericvergnaud commented Dec 24, 2021 via email

parrt commented Dec 24, 2021

parrt commented Dec 24, 2021

parrt commented Dec 25, 2021

KvanTTT commented Dec 25, 2021

parrt commented Dec 25, 2021

KvanTTT commented Dec 25, 2021

parrt commented Dec 25, 2021

parrt commented Dec 25, 2021

Implement caseInsensitive option #3399

Implement caseInsensitive option #3399

Conversation

KvanTTT commented Dec 9, 2021 • edited Loading

parrt commented Dec 12, 2021

KvanTTT commented Dec 12, 2021 • edited Loading

KvanTTT commented Dec 12, 2021 • edited Loading

parrt commented Dec 13, 2021

mike-lischke commented Dec 24, 2021

KvanTTT commented Dec 24, 2021 • edited Loading

parrt commented Dec 24, 2021

parrt commented Dec 24, 2021 • edited Loading

parrt commented Dec 24, 2021

ericvergnaud commented Dec 24, 2021 via email

parrt commented Dec 24, 2021

parrt commented Dec 24, 2021

parrt commented Dec 25, 2021

KvanTTT commented Dec 25, 2021

parrt commented Dec 25, 2021

KvanTTT commented Dec 25, 2021

parrt commented Dec 25, 2021

parrt commented Dec 25, 2021

KvanTTT commented Dec 9, 2021 •

edited

Loading

KvanTTT commented Dec 12, 2021 •

edited

Loading

KvanTTT commented Dec 12, 2021 •

edited

Loading

KvanTTT commented Dec 24, 2021 •

edited

Loading

parrt commented Dec 24, 2021 •

edited

Loading