Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new CharStream that converts the symbols to upper or lower case. #2046

Closed
wants to merge 4 commits into from

Conversation

bramp
Copy link
Contributor

@bramp bramp commented Oct 6, 2017

This is useful for many of the case insensitive grammars found at https://github.com/antlr/grammars-v4/ which assume the input would be all upper or lower case. Related discussion can be found at antlr/antlr4test-maven-plugin#1

It would be used like so:

input, _ := antlr.NewFileStream("filename")

in = antlr.NewCaseChangingStream(is, true) // true forces upper case symbols, false forces lower case.

lexer := parser.NewAbnfLexer(in)

While writing this, I found other people have written their own similar implementations (go, java). It makes sense to place this in the core, so everyone can use it.

I would love for the grammar to have a option that says the lexer should upper/lower case all input, and then this code could be moved into the generated Lexer, and no user would need to explicitly use a CaseChangingStream (similar to what's discussed in #1002).

Copy link
Member

@KvanTTT KvanTTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you also add such streams for other runtimes (Java, C#, JavaScript, Python2, Python3, C++, Swift). At least for a few of them.

return in
}
if is.upper {
return int(unicode.ToUpper(rune(in)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to convert case at the beginning one time and use upper/lower values from a prepared array for the performance reason? LA function can be called multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could cache the result, but I didn't want this to work by reading the full input up front, because this data could be streamed, or extremely large.

However, I don't think caching will save much, as this is quite a cheap operation for the CPU. But I'm happy to benchmark it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. Maybe it's right.

@bramp
Copy link
Contributor Author

bramp commented Oct 6, 2017

Sure, I'd be happy to contribute, Java, C++ and Python[23] (I don't know / have access to C# or Swift). Let's resolve this one first, then I'll work on the other languages.

@bramp
Copy link
Contributor Author

bramp commented Oct 7, 2017

I just added a similar version for Java #2048

@parrt
Copy link
Member

parrt commented Oct 21, 2017

hi gang. How is this different than simply running tolower() or whatever on the input string?

@bramp
Copy link
Contributor Author

bramp commented Oct 21, 2017

In my example, I want to use a NewFileStream, which gives me no opportunity to tolower the input. Basically this just wraps an existing stream and does that for me.

@parrt
Copy link
Member

parrt commented Oct 21, 2017

It seems to me that what one really wants is insensitivity during the lexing process but to leave the actual input as is, right?

@bramp
Copy link
Contributor Author

bramp commented Oct 21, 2017

Sure, if the lexer could treat the input in a case insensitive way, that would be ideal. I don't like sticking an extra stream in front, I'd rather the author of the lexer hide this detail from me. But currently there are multiple grammars that require lower or upper case input, and this helps me use them.

Alternatively as a step in that direction, this stream could be kept inside the lexer, and put there by antlr during the code generation.

@parrt
Copy link
Member

parrt commented Oct 21, 2017

It seems like we should simply tweak LA() so input.LA(1) returns toupper or change getExistingTargetState() and computeTargetState() and any other places that test characters in the LexerATNSimulator. of course that is a lot scarier to change

@bramp
Copy link
Contributor Author

bramp commented Oct 21, 2017

And there would be a change to the grammar file to indicate that input should always be uppercased? Because presumably some grammars would want to be case sensitive.

@parrt
Copy link
Member

parrt commented Oct 21, 2017

That's a good question. Is it a function of the grammar purely or could the same grammar be used in both case-sensitive and insensitive situations?

@bramp
Copy link
Contributor Author

bramp commented Oct 21, 2017

The samples I've seen on https://github.com/antlr/grammars-v4 have all been written assuming the input would all be one case, and not something you'd want to selectively turn on and off. They do this because it makes the grammar easier to write. So in those cases it would be a function of the grammar. However, I can't speak for all exotic grammars.

@parrt
Copy link
Member

parrt commented Oct 21, 2017

Hmm...yeah,I can imagine it being a function of the grammar. That said, given my limited time I think we have a simple solution which is to manually alter the string going into the upper or lower case. The programmer can do this easily with their own wrapper class or a simple function call if they have the input as a string. In that case, I'm going to close. Sorry but thanks for the effort.

@bramp
Copy link
Contributor Author

bramp commented Oct 21, 2017

Sure, the programmer can write their own wrapper, but providing one in the core library makes that so much easier. When I first started using antlr, I took a pre-written grammar (the mysql one in fact), and it took me a while to realise I had to uppercase everything, then it took me another while to understand the Stream interface, and figure out how to adapt the LA function to do what I need... Lets avoid that burden on a first time user, and make this easy for them with this pre-written and tested wrapper class.

@parrt
Copy link
Member

parrt commented Oct 21, 2017

Well,I have a lot of targets that we need to be updated and documentation and tests and so on. Probably not worth it when I personally just call toupper after reading in a string and before passing to ANTLR.

@bramp
Copy link
Contributor Author

bramp commented Oct 21, 2017

Ok sounds reasonable. I am willing to provide implementations for some of the other targets, but not all of them. However, I'll maintain this code outside of the core library, until there is an appetite for it. Thanks.

@KvanTTT
Copy link
Member

KvanTTT commented Oct 30, 2017

@parrt, this stream save chars case. It only influences on lexer, but not on the parser or parse tree bypassing. It's roughly just to call "toupper" for an input string.

On my opinion, it's very demanded feature. See the list of issues in grammars and main repository:
antlr/grammars-v4#854
antlr/grammars-v4#878
antlr/grammars-v4#919

@parrt
Copy link
Member

parrt commented Oct 30, 2017

hiya. I can see how it's useful occasionally but it seems like we would need to do this for all targets not just Go before adding it. Maybe I can come up with a Java version to see how it feels.

@bramp
Copy link
Contributor Author

bramp commented Oct 30, 2017

I sent a PR for a Java implementation here: #2048

@KvanTTT
Copy link
Member

KvanTTT commented Oct 30, 2017

I've implemented such streams for C#, C# optimized, Go, Java, JavaScript, Python2, Python3 runtimes. See AntlrCaseInsensitiveInputStream in respective runtime directories: https://github.com/KvanTTT/DAGE/tree/master/AntlrGrammarEditor/Runtimes. С++ and Swift are lacking for now.

Also, grammar tests can be found here: https://github.com/antlr/antlr4test-maven-plugin/tree/master/src/test/resources

@KvanTTT
Copy link
Member

KvanTTT commented Oct 31, 2017

Yet another case insensitive issue: antlr/grammars-v4#921

@parrt
Copy link
Member

parrt commented Nov 1, 2017

Ok, so this wrapper implementation that @bramp has provided seems flexible actually. It works with existing streams but we just wrap that stream in an object that upper or lower case is things as needed without changing the actual stream. If we could bring other targets in line with #2048 and bring it all together as a PR, I think that would be acceptable.

@parrt
Copy link
Member

parrt commented Dec 4, 2017

btw, @sharwell used a stream approach which makes me think it's also not a hack. His approach does toUpperCase() once on copy of stream which is faster and also supports beta becoming two char 'SS'. Disadvantage: must replicate across streams and also that grammar doesn't indicate case-insensitivity.

@davesisson
Copy link
Contributor

davesisson commented Dec 4, 2017 via email

@parrt
Copy link
Member

parrt commented Dec 4, 2017

yep, streams leave text alone (in functionality) but make the lexer think everything is upper or lower.

@KvanTTT
Copy link
Member

KvanTTT commented Dec 4, 2017

Given a unicode string, shouldn't toUpperCase() (without locale) work correctly?

The Character.toUpperCase('а')); is not depend on the users locale.

Disadvantage: must replicate across streams and also that grammar doesn't indicate case-insensitivity.

Other runtimes also have a culture-invariant character case changing methods. I checked it at least for Java, C#, JavaScript, Go (I can attach confirming links later). We also can introduce case insensitive option in grammar and use such streams if it's set up.

Maybe it makes sense to add not only case sensitivity option, but also cultureor locale option (invariant by default, Russian ru, Turkish tr, Deutch de etc.) . But I'm afraid it may not work across all runtimes.

btw, @sharwell used a stream approach which makes me think it's also not a hack.

Yes, this is a working solution :)

@parrt
Copy link
Member

parrt commented Dec 4, 2017

So Character.toUpperCase(x)) will work for any unicode x? Why do they support a locale I wonder?

@davesisson
Copy link
Contributor

davesisson commented Dec 4, 2017 via email

@parrt
Copy link
Member

parrt commented Dec 4, 2017

Yeah, but the unicode dotted i should upper case properly as it's not the english i right? toUpperCase should handle it

@parrt
Copy link
Member

parrt commented Dec 4, 2017

Actually, doesn't work. Turkish reuses english 'i' I think. i get same upper case char here when it should be two diff ones.

class T {
     public static void main(String []args) {
        System.out.printf("%d\n", (int)Character.toUpperCase('\u0069'));
        System.out.printf("%d\n", (int)Character.toUpperCase('\u0131'));
     }
}
beast:/tmp $ javac T.java; java T
73
73

@KvanTTT
Copy link
Member

KvanTTT commented Dec 4, 2017

So Character.toUpperCase(x)) will work for any unicode x? Why do they support a locale I wonder?

The Locale influence only on string toUpperCase method, but not on the character toUpperCase method (see my code above). I can prepare a table for all Unicode characters.

For an example where uppercasing will fail, try the Turkish dotted i:

It's failed because of culture dependece, see https://ideone.com/IvtDOd

Invariant: 
UPPER_CHAR(i) == I
UPPER_STR(i) == I

Turkish: 
LOWER_CHAR(I) == i
LOWER_STR(I) == ı
UPPER_CHAR(i) == I
UPPER_STR(i) == İ

Deutsch: 
UPPER_CHAR(ß) == ß
UPPER_STR(ß) == SS

Russian: 
UPPER_CHAR(а) == А
UPPER_STR(а) == А

@parrt
Copy link
Member

parrt commented Dec 4, 2017

thanks! So it would seem option { caseInsensitive=TR; } or whatever would be needed, not just case insensitive. that makes sense as grammar should indicate the exact characters/sets it is matching. A stream approach hides all of that and makes it easy to mismatch stream / grammar I'd say.

would we want the lower char or lower string version? ß or SS?

@KvanTTT
Copy link
Member

KvanTTT commented Dec 4, 2017

Maybe something like this?

options
{
   caseInsensitive = true; // or false by default
   culture = TR; // by default en or invariant but other options are also available
}

In theory, culture can be also used in other cases.

@parrt
Copy link
Member

parrt commented Dec 4, 2017

I wonder if the simpler solution is just have antlr's IDE alter a rule inplace or have a trivial script that does it.

@parrt
Copy link
Member

parrt commented Dec 4, 2017

Referring to Mike's:

fragment A: 'A' | 'a';
fragment B: 'B' | 'b';
...
KEYWORD1: K E Y W O R D '1';

All that is really needed is the fragment rules because:

KEYWORD1: K E Y W O R D '1';

is no more burden than

KEYWORD1: 'keyword1' ;

I would say. I can easily add the fragment generator to the IDE and/or we can simply add that list to the doc for cut/paste. And we can refer people to these PR for streams and include them in the doc if we want. I think that satisfies everything. I just can't see beta to two chars right + Turkish etc...

@KvanTTT
Copy link
Member

KvanTTT commented Dec 4, 2017

KEYWORD1: K E Y W O R D '1';

A new unreachable token warning from #2066 won't work with such fragment rules.

Ok, you have the freedom to do what you think right. But I won't transform tokens in existing T-SQL, PL/SQL, and other case insensitive grammars. Maybe somebody else...

@parrt
Copy link
Member

parrt commented Dec 4, 2017

Ok, I'll sleep on it. I think maybe doc is the right way to go on this rather than enforcing a specific vision in the tool.

@mike-lischke
Copy link
Member

mike-lischke commented Dec 5, 2017

A word about locale: don't mix that with charset. A charset describes a mapping of numbers to charactes. Unicode itself is a charset (however, usually ones speaks of "encoding" instead). A locale is a collection of rules that describe various aspects about a culture/language and has nothing to do how characters are mapped. A locale determines time + number formats, sorting rules (collation) and others (here's the C++11 locale description with more details). This is what makes this so complicated with Unicode. And my question (and suggestion) about focusing on keywords tried to reflect that.

By focusing on keywords we don't need to worry about locales at all. We don't touch text that doesn't need to be case folded. We usually don't have trouble with special cases like the turkish I or the german ß (which is called: ess-zet or sharp s btw.), since they rarely are part of keywords. In general we can simplify things a lot by just focusing on keywords. We don't need comments or strings to be case insensitive, as we never have to match them in a lexer. Converting them in the input stream is just a big waste of CPU cycles and memory slots and therefore also Sam's solution is not good either.

@KvanTTT: by using a new caseInsensitive option you wouldn't need to have letter fragments. The lexer internally would handle that. As I wrote before it would probably boil down to implicitly extending the interval sets used to match a character. So, from an implementation standpoint it should be relatively easy to do (and can be done fully in the Java code, no target involved then).

@parrt One thing remains to be clarified then: what is a keyword? That's a weak point in my approach. It's tricky to say this is a keyword and that is not, in a lexer grammar. You would need to limit where the new option can be applied to. In any case, I would leave out culture/locale processing, as ANTLR is not an NLP tool, but made for programming languages, where some common rules exist (like: keywords use the ASCII charset) that simplify processing. So, one way could be to say: case insensitivity can only be applied to lexer rules that only match ASCII input. That should still cover most usage scenarios, but keeps implementation work small.

@parrt
Copy link
Member

parrt commented Dec 5, 2017

Thanks everybody! I think I'm going to try out the documentation cut/paste angle ala @mike-lischke's approach and send in a pull request that incorporates streams in the doc provided by the other folks as options. then we can comment on that before I pull it in.

@parrt parrt mentioned this pull request Dec 5, 2017
@parrt
Copy link
Member

parrt commented Dec 5, 2017

For your review gentlemen, #2146 I have tried to incorporate with cherry picking the resources created by @bramp so that he can get credit for the contribution. It is a purely documentation based solution to this problem but requires no source code changes to ANTLR etc...

@balahsen
Copy link

balahsen commented Dec 6, 2017

Hi,
I agree with mike-lischke:
It's about lexer characters for reserved (or key) word for some grammar that is case insensitive like Oracle (PL/SQL).
The lexer should accept upper or lower key word without syntax error for case insensitive grammar.
it's very important: do not change case of client code ( the parsed code).

@parrt
Copy link
Member

parrt commented Dec 6, 2017

Closing As I'm going the documentation route #2146

@KvanTTT
Copy link
Member

KvanTTT commented Sep 12, 2021

@mike-lischke

Converting them in the input stream is just a big waste of CPU cycles and memory slots and therefore also Sam's solution is not good either.

I've made a benchmark (C# Runtime) for comparison of grammar and runtime case comparison approaches. It shows that grammar based approach a bit faster:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.17763.2114 (1809/October2018Update/Redstone5)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.101
  [Host]     : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
  DefaultJob : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT


|                     Method |     Mean |     Error |    StdDev | Ratio |
|--------------------------- |---------:|----------:|----------:|------:|
|   CaseInsensitiveLexerTest | 6.769 ms | 0.0282 ms | 0.0236 ms |  1.00 |
| CaseInsensitiveGrammarTest | 6.258 ms | 0.1018 ms | 0.0903 ms |  0.92 |

But, if grammar approach is used, the generated parser is a bit bigger.

So, one way could be to say: case insensitivity can only be applied to lexer rules that only match ASCII input.

It's not completely true. I know well-known language in Russia (1С) that have case insensitive cyrillic lexemes (А а Б б В в ...). But I've found a simple solution for not dealing with complex locale case comparison: just do not change case if length of resulting char (actually string) more than 1 or use culture inveriant case changing if possible. It keeps equal length of input and normalized string. For instance, the following string:

A a ß Д д

is being transformed to

A A ß Д Д

You can take a look at my implementation.

Python, JavaScript, PHP require one-length character check, but C#, Java, Go, Dart do correct transfomation using system functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants