-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new CharStream that converts the symbols to upper or lower case. #2046
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you also add such streams for other runtimes (Java, C#, JavaScript, Python2, Python3, C++, Swift). At least for a few of them.
return in | ||
} | ||
if is.upper { | ||
return int(unicode.ToUpper(rune(in))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to convert case at the beginning one time and use upper/lower values from a prepared array for the performance reason? LA
function can be called multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could cache the result, but I didn't want this to work by reading the full input up front, because this data could be streamed, or extremely large.
However, I don't think caching will save much, as this is quite a cheap operation for the CPU. But I'm happy to benchmark it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. Maybe it's right.
Sure, I'd be happy to contribute, Java, C++ and Python[23] (I don't know / have access to C# or Swift). Let's resolve this one first, then I'll work on the other languages. |
I just added a similar version for Java #2048 |
hi gang. How is this different than simply running tolower() or whatever on the input string? |
In my example, I want to use a NewFileStream, which gives me no opportunity to tolower the input. Basically this just wraps an existing stream and does that for me. |
It seems to me that what one really wants is insensitivity during the lexing process but to leave the actual input as is, right? |
Sure, if the lexer could treat the input in a case insensitive way, that would be ideal. I don't like sticking an extra stream in front, I'd rather the author of the lexer hide this detail from me. But currently there are multiple grammars that require lower or upper case input, and this helps me use them. Alternatively as a step in that direction, this stream could be kept inside the lexer, and put there by antlr during the code generation. |
It seems like we should simply tweak |
And there would be a change to the grammar file to indicate that input should always be uppercased? Because presumably some grammars would want to be case sensitive. |
That's a good question. Is it a function of the grammar purely or could the same grammar be used in both case-sensitive and insensitive situations? |
The samples I've seen on https://github.com/antlr/grammars-v4 have all been written assuming the input would all be one case, and not something you'd want to selectively turn on and off. They do this because it makes the grammar easier to write. So in those cases it would be a function of the grammar. However, I can't speak for all exotic grammars. |
Hmm...yeah,I can imagine it being a function of the grammar. That said, given my limited time I think we have a simple solution which is to manually alter the string going into the upper or lower case. The programmer can do this easily with their own wrapper class or a simple function call if they have the input as a string. In that case, I'm going to close. Sorry but thanks for the effort. |
Sure, the programmer can write their own wrapper, but providing one in the core library makes that so much easier. When I first started using antlr, I took a pre-written grammar (the mysql one in fact), and it took me a while to realise I had to uppercase everything, then it took me another while to understand the Stream interface, and figure out how to adapt the LA function to do what I need... Lets avoid that burden on a first time user, and make this easy for them with this pre-written and tested wrapper class. |
Well,I have a lot of targets that we need to be updated and documentation and tests and so on. Probably not worth it when I personally just call toupper after reading in a string and before passing to ANTLR. |
Ok sounds reasonable. I am willing to provide implementations for some of the other targets, but not all of them. However, I'll maintain this code outside of the core library, until there is an appetite for it. Thanks. |
@parrt, this stream save chars case. It only influences on lexer, but not on the parser or parse tree bypassing. It's roughly just to call "toupper" for an input string. On my opinion, it's very demanded feature. See the list of issues in grammars and main repository: |
hiya. I can see how it's useful occasionally but it seems like we would need to do this for all targets not just Go before adding it. Maybe I can come up with a Java version to see how it feels. |
I sent a PR for a Java implementation here: #2048 |
I've implemented such streams for C#, C# optimized, Go, Java, JavaScript, Python2, Python3 runtimes. See Also, grammar tests can be found here: https://github.com/antlr/antlr4test-maven-plugin/tree/master/src/test/resources |
Yet another case insensitive issue: antlr/grammars-v4#921 |
Ok, so this wrapper implementation that @bramp has provided seems flexible actually. It works with existing streams but we just wrap that stream in an object that upper or lower case is things as needed without changing the actual stream. If we could bring other targets in line with #2048 and bring it all together as a PR, I think that would be acceptable. |
btw, @sharwell used a stream approach which makes me think it's also not a hack. His approach does toUpperCase() once on copy of stream which is faster and also supports beta becoming two char 'SS'. Disadvantage: must replicate across streams and also that grammar doesn't indicate case-insensitivity. |
How would the stream approach work with embedded string content (I'm
presuming you want to preserve case in text)?
…On Mon, Dec 4, 2017 at 9:36 AM Terence Parr ***@***.***> wrote:
btw, @sharwell <https://github.com/sharwell> used a stream approach which
makes me think it's also not a hack. His approach does toUpperCase() once
on copy of stream which is faster and also supports beta becoming two char
'SS'. Disadvantage: must replicate across streams and also that grammar
doesn't indicate case-insensitivity.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2046 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALUC_SpBDQFMb7pKtiHI3Qk0qU1cSpFTks5s9C2sgaJpZM4Pwmzt>
.
|
yep, streams leave text alone (in functionality) but make the lexer think everything is upper or lower. |
The
Other runtimes also have a culture-invariant character case changing methods. I checked it at least for Java, C#, JavaScript, Go (I can attach confirming links later). We also can introduce case insensitive option in grammar and use such streams if it's set up. Maybe it makes sense to add not only case sensitivity option, but also
Yes, this is a working solution :) |
So |
For an example where uppercasing will fail, try the Turkish dotted i:
http://www.i18nguy.com/unicode/turkish-i18n.html
…On Mon, Dec 4, 2017 at 9:49 AM Terence Parr ***@***.***> wrote:
So Character.toUpperCase(x)) will work for any unicode x? Why do they
support a locale I wonder?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2046 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALUC_SV2I4wJUtikC-ZUe-EeSaOCB1T4ks5s9DCegaJpZM4Pwmzt>
.
|
Yeah, but the unicode dotted i should upper case properly as it's not the english i right? toUpperCase should handle it |
Actually, doesn't work. Turkish reuses english 'i' I think. i get same upper case char here when it should be two diff ones.
|
The Locale influence only on string
It's failed because of culture dependece, see https://ideone.com/IvtDOd
|
thanks! So it would seem would we want the lower char or lower string version? ß or SS? |
Maybe something like this?
In theory, culture can be also used in other cases. |
I wonder if the simpler solution is just have antlr's IDE alter a rule inplace or have a trivial script that does it. |
Referring to Mike's:
All that is really needed is the fragment rules because:
is no more burden than
I would say. I can easily add the fragment generator to the IDE and/or we can simply add that list to the doc for cut/paste. And we can refer people to these PR for streams and include them in the doc if we want. I think that satisfies everything. I just can't see beta to two chars right + Turkish etc... |
KEYWORD1: K E Y W O R D '1'; A new unreachable token warning from #2066 won't work with such fragment rules. Ok, you have the freedom to do what you think right. But I won't transform tokens in existing T-SQL, PL/SQL, and other case insensitive grammars. Maybe somebody else... |
Ok, I'll sleep on it. I think maybe doc is the right way to go on this rather than enforcing a specific vision in the tool. |
A word about By focusing on keywords we don't need to worry about locales at all. We don't touch text that doesn't need to be case folded. We usually don't have trouble with special cases like the turkish I or the german @KvanTTT: by using a new @parrt One thing remains to be clarified then: what is a keyword? That's a weak point in my approach. It's tricky to say this is a keyword and that is not, in a lexer grammar. You would need to limit where the new option can be applied to. In any case, I would leave out culture/locale processing, as ANTLR is not an NLP tool, but made for programming languages, where some common rules exist (like: keywords use the ASCII charset) that simplify processing. So, one way could be to say: case insensitivity can only be applied to lexer rules that only match ASCII input. That should still cover most usage scenarios, but keeps implementation work small. |
Thanks everybody! I think I'm going to try out the documentation cut/paste angle ala @mike-lischke's approach and send in a pull request that incorporates streams in the doc provided by the other folks as options. then we can comment on that before I pull it in. |
Hi, |
Closing As I'm going the documentation route #2146 |
I've made a benchmark (C# Runtime) for comparison of grammar and runtime case comparison approaches. It shows that grammar based approach a bit faster:
But, if grammar approach is used, the generated parser is a bit bigger.
It's not completely true. I know well-known language in Russia (1С) that have case insensitive cyrillic lexemes (А а Б б В в ...). But I've found a simple solution for not dealing with complex locale case comparison: just do not change case if length of resulting char (actually string) more than 1 or use culture inveriant case changing if possible. It keeps equal length of input and normalized string. For instance, the following string:
is being transformed to
You can take a look at my implementation. Python, JavaScript, PHP require one-length character check, but C#, Java, Go, Dart do correct transfomation using system functions. |
This is useful for many of the case insensitive grammars found at https://github.com/antlr/grammars-v4/ which assume the input would be all upper or lower case. Related discussion can be found at antlr/antlr4test-maven-plugin#1
It would be used like so:
While writing this, I found other people have written their own similar implementations (go, java). It makes sense to place this in the core, so everyone can use it.
I would love for the grammar to have a option that says the lexer should upper/lower case all input, and then this code could be moved into the generated Lexer, and no user would need to explicitly use a CaseChangingStream (similar to what's discussed in #1002).