diff --git a/doc/case-insensitive-lexing.md b/doc/case-insensitive-lexing.md deleted file mode 100644 index 043de7eb479..00000000000 --- a/doc/case-insensitive-lexing.md +++ /dev/null @@ -1,80 +0,0 @@ -# Case-Insensitive Lexing - -In some languages, keywords are case insensitive meaning that `BeGiN` means the same thing as `begin` or `BEGIN`. ANTLR has two mechanisms to support building grammars for such languages: - -1. Build lexical rules that match either upper or lower case. - * **Advantage**: no changes required to ANTLR, makes it clear in the grammar that the language in this case insensitive. - * **Disadvantage**: might have a small efficiency cost and grammar is a more verbose and more of a hassle to write. - -2. Build lexical rules that match keywords in all uppercase and then parse with a custom [character stream](https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/CharStream.java) that converts all characters to uppercase before sending them to the lexer (via the `LA()` method). Care must be taken not to convert all characters in the stream to uppercase because characters within strings and comments should be unaffected. All we really want is to trick the lexer into thinking the input is all uppercase. - * **Advantage**: Could have a speed advantage depending on implementation, no change required to the grammar. - * **Disadvantage**: Requires that the case-insensitive stream and grammar are used in correctly in conjunction with each other, makes all characters appear as uppercase/lowercase to the lexer but some grammars are case sensitive outside of keywords, errors new case insensitive streams and language output targets (java, C#, C++, ...). - -For the 4.7.1 release, we discussed both approaches in [detail](https://github.com/antlr/antlr4/pull/2046) and even possibly altering the ANTLR metalanguage to directly support case-insensitive lexing. We discussed including the case insensitive streams into the runtime but not all would be immediately supported. I decided to simply make documentation that clearly states how to handle this and include the appropriate snippets that people can cut-and-paste into their grammars. - -## Case-insensitive grammars - -As a prime example of a grammar that specifically describes case insensitive keywords, see the -[SQLite grammar](https://github.com/antlr/grammars-v4/blob/master/sqlite/SQLite.g4). To match a case insensitive keyword, there are rules such as - -``` -K_UPDATE : U P D A T E; -``` - -that will match `UpdaTE` and `upDATE` etc... as the `update` keyword. This rule makes use of some generically useful fragment rules that you can cut-and-paste into your grammars: - -``` -fragment A : [aA]; // match either an 'a' or 'A' -fragment B : [bB]; -fragment C : [cC]; -fragment D : [dD]; -fragment E : [eE]; -fragment F : [fF]; -fragment G : [gG]; -fragment H : [hH]; -fragment I : [iI]; -fragment J : [jJ]; -fragment K : [kK]; -fragment L : [lL]; -fragment M : [mM]; -fragment N : [nN]; -fragment O : [oO]; -fragment P : [pP]; -fragment Q : [qQ]; -fragment R : [rR]; -fragment S : [sS]; -fragment T : [tT]; -fragment U : [uU]; -fragment V : [vV]; -fragment W : [wW]; -fragment X : [xX]; -fragment Y : [yY]; -fragment Z : [zZ]; -``` - -No special streams are required to use this mechanism for case insensitivity. - -## Custom character streams approach - -The other approach is to use lexical rules that match either all uppercase or all lowercase, such as: - -``` -K_UPDATE : 'UPDATE'; -``` - -Then, when creating the character stream to parse from, we need a custom class that overrides methods used by the lexer. Below you will find custom character streams for a number of the targets that you can copy into your projects, but here is how to use the streams in Java as an example: - -```java -CharStream s = CharStreams.fromPath(Paths.get("test.sql")); -CaseChangingCharStream upper = new CaseChangingCharStream(s, true); -Lexer lexer = new SomeSQLLexer(upper); -``` - -Here are implementations of `CaseChangingCharStream` in various target languages: - -* [C#](https://github.com/antlr/antlr4/blob/master/doc/resources/CaseChangingCharStream.cs) -* [Dart](https://github.com/antlr/antlr4/blob/master/doc/resources/case_changing_char_stream.dart) -* [Go](https://github.com/antlr/antlr4/blob/master/doc/resources/case_changing_stream.go) -* [Java](https://github.com/antlr/antlr4/blob/master/doc/resources/CaseChangingCharStream.java) -* [JavaScript](https://github.com/antlr/antlr4/blob/master/doc/resources/CaseChangingStream.js) -* [Python2/3](https://github.com/antlr/antlr4/blob/master/doc/resources/CaseChangingStream.py) diff --git a/doc/options.md b/doc/options.md index 7ce277551ef..b0471f3b63c 100644 --- a/doc/options.md +++ b/doc/options.md @@ -12,7 +12,10 @@ where a value can be an identifier, a qualified identifier (for example, a.b.c), All grammars can use the following options. In combined grammars, all options except language pertain only to the generated parser. Options may be set either within the grammar file using the options syntax (described above) or when invoking ANTLR on the command line, using the `-D` option. (see Section 15.9, [ANTLR Tool Command Line Options](tool-options.md).) The following examples demonstrate both mechanisms; note that `-D` overrides options within the grammar. -* `superClass`. Set the superclass of the generated parser or lexer. For combined grammars, it sets the superclass of the parser. +### `superClass` + +Set the superclass of the generated parser or lexer. For combined grammars, it sets the superclass of the parser. + ``` $ cat Hi.g4 grammar Hi; @@ -23,12 +26,20 @@ public class HiParser extends XX { $ grep 'public class' HiLexer.java public class HiLexer extends Lexer { ``` -* `language` Generate code in the indicated language, if ANTLR is able to do so. Otherwise, you will see an error message like this: + +### `language` + +Generate code in the indicated language, if ANTLR is able to do so. Otherwise, you will see an error message like this: + ``` $ antlr4 -Dlanguage=C MyGrammar.g4 error(31): ANTLR cannot generate C code as of version 4.0 ``` -* `tokenVocab` ANTLR assigns token type numbers to the tokens as it encounters them in a file. To use different token type values, such as with a separate lexer, use this option to have ANTLR pull in the tokens file. ANTLR generates a tokens file from each grammar. + +### `tokenVocab` + +ANTLR assigns token type numbers to the tokens as it encounters them in a file. To use different token type values, such as with a separate lexer, use this option to have ANTLR pull in the tokens file. ANTLR generates a tokens file from each grammar. + ``` $ cat SomeLexer.g4 lexer grammar SomeLexer; @@ -48,7 +59,11 @@ B=3 C=4 ID=1 ``` -* `TokenLabelType` ANTLR normally uses type Token when it generates variables referencing tokens. If you have passed a TokenFactory to your parser and lexer so that they create custom tokens, you should set this option to your specific type. This ensures that the context objects know your type for fields and method return values. + +### `TokenLabelType` + +ANTLR normally uses type Token when it generates variables referencing tokens. If you have passed a TokenFactory to your parser and lexer so that they create custom tokens, you should set this option to your specific type. This ensures that the context objects know your type for fields and method return values. + ``` $ cat T2.g4 grammar T2; @@ -58,8 +73,40 @@ $ antlr4 T2.g4 $ grep MyToken T2Parser.java public MyToken x; ``` -* `contextSuperClass`. Specify the super class of parse tree internal nodes. Default is `ParserRuleContext`. Should derive from ultimately `RuleContext` at minimum. -Java target can use `contextSuperClass=org.antlr.v4.runtime.RuleContextWithAltNum` for convenience. It adds a backing field for `altNumber`, the alt matched for the associated rule node. + +### `contextSuperClass` + +Specify the super class of parse tree internal nodes. Default is `ParserRuleContext`. Should derive from ultimately `RuleContext` at minimum. +Java target can use `contextSuperClass=org.antlr.v4.runtime.RuleContextWithAltNum` for convenience. It adds a backing field for `altNumber`, the alt matched for the associated rule node. + +### `caseInsensitive` + +Ignore character case of input stream. + +The parser from the following grammar: + +```g4 +lexer grammar L; +options { caseInsensitive = true; } +ENGLISH_TOKEN: [a-z]+; +GERMAN_TOKEN: [äéöüß]+; +FRENCH_TOKEN: [àâæ-ëîïôœùûüÿ]+; +CROATIAN_TOKEN: [ćčđšž]+; +ITALIAN_TOKEN: [àèéìòù]+; +SPANISH_TOKEN: [áéíñóúü¡¿]+; +GREEK_TOKEN: [α-ω]+; +RUSSIAN_TOKEN: [а-я]+; +WS: [ ]+ -> skip; +``` + +Matches the following sequence of words: + +``` +abcXYZ äéöüßÄÉÖÜß àâæçÙÛÜŸ ćčđĐŠŽ àèéÌÒÙ áéÚÜ¡¿ αβγΧΨΩ абвЭЮЯ +``` + +ANTLR considers only one-length chars in all cases. +For instance, german lower `ß` is not treated as upper `ss` and vice versa. ## Rule Options diff --git a/doc/resources/CaseChangingCharStream.cs b/doc/resources/CaseChangingCharStream.cs deleted file mode 100644 index 9f73a038264..00000000000 --- a/doc/resources/CaseChangingCharStream.cs +++ /dev/null @@ -1,105 +0,0 @@ -/* Copyright (c) 2012-2017 The ANTLR Project. All rights reserved. - * Use of this file is governed by the BSD 3-clause license that - * can be found in the LICENSE.txt file in the project root. - */ -using System; -using Antlr4.Runtime.Misc; - -namespace Antlr4.Runtime -{ - /// - /// This class supports case-insensitive lexing by wrapping an existing - /// and forcing the lexer to see either upper or - /// lowercase characters. Grammar literals should then be either upper or - /// lower case such as 'BEGIN' or 'begin'. The text of the character - /// stream is unaffected. Example: input 'BeGiN' would match lexer rule - /// 'BEGIN' if constructor parameter upper=true but getText() would return - /// 'BeGiN'. - /// - public class CaseChangingCharStream : ICharStream - { - private ICharStream stream; - private bool upper; - - /// - /// Constructs a new CaseChangingCharStream wrapping the given forcing - /// all characters to upper case or lower case. - /// - /// The stream to wrap. - /// If true force each symbol to upper case, otherwise force to lower. - public CaseChangingCharStream(ICharStream stream, bool upper) - { - this.stream = stream; - this.upper = upper; - } - - public int Index - { - get - { - return stream.Index; - } - } - - public int Size - { - get - { - return stream.Size; - } - } - - public string SourceName - { - get - { - return stream.SourceName; - } - } - - public void Consume() - { - stream.Consume(); - } - - [return: NotNull] - public string GetText(Interval interval) - { - return stream.GetText(interval); - } - - public int LA(int i) - { - int c = stream.LA(i); - - if (c <= 0) - { - return c; - } - - char o = (char)c; - - if (upper) - { - return (int)char.ToUpperInvariant(o); - } - - return (int)char.ToLowerInvariant(o); - } - - public int Mark() - { - return stream.Mark(); - } - - public void Release(int marker) - { - stream.Release(marker); - } - - public void Seek(int index) - { - stream.Seek(index); - } - } -} diff --git a/doc/resources/CaseChangingCharStream.java b/doc/resources/CaseChangingCharStream.java deleted file mode 100644 index d069d0188aa..00000000000 --- a/doc/resources/CaseChangingCharStream.java +++ /dev/null @@ -1,81 +0,0 @@ -package org.antlr.v4.runtime; - -import org.antlr.v4.runtime.misc.Interval; - -/** - * This class supports case-insensitive lexing by wrapping an existing - * {@link CharStream} and forcing the lexer to see either upper or - * lowercase characters. Grammar literals should then be either upper or - * lower case such as 'BEGIN' or 'begin'. The text of the character - * stream is unaffected. Example: input 'BeGiN' would match lexer rule - * 'BEGIN' if constructor parameter upper=true but getText() would return - * 'BeGiN'. - */ -public class CaseChangingCharStream implements CharStream { - - final CharStream stream; - final boolean upper; - - /** - * Constructs a new CaseChangingCharStream wrapping the given {@link CharStream} forcing - * all characters to upper case or lower case. - * @param stream The stream to wrap. - * @param upper If true force each symbol to upper case, otherwise force to lower. - */ - public CaseChangingCharStream(CharStream stream, boolean upper) { - this.stream = stream; - this.upper = upper; - } - - @Override - public String getText(Interval interval) { - return stream.getText(interval); - } - - @Override - public void consume() { - stream.consume(); - } - - @Override - public int LA(int i) { - int c = stream.LA(i); - if (c <= 0) { - return c; - } - if (upper) { - return Character.toUpperCase(c); - } - return Character.toLowerCase(c); - } - - @Override - public int mark() { - return stream.mark(); - } - - @Override - public void release(int marker) { - stream.release(marker); - } - - @Override - public int index() { - return stream.index(); - } - - @Override - public void seek(int index) { - stream.seek(index); - } - - @Override - public int size() { - return stream.size(); - } - - @Override - public String getSourceName() { - return stream.getSourceName(); - } -} diff --git a/doc/resources/CaseChangingStream.js b/doc/resources/CaseChangingStream.js deleted file mode 100644 index 3af1ad61277..00000000000 --- a/doc/resources/CaseChangingStream.js +++ /dev/null @@ -1,65 +0,0 @@ -// -/* Copyright (c) 2012-2017 The ANTLR Project. All rights reserved. - * Use of this file is governed by the BSD 3-clause license that - * can be found in the LICENSE.txt file in the project root. - */ -// - -function CaseChangingStream(stream, upper) { - this._stream = stream; - this._upper = upper; -} - -CaseChangingStream.prototype.LA = function(offset) { - var c = this._stream.LA(offset); - if (c <= 0) { - return c; - } - return String.fromCodePoint(c)[this._upper ? "toUpperCase" : "toLowerCase"]().codePointAt(0); -}; - -CaseChangingStream.prototype.reset = function() { - return this._stream.reset(); -}; - -CaseChangingStream.prototype.consume = function() { - return this._stream.consume(); -}; - -CaseChangingStream.prototype.LT = function(offset) { - return this._stream.LT(offset); -}; - -CaseChangingStream.prototype.mark = function() { - return this._stream.mark(); -}; - -CaseChangingStream.prototype.release = function(marker) { - return this._stream.release(marker); -}; - -CaseChangingStream.prototype.seek = function(_index) { - return this._stream.seek(_index); -}; - -CaseChangingStream.prototype.getText = function(start, stop) { - return this._stream.getText(start, stop); -}; - -CaseChangingStream.prototype.toString = function() { - return this._stream.toString(); -}; - -Object.defineProperty(CaseChangingStream.prototype, "index", { - get: function() { - return this._stream.index; - } -}); - -Object.defineProperty(CaseChangingStream.prototype, "size", { - get: function() { - return this._stream.size; - } -}); - -exports.CaseChangingStream = CaseChangingStream; diff --git a/doc/resources/CaseChangingStream.py b/doc/resources/CaseChangingStream.py deleted file mode 100644 index 6d2815de418..00000000000 --- a/doc/resources/CaseChangingStream.py +++ /dev/null @@ -1,13 +0,0 @@ -class CaseChangingStream(): - def __init__(self, stream, upper): - self._stream = stream - self._upper = upper - - def __getattr__(self, name): - return self._stream.__getattribute__(name) - - def LA(self, offset): - c = self._stream.LA(offset) - if c <= 0: - return c - return ord(chr(c).upper() if self._upper else chr(c).lower()) diff --git a/doc/resources/case_changing_char_stream.dart b/doc/resources/case_changing_char_stream.dart deleted file mode 100644 index b7133ded495..00000000000 --- a/doc/resources/case_changing_char_stream.dart +++ /dev/null @@ -1,64 +0,0 @@ -// @dart=2.12 - -import '../../runtime/Dart/lib/antlr4.dart'; -import '../../runtime/Dart/lib/src/interval_set.dart'; - -/// This class supports case-insensitive lexing by wrapping an existing -/// {@link CharStream} and forcing the lexer to see either upper or -/// lowercase characters. Grammar literals should then be either upper or -/// lower case such as 'BEGIN' or 'begin'. The text of the character -/// stream is unaffected. Example: input 'BeGiN' would match lexer rule -/// 'BEGIN' if constructor parameter upper=true but getText() would return -/// 'BeGiN'. -class CaseChangingCharStream extends CharStream { - final CharStream stream; - final bool upper; - - /// Constructs a new CaseChangingCharStream wrapping the given [stream] forcing - /// all characters to upper case or lower case depending on [upper]. - CaseChangingCharStream(this.stream, this.upper); - - @override - int? LA(int i) { - int? c = stream.LA(i); - if (c == null || c <= 0) { - return c; - } - String newCaseStr; - if (upper) { - newCaseStr = String.fromCharCode(c).toUpperCase(); - } else { - newCaseStr = String.fromCharCode(c).toLowerCase(); - } - // Skip changing case if length changes (e.g., ß -> SS). - if (newCaseStr.length != 1) { - return c; - } else { - return newCaseStr.codeUnitAt(0); - } - } - - @override - String get sourceName => stream.sourceName; - - @override - void consume() => stream.consume(); - - @override - String getText(Interval interval) => stream.getText(interval); - - @override - int get index => stream.index; - - @override - int mark() => stream.mark(); - - @override - void release(int marker) => stream.release(marker); - - @override - void seek(int index) => stream.seek(index); - - @override - int get size => stream.size; -} diff --git a/doc/resources/case_changing_stream.go b/doc/resources/case_changing_stream.go deleted file mode 100644 index 5b510fa3211..00000000000 --- a/doc/resources/case_changing_stream.go +++ /dev/null @@ -1,37 +0,0 @@ -package antlr_resource - -import ( - "unicode" - - "github.com/antlr/antlr4/runtime/Go/antlr" -) - -// CaseChangingStream wraps an existing CharStream, but upper cases, or -// lower cases the input before it is tokenized. -type CaseChangingStream struct { - antlr.CharStream - - upper bool -} - -// NewCaseChangingStream returns a new CaseChangingStream that forces -// all tokens read from the underlying stream to be either upper case -// or lower case based on the upper argument. -func NewCaseChangingStream(in antlr.CharStream, upper bool) *CaseChangingStream { - return &CaseChangingStream{in, upper} -} - -// LA gets the value of the symbol at offset from the current position -// from the underlying CharStream and converts it to either upper case -// or lower case. -func (is *CaseChangingStream) LA(offset int) int { - in := is.CharStream.LA(offset) - if in < 0 { - // Such as antlr.TokenEOF which is -1 - return in - } - if is.upper { - return int(unicode.ToUpper(rune(in))) - } - return int(unicode.ToLower(rune(in))) -} diff --git a/runtime/Java/src/org/antlr/v4/runtime/atn/CodePointTransitions.java b/runtime/Java/src/org/antlr/v4/runtime/atn/CodePointTransitions.java index 7aedfc44a6f..7024a667546 100644 --- a/runtime/Java/src/org/antlr/v4/runtime/atn/CodePointTransitions.java +++ b/runtime/Java/src/org/antlr/v4/runtime/atn/CodePointTransitions.java @@ -23,13 +23,8 @@ public abstract class CodePointTransitions { * If {@code codePoint} is <= U+FFFF, returns a new {@link AtomTransition}. * Otherwise, returns a new {@link SetTransition}. */ - public static Transition createWithCodePoint(ATNState target, int codePoint) { - if (Character.isSupplementaryCodePoint(codePoint)) { - return new SetTransition(target, IntervalSet.of(codePoint)); - } - else { - return new AtomTransition(target, codePoint); - } + public static Transition createWithCodePoint(ATNState target, int codePoint, boolean caseInsensitive) { + return createWithCodePointRange(target, codePoint, codePoint, caseInsensitive); } /** @@ -40,13 +35,30 @@ public static Transition createWithCodePoint(ATNState target, int codePoint) { public static Transition createWithCodePointRange( ATNState target, int codePointFrom, - int codePointTo) { - if (Character.isSupplementaryCodePoint(codePointFrom) || - Character.isSupplementaryCodePoint(codePointTo)) { - return new SetTransition(target, IntervalSet.of(codePointFrom, codePointTo)); - } - else { - return new RangeTransition(target, codePointFrom, codePointTo); + int codePointTo, + boolean caseInsensitive) { + if (caseInsensitive) { + int lowerCodePointFrom = Character.toLowerCase(codePointFrom); + int upperCodePointFrom = Character.toUpperCase(codePointFrom); + int lowerCodePointTo = Character.toLowerCase(codePointTo); + int upperCodePointTo = Character.toUpperCase(codePointTo); + if (lowerCodePointFrom == upperCodePointFrom && lowerCodePointTo == upperCodePointTo) { + return createWithCodePointRange(target, lowerCodePointFrom, lowerCodePointTo, false); + } else { + IntervalSet intervalSet = new IntervalSet(); + intervalSet.add(lowerCodePointFrom, lowerCodePointTo); + intervalSet.add(upperCodePointFrom, upperCodePointTo); + return new SetTransition(target, intervalSet); + } + } else { + if (Character.isSupplementaryCodePoint(codePointFrom) || + Character.isSupplementaryCodePoint(codePointTo)) { + return new SetTransition(target, IntervalSet.of(codePointFrom, codePointTo)); + } else { + return codePointFrom == codePointTo + ? new AtomTransition(target, codePointFrom) + : new RangeTransition(target, codePointFrom, codePointTo); + } } } } diff --git a/tool-testsuite/test/org/antlr/v4/test/tool/TestATNLexerInterpreter.java b/tool-testsuite/test/org/antlr/v4/test/tool/TestATNLexerInterpreter.java index 4cd34deec66..5b4d985ac73 100644 --- a/tool-testsuite/test/org/antlr/v4/test/tool/TestATNLexerInterpreter.java +++ b/tool-testsuite/test/org/antlr/v4/test/tool/TestATNLexerInterpreter.java @@ -380,6 +380,109 @@ public void testSetUp() throws Exception { checkLexerMatches(lg, "a", expecting); } + @Test public void testLexerCaseInsensitive() throws Exception { + LexerGrammar lg = new LexerGrammar( + "lexer grammar L;\n" + + "\n" + + "options { caseInsensitive = true; }\n" + + "\n" + + "WS: [ \\t\\r\\n] -> skip;\n" + + "\n" + + "SIMPLE_TOKEN: 'and';\n" + + "TOKEN_WITH_SPACES: 'as' 'd' 'f';\n" + + "TOKEN_WITH_DIGITS: 'INT64';\n" + + "TOKEN_WITH_UNDERSCORE: 'TOKEN_WITH_UNDERSCORE';\n" + + "BOOL: 'true' | 'FALSE';\n" + + "SPECIAL: '==';\n" + + "SET: [a-z0-9]+;\n" + // [a-zA-Z0-9] + "RANGE: ('а'..'я')+;" + ); + + String inputString = + "and AND aND\n" + + "asdf ASDF\n" + + "int64\n" + + "token_WITH_underscore\n" + + "TRUE FALSE\n" + + "==\n" + + "A0bcDE93\n" + + "АБВабв\n"; + + String expecting = Utils.join(new String[] { + "SIMPLE_TOKEN", "SIMPLE_TOKEN", "SIMPLE_TOKEN", + "TOKEN_WITH_SPACES", "TOKEN_WITH_SPACES", + "TOKEN_WITH_DIGITS", + "TOKEN_WITH_UNDERSCORE", + "BOOL", "BOOL", + "SPECIAL", + "SET", + "RANGE", + "EOF" + }, + ", WS, "); + + checkLexerMatches(lg, inputString, expecting); + } + + @Test public void testLexerCaseInsensitiveWithNegation() throws Exception { + String grammar = + "lexer grammar L;\n" + + "options { caseInsensitive = true; }\n" + + "TOKEN_WITH_NOT: ~'f';\n"; // ~('f' | 'F) + execLexer("L.g4", grammar, "L", "F"); + + assertEquals("line 1:0 token recognition error at: 'F'\n", getParseErrors()); + } + + @Test public void testLexerCaseInsensitiveFragments() throws Exception { + LexerGrammar lg = new LexerGrammar( + "lexer grammar L;\n" + + "options { caseInsensitive = true; }\n" + + "TOKEN_0: FRAGMENT 'd'+;\n" + + "TOKEN_1: FRAGMENT 'e'+;\n" + + "FRAGMENT: 'abc';\n"); + + String inputString = + "ABCDDD"; + + String expecting = "TOKEN_0, EOF"; + + checkLexerMatches(lg, inputString, expecting); + } + + @Test public void testLexerCaseInsensitiveWithDifferentCultures() throws Exception { + // From http://www.periodni.com/unicode_utf-8_encoding.html + LexerGrammar lg = new LexerGrammar( + "lexer grammar L;\n" + + "options { caseInsensitive = true; }\n" + + "ENGLISH_TOKEN: [a-z]+;\n" + + "GERMAN_TOKEN: [äéöüß]+;\n" + + "FRENCH_TOKEN: [àâæ-ëîïôœùûüÿ]+;\n" + + "CROATIAN_TOKEN: [ćčđšž]+;\n" + + "ITALIAN_TOKEN: [àèéìòù]+;\n" + + "SPANISH_TOKEN: [áéíñóúü¡¿]+;\n" + + "GREEK_TOKEN: [α-ω]+;\n" + + "RUSSIAN_TOKEN: [а-я]+;\n" + + "WS: [ ]+ -> skip;" + ); + + String inputString = "abcXYZ äéöüßÄÉÖÜß àâæçÙÛÜŸ ćčđĐŠŽ àèéÌÒÙ áéÚÜ¡¿ αβγΧΨΩ абвЭЮЯ "; + + String expecting = Utils.join(new String[] { + "ENGLISH_TOKEN", + "GERMAN_TOKEN", + "FRENCH_TOKEN", + "CROATIAN_TOKEN", + "ITALIAN_TOKEN", + "SPANISH_TOKEN", + "GREEK_TOKEN", + "RUSSIAN_TOKEN", + "EOF" }, + ", WS, "); + + checkLexerMatches(lg, inputString, expecting); + } + protected void checkLexerMatches(LexerGrammar lg, String inputString, String expecting) { ATN atn = createATN(lg, true); CharStream input = CharStreams.fromString(inputString); diff --git a/tool-testsuite/test/org/antlr/v4/test/tool/TestParserExec.java b/tool-testsuite/test/org/antlr/v4/test/tool/TestParserExec.java index 0d0265d8267..8ebe137ad08 100644 --- a/tool-testsuite/test/org/antlr/v4/test/tool/TestParserExec.java +++ b/tool-testsuite/test/org/antlr/v4/test/tool/TestParserExec.java @@ -160,4 +160,25 @@ public void testSetUp() throws Exception { assertEquals("6\n", found); assertNull(getParseErrors()); } + + @Test public void testCaseInsensitiveInCombinedGrammar() throws Exception { + String grammar = + "grammar CaseInsensitiveGrammar;\n" + + "options { caseInsensitive = true; }\n" + + "e\n" + + " : ID\n" + + " | 'not' e\n" + + " | e 'and' e\n" + + " | 'new' ID '(' e ')'\n" + + " ;\n" + + "ID: [a-z_][a-z_0-9]*;\n" + + "WS: [ \\t\\n\\r]+ -> skip;"; + + String input = "NEW Abc (Not a AND not B)"; + execParser( + "CaseInsensitiveGrammar.g4", grammar, + "CaseInsensitiveGrammarParser", "CaseInsensitiveGrammarLexer", + null, null, "e", input, false); + assertNull(getParseErrors()); + } } diff --git a/tool-testsuite/test/org/antlr/v4/test/tool/TestParserInterpreter.java b/tool-testsuite/test/org/antlr/v4/test/tool/TestParserInterpreter.java index b424fc02df7..7b001f31901 100644 --- a/tool-testsuite/test/org/antlr/v4/test/tool/TestParserInterpreter.java +++ b/tool-testsuite/test/org/antlr/v4/test/tool/TestParserInterpreter.java @@ -330,6 +330,30 @@ public void testSetUp() throws Exception { testInterp(lg, g, "e", "a+a*a", "(e (e a) + (e (e a) * (e a)))"); } + @Test public void testCaseInsensitiveTokensInParser() throws Exception { + LexerGrammar lg = new LexerGrammar( + "lexer grammar L;\n" + + "options { caseInsensitive = true; }\n" + + "NOT: 'not';\n" + + "AND: 'and';\n" + + "NEW: 'new';\n" + + "LB: '(';\n" + + "RB: ')';\n" + + "ID: [a-z_][a-z_0-9]*;\n" + + "WS: [ \\t\\n\\r]+ -> skip;"); + Grammar g = new Grammar( + "parser grammar T;\n" + + "options { caseInsensitive = true; }\n" + + "e\n" + + " : ID\n" + + " | 'not' e\n" + + " | e 'and' e\n" + + " | 'new' ID '(' e ')'\n" + + " ;", lg); + + testInterp(lg, g, "e", "NEW Abc (Not a AND not B)", "(e NEW Abc ( (e (e Not (e a)) AND (e not (e B))) ))"); + } + ParseTree testInterp(LexerGrammar lg, Grammar g, String startRule, String input, String expectedParseTree) diff --git a/tool-testsuite/test/org/antlr/v4/test/tool/TestSymbolIssues.java b/tool-testsuite/test/org/antlr/v4/test/tool/TestSymbolIssues.java index ad53c0360a7..cb3100510a6 100644 --- a/tool-testsuite/test/org/antlr/v4/test/tool/TestSymbolIssues.java +++ b/tool-testsuite/test/org/antlr/v4/test/tool/TestSymbolIssues.java @@ -392,9 +392,9 @@ public void testLabelsForTokensWithMixedTypesLRWithoutLabels() { "TOKEN_RANGE_WITHOUT_COLLISION: '_' | [a-zA-Z];\n" + "TOKEN_RANGE_WITH_ESCAPED_CHARS: [\\n-\\r] | '\\n'..'\\r';", - "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:2:18: chars 'a'..'f' used multiple times in set [aa-f]\n" + - "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:3:18: chars 'D'..'J' used multiple times in set [A-FD-J]\n" + - "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:4:13: chars 'O'..'V' used multiple times in set 'Z' | 'K'..'R' | 'O'..'V'\n" + + "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:2:18: chars a-f used multiple times in set [aa-f]\n" + + "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:3:18: chars D-J used multiple times in set [A-FD-J]\n" + + "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:4:13: chars O-V used multiple times in set 'Z' | 'K'..'R' | 'O'..'V'\n" + "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4::: chars 'g' used multiple times in set 'g'..'l'\n" + "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4::: chars '\\n' used multiple times in set '\\n'..'\\r'\n" }; @@ -402,6 +402,22 @@ public void testLabelsForTokensWithMixedTypesLRWithoutLabels() { testErrors(test, false); } + @Test public void testCaseInsensitiveCharsCollision() throws Exception { + String[] test = { + "lexer grammar L;\n" + + "options { caseInsensitive = true; }\n" + + "TOKEN_RANGE: [a-fA-F0-9];\n" + + "TOKEN_RANGE_2: 'g'..'l' | 'G'..'L';\n", + + "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:3:18: chars a-f used multiple times in set [a-fA-F0-9]\n" + + "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:3:18: chars A-F used multiple times in set [a-fA-F0-9]\n" + + "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:4:13: chars g-l used multiple times in set 'g'..'l' | 'G'..'L'\n" + + "warning(" + ErrorType.CHARACTERS_COLLISION_IN_SET.code + "): L.g4:4:13: chars G-L used multiple times in set 'g'..'l' | 'G'..'L'\n" + }; + + testErrors(test, false); + } + @Test public void testUnreachableTokens() { String[] test = { "lexer grammar Test;\n" + @@ -436,4 +452,16 @@ public void testLabelsForTokensWithMixedTypesLRWithoutLabels() { testErrors(test, false); } + + @Test public void testIllegalModeOption() throws Exception { + String[] test = { + "lexer grammar L;\n" + + "options { caseInsensitive = badValue; }\n" + + "DEFAULT_TOKEN: [A-F]+;\n", + + "warning(" + ErrorType.ILLEGAL_OPTION_VALUE.code + "): L.g4:2:28: unsupported option value caseInsensitive=badValue\n" + }; + + testErrors(test, false); + } } diff --git a/tool/src/org/antlr/v4/automata/ATNOptimizer.java b/tool/src/org/antlr/v4/automata/ATNOptimizer.java index 6c720167dc1..a3f9a636854 100644 --- a/tool/src/org/antlr/v4/automata/ATNOptimizer.java +++ b/tool/src/org/antlr/v4/automata/ATNOptimizer.java @@ -124,11 +124,11 @@ private static void optimizeSets(Grammar g, ATN atn) { Transition newTransition; if (matchSet.getIntervals().size() == 1) { if (matchSet.size() == 1) { - newTransition = CodePointTransitions.createWithCodePoint(blockEndState, matchSet.getMinElement()); + newTransition = CodePointTransitions.createWithCodePoint(blockEndState, matchSet.getMinElement(), false); } else { Interval matchInterval = matchSet.getIntervals().get(0); - newTransition = CodePointTransitions.createWithCodePointRange(blockEndState, matchInterval.a, matchInterval.b); + newTransition = CodePointTransitions.createWithCodePointRange(blockEndState, matchInterval.a, matchInterval.b, false); } } else { diff --git a/tool/src/org/antlr/v4/automata/LexerATNFactory.java b/tool/src/org/antlr/v4/automata/LexerATNFactory.java index 44c9eacb119..2676cdaf231 100644 --- a/tool/src/org/antlr/v4/automata/LexerATNFactory.java +++ b/tool/src/org/antlr/v4/automata/LexerATNFactory.java @@ -76,6 +76,8 @@ public class LexerATNFactory extends ParserATNFactory { private List ruleCommands = new ArrayList(); + private boolean caseInsensitive; + /** * Maps from an action index to a {@link LexerAction} object. */ @@ -89,6 +91,8 @@ public LexerATNFactory(LexerGrammar g) { super(g); // use codegen to get correct language templates for lexer commands String language = g.getOptionString("language"); + String caseInsensitiveOption = g.getOptionString("caseInsensitive"); + caseInsensitive = caseInsensitiveOption != null && caseInsensitiveOption.equals("true"); CodeGenerator gen = new CodeGenerator(g.tool, null, language); codegenTemplates = gen.getTemplates(); } @@ -257,7 +261,7 @@ public Handle range(GrammarAST a, GrammarAST b) { int t1 = CharSupport.getCharValueFromGrammarCharLiteral(a.getText()); int t2 = CharSupport.getCharValueFromGrammarCharLiteral(b.getText()); checkRange(a, b, t1, t2); - left.addTransition(CodePointTransitions.createWithCodePointRange(right, t1, t2)); + left.addTransition(CodePointTransitions.createWithCodePointRange(right, t1, t2, caseInsensitive)); a.atnState = left; b.atnState = left; return new Handle(left, right); @@ -272,9 +276,8 @@ public Handle set(GrammarAST associatedAST, List alts, boolean inver if ( t.getType()==ANTLRParser.RANGE ) { int a = CharSupport.getCharValueFromGrammarCharLiteral(t.getChild(0).getText()); int b = CharSupport.getCharValueFromGrammarCharLiteral(t.getChild(1).getText()); - if (checkRange((GrammarAST) t.getChild(0), (GrammarAST) t.getChild(1), a, b)) { - checkSetCollision(associatedAST, set, a, b); - set.add(a,b); + if (checkRange((GrammarAST)t.getChild(0), (GrammarAST)t.getChild(1), a, b)) { + checkRangeAndAddToSet(associatedAST, set, a, b, caseInsensitive); } } else if ( t.getType()==ANTLRParser.LEXER_CHAR_SET ) { @@ -283,8 +286,7 @@ else if ( t.getType()==ANTLRParser.LEXER_CHAR_SET ) { else if ( t.getType()==ANTLRParser.STRING_LITERAL ) { int c = CharSupport.getCharValueFromGrammarCharLiteral(t.getText()); if ( c != -1 ) { - checkSetCollision(associatedAST, set, c); - set.add(c); + checkCharAndAddToSet(associatedAST, set, c, caseInsensitive); } else { g.tool.errMgr.grammarError(ErrorType.INVALID_LITERAL_IN_LEXER_SET, @@ -303,7 +305,7 @@ else if ( t.getType()==ANTLRParser.TOKEN_REF ) { Transition transition; if (set.getIntervals().size() == 1) { Interval interval = set.getIntervals().get(0); - transition = CodePointTransitions.createWithCodePointRange(right, interval.a, interval.b); + transition = CodePointTransitions.createWithCodePointRange(right, interval.a, interval.b, caseInsensitive); } else { transition = new SetTransition(right, set); @@ -340,6 +342,8 @@ protected boolean checkRange(GrammarAST leftNode, GrammarAST rightNode, int left * "fog" is treated as 'f' 'o' 'g' not as a single transition in * the DFA. Machine== o-'f'->o-'o'->o-'g'->o and has n+1 states * for n characters. + * if "caseInsensitive" option is enabled, "fog" will be treated as + * o-('f'|'F') -> o-('o'|'O') -> o-('g'|'G') */ @Override public Handle stringLiteral(TerminalAST stringLiteralAST) { @@ -358,7 +362,7 @@ public Handle stringLiteral(TerminalAST stringLiteralAST) { for (int i = 0; i < n; ) { right = newState(stringLiteralAST); int codePoint = s.codePointAt(i); - prev.addTransition(CodePointTransitions.createWithCodePoint(right, codePoint)); + prev.addTransition(CodePointTransitions.createWithCodePoint(right, codePoint, caseInsensitive)); prev = right; i += Character.charCount(codePoint); } @@ -466,10 +470,10 @@ public IntervalSet getSetFromCharSetLiteral(GrammarAST charSetAST) { state = CharSetParseState.ERROR; break; case CODE_POINT: - state = applyPrevStateAndMoveToCodePoint(charSetAST, set, state, escapeParseResult.codePoint); + state = applyPrevStateAndMoveToCodePoint(charSetAST, set, state, escapeParseResult.codePoint, caseInsensitive); break; case PROPERTY: - state = applyPrevStateAndMoveToProperty(charSetAST, set, state, escapeParseResult.propertyIntervalSet); + state = applyPrevStateAndMoveToProperty(charSetAST, set, state, escapeParseResult.propertyIntervalSet, caseInsensitive); break; } offset = escapeParseResult.parseLength; @@ -485,7 +489,7 @@ else if (c == '-' && !state.inRange && i != 0 && i != n - 1 && state.mode != Cha } } else { - state = applyPrevStateAndMoveToCodePoint(charSetAST, set, state, c); + state = applyPrevStateAndMoveToCodePoint(charSetAST, set, state, c, caseInsensitive); } i += offset; } @@ -493,7 +497,7 @@ else if (c == '-' && !state.inRange && i != 0 && i != n - 1 && state.mode != Cha return new IntervalSet(); } // Whether or not we were in a range, we'll add the last code point found to the set. - applyPrevState(charSetAST, set, state); + applyPrevState(charSetAST, set, state, caseInsensitive); return set; } @@ -501,7 +505,8 @@ private CharSetParseState applyPrevStateAndMoveToCodePoint( GrammarAST charSetAST, IntervalSet set, CharSetParseState state, - int codePoint) { + int codePoint, + boolean caseInsensitive) { if (state.inRange) { if (state.prevCodePoint > codePoint) { g.tool.errMgr.grammarError( @@ -510,12 +515,11 @@ private CharSetParseState applyPrevStateAndMoveToCodePoint( charSetAST.getToken(), CharSupport.getRangeEscapedString(state.prevCodePoint, codePoint)); } - checkSetCollision(charSetAST, set, state.prevCodePoint, codePoint); - set.add(state.prevCodePoint, codePoint); + checkRangeAndAddToSet(charSetAST, set, state.prevCodePoint, codePoint, caseInsensitive); state = CharSetParseState.NONE; } else { - applyPrevState(charSetAST, set, state); + applyPrevState(charSetAST, set, state, caseInsensitive); state = new CharSetParseState( CharSetParseState.Mode.PREV_CODE_POINT, false, @@ -529,14 +533,15 @@ private CharSetParseState applyPrevStateAndMoveToProperty( GrammarAST charSetAST, IntervalSet set, CharSetParseState state, - IntervalSet property) { + IntervalSet property, + boolean caseInsensitive) { if (state.inRange) { g.tool.errMgr.grammarError(ErrorType.UNICODE_PROPERTY_NOT_ALLOWED_IN_RANGE, g.fileName, charSetAST.getToken(), charSetAST.getText()); return CharSetParseState.ERROR; } else { - applyPrevState(charSetAST, set, state); + applyPrevState(charSetAST, set, state, caseInsensitive); state = new CharSetParseState( CharSetParseState.Mode.PREV_PROPERTY, false, @@ -546,14 +551,13 @@ private CharSetParseState applyPrevStateAndMoveToProperty( return state; } - private void applyPrevState(GrammarAST charSetAST, IntervalSet set, CharSetParseState state) { + private void applyPrevState(GrammarAST charSetAST, IntervalSet set, CharSetParseState state, boolean caseInsensitive) { switch (state.mode) { case NONE: case ERROR: break; case PREV_CODE_POINT: - checkSetCollision(charSetAST, set, state.prevCodePoint); - set.add(state.prevCodePoint); + checkCharAndAddToSet(charSetAST, set, state.prevCodePoint, caseInsensitive); break; case PREV_PROPERTY: set.addAll(state.prevProperty); @@ -561,37 +565,50 @@ private void applyPrevState(GrammarAST charSetAST, IntervalSet set, CharSetParse } } - protected void checkSetCollision(GrammarAST ast, IntervalSet set, int el) { - checkSetCollision(ast, set, el, el); + private void checkCharAndAddToSet(GrammarAST ast, IntervalSet set, int c, boolean caseInsensitive) { + checkRangeAndAddToSet(ast, set, c, c, caseInsensitive); } - protected void checkSetCollision(GrammarAST ast, IntervalSet set, int a, int b) { - for (int i = a; i <= b; i++) { - if (set.contains(i)) { - String setText; - if (ast.getChildren() == null) { - setText = ast.getText(); - } - else { - StringBuilder sb = new StringBuilder(); - for (Object child : ast.getChildren()) { - if (child instanceof RangeAST) { - sb.append(((RangeAST) child).getChild(0).getText()); - sb.append(".."); - sb.append(((RangeAST) child).getChild(1).getText()); - } - else { - sb.append(((GrammarAST)child).getText()); + private void checkRangeAndAddToSet(GrammarAST ast, IntervalSet set, int a, int b, boolean caseInsensitive) { + if (caseInsensitive) { + int lowerA = Character.toLowerCase(a); + int upperA = Character.toUpperCase(a); + int lowerB = Character.toLowerCase(b); + int upperB = Character.toUpperCase(b); + if (lowerA == upperA && upperB == lowerB) { + checkRangeAndAddToSet(ast, set, a, b, false); + } else { + checkRangeAndAddToSet(ast, set, lowerA, lowerB, false); + checkRangeAndAddToSet(ast, set, upperA, upperB, false); + } + } else { + for (int i = a; i <= b; i++) { + if (set.contains(i)) { + String setText; + if (ast.getChildren() == null) { + setText = ast.getText(); + } else { + StringBuilder sb = new StringBuilder(); + for (Object child : ast.getChildren()) { + if (child instanceof RangeAST) { + sb.append(((RangeAST) child).getChild(0).getText()); + sb.append(".."); + sb.append(((RangeAST) child).getChild(1).getText()); + } + else { + sb.append(((GrammarAST)child).getText()); + } + sb.append(" | "); } - sb.append(" | "); + sb.replace(sb.length() - 3, sb.length(), ""); + setText = sb.toString(); } - sb.replace(sb.length() - 3, sb.length(), ""); - setText = sb.toString(); + g.tool.errMgr.grammarError(ErrorType.CHARACTERS_COLLISION_IN_SET, g.fileName, ast.getToken(), + (char)a + "-" + (char)b, setText); + break; } - g.tool.errMgr.grammarError(ErrorType.CHARACTERS_COLLISION_IN_SET, g.fileName, ast.getToken(), - CharSupport.getRangeEscapedString(a, b), setText); - break; } + set.add(a, b); } } diff --git a/tool/src/org/antlr/v4/semantics/BasicSemanticChecks.java b/tool/src/org/antlr/v4/semantics/BasicSemanticChecks.java index 4f76b5bd00b..17ff09e1479 100644 --- a/tool/src/org/antlr/v4/semantics/BasicSemanticChecks.java +++ b/tool/src/org/antlr/v4/semantics/BasicSemanticChecks.java @@ -224,6 +224,13 @@ public void blockOption(GrammarAST ID, GrammarAST valueAST) { public void grammarOption(GrammarAST ID, GrammarAST valueAST) { boolean ok = checkOptions(g.ast, ID.token, valueAST); //if ( ok ) g.ast.setOption(ID.getText(), value); + if (ID.getText().equals("caseInsensitive")) { + String valueText = valueAST.getText(); + if (!valueText.equals("true") && !valueText.equals("false")) { + g.tool.errMgr.grammarError(ErrorType.ILLEGAL_OPTION_VALUE, g.fileName, valueAST.getToken(), + ID.getText(), valueText); + } + } } @Override diff --git a/tool/src/org/antlr/v4/tool/Grammar.java b/tool/src/org/antlr/v4/tool/Grammar.java index fc98fcf4a77..cbd4af76add 100644 --- a/tool/src/org/antlr/v4/tool/Grammar.java +++ b/tool/src/org/antlr/v4/tool/Grammar.java @@ -83,6 +83,7 @@ public class Grammar implements AttributeResolver { parserOptions.add("language"); parserOptions.add("accessLevel"); parserOptions.add("exportMacro"); + parserOptions.add("caseInsensitive"); } public static final Set lexerOptions = parserOptions; @@ -435,7 +436,6 @@ public boolean defineRule(Rule r) { if ( rules.get(r.name)!=null ) { return false; } - rules.put(r.name, r); r.index = ruleNumber++; indexToRule.add(r);