Cannot split a grammar into two files #4059

Korporal · 2023-01-05T22:28:52Z

Korporal
Jan 5, 2023

This has consumed three hours of my time so far, I'm absolutely stumped.

I want to split a grammar file into two, from the few articles scattered around the web and from example of real grammars (like the ones for Ada) it see we can just put parser rules and grammar rules into two files and inside the files name them the same as the files.

I have named the lexer file: ImperiumLexer.g4 and the parser file ImperiumParser.g4.

ImperiumParser.g4 starts with parser grammar ImperiumParser;

ImperiumLexer.g4 starts with lexer grammar ImperiumLexer;

But I cannot get this work, am I expected to pass both file names into the Antlr4 tool? one name? just the raw grammar name with no "Parser" or "Lexer" parts? I've tried this and more and nothing works.

I tried putting an import into the parser to import the lexer and that too caused immense confusion.

Look:

specifying the files in the opposite order:

Here are the two files (they are in same folder)

I even found some stack overflow post that suggested one just specify *.g4 but nope, that too leads to problems...

This naming pattern seems to be used all over the place in the Antlr4 example grammars, like I said Ada but also CSharp, these split the files and name the files and so on, exactly as I am, what am I doing wrong?

As I play with this I wonder why we don't name these files <grammar>.<parers/lexer>.g4 that would isolate the grammar name nicely, a distinct component of the file name, then the files themselves could (internally) name the grammar <grammar> and the "lexer" and "parser" stuff could be left out of the grammar name altogether...

Thx

ericvergnaud · 2023-01-06T09:26:28Z

ericvergnaud
Jan 6, 2023
Maintainer

Hi,
you need to generate the Lexer first, then the Parser
if required, add a tokenVocab option in your parser grammar

0 replies

Korporal · 2023-01-06T16:16:23Z

Korporal
Jan 6, 2023
Author

@ericvergnaud - I made some progress but this is quite confusing, very non-intuitive to me.

I got my two files sorted ImperiumParser.g4 has parser grammar ImperiumParser; and ImperiumLexer.g4 has lexer grammar ImperiumLexer;.

In ImperiumParser.g4 I have:

options { tokenVocab = ImperiumLexer;}
Then I run the tool with ImperiumLexer.g4 and ImperiumParser.g4 specified in that order and it all works.

Now, the last step of my change is to split the lexer grammar itself file into two, I want ImperiumLexer.g4 to import ImperiumKeywords.g4 so that I can use a tool (that I will write) to generate the ImperiumKeywords.g4 file.

This is just not acting nicely though, it is very frustrating, and I can't find definitive documentation about how to use and understand all the ways of combining multiple files into a single grammar.

You can see the parser file here and the lexer file here, this works but I'd like to extract a subset of the lexer file into a separate file too.

The stuff I'd like to extract (and import, into the lexer grammar) is currently demarcated by comments inside the lexer grammar file:

// BEGIN - Culture dependent keywords

CALL: ('call');
GOTO: ('goto'); //{langcode=="en"}? ('goto') | {langcode=="fr"}? ('goto'); 
GO: ('go');
TO: ('to');
PROCEDURE: ('procedure' | 'proc');
PROC: ('proc');
END: ('end');
DECLARE: ('declare' | 'dcl');
ARGUMENT: ('argument' | 'arg');
DEFINE: ('define' | 'def');
BINARY: ('binary' | 'bin'); //{langcode=="en"}? ('binary' | 'bin') | {langcode=="fr"}? ('binaire' | 'bin');
DECIMAL: ('decimal' | 'dec');
AUTOMATIC: ('automatic' | 'auto');
BUILTIN: ('builtin');
INTRINSIC: ('intrinsic');
STATIC: ('static');
VARIABLE: ('variable');
BASED: ('based');
DEFINED: ('defined');
INTERNAL: ('internal');
EXTERNAL: ('external');
RETURN: ('return');
IF: ('if');
THEN: ('then');
ELSE: ('else');
ELIF: ('elif');
RETURNS: ('returns');
POINTER: ('pointer' | 'ptr');
BIT: ('bit');
CHARACTER: ('character' | 'char');
ENTRY: ('entry');
FIXED: ('fixed');
FLOAT: ('float');
OFFSET: ('offset' | 'ofx');
STRING: ('string');
VARYING: ('varying' | 'var');
COROUTINE: ('coroutine' | 'cor');
COFUNCTION: ('cofunction' | 'cof');
LOOP: ('loop');
WHILE: ('while');
UNTIL: ('until');
ENDLOOP: ('endloop');
RELOOP: ('reloop');
INCLUDE: ('include');
INC: ('inc');

// END - Culture dependent keywords

In case you're wondering why, I want to auto generate the file to do this kind of thing:

BINARY: {langcode=="en"}? ('binary' | 'bin') | {langcode=="fr"}? ('binaire' | 'bin');

From a json file that contains all of the keywords in multiple cultures. This example shows just English and French for the English "binary" but ultimately it would generate that for every keyword and a multiplicity of languages, English, French, Dutch, German...

2 replies

ericvergnaud Jan 6, 2023
Maintainer

May I suggest that you read the book ? I believe there is a lot to learn from it.
It should be quite straightforward to import ImperiumKeywords in ImperiumLexer.
You just need to be aware that imported rules will be added after the local rules, which affects precedence.
Notably, the greedy rules need to be in the imported lexer, not the main lexer.

Korporal Jan 6, 2023
Author

May I suggest that you read the book ? I believe there is a lot to learn from it. It should be quite straightforward to import ImperiumKeywords in ImperiumLexer. You just need to be aware that imported rules will be added after the local rules, which affects precedence. Notably, the greedy rules need to be in the imported lexer, not the main lexer.

I do have the book, and I can of course revisit that, thanks.

ericvergnaud · 2023-01-06T16:20:11Z

ericvergnaud
Jan 6, 2023
Maintainer

@parrt FYI an occurence of the topic of our recent discussion

1 reply

parrt Jan 6, 2023
Maintainer

haha. indeed.

Korporal · 2023-01-06T16:31:44Z

Korporal
Jan 6, 2023
Author

@ericvergnaud - OK that fixed it, I was unaware that the imported text appeared after the local text switching things around fixed it!

Thanks

0 replies

Korporal · 2023-01-06T17:53:09Z

Korporal
Jan 6, 2023
Author

It could be helpful to have a new way to include, namely an "include" term rather than only having "import". The "include" could be used to insert the text right into the place where the "include" directive is positioned, literally replace that line with all of the lines from the included file.

Then one could include text much more flexibly...

0 replies

kaby76 · 2023-01-06T20:48:09Z

kaby76
Jan 6, 2023

Splitting a grammar isn't particularly hard:

Copy all parser rules to a new file that is named <combined-grammar-name>Parser.g4, and lexer rules to <combined-grammar-name>Lexer.g4. It's important to preserve the order of the lexer rules in the new file.
Add in grammarDecl parser grammar <combined-grammar-name>Parser; to the parser.g4 file and lexer grammar <combined-grammar-name>Lexer;.
Add in options { tokenVocab=<combined-grammar-name>Lexer; } to the lexer.g4 file.
All string literals must have explicit lexer rules declared. Otherwise, if you try to run the Antlr tool "cannot create implicit token for string literal in non-combined
grammar". Antlr does not require that you fold the string literal (i.e., replace the string literal in the parser with the LHS symbol for the lexer rule that has that string literal on the RHS).

There's definitely more to the refactoring, but that usually suffices for simple grammars (that don't have options). See trsplit. There currently is no list of Antlr lexer rules that implement a full Unicode character table (e.g., Latin_Capital_letter_N : 'N';). But, it is on my "to do" list. Combined with an "XPath grep", one could extract out all the string literals and select the rules in the Unicode table, and define rules for string literals of more than one code point, so you don't have to type all these rules manually.

0 replies

Korporal · 2023-01-06T21:43:25Z

Korporal
Jan 6, 2023
Author

I'm going to have to stop, I cannot continue just now, it has become so bewildering that I don't even know what questions to ask any more.

Everything has fallen apart and I was making excellent progress too, or so it seemed.

All three .g4 files are being consumed fine yet I get errors reported when I run the parse now for input source that previously parsed fine.

I need to step away, its been gruelling...

1 reply

kaby76 Jan 6, 2023

(Always output the tokens generated before looking at the parse.)

parrt · 2023-01-06T21:45:01Z

parrt
Jan 6, 2023
Maintainer

Personally I don't think I've ever used to the import, so perhaps you don't need it. That definitely simplifies your life.

0 replies

Korporal · 2023-01-06T22:05:34Z

Korporal
Jan 6, 2023
Author

OK I'm back in the game!

I decided (as one often does in this business) to step back to an earlier point and its good.

I reverted to the original single .g4 file and now simply use my tool to generate a set of ANTLR4 token defs like this:

/* These Antlr4 keyword token definitions were generated by a utility on 1/6/2023 at 3:16 PM */

ALIGNED: {langcode=="en"}? ('aligned') | {langcode=="fr"}? ('aligned') | {langcode=="he"}? ('aligned') ;
ARGUMENT: {langcode=="en"}? ('argument' | 'arg') | {langcode=="fr"}? ('argument') | {langcode=="he"}? ('argument') ;
AUTOMATIC: {langcode=="en"}? ('automatic') ;
BASED: {langcode=="en"}? ('based') ;
BINARY: {langcode=="en"}? ('binary' | 'bin') | {langcode=="fr"}? ('binaire') | {langcode=="he"}? ('binary') ;
BIT: {langcode=="en"}? ('bit') ;
BOOLEAN: {langcode=="en"}? ('boolean') | {langcode=="fr"}? ('booléenne') | {langcode=="he"}? ('boolean') ;
BUILTIN: {langcode=="en"}? ('builtin') | {langcode=="fr"}? ('intégré') | {langcode=="he"}? ('builtin') ;
BY: {langcode=="en"}? ('by') | {langcode=="fr"}? ('by') | {langcode=="he"}? ('by') ;
BYPASS: {langcode=="en"}? ('bypass') | {langcode=="fr"}? ('bypass') ;
CALL: {langcode=="en"}? ('call') | {langcode=="fr"}? ('appeler') | {langcode=="he"}? ('call') ;
CHARACTER: {langcode=="en"}? ('character' | 'char') ;
COFUNCTION: {langcode=="en"}? ('cofunction') ;
COROUTINE: {langcode=="en"}? ('coroutine') | {langcode=="fr"}? ('coroutine') | {langcode=="he"}? ('coroutine') ;
DECIMAL: {langcode=="en"}? ('decimal' | 'dec') | {langcode=="fr"}? ('décimal') | {langcode=="he"}? ('decimal') ;
DECLARE: {langcode=="en"}? ('declare' | 'dcl') | {langcode=="fr"}? ('déclarer') | {langcode=="he"}? ('declare') ;
DEFINE: {langcode=="en"}? ('define' | 'def') | {langcode=="fr"}? ('define') | {langcode=="he"}? ('define') ;
DEFINED: {langcode=="en"}? ('defined') ;
ELIF: {langcode=="en"}? ('elif') ;
ELSE: {langcode=="en"}? ('else') | {langcode=="fr"}? ('else') | {langcode=="he"}? ('אחרת') ;
END: {langcode=="en"}? ('end') | {langcode=="fr"}? ('fin') | {langcode=="he"}? ('end') ;
ENDLOOP: {langcode=="en"}? ('endloop') ;
ENTRY: {langcode=="en"}? ('entry') ;
ENUM: {langcode=="en"}? ('enum') | {langcode=="fr"}? ('enum') | {langcode=="he"}? ('enum') ;
FIXED: {langcode=="en"}? ('fixed') | {langcode=="fr"}? ('fixe') | {langcode=="he"}? ('fixed') ;
FLOAT: {langcode=="en"}? ('float') | {langcode=="fr"}? ('flottant') | {langcode=="he"}? ('float') ;
FUNCTION: {langcode=="en"}? ('function' | 'func') | {langcode=="fr"}? ('fonction') | {langcode=="he"}? ('function') ;
GO: {langcode=="en"}? ('go') ;
GOTO: {langcode=="en"}? ('goto') | {langcode=="fr"}? ('goto') | {langcode=="he"}? ('goto') ;
IF: {langcode=="en"}? ('if') | {langcode=="fr"}? ('si') | {langcode=="he"}? ('אם') ;
INC: {langcode=="en"}? ('inc') ;
INCLUDE: {langcode=="en"}? ('include' | 'inc') ;
INTERNAL: {langcode=="en"}? ('internal') | {langcode=="fr"}? ('interne') | {langcode=="he"}? ('internal') ;
INTERRUPT: {langcode=="en"}? ('interrupt') | {langcode=="fr"}? ('interrompre') | {langcode=="he"}? ('interrupt') ;
INTRINSIC: {langcode=="en"}? ('intrinsic') ;
LANGUAGE: {langcode=="en"}? ('lingua') | {langcode=="fr"}? ('lingua') | {langcode=="he"}? ('lingua') ;
LOOP: {langcode=="en"}? ('loop') | {langcode=="fr"}? ('boucle') | {langcode=="he"}? ('loop') ;
NAMESPACE: {langcode=="en"}? ('namespace') | {langcode=="fr"}? ('namespace') | {langcode=="he"}? ('namespace') ;
OFFSET: {langcode=="en"}? ('offset') | {langcode=="fr"}? ('offset') | {langcode=="he"}? ('offset') ;
OUT: {langcode=="en"}? ('out') | {langcode=="fr"}? ('depuis') | {langcode=="he"}? ('out') ;
PAD: {langcode=="en"}? ('pad') | {langcode=="fr"}? ('pad') | {langcode=="he"}? ('pad') ;
POINTER: {langcode=="en"}? ('pointer' | 'ptr') | {langcode=="fr"}? ('pointer') | {langcode=="he"}? ('pointer') ;
PRIVATE: {langcode=="en"}? ('private') | {langcode=="fr"}? ('privé') | {langcode=="he"}? ('private') ;
PROCEDURE: {langcode=="en"}? ('procedure' | 'proc') | {langcode=="fr"}? ('procédé') | {langcode=="he"}? ('procedure') ;
PUBLIC: {langcode=="en"}? ('public') | {langcode=="fr"}? ('public') | {langcode=="he"}? ('public') ;
READONLY: {langcode=="en"}? ('readonly') | {langcode=="fr"}? ('readonly') | {langcode=="he"}? ('readonly') ;
REF: {langcode=="en"}? ('ref') | {langcode=="fr"}? ('ref') | {langcode=="he"}? ('ref') ;
RELOOP: {langcode=="en"}? ('reloop') ;
RETURN: {langcode=="en"}? ('return') | {langcode=="fr"}? ('retour') | {langcode=="he"}? ('return') ;
RETURNS: {langcode=="en"}? ('returns') ;
RETURNON: {langcode=="en"}? ('returnon') | {langcode=="fr"}? ('retour si') | {langcode=="he"}? ('returnon') ;
SINGLET: {langcode=="en"}? ('singlet') | {langcode=="fr"}? ('singlet') | {langcode=="he"}? ('singlet') ;
STATIC: {langcode=="en"}? ('static') | {langcode=="fr"}? ('static') | {langcode=="he"}? ('static') ;
STRING: {langcode=="en"}? ('string') | {langcode=="fr"}? ('chaîne') | {langcode=="he"}? ('string') ;
STRUCTURE: {langcode=="en"}? ('structure' | 'struct') | {langcode=="fr"}? ('structure') | {langcode=="he"}? ('structure') ;
THEN: {langcode=="en"}? ('then') ;
TO: {langcode=="en"}? ('to') | {langcode=="fr"}? ('to') | {langcode=="he"}? ('to') ;
TYPE: {langcode=="en"}? ('type') | {langcode=="fr"}? ('type') | {langcode=="he"}? ('type') ;
UNALIGNED: {langcode=="en"}? ('unaligned') | {langcode=="fr"}? ('unaligned') | {langcode=="he"}? ('unaligned') ;
UNTIL: {langcode=="en"}? ('until') | {langcode=="fr"}? ('avant') | {langcode=="he"}? ('until') ;
USING: {langcode=="en"}? ('using') | {langcode=="fr"}? ('using') | {langcode=="he"}? ('using') ;
WHILE: {langcode=="en"}? ('while') | {langcode=="fr"}? ('tandis que') | {langcode=="he"}? ('כלעוד') ;
VARIABLE: {langcode=="en"}? ('variable') ;
VARYING: {langcode=="en"}? ('varying') | {langcode=="fr"}? ('varying') | {langcode=="he"}? ('varying') ;
YIELD: {langcode=="en"}? ('yield') | {langcode=="fr"}? ('donner') | {langcode=="he"}? ('yield') ;

/* End of generated Antlr4 keyword token definitions. */

and just copy/paste that into the .g4 - overwriting the earlier version of that block of statements - it works!

This is the initial proof, I can now easily maintain a JSON file of keywords per (human) language, generate the above file and then copy/paste into the g4, the effort is 1% more than it would be if I was generating the .g4 and importing it and so on.

(This may not be the most efficient way to do what I want but it does work and that's fine for time being).

1 reply

KvanTTT Feb 12, 2023

BTW, your lexer is very slow because semantic predicates are located at the beginning of lexer rules. See detail in #3633

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot split a grammar into two files #4059

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Cannot split a grammar into two files #4059

Korporal Jan 5, 2023

Replies: 9 comments · 5 replies

ericvergnaud Jan 6, 2023 Maintainer

Korporal Jan 6, 2023 Author

ericvergnaud Jan 6, 2023 Maintainer

Korporal Jan 6, 2023 Author

ericvergnaud Jan 6, 2023 Maintainer

parrt Jan 6, 2023 Maintainer

Korporal Jan 6, 2023 Author

Korporal Jan 6, 2023 Author

kaby76 Jan 6, 2023

Korporal Jan 6, 2023 Author

kaby76 Jan 6, 2023

parrt Jan 6, 2023 Maintainer

Korporal Jan 6, 2023 Author

KvanTTT Feb 12, 2023

Korporal
Jan 5, 2023

Replies: 9 comments 5 replies

ericvergnaud
Jan 6, 2023
Maintainer

Korporal
Jan 6, 2023
Author

ericvergnaud Jan 6, 2023
Maintainer

Korporal Jan 6, 2023
Author

ericvergnaud
Jan 6, 2023
Maintainer

parrt Jan 6, 2023
Maintainer

Korporal
Jan 6, 2023
Author

Korporal
Jan 6, 2023
Author

kaby76
Jan 6, 2023

Korporal
Jan 6, 2023
Author

parrt
Jan 6, 2023
Maintainer

Korporal
Jan 6, 2023
Author