-
I'm currently working on a small project for parsing PDF files, which first filters out all objects using
Lark then returns a token like this
My question now is: Does lark provide a way to access the capturing groups of regular expressions? I couldn't find any information on this in the documentation. As far as I am aware, lark uses Python's built-in regular expressions ( Currently, I just apply the same regular expression to the token to extract the stream's content. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
No, Lark doesn't provide something to do that, and probably can't with the current infrastructure. This is because we are not using the Regex in isolation, but in a clumped up compiled regex with all other Terminals for better performance. This means that group references are messed up and unusable. |
Beta Was this translation helpful? Give feedback.
-
Lark already uses named groups to retrieve the token type. Unfortunately, the My suggestion is to run the regex again on the token, post lex (or post parse), to extract the named groups. That might add around 10% to the parse time, which isn't that bad. You could also write your own lexer, if you needed. Provide it to Lark with the lexer argument (https://lark-parser.readthedocs.io/en/latest/classes.html#lark.Lark). We could, in theory, add this as a feature to Lark. It will allow you to specify in the grammar which regexes should be considered with their groups, and we can re-evaluate them whenever they get matched. But I don't know if it's worth the trouble. (We could also just search for them separately, but that would probably perform even worse) |
Beta Was this translation helpful? Give feedback.
Lark already uses named groups to retrieve the token type. Unfortunately, the
re
module doesn't have support for nested groups.My suggestion is to run the regex again on the token, post lex (or post parse), to extract the named groups. That might add around 10% to the parse time, which isn't that bad.
You could also write your own lexer, if you needed. Provide it to Lark with the lexer argument (https://lark-parser.readthedocs.io/en/latest/classes.html#lark.Lark).
@MegaIng
We could, in theory, add this as a feature to Lark. It will allow you to specify in the grammar which regexes should be considered with their groups, and we can re-evaluate them whenever they get matched. But I don't k…