Accessing capture groups of regular expressions? #905

cubinator · 2021-06-03T14:03:19Z

cubinator
Jun 3, 2021

I'm currently working on a small project for parsing PDF files, which first filters out all objects using re.finditer and then applies lark to parse those objects. One particular type of object, called stream object, consists of binary data enclosed in the tokens "stream" and "endstream". But due to the binary format of a PDF, I need to declare those stream objects as a terminal with a regular expression which looks like this

STREAM_LITERAL: /(?<=...)stream\r?\n(.*?)endstream(?=...)/s

Lark then returns a token like this

stream
Hello Worldendstream

My question now is: Does lark provide a way to access the capturing groups of regular expressions? I couldn't find any information on this in the documentation. As far as I am aware, lark uses Python's built-in regular expressions (re). Is it possible to access the Match object of a token whose underlying rule is a regular expression?

Currently, I just apply the same regular expression to the token to extract the stream's content.

Answered by erezsh

Jun 3, 2021

Lark already uses named groups to retrieve the token type. Unfortunately, the re module doesn't have support for nested groups.

My suggestion is to run the regex again on the token, post lex (or post parse), to extract the named groups. That might add around 10% to the parse time, which isn't that bad.

You could also write your own lexer, if you needed. Provide it to Lark with the lexer argument (https://lark-parser.readthedocs.io/en/latest/classes.html#lark.Lark).

@MegaIng

We could, in theory, add this as a feature to Lark. It will allow you to specify in the grammar which regexes should be considered with their groups, and we can re-evaluate them whenever they get matched. But I don't k…

View full answer

MegaIng · 2021-06-03T14:15:26Z

MegaIng
Jun 3, 2021
Collaborator

No, Lark doesn't provide something to do that, and probably can't with the current infrastructure. This is because we are not using the Regex in isolation, but in a clumped up compiled regex with all other Terminals for better performance. This means that group references are messed up and unusable.

1 reply

cubinator Jun 3, 2021
Author

Would it be possible with named groups, i.e. (?P<name>.*?)?

erezsh · 2021-06-03T15:23:10Z

erezsh
Jun 3, 2021
Maintainer

Lark already uses named groups to retrieve the token type. Unfortunately, the re module doesn't have support for nested groups.

My suggestion is to run the regex again on the token, post lex (or post parse), to extract the named groups. That might add around 10% to the parse time, which isn't that bad.

You could also write your own lexer, if you needed. Provide it to Lark with the lexer argument (https://lark-parser.readthedocs.io/en/latest/classes.html#lark.Lark).

@MegaIng

We could, in theory, add this as a feature to Lark. It will allow you to specify in the grammar which regexes should be considered with their groups, and we can re-evaluate them whenever they get matched. But I don't know if it's worth the trouble.

(We could also just search for them separately, but that would probably perform even worse)

1 reply

cubinator Jun 3, 2021
Author

Well, I'm fine with applying the regular expression a second time; parse time is not of great importance for my project. And considering all the nice projects that already use lark without the need for this feature, I don't think this feature is worth the trouble. Thanks to both of you for your time :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing capture groups of regular expressions? #905

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Accessing capture groups of regular expressions? #905

cubinator Jun 3, 2021

Replies: 2 comments · 2 replies

MegaIng Jun 3, 2021 Collaborator

cubinator Jun 3, 2021 Author

erezsh Jun 3, 2021 Maintainer

cubinator Jun 3, 2021 Author

cubinator
Jun 3, 2021

Replies: 2 comments 2 replies

MegaIng
Jun 3, 2021
Collaborator

cubinator Jun 3, 2021
Author

erezsh
Jun 3, 2021
Maintainer

cubinator Jun 3, 2021
Author