Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Various updates to the Regex HOWTO #107825

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

akuchling
Copy link
Member

@akuchling akuchling commented Aug 9, 2023

As people sent me comments over the years, I've been collecting user feedback on the Regex HOWTO. This PR will contain the resulting set of changes. It is currently still work-in-progress; I have a lengthy list of changes that I'm making.

I'll try very hard to keep each commit completely and logically separated, so you may want to proofread commit-by-commit. Feel free to cherry-pick particular commits into main if you like while other commits get worked on; I can rebase or merge and try to keep things coherent.


📚 Documentation preview 📚: https://cpython-previews--107825.org.readthedocs.build/

@bedevere-bot bedevere-bot added awaiting review docs Documentation in the Doc dir skip news labels Aug 9, 2023
@akuchling akuchling changed the title Various updates to the Regex HOWTO WIP: Various updates to the Regex HOWTO Aug 10, 2023
Doc/howto/regex.rst Outdated Show resolved Hide resolved
Comment on lines 556 to 558
To specify them in the pattern, you can write them as an embedded
modifier at the start of the pattern that uses the short one-letter
form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth to mentioned "modifier spans" like (?i:...). They are more powerful than global flags and modifiers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but it's also okay to do that in a separate PR. We can iterate and work incrementally.


For example, the following RE detects doubled words in a string. ::

>>> p = re.compile(r'\b(\w+)\s+\1\b')
>>> p = re.compile(r'\b(\w+)\b\s+\1\b')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second \b was removed intentionally. It is not needed here.

It is worth also to use possessive qualifiers here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's fine to keep the second \b, and when modifying the example for some other context it might be useful. So I'd be fine with keeping it too. (Note that it's mentioned in the text below also.)

(Also, what's a possessive qualifier?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly this example, but see the conversation in #21420 about redundant \b.

This example was fixed in #4443. It was incorrect without \b at the end, but \b between \w and \s is redundant by definition.

Sorry, not "possessive qualifier" but "possessive quantifier" (although in some documents they are named "qualifiers"). A possessive quantifier is a quantifier without backtracking. It is written by adding + to the quantifier (as non-greed quantifiers are written by adding ?). For example, when try to match the pattern with greedy quantifiers \b(\w+)\s+\1\b in "then the", a dumb backtracking engine will try to match "then then", fail, backtrack and try to match consequentially "the ", "th ", "t " until it give up. But with possessive quantifier \b(\w++)\s++\1\b it will not backtrack and fail quicker. It is a new feature in Python 3.11. Even if it is supported in most modern RE engines, it is relatively little known, because it was not initially supported in old RE engines.

See https://www.regular-expressions.info/possessive.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've removed the second \b and edited the text below a bit.

@serhiy-storchaka
Copy link
Member

It would be nice to add more about possessive qualifiers and atomic grouping. Modifier spans are also underrated.

Doc/howto/regex.rst Outdated Show resolved Hide resolved
Doc/howto/regex.rst Show resolved Hide resolved
Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrew! Here are some small suggestions. I recommend merging this rather than sitting on it for much longer. If there are improvements you're still planning to make but don't feel you have time for right now, feel free to open another PR. I promise to review and merge quickly -- this looks like almost everything is uncontroversial.

Doc/howto/regex.rst Outdated Show resolved Hide resolved
Doc/howto/regex.rst Outdated Show resolved Hide resolved
Doc/howto/regex.rst Show resolved Hide resolved
Doc/howto/regex.rst Outdated Show resolved Hide resolved
Comment on lines 556 to 558
To specify them in the pattern, you can write them as an embedded
modifier at the start of the pattern that uses the short one-letter
form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but it's also okay to do that in a separate PR. We can iterate and work incrementally.


For example, the following RE detects doubled words in a string. ::

>>> p = re.compile(r'\b(\w+)\s+\1\b')
>>> p = re.compile(r'\b(\w+)\b\s+\1\b')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's fine to keep the second \b, and when modifying the example for some other context it might be useful. So I'd be fine with keeping it too. (Note that it's mentioned in the text below also.)

(Also, what's a possessive qualifier?)

Doc/howto/regex.rst Outdated Show resolved Hide resolved
Doc/howto/regex.rst Outdated Show resolved Hide resolved
Doc/howto/regex.rst Outdated Show resolved Hide resolved
akuchling and others added 8 commits September 24, 2024 21:43
Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>
Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>
Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
@akuchling akuchling marked this pull request as ready for review September 25, 2024 02:00
@akuchling
Copy link
Member Author

OK, I've applied a bunch of suggested revisions, and also adds comments listing future topics such as the possessive quantifiers and spanning modifiers. Let's work on those in future PRs, since this one has already taken long enough! 🕙

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review docs Documentation in the Doc dir skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants