Minor request: \v for vertical spacing #477

dchaplinsky · 2022-08-15T22:16:48Z

Hi!

I'm using the regex lib to make a port of language tool libs (originally java) for sentence and word tokenization.
Those are relying on \v\h heavily. Some of those rules are shipped in the xml files full of regexes and I'm willing not to alter those to not to maintain a separate copy. I can kind of workaround it by replacing \v with VERTICAL_SPACE: str = "\u000a\u000b\u000c\u000d\u0085\u2028\u2029" but it's another tiny nightmare, as those regexes can come in different fashions: \v*, [\v\t]*, etc.

Please review the possibility to add the \v flag

dchaplinsky · 2022-08-15T23:40:23Z

I can see that code suggest \v pseudo, but I cannot understand why it doesn't work then:

In [3]: import regex as re

In [8]: re.search(r"\v", "\n") is None
Out[8]: True

In [9]: re.search(r"\v", "\n", flags=re.M | re.U | re.V1) is None
Out[9]: True

mrabarnett · 2022-08-16T00:13:21Z

\v already exists in Python as being short for \x0b (LINE TABULATION):

>>> '\v'
'\x0b'
>>> '\v' == '\N{LINE TABULATION}'
True

dchaplinsky · 2022-08-16T07:41:40Z

Thanks for the prompt reply!

Any ideas on the matching of vertical space?

mrabarnett · 2022-08-16T14:18:34Z

There are far fewer characters that need to match: [\x0A\x0B\x0C\x0D\x85\u2028\u2029] or [\x0A-x0D\x85\u2028\u2029].

Maybe it could be added as \V, although that would be inconsistent with \h, and there are pairs of lowercase/uppercase escape codes where the uppercase one is the negative of the lowercase one, e.g. \d and \D. On the other hand, those implementations that have \h and \v don't have \H and \V.

Also, I don't want to add something that the re module might do differently if it were added later.

That's why it hasn't been added already.

dchaplinsky · 2022-08-16T15:16:30Z

Okay, makes perfect sense (still sad for my downstream task).

…

On Tue, Aug 16, 2022 at 5:18 PM mrabarnett ***@***.***> wrote: There are far fewer characters that need to match: [\x0A\x0B\x0C\x0D\x85\u2028\u2029] or [\x0A-x0D\x85\u2028\u2029]. Maybe it could be added as \V, although that would be inconsistent with \h, and there are pairs of lowercase/uppercase escape codes where the uppercase one is the negative of the lowercase one, e.g. \d and \D. On the other hand, those implementations that have \h and \v don't have \H and \V. Also, I don't want to add something that the re module might do differently if it were added later. That's why it hasn't been added already. — Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABAA4WMEYPZUNSUUHPGUF3VZOPMJANCNFSM56TVAIOQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mrabarnett · 2022-08-16T18:38:59Z

I've come across a mention of \H and \V, so using \V would be a bad idea.

dchaplinsky · 2022-08-16T18:51:31Z

Maybe something like a pseudo-character class, like [:blank:]?

…

On Tue, Aug 16, 2022 at 9:39 PM mrabarnett ***@***.***> wrote: I've come across a mention of \H and \V, so using \V would be a bad idea. — Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABAA4UITGW7KYVBY4FVE7DVZPN43ANCNFSM56TVAIOQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mrabarnett · 2022-08-16T20:28:59Z

Now I'm thinking about \y and \Y, which look a little like \v and \V. ProgressSQL uses them instead of \b and \B, which every other implementation that I know of uses, possibly because \b normally represents \x08 outside regex, and does still within characters classes.

I want the regex module to remain compatible with the re module, and just in case they ever get added there in the future, I'm soliciting opinions on python-dev.

mrabarnett · 2022-08-17T19:30:27Z

I've added \p{HorizSpace} (\p{H}) and \p{VertSpace} (\p{V}) in regex 2022.8.17, which is currently being built on GitHub and should arrive on PyPI soon.

dchaplinsky · 2022-08-17T20:27:38Z

Wow, many thanks!

dchaplinsky · 2022-10-11T08:02:55Z

Well, you can probably add it to V1? It's already somewhat beyond the original re :)

…

On Tue, Aug 16, 2022 at 11:29 PM mrabarnett ***@***.***> wrote: Now I'm thinking about \y and \Y, which look a little like \v and \V. ProgressSQL uses them instead of \b and \B, which every other implementation that I know of uses, possibly because \b normally represents \x08 outside regex, and does still within characters classes. I want the regex module to remain compatible with the re module, and just in case they ever get added there in the future, I'm soliciting opinions on python-dev. — Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABAA4VV4QEAN2LIUDUDSXTVZP2ZNANCNFSM56TVAIOQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mrabarnett · 2022-10-11T17:14:16Z

Given the feedback on python-dev, I won't be adding \y and \Y. What I've already added should suffice.

dchaplinsky · 2022-10-11T17:27:54Z

Totally works for me! Thanks a lot for looking into it again!

…

On Tue, Oct 11, 2022 at 8:14 PM mrabarnett ***@***.***> wrote: Given the feedback on python-dev, I won't be adding \y and \Y. What I've already added should suffice. — Reply to this email directly, view it on GitHub <#477 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABAA4VDSYGNLFNU37PSGS3WCWN7JANCNFSM56TVAIOQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mrabarnett closed this as completed Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor request: \v for vertical spacing #477

Minor request: \v for vertical spacing #477

dchaplinsky commented Aug 15, 2022

dchaplinsky commented Aug 15, 2022

mrabarnett commented Aug 16, 2022

dchaplinsky commented Aug 16, 2022

mrabarnett commented Aug 16, 2022

dchaplinsky commented Aug 16, 2022 via email

mrabarnett commented Aug 16, 2022

dchaplinsky commented Aug 16, 2022 via email

mrabarnett commented Aug 16, 2022

mrabarnett commented Aug 17, 2022

dchaplinsky commented Aug 17, 2022

dchaplinsky commented Oct 11, 2022 via email

mrabarnett commented Oct 11, 2022

dchaplinsky commented Oct 11, 2022 via email

Minor request: \v for vertical spacing #477

Minor request: \v for vertical spacing #477

Comments

dchaplinsky commented Aug 15, 2022

dchaplinsky commented Aug 15, 2022

mrabarnett commented Aug 16, 2022

dchaplinsky commented Aug 16, 2022

mrabarnett commented Aug 16, 2022

dchaplinsky commented Aug 16, 2022 via email

mrabarnett commented Aug 16, 2022

dchaplinsky commented Aug 16, 2022 via email

mrabarnett commented Aug 16, 2022

mrabarnett commented Aug 17, 2022

dchaplinsky commented Aug 17, 2022

dchaplinsky commented Oct 11, 2022 via email

mrabarnett commented Oct 11, 2022

dchaplinsky commented Oct 11, 2022 via email