Only validate single emails #315

dwt · 2018-07-11T16:18:06Z

Hi there,

on our deployment we noticed a (new?) bug that suddenly multiple email addresses entered into an email field where accepted - and of course triggered errors in subsequent tools.

The problem are emails like foo,bar@baz.quoox which technically are two email addresses foo@localhost and bar@baz.quoox.

By my analysis this is triggered by the email regex, which has a whole bunch of special characters in the first [] group, that thus do not need to be escaped. Well, all but the - - which of course needs to be escaped.

Which is what this pull request fixes. What do you think?

… separated by comma

tisdall · 2018-07-11T17:05:04Z

Spelling mistake on "characterst", but everything else looks good! I'm surprised no one noticed the unescaped - in the regex before, but it seems like the only character it allowed accidentally was the ,.

stevepiercy · 2018-07-11T21:37:36Z

CHANGES.rst

@@ -4,6 +4,10 @@ unreleased
 - Drop Python 3.3 support. Add PyPy3 and Python 3.7 as allowed failures.
  See https://github.com/Pylons/colander/pull/309

+- Fix email validation to not allow all ascii characterst between + and /.


"ASCII characters"

stevepiercy · 2018-07-11T22:04:35Z

colander/__init__.py

@@ -349,7 +349,7 @@ def __call__(self, node, value):
        if self.match_object.match(value) is None:
            raise Invalid(node, self.msg)

-EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
+EMAIL_RE = r"(?i)^[A-Z0-9._%!#$%&'*+\-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"


This is an improvement.

stevepiercy

Minor typo in the change log, otherwise it looks good to me. There was build failure, but it appears to have been due to a temporary network issue in Travis after I restarted the one task.

stevepiercy · 2018-07-11T22:07:10Z

After careful scrutiny, I found a couple more issues with the regex for email. Is it worth creating new issues for these?

I see that % is repeated. Suggest removing the first %.
It should allow ., but disallow consecutive . in the local part of the email and disallow it in the start or end position, according to rfc3696. Those parts are beyond my regex powers.

dwt · 2018-07-12T13:10:33Z

Well, I'm not sure anything but a full parser can really validate email addresses. So the Regex is always going to be an approximation. Still glaring errors like allowing something that will be interpreted as multiple emails by subsequent tools should be fixed.

Regarding disallowing . in the start and end position and .., that would require adding specific character classes for the first and last character that are missing the dot. Not pretty. I'd have to say, I've seen some regexes that try to validate emails, and they are not pretty. If you want some comparison: I quite like the Perl REGEX posted here

Maybe something like this would do: (very much python3 pseudo code)

local_start_end = r"..."
local_middle = r"..."
email_regex = rf"{local_start_end}(?:[{local_middle}]*\.)?(?:[{local_middle}]+\.)*{local_start_end}@..."

Not sure this even works, but something along those lines…

dwt · 2018-07-12T13:10:52Z

Oh, and I hope I addressed those comments in the pull request.

stevepiercy

You got all but one, "ASCII" should be upper-case.

stevepiercy · 2018-07-12T20:50:54Z

CHANGES.rst

@@ -4,6 +4,10 @@ unreleased
 - Drop Python 3.3 support. Add PyPy3 and Python 3.7 as allowed failures.
  See https://github.com/Pylons/colander/pull/309

+- Fix email validation to not allow all ascii characters between + and /.


Uppercase "ASCII".

stevepiercy · 2018-07-12T20:56:35Z

Well, I'm not sure anything but a full parser can really validate email addresses.

Agreed. I'm not positive how rigorous the check should be, and ultimately apps should send an email with a link for the user to click and verify their email. I think what you've submitted are improvements and other refinements should not obstruct the approval of this PR.

I've asked the maintainers to review, and with their blessing, it can be merged.

tseaver · 2018-07-12T22:22:07Z

colander/__init__.py

@@ -349,7 +349,7 @@ def __call__(self, node, value):
        if self.match_object.match(value) is None:
            raise Invalid(node, self.msg)

-EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
+EMAIL_RE = r"(?i)^[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"


Can we please convert this into a verbose regex with comments? No human is going to be able to reason about it effectively in this form.

Fortunately, there is a tool that can help with that.
https://regex101.com/r/N05IX8/1

" (?i)^[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$ " gm (?i) match the remainder of the pattern with the following effective flags: gmi i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z]) ^ asserts position at start of a line Match a single character present in the list below [A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive) 0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case insensitive) ._!#$%&'*+ matches a single character in the list ._!#$%&'*+ (case insensitive) \- matches the character - literally (case insensitive) /=?^_`{|}~() matches a single character in the list /=?^_`{|}~() (case insensitive) @ matches the character @ literally (case insensitive) Match a single character present in the list below [A-Z0-9]+ + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive) 0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case insensitive) 1st Capturing Group ([.-][A-Z0-9]+)* * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy) A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data Match a single character present in the list below [.-] .- matches a single character in the list .- (case insensitive) Match a single character present in the list below [A-Z0-9]+ \. matches the character . literally (case insensitive) Match a single character present in the list below [A-Z]{2,22} {2,22} Quantifier — Matches between 2 and 22 times, as many times as possible, giving back as needed (greedy) A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive) $ asserts position at the end of a line Global pattern flags g modifier: global. All matches (don't return after first match) m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

Nice, but not what I was asking for: I want the intent documented explicitly, not inferred by some tool, so that one can argue whether what it does is what we mean.

While I Support the notion of rewriting this regex in a more readable form, I would also like to separate such a refactoring from this (quite critical) bugfix.

Just an example, this lets users get around our uniqueness of email addresses requirement for accounts. Not sure how to exploit that just yet, but it is making me nervous.

Which is why I would really like to get this merged and released soon.

@dwt Making the regex verbose isn't really a refactoring: without it, I've got no confidence that the change is correct.

@tseaver If I create a new issue that refers to this discussion to add a new feature, specifically apply verbose regex syntax to the email regex (and possibly others) and modifying the regex class to parse verbosity, would that be acceptable to merging this bug fix?

@stevepiercy I think so, but @tseaver should be ok with it as well

@tseaver: So how about this, can this be merged? It seems that another pull request for a refactoring of that regex seems to be acceptable?

@dwt Reformatting the regex to document its sections not a "feature": it is a maintainability requirement. The bug which the current URL contains exists precisely because understanding an 80-character regex is too hard.

@tseaver I think there's agreement about making the regex easier to understand. However, that should not be blocking making it "less wrong". This change contains a test case, fixes a bug and changes behavior.

Refactoring the regex is orthogonal and should not change behavior.

dwt · 2018-08-20T14:35:11Z

@tseaver I hope that this addresses all your worries.

tisdall · 2018-08-20T15:45:50Z

@dwt - I'd remove that "FIXME" component unless you want to comment that whole regex too... ;) Also, while that regex may be more accurate, there's no way anyone will maintain that.

mmerickel · 2018-08-21T06:47:28Z

I think the changes here are great, especially with the nice new comments (thank you @dwt) but I also do not understand the point of the FIXME line. That should be removed and turned into an issue in the tracker ... or just removed.

dwt · 2018-08-21T07:12:05Z

FIXME removed

stevepiercy

Please make the requested changes.

stevepiercy · 2018-08-21T08:37:14Z

colander/__init__.py

-EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
+EMAIL_RE = r"""(?ix) # matches case invariant with spaces and comments ignored
+^ # matches the start of string
+[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+\-/=?^_`{|}~()


This was a copy-paste error. The comment should remove the escaped backslash for the hyphen (\- should be just -):

# matches multiples of the characters: A-Z0-9._!#$%&'*+-/=?^_`{|}~()

stevepiercy · 2018-08-21T08:43:54Z

colander/__init__.py

@@ -349,7 +349,15 @@ def __call__(self, node, value):
        if self.match_object.match(value) is None:
            raise Invalid(node, self.msg)

-EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
+EMAIL_RE = r"""(?ix) # matches case invariant with spaces and comments ignored


"invariant" should be "insensitive"

Also please append " for the entire expression" so that it is more clear that A-Z includes a-z.

stevepiercy · 2018-08-21T08:49:41Z

colander/__init__.py

+[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+\-/=?^_`{|}~()
+@ # matches the @ sign
+[A-Z0-9]+ # matches multiple of the characters A-Z0-9
+([.-][A-Z0-9]+)* # matches .- followed by at least one of A-Z0-9 - multiple times


# matches one of . or - followed by at least one of A-Z0-9, zero to unlimited times

stevepiercy · 2018-08-21T08:51:00Z

colander/__init__.py

+@ # matches the @ sign
+[A-Z0-9]+ # matches multiple of the characters A-Z0-9
+([.-][A-Z0-9]+)* # matches .- followed by at least one of A-Z0-9 - multiple times
+\.[A-Z]{2,22} # matches two to twenty two of A-Z


# matches a period, followed by two to twenty-two of A-Z

dwt · 2018-08-21T12:51:26Z

@stevepiercy Are you fine with these changes?

stevepiercy

I missed a few items on my first pass. Please make the requested changes, and then I think we're good to go.

Thank you for your patience and diligence on this PR. I appreciate it.

stevepiercy · 2018-08-21T18:50:01Z

colander/__init__.py

+[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+-/=?^_`{|}~()
+@ # matches the @ sign
+[A-Z0-9]+ # matches multiples of the characters A-Z0-9
+([.-][A-Z0-9]+)* # matches one of . or - followed by at least one of A-Z0-9, zero to unlimited


Sorry, I messed up. Please append " times".

"zero to unlimited times"

stevepiercy · 2018-08-21T18:54:18Z

colander/__init__.py

@@ -349,7 +349,15 @@ def __call__(self, node, value):
        if self.match_object.match(value) is None:
            raise Invalid(node, self.msg)

-EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
+EMAIL_RE = r"""(?ix) # matches case insensitive with spaces and comments ignored for the entire expression


Let's wrap this to 79 columns.

EMAIL_RE = r"""(?ix) # matches case insensitive with spaces and comments # ignored for the entire expression

stevepiercy · 2018-08-21T18:56:31Z

colander/__init__.py

-EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
+EMAIL_RE = r"""(?ix) # matches case insensitive with spaces and comments ignored for the entire expression
+^ # matches the start of string
+[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+-/=?^_`{|}~()


I overlooked the phrasing here on the first pass. Let's use this for clarity:

# matches any of the characters A-Z0-9._!#$%&'*+-/=?^_`{|}~() one or more times

dwt · 2018-08-22T06:16:14Z

@stevepiercy like this?

I'd appreciate if this can be merged soon - and I have to say, if there are more ideas how to improve the comments, how about you merge first and just add them yourself? That seems to be a much more time saving workflow for everyone involved.

Considering that I initially wanted to contribute a one (!!!!!) character fix to this project.

stevepiercy · 2018-08-22T11:18:33Z

Thank you, @dwt. @tseaver, is this OK to merge?

I think a release is also in order.

rbu · 2018-08-23T13:10:15Z

Thank you for merging. I would appreciate if you pushed a release.

I have a somewhat off-topic question regarding your taste regarding the regex: Do you find it easier to read with these changes? Personally, I don't find the commented regex easier to understand than the previous one-liner due to duplication and whitespace. Maybe it comes down to taste, but I would like to hear your opinion on this.

stevepiercy · 2018-08-23T13:27:43Z

@rbu in this case, because email regex is a nasty beast as defined in its RFC where the developer's intent is highly technical and difficult to put into layperson terms, my personal preference is to just punt to my favorite online tool to explain a regex in almost English terms.

For other regex's where the developer's intent can be expressed in layperson terms, I'm inclined to go the verbose route.

For the sake of best practices and consistency, I can compromise.

dwt added 2 commits July 11, 2018 18:08

Fix validation of email addresses to not recognize multiple addresses…

577b910

… separated by comma

Add changenotes

591d549

stevepiercy reviewed Jul 11, 2018

View reviewed changes

stevepiercy requested changes Jul 11, 2018

View reviewed changes

dwt added 2 commits July 12, 2018 14:14

Typo

c81c1d6

Remove duplicate %

7b0d0a5

stevepiercy requested changes Jul 12, 2018

View reviewed changes

stevepiercy requested review from miohtama, mcdonc, tseaver, ericof and mmerickel July 12, 2018 20:55

Uppercase ASCII

b28e80a

stevepiercy approved these changes Jul 12, 2018

View reviewed changes

tseaver reviewed Jul 12, 2018

View reviewed changes

dwt added 2 commits August 20, 2018 16:34

Switch to multiline regex with comments

5c9c688

Add more correct regex.

246b552

Remove FIXME as per request

9fb1e4b

stevepiercy requested changes Aug 21, 2018

View reviewed changes

Update comments

3ec22d7

stevepiercy requested changes Aug 21, 2018

View reviewed changes

Work in @stevepercies comments

9478e96

stevepiercy approved these changes Aug 22, 2018

View reviewed changes

tseaver merged commit 4ce7659 into Pylons:master Aug 22, 2018

stevepiercy mentioned this pull request Feb 1, 2019

Fix: email regex #324

Merged

pyup-bot mentioned this pull request Jun 30, 2020

Pin colander to latest version 1.7.0 camptocamp/c2cgeoportal#6618

Closed

Only validate single emails #315

Only validate single emails #315

Conversation

dwt commented Jul 11, 2018

tisdall commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevepiercy left a comment

Choose a reason for hiding this comment

stevepiercy commented Jul 11, 2018

dwt commented Jul 12, 2018

dwt commented Jul 12, 2018

stevepiercy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevepiercy commented Jul 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwt Jul 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwt commented Aug 20, 2018

tisdall commented Aug 20, 2018

mmerickel commented Aug 21, 2018

dwt commented Aug 21, 2018

stevepiercy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwt commented Aug 21, 2018

stevepiercy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwt commented Aug 22, 2018

stevepiercy commented Aug 22, 2018

rbu commented Aug 23, 2018

stevepiercy commented Aug 23, 2018

dwt Jul 13, 2018 •

edited

Loading