Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only validate single emails #315

Merged
merged 10 commits into from
Aug 22, 2018
Merged

Conversation

dwt
Copy link
Contributor

@dwt dwt commented Jul 11, 2018

Hi there,

on our deployment we noticed a (new?) bug that suddenly multiple email addresses entered into an email field where accepted - and of course triggered errors in subsequent tools.

The problem are emails like foo,bar@baz.quoox which technically are two email addresses foo@localhost and bar@baz.quoox.

By my analysis this is triggered by the email regex, which has a whole bunch of special characters in the first [] group, that thus do not need to be escaped. Well, all but the - - which of course needs to be escaped.

Which is what this pull request fixes. What do you think?

@tisdall
Copy link
Contributor

tisdall commented Jul 11, 2018

Spelling mistake on "characterst", but everything else looks good! I'm surprised no one noticed the unescaped - in the regex before, but it seems like the only character it allowed accidentally was the ,.

CHANGES.rst Outdated
@@ -4,6 +4,10 @@ unreleased
- Drop Python 3.3 support. Add PyPy3 and Python 3.7 as allowed failures.
See https://github.com/Pylons/colander/pull/309

- Fix email validation to not allow all ascii characterst between + and /.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ASCII characters"

@@ -349,7 +349,7 @@ def __call__(self, node, value):
if self.match_object.match(value) is None:
raise Invalid(node, self.msg)

EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
EMAIL_RE = r"(?i)^[A-Z0-9._%!#$%&'*+\-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an improvement.

Copy link
Member

@stevepiercy stevepiercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo in the change log, otherwise it looks good to me. There was build failure, but it appears to have been due to a temporary network issue in Travis after I restarted the one task.

@stevepiercy
Copy link
Member

After careful scrutiny, I found a couple more issues with the regex for email. Is it worth creating new issues for these?

  1. I see that % is repeated. Suggest removing the first %.
  2. It should allow ., but disallow consecutive . in the local part of the email and disallow it in the start or end position, according to rfc3696. Those parts are beyond my regex powers.

@dwt
Copy link
Contributor Author

dwt commented Jul 12, 2018

Well, I'm not sure anything but a full parser can really validate email addresses. So the Regex is always going to be an approximation. Still glaring errors like allowing something that will be interpreted as multiple emails by subsequent tools should be fixed.

Regarding disallowing . in the start and end position and .., that would require adding specific character classes for the first and last character that are missing the dot. Not pretty. I'd have to say, I've seen some regexes that try to validate emails, and they are not pretty. If you want some comparison: I quite like the Perl REGEX posted here

Maybe something like this would do: (very much python3 pseudo code)

local_start_end = r"..."
local_middle = r"..."
email_regex = rf"{local_start_end}(?:[{local_middle}]*\.)?(?:[{local_middle}]+\.)*{local_start_end}@..."

Not sure this even works, but something along those lines…

@dwt
Copy link
Contributor Author

dwt commented Jul 12, 2018

Oh, and I hope I addressed those comments in the pull request.

Copy link
Member

@stevepiercy stevepiercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got all but one, "ASCII" should be upper-case.

CHANGES.rst Outdated
@@ -4,6 +4,10 @@ unreleased
- Drop Python 3.3 support. Add PyPy3 and Python 3.7 as allowed failures.
See https://github.com/Pylons/colander/pull/309

- Fix email validation to not allow all ascii characters between + and /.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uppercase "ASCII".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bien sure!

@stevepiercy
Copy link
Member

Well, I'm not sure anything but a full parser can really validate email addresses.

Agreed. I'm not positive how rigorous the check should be, and ultimately apps should send an email with a link for the user to click and verify their email. I think what you've submitted are improvements and other refinements should not obstruct the approval of this PR.

I've asked the maintainers to review, and with their blessing, it can be merged.

@@ -349,7 +349,7 @@ def __call__(self, node, value):
if self.match_object.match(value) is None:
raise Invalid(node, self.msg)

EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
EMAIL_RE = r"(?i)^[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please convert this into a verbose regex with comments? No human is going to be able to reason about it effectively in this form.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, there is a tool that can help with that.
https://regex101.com/r/N05IX8/1

"
(?i)^[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$
"
gm
(?i) match the remainder of the pattern with the following effective flags: gmi
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
^ asserts position at start of a line
Match a single character present in the list below [A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive)
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case insensitive)
._!#$%&'*+ matches a single character in the list ._!#$%&'*+ (case insensitive)
\- matches the character - literally (case insensitive)
/=?^_`{|}~() matches a single character in the list /=?^_`{|}~() (case insensitive)
@ matches the character @ literally (case insensitive)
Match a single character present in the list below [A-Z0-9]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive)
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case insensitive)
1st Capturing Group ([.-][A-Z0-9]+)*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
Match a single character present in the list below [.-]
.- matches a single character in the list .- (case insensitive)
Match a single character present in the list below [A-Z0-9]+
\. matches the character . literally (case insensitive)
Match a single character present in the list below [A-Z]{2,22}
{2,22} Quantifier — Matches between 2 and 22 times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, but not what I was asking for: I want the intent documented explicitly, not inferred by some tool, so that one can argue whether what it does is what we mean.

Copy link
Contributor Author

@dwt dwt Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I Support the notion of rewriting this regex in a more readable form, I would also like to separate such a refactoring from this (quite critical) bugfix.

Just an example, this lets users get around our uniqueness of email addresses requirement for accounts. Not sure how to exploit that just yet, but it is making me nervous.

Which is why I would really like to get this merged and released soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwt Making the regex verbose isn't really a refactoring: without it, I've got no confidence that the change is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tseaver If I create a new issue that refers to this discussion to add a new feature, specifically apply verbose regex syntax to the email regex (and possibly others) and modifying the regex class to parse verbosity, would that be acceptable to merging this bug fix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevepiercy I think so, but @tseaver should be ok with it as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tseaver: So how about this, can this be merged? It seems that another pull request for a refactoring of that regex seems to be acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwt Reformatting the regex to document its sections not a "feature": it is a maintainability requirement. The bug which the current URL contains exists precisely because understanding an 80-character regex is too hard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tseaver I think there's agreement about making the regex easier to understand. However, that should not be blocking making it "less wrong". This change contains a test case, fixes a bug and changes behavior.

Refactoring the regex is orthogonal and should not change behavior.

@dwt
Copy link
Contributor Author

dwt commented Aug 20, 2018

@tseaver I hope that this addresses all your worries.

@tisdall
Copy link
Contributor

tisdall commented Aug 20, 2018

@dwt - I'd remove that "FIXME" component unless you want to comment that whole regex too... ;) Also, while that regex may be more accurate, there's no way anyone will maintain that.

@mmerickel
Copy link
Member

I think the changes here are great, especially with the nice new comments (thank you @dwt) but I also do not understand the point of the FIXME line. That should be removed and turned into an issue in the tracker ... or just removed.

@dwt
Copy link
Contributor Author

dwt commented Aug 21, 2018

FIXME removed

Copy link
Member

@stevepiercy stevepiercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the requested changes.

EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
EMAIL_RE = r"""(?ix) # matches case invariant with spaces and comments ignored
^ # matches the start of string
[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+\-/=?^_`{|}~()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a copy-paste error. The comment should remove the escaped backslash for the hyphen (\- should be just -):

# matches multiples of the characters: A-Z0-9._!#$%&'*+-/=?^_`{|}~()

@@ -349,7 +349,15 @@ def __call__(self, node, value):
if self.match_object.match(value) is None:
raise Invalid(node, self.msg)

EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
EMAIL_RE = r"""(?ix) # matches case invariant with spaces and comments ignored
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"invariant" should be "insensitive"

Also please append " for the entire expression" so that it is more clear that A-Z includes a-z.

[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+\-/=?^_`{|}~()
@ # matches the @ sign
[A-Z0-9]+ # matches multiple of the characters A-Z0-9
([.-][A-Z0-9]+)* # matches .- followed by at least one of A-Z0-9 - multiple times
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# matches one of . or - followed by at least one of A-Z0-9, zero to unlimited times

@ # matches the @ sign
[A-Z0-9]+ # matches multiple of the characters A-Z0-9
([.-][A-Z0-9]+)* # matches .- followed by at least one of A-Z0-9 - multiple times
\.[A-Z]{2,22} # matches two to twenty two of A-Z
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# matches a period, followed by two to twenty-two of A-Z

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@dwt
Copy link
Contributor Author

dwt commented Aug 21, 2018

@stevepiercy Are you fine with these changes?

Copy link
Member

@stevepiercy stevepiercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed a few items on my first pass. Please make the requested changes, and then I think we're good to go.

Thank you for your patience and diligence on this PR. I appreciate it.

[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+-/=?^_`{|}~()
@ # matches the @ sign
[A-Z0-9]+ # matches multiples of the characters A-Z0-9
([.-][A-Z0-9]+)* # matches one of . or - followed by at least one of A-Z0-9, zero to unlimited
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I messed up. Please append " times".

"zero to unlimited times"

@@ -349,7 +349,15 @@ def __call__(self, node, value):
if self.match_object.match(value) is None:
raise Invalid(node, self.msg)

EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
EMAIL_RE = r"""(?ix) # matches case insensitive with spaces and comments ignored for the entire expression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wrap this to 79 columns.

EMAIL_RE = r"""(?ix) # matches case insensitive with spaces and comments
                     # ignored for the entire expression

EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$"
EMAIL_RE = r"""(?ix) # matches case insensitive with spaces and comments ignored for the entire expression
^ # matches the start of string
[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+-/=?^_`{|}~()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked the phrasing here on the first pass. Let's use this for clarity:

# matches any of the characters A-Z0-9._!#$%&'*+-/=?^_`{|}~() one or more times

@dwt
Copy link
Contributor Author

dwt commented Aug 22, 2018

@stevepiercy like this?

I'd appreciate if this can be merged soon - and I have to say, if there are more ideas how to improve the comments, how about you merge first and just add them yourself? That seems to be a much more time saving workflow for everyone involved.

Considering that I initially wanted to contribute a one (!!!!!) character fix to this project.

@stevepiercy
Copy link
Member

Thank you, @dwt. @tseaver, is this OK to merge?

I think a release is also in order.

@tseaver tseaver merged commit 4ce7659 into Pylons:master Aug 22, 2018
@rbu
Copy link
Contributor

rbu commented Aug 23, 2018

Thank you for merging. I would appreciate if you pushed a release.

I have a somewhat off-topic question regarding your taste regarding the regex: Do you find it easier to read with these changes? Personally, I don't find the commented regex easier to understand than the previous one-liner due to duplication and whitespace. Maybe it comes down to taste, but I would like to hear your opinion on this.

@stevepiercy
Copy link
Member

@rbu in this case, because email regex is a nasty beast as defined in its RFC where the developer's intent is highly technical and difficult to put into layperson terms, my personal preference is to just punt to my favorite online tool to explain a regex in almost English terms.

For other regex's where the developer's intent can be expressed in layperson terms, I'm inclined to go the verbose route.

For the sake of best practices and consistency, I can compromise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants