-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only validate single emails #315
Conversation
Spelling mistake on "characterst", but everything else looks good! I'm surprised no one noticed the unescaped |
CHANGES.rst
Outdated
@@ -4,6 +4,10 @@ unreleased | |||
- Drop Python 3.3 support. Add PyPy3 and Python 3.7 as allowed failures. | |||
See https://github.com/Pylons/colander/pull/309 | |||
|
|||
- Fix email validation to not allow all ascii characterst between + and /. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ASCII characters"
colander/__init__.py
Outdated
@@ -349,7 +349,7 @@ def __call__(self, node, value): | |||
if self.match_object.match(value) is None: | |||
raise Invalid(node, self.msg) | |||
|
|||
EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$" | |||
EMAIL_RE = r"(?i)^[A-Z0-9._%!#$%&'*+\-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo in the change log, otherwise it looks good to me. There was build failure, but it appears to have been due to a temporary network issue in Travis after I restarted the one task.
After careful scrutiny, I found a couple more issues with the regex for email. Is it worth creating new issues for these?
|
Well, I'm not sure anything but a full parser can really validate email addresses. So the Regex is always going to be an approximation. Still glaring errors like allowing something that will be interpreted as multiple emails by subsequent tools should be fixed. Regarding disallowing Maybe something like this would do: (very much python3 pseudo code) local_start_end = r"..."
local_middle = r"..."
email_regex = rf"{local_start_end}(?:[{local_middle}]*\.)?(?:[{local_middle}]+\.)*{local_start_end}@..." Not sure this even works, but something along those lines… |
Oh, and I hope I addressed those comments in the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You got all but one, "ASCII" should be upper-case.
CHANGES.rst
Outdated
@@ -4,6 +4,10 @@ unreleased | |||
- Drop Python 3.3 support. Add PyPy3 and Python 3.7 as allowed failures. | |||
See https://github.com/Pylons/colander/pull/309 | |||
|
|||
- Fix email validation to not allow all ascii characters between + and /. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uppercase "ASCII".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bien sure!
Agreed. I'm not positive how rigorous the check should be, and ultimately apps should send an email with a link for the user to click and verify their email. I think what you've submitted are improvements and other refinements should not obstruct the approval of this PR. I've asked the maintainers to review, and with their blessing, it can be merged. |
colander/__init__.py
Outdated
@@ -349,7 +349,7 @@ def __call__(self, node, value): | |||
if self.match_object.match(value) is None: | |||
raise Invalid(node, self.msg) | |||
|
|||
EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$" | |||
EMAIL_RE = r"(?i)^[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please convert this into a verbose regex with comments? No human is going to be able to reason about it effectively in this form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately, there is a tool that can help with that.
https://regex101.com/r/N05IX8/1
"
(?i)^[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$
"
gm
(?i) match the remainder of the pattern with the following effective flags: gmi
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
^ asserts position at start of a line
Match a single character present in the list below [A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive)
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case insensitive)
._!#$%&'*+ matches a single character in the list ._!#$%&'*+ (case insensitive)
\- matches the character - literally (case insensitive)
/=?^_`{|}~() matches a single character in the list /=?^_`{|}~() (case insensitive)
@ matches the character @ literally (case insensitive)
Match a single character present in the list below [A-Z0-9]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive)
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case insensitive)
1st Capturing Group ([.-][A-Z0-9]+)*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
Match a single character present in the list below [.-]
.- matches a single character in the list .- (case insensitive)
Match a single character present in the list below [A-Z0-9]+
\. matches the character . literally (case insensitive)
Match a single character present in the list below [A-Z]{2,22}
{2,22} Quantifier — Matches between 2 and 22 times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, but not what I was asking for: I want the intent documented explicitly, not inferred by some tool, so that one can argue whether what it does is what we mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I Support the notion of rewriting this regex in a more readable form, I would also like to separate such a refactoring from this (quite critical) bugfix.
Just an example, this lets users get around our uniqueness of email addresses requirement for accounts. Not sure how to exploit that just yet, but it is making me nervous.
Which is why I would really like to get this merged and released soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dwt Making the regex verbose isn't really a refactoring: without it, I've got no confidence that the change is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tseaver If I create a new issue that refers to this discussion to add a new feature, specifically apply verbose regex syntax to the email regex (and possibly others) and modifying the regex class to parse verbosity, would that be acceptable to merging this bug fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevepiercy I think so, but @tseaver should be ok with it as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tseaver: So how about this, can this be merged? It seems that another pull request for a refactoring of that regex seems to be acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dwt Reformatting the regex to document its sections not a "feature": it is a maintainability requirement. The bug which the current URL contains exists precisely because understanding an 80-character regex is too hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tseaver I think there's agreement about making the regex easier to understand. However, that should not be blocking making it "less wrong". This change contains a test case, fixes a bug and changes behavior.
Refactoring the regex is orthogonal and should not change behavior.
@tseaver I hope that this addresses all your worries. |
@dwt - I'd remove that "FIXME" component unless you want to comment that whole regex too... ;) Also, while that regex may be more accurate, there's no way anyone will maintain that. |
I think the changes here are great, especially with the nice new comments (thank you @dwt) but I also do not understand the point of the FIXME line. That should be removed and turned into an issue in the tracker ... or just removed. |
FIXME removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the requested changes.
colander/__init__.py
Outdated
EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$" | ||
EMAIL_RE = r"""(?ix) # matches case invariant with spaces and comments ignored | ||
^ # matches the start of string | ||
[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+\-/=?^_`{|}~() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a copy-paste error. The comment should remove the escaped backslash for the hyphen (\-
should be just -
):
# matches multiples of the characters: A-Z0-9._!#$%&'*+-/=?^_`{|}~()
colander/__init__.py
Outdated
@@ -349,7 +349,15 @@ def __call__(self, node, value): | |||
if self.match_object.match(value) is None: | |||
raise Invalid(node, self.msg) | |||
|
|||
EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$" | |||
EMAIL_RE = r"""(?ix) # matches case invariant with spaces and comments ignored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"invariant" should be "insensitive"
Also please append " for the entire expression" so that it is more clear that A-Z
includes a-z
.
colander/__init__.py
Outdated
[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+\-/=?^_`{|}~() | ||
@ # matches the @ sign | ||
[A-Z0-9]+ # matches multiple of the characters A-Z0-9 | ||
([.-][A-Z0-9]+)* # matches .- followed by at least one of A-Z0-9 - multiple times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# matches one of . or - followed by at least one of A-Z0-9, zero to unlimited times
colander/__init__.py
Outdated
@ # matches the @ sign | ||
[A-Z0-9]+ # matches multiple of the characters A-Z0-9 | ||
([.-][A-Z0-9]+)* # matches .- followed by at least one of A-Z0-9 - multiple times | ||
\.[A-Z]{2,22} # matches two to twenty two of A-Z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# matches a period, followed by two to twenty-two of A-Z
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@stevepiercy Are you fine with these changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed a few items on my first pass. Please make the requested changes, and then I think we're good to go.
Thank you for your patience and diligence on this PR. I appreciate it.
colander/__init__.py
Outdated
[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+-/=?^_`{|}~() | ||
@ # matches the @ sign | ||
[A-Z0-9]+ # matches multiples of the characters A-Z0-9 | ||
([.-][A-Z0-9]+)* # matches one of . or - followed by at least one of A-Z0-9, zero to unlimited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I messed up. Please append " times".
"zero to unlimited times"
colander/__init__.py
Outdated
@@ -349,7 +349,15 @@ def __call__(self, node, value): | |||
if self.match_object.match(value) is None: | |||
raise Invalid(node, self.msg) | |||
|
|||
EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$" | |||
EMAIL_RE = r"""(?ix) # matches case insensitive with spaces and comments ignored for the entire expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wrap this to 79 columns.
EMAIL_RE = r"""(?ix) # matches case insensitive with spaces and comments
# ignored for the entire expression
colander/__init__.py
Outdated
EMAIL_RE = "(?i)^[A-Z0-9._%!#$%&'*+-/=?^_`{|}~()]+@[A-Z0-9]+([.-][A-Z0-9]+)*\.[A-Z]{2,22}$" | ||
EMAIL_RE = r"""(?ix) # matches case insensitive with spaces and comments ignored for the entire expression | ||
^ # matches the start of string | ||
[A-Z0-9._!#$%&'*+\-/=?^_`{|}~()]+ # matches multiples of the characters: A-Z0-9._!#$%&'*+-/=?^_`{|}~() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I overlooked the phrasing here on the first pass. Let's use this for clarity:
# matches any of the characters A-Z0-9._!#$%&'*+-/=?^_`{|}~() one or more times
@stevepiercy like this? I'd appreciate if this can be merged soon - and I have to say, if there are more ideas how to improve the comments, how about you merge first and just add them yourself? That seems to be a much more time saving workflow for everyone involved. Considering that I initially wanted to contribute a one (!!!!!) character fix to this project. |
Thank you for merging. I would appreciate if you pushed a release. I have a somewhat off-topic question regarding your taste regarding the regex: Do you find it easier to read with these changes? Personally, I don't find the commented regex easier to understand than the previous one-liner due to duplication and whitespace. Maybe it comes down to taste, but I would like to hear your opinion on this. |
@rbu in this case, because email regex is a nasty beast as defined in its RFC where the developer's intent is highly technical and difficult to put into layperson terms, my personal preference is to just punt to my favorite online tool to explain a regex in almost English terms. For other regex's where the developer's intent can be expressed in layperson terms, I'm inclined to go the verbose route. For the sake of best practices and consistency, I can compromise. |
Hi there,
on our deployment we noticed a (new?) bug that suddenly multiple email addresses entered into an email field where accepted - and of course triggered errors in subsequent tools.
The problem are emails like
foo,bar@baz.quoox
which technically are two email addressesfoo@localhost
andbar@baz.quoox
.By my analysis this is triggered by the email regex, which has a whole bunch of special characters in the first
[]
group, that thus do not need to be escaped. Well, all but the-
- which of course needs to be escaped.Which is what this pull request fixes. What do you think?