-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite egg_info_matches with canonicalize_name #5875
Conversation
33fea2e
to
8673ab4
Compare
src/pip/_internal/index.py
Outdated
return None | ||
return full_match[(len(search_name) + 1):] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case to consider if the original match didn't occur at the beginning of the string? For example, it seems like a test case like ("pip-pip-18.0", "pip", "18.0")
would yield version pip-18.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python packaging tools (Distutils, maybe Setuptools? I am not sure) historically allow packages to specify an arbitrary string as version, so if a package named pip
has a version pip-18.0
, pip-pip-18.0
is exactly the sdist name to expect.
Links on a simple API page are guaranteed to come from a specific package (e.g. pip
if the URL is /simple/pip/
, or pip-pip
if /simple/pip-pip/
), so the name should match the beginning. Ambiguities occur for flat HTML pages (find-links values), but this has been the case since as far as I can recall, and there’s really no way to work around it either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I meant to ask is if there are any real-world cases where _egg_info_re.search(egg_info)
doesn't match the full provided egg_info
string (e.g. starts after the beginning and/or ends before getting to the end) and whether it would make sense to add such test cases, because it looks like all of the test cases match the full string. (The example I provided wasn't such a case after all.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I never thought about that. Python package names (as defined in PEP 508) must return a match to
re.match(r'^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$', re.IGNORECASE)
Currently _egg_info_re
is r'([a-z0-9_.]+)-([a-z0-9_.!+-]+)'
, which always matches the whole string for any validly-named links, so I don’t think we need to consider cases that it doesn’t match the whole thing.
I have a feeling like the implementation of I also feel like it should be discussed whether (1) the |
I can’t really guess what the original intention was, but right now the function is used in two similar, but kind of distinctive ways:
I fully agree the function can (and should) be simplified for 1., but I am not confident about 2. Maybe it is a good idea to separate the two usages, and simplify 1. specifically for |
Making two functions sounds like a good approach worth considering. For 2, |
For |
Excellent! That's what I was hoping. Thanks for thinking it through. Would you be willing to try out that PR? |
Sure, should I just overwrite this one or open a new one? |
I would just overwrite this one, and save it locally for now.
But if you’d rather keep this public, you could close this PR and open a
new one. It’s up to you.
…On Fri, Oct 12, 2018 at 4:04 AM Tzu-ping Chung ***@***.***> wrote:
Sure, should I just overwrite this one or open a new one?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5875 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAVt7nrnFNgv2Rl5STNh6dLz8GvhvgHcks5ukHdIgaJpZM4XXEwP>
.
|
8673ab4
to
b5e8dd9
Compare
I found my implementation is flawed. There is no guarantee the canonicalised string is as long as the input, since the logic collapses continuous dashes, dots, and underlines. A (totally valid according to PEP 508) package name More elaborated parsing would be needed, but I am not sure if this is even worth the effort. I am certainly not aware of a real-world package having a name and canonicalised name with different lengths. |
Which function version / code path are you saying is flawed: 1 or 2 in the previous comments? Also, I'm not sure I understand. Doesn't the canonicalization do the collapsing you're talking about, or is there separate collapsing going on somewhere else? |
Also, was the flaw you're referring to present in the current implementation, too, or only in the approach as you tried implementing? |
src/pip/_internal/wheel.py
Outdated
@@ -620,6 +620,17 @@ def supported(self, tags=None): | |||
return bool(set(tags).intersection(self.file_tags)) | |||
|
|||
|
|||
def _is_like_egg_info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this better be named contains_egg_info
, since the regex can be matching on any substring?
src/pip/_internal/wheel.py
Outdated
|
||
:param s: The string to parse. E.g. foo-2.1 | ||
""" | ||
if not _egg_info_re.search(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would reverse the True and False cases for simplicity, or even cast to boolean.
So to paraphrase, you need a function which strips from the beginning of |
b5e8dd9
to
557b552
Compare
Sorry I didn’t describe it clearly. I was talking about 1. (in index.py), if the package name is something like The old code fails to do so since it requires the argument to almost match the actual package name. The new code still won’t be able to match it, since the string slicing method does not work if the canonicalised name has a different length from the actual package name. To overcome this, the implementation would need to know how to use |
557b552
to
f0f715c
Compare
Yes, I agree with your assessment. So this would be addressing yet another bug in the current implementation.
Roughly speaking, one option would be to look at successively larger prefixes (using an increasing index |
I added an implementation using Since this uses a duplicate regex pattern (and shares the canonicalisation logic) with |
dc1fc99
to
57777c7
Compare
This is one reason why the approach I suggested I feel is simpler and less error prone. It depends only on |
Also, with the approach that I described, if the index you start with is |
57777c7
to
9801eb7
Compare
I thought about this again. Instead of checking the name first, and then try to find the dash, we can find all dashes first, and check which one of them actually yields the correct canonical name (by calling I haven’t seen any packages with three or more dashes in their names, so practically this would be able to find the match within three tries (and always once with modern setuptools). |
Okay, great. That sounds okay. |
Oh, sorry, I only saw your comment -- not that you had already updated the PR. It's looking a lot better. I was thinking more about your earlier implementation, and I think I might be okay with using the One of my thoughts on your earlier version is that I do think it can be made simpler. For example, I think there are approaches that don't involve calling the lower-level |
("foo-_bar-1.0", "foo-bar", 8), | ||
], | ||
) | ||
def test_find_name_version_sep(egg_info, canonical_name, expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add a test of how this function behaves if the version specifier is missing (e.g. "zope-interface"
) -- as well as for _egg_info_matches()
below if not already present.
Another edge case I think would be worth adding is something like "zope.interface.-4.5.0"
(where the prefix you check winds up ending with a non-hyphen separator character). Two dashes in the separator is another case: "zope.interface--4.5.0"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to PEP 508, the package name must match ^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$
, so zope.interface.-4.5.0
is invalid. I’ll add zope.interface--4.5.0
(the first dash would be the separator).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know in advance that all of the egg_info
strings passed to the function will be valid? Or is there the possibility of mistaken input by the user? If the latter, I think it would be worth having some test cases to make sure e.g. the function doesn't treat invalid input as valid. (More generally, what guarantees do we have about what will be passed to the function?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The egg info strings come from an index or find links page. They can be invalid only if the page is formatted incorrectly. I think it is reasonable to either expect invalid inputs or not, depending on how you think pip should handle non-standard-conforming inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, what I was getting at is that I think it would be worth adding a few more test cases that came to mind, even if invalid -- not because the additional cases should be considered valid, but because it provides more visibility into how the two functions behave on various edge cases. If it's possible for these inputs to happen in real life, then that's even more reason.
For example, I noticed that none of the _find_name_version_sep()
test cases hit the ValueError
line (though some of the _egg_info_matches()
test cases hit that code path by virtue of calling _find_name_version_sep()
).
Here are additional test cases that come to mind (for both functions):
("zope.interface4.5.0", "zope-interface"),
("zope.interface.4.5.0", "zope-interface"),
("zope.interface.-4.5.0", "zope-interface"),
("zope.interface--4.5.0", "zope-interface"),
("zope.interface--", "zope-interface"),
("zope.interface-", "zope-interface"),
("zope.interface", "zope-interface"),
(You may want to put the test cases for _find_name_version_sep()
that cause a ValueError
into its own test function.)
src/pip/_internal/index.py
Outdated
8 | ||
""" | ||
# Project name and version must be separated by one single dash. Find all | ||
# occurances of dashes; if the string in front of it matches the canonical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"occurrences" btw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch!
c4fba95
to
1ead7da
Compare
The idea is to use regex to scan for character sequences that would be canonicalized into a dash, and skip the equal amount of sequences to the given canonical name.
1ead7da
to
5d0a576
Compare
Regarding |
Added test cases as suggested. I also modified the implementation a little to make the return value of |
Okay, awesome! I'll take a look tomorrow and hopefully push. And then you can move onto whatever you have planned next.. :) |
Thanks for all your great work on this, @uranusjr! |
@uranusjr By the way, it occurred to me while reviewing the patch that it would also be good to have, as a separate PR, a higher-level regression test for #5870. This would show, for example, that |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Fix #5870. I also took the chance to clean up repeated calls to
match.group(0)
in the function.