Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite egg_info_matches with canonicalize_name #5875

Merged
merged 4 commits into from
Oct 30, 2018

Conversation

uranusjr
Copy link
Member

Fix #5870. I also took the chance to clean up repeated calls to match.group(0) in the function.

return None
return full_match[(len(search_name) + 1):]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case to consider if the original match didn't occur at the beginning of the string? For example, it seems like a test case like ("pip-pip-18.0", "pip", "18.0") would yield version pip-18.0.

Copy link
Member Author

@uranusjr uranusjr Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python packaging tools (Distutils, maybe Setuptools? I am not sure) historically allow packages to specify an arbitrary string as version, so if a package named pip has a version pip-18.0, pip-pip-18.0 is exactly the sdist name to expect.

Links on a simple API page are guaranteed to come from a specific package (e.g. pip if the URL is /simple/pip/, or pip-pip if /simple/pip-pip/), so the name should match the beginning. Ambiguities occur for flat HTML pages (find-links values), but this has been the case since as far as I can recall, and there’s really no way to work around it either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I meant to ask is if there are any real-world cases where _egg_info_re.search(egg_info) doesn't match the full provided egg_info string (e.g. starts after the beginning and/or ends before getting to the end) and whether it would make sense to add such test cases, because it looks like all of the test cases match the full string. (The example I provided wasn't such a case after all.)

Copy link
Member Author

@uranusjr uranusjr Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I never thought about that. Python package names (as defined in PEP 508) must return a match to

re.match(r'^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$', re.IGNORECASE)

Currently _egg_info_re is r'([a-z0-9_.]+)-([a-z0-9_.!+-]+)', which always matches the whole string for any validly-named links, so I don’t think we need to consider cases that it doesn’t match the whole thing.

@cjerdonek
Copy link
Member

I have a feeling like the implementation of egg_info_matches() can be made a lot simpler. For starters, it seems like it would be better from an overall code perspective if it's documented that the second argument has to be a canonicalized name. Then egg_info_matches() can be passed in search.canonical for the second argument in the one spot in the code that a non-None value is passed, instead of search.supplied.

I also feel like it should be discussed whether (1) the _egg_info_re regex has to be applied at all at the beginning of the function, and (2) if it's applied, whether it can be required that it has to match at the beginning and end. For example, if it has to match at the beginning, then perhaps the regex doesn't have to be applied at all at least in the case where the second argument is non-None, because startswith() can be used after canonicalizing the egg_info argument. Also, is there any case where it would actually be important to truncate possible invalid characters from the end of the egg_info value?

@uranusjr
Copy link
Member Author

uranusjr commented Oct 12, 2018

I can’t really guess what the original intention was, but right now the function is used in two similar, but kind of distinctive ways:

  1. In index.py, when parsing an HTML page, to get the version part of an sdist.
  2. In wheel.py, when building a wheel, to tell whether the source came from an sdist or local directory. (?? I am not really sure. git blame shows this line was from 08acb66)

I fully agree the function can (and should) be simplified for 1., but I am not confident about 2. Maybe it is a good idea to separate the two usages, and simplify 1. specifically for index.py.

@cjerdonek
Copy link
Member

cjerdonek commented Oct 12, 2018

Making two functions sounds like a good approach worth considering. For 2, search_name is always None, which would make that version of the function much simpler (and supports the idea of making two functions). Are you able to answer the questions I had in the first use case of index.py? That would reveal to what extent the implementation of 1. can be made simpler.

@uranusjr
Copy link
Member Author

For index.py specifically, I don’t think the regex needs to be applied at all. Assuming the input is always valid (should be), it would be enough to simply canonicalise, check with .startswith(), and slice out the version.

@cjerdonek
Copy link
Member

cjerdonek commented Oct 12, 2018

Excellent! That's what I was hoping. Thanks for thinking it through. Would you be willing to try out that PR?

@uranusjr
Copy link
Member Author

Sure, should I just overwrite this one or open a new one?

@cjerdonek
Copy link
Member

cjerdonek commented Oct 12, 2018 via email

@cjerdonek cjerdonek added type: refactor Refactoring code C: wheel The wheel format and 'pip wheel' command T: bugfix C: finder PackageFinder and index related code labels Oct 12, 2018
@cjerdonek cjerdonek added this to the 19.0 milestone Oct 12, 2018
@uranusjr
Copy link
Member Author

I found my implementation is flawed. There is no guarantee the canonicalised string is as long as the input, since the logic collapses continuous dashes, dots, and underlines. A (totally valid according to PEP 508) package name pip..pip would have an sdist named pip..pip-18.1.tar.gz, but that wouldn’t correctly return 18.1 if the specified name is pip-pip.

More elaborated parsing would be needed, but I am not sure if this is even worth the effort. I am certainly not aware of a real-world package having a name and canonicalised name with different lengths.

@cjerdonek
Copy link
Member

Which function version / code path are you saying is flawed: 1 or 2 in the previous comments? Also, I'm not sure I understand. Doesn't the canonicalization do the collapsing you're talking about, or is there separate collapsing going on somewhere else?

@cjerdonek
Copy link
Member

Also, was the flaw you're referring to present in the current implementation, too, or only in the approach as you tried implementing?

@@ -620,6 +620,17 @@ def supported(self, tags=None):
return bool(set(tags).intersection(self.file_tags))


def _is_like_egg_info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this better be named contains_egg_info, since the regex can be matching on any substring?


:param s: The string to parse. E.g. foo-2.1
"""
if not _egg_info_re.search(s):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would reverse the True and False cases for simplicity, or even cast to boolean.

@cjerdonek
Copy link
Member

More elaborated parsing would be needed

So to paraphrase, you need a function which strips from the beginning of egg_info, the prefix that canonicalizes to the given canonical_name? It seems like that's pretty doable via different approaches (and could be tested separately).

@uranusjr
Copy link
Member Author

Sorry I didn’t describe it clearly. I was talking about 1. (in index.py), if the package name is something like pip..pip. Ideally, pip install pip-pip should match this, since pip-pip is the canonicalised form of pip..pip.

The old code fails to do so since it requires the argument to almost match the actual package name. The new code still won’t be able to match it, since the string slicing method does not work if the canonicalised name has a different length from the actual package name.

To overcome this, the implementation would need to know how to use pip-pip to strip pip..pip-18.0 into 18.0. I’m quite sure it should be doable, but haven’t figured out how to do this exactly (and also not sure if it is worthwhile).

@cjerdonek
Copy link
Member

The old code fails to do so since it requires the argument to almost match the actual package name.

Yes, I agree with your assessment. So this would be addressing yet another bug in the current implementation.

but haven’t figured out how to do this exactly

Roughly speaking, one option would be to look at successively larger prefixes (using an increasing index i), canonicalize the prefix, and then stop if it matches. Or you could start from the full string and go down. (With either approach, you would also have to decide how to handle the boundary case of e.g. multiple hyphens separating the project name portion from the version portion.) This could be optimized in various ways to avoid checking all i's. But even if not optimized at all, I think it would be fine (it would be an O(n) algorithm).

@uranusjr
Copy link
Member Author

I added an implementation using re.finditer(). Since canonicalize_name() only collapses [-_.] into -, this finds sequences that would have been collapsed/replaced by a dash, and skip them until the supposed end of the package name part. The next match should be the dash separating name and version.

Since this uses a duplicate regex pattern (and shares the canonicalisation logic) with canonicalize_name(), I feel more strongly this should probably be added to packaging. (pypa/packaging#136)

@cjerdonek
Copy link
Member

cjerdonek commented Oct 13, 2018

Since this uses a duplicate regex pattern

This is one reason why the approach I suggested I feel is simpler and less error prone. It depends only on canonicalize_name() and doesn't involve duplicating logic or creating new regexes (and so is more DRY). Either way, I think the implementation should be in its own function with separate tests to test various edge cases (e.g. combinations of characters at the boundary). The function can be called something like find_name_end_index(egg_info, canonical_name) and return the index of the end of the portion corresponding to the given name.

@cjerdonek
Copy link
Member

Also, with the approach that I described, if the index you start with is len(canonical_name) (the smallest index that can possibly work), then in the common case the match will occur with the first slice you look at.

@uranusjr
Copy link
Member Author

I thought about this again. Instead of checking the name first, and then try to find the dash, we can find all dashes first, and check which one of them actually yields the correct canonical name (by calling canonicalize_name repeatedly, as suggested).

I haven’t seen any packages with three or more dashes in their names, so practically this would be able to find the match within three tries (and always once with modern setuptools).

@cjerdonek
Copy link
Member

Okay, great. That sounds okay.

@cjerdonek
Copy link
Member

Oh, sorry, I only saw your comment -- not that you had already updated the PR. It's looking a lot better.

I was thinking more about your earlier implementation, and I think I might be okay with using the [-_.]+ regular expression (or similar) after all. It wouldn't be the worst thing in the world to duplicate that pattern. What do you think? (This would only mean changing _find_name_version_sep() in your latest iteration, of course.)

One of my thoughts on your earlier version is that I do think it can be made simpler. For example, I think there are approaches that don't involve calling the lower-level next() and handling StopIteration explicitly. For example, you could use enumerate() in conjunction with finditer(), or perhaps zip() in some way with a finite range of integers. Another option that occurred to me is using the re module's split(), and passing a maxsplit with the expected / proper count. The advantage with that approach is that you wouldn't have to deal with match objects directly. You could just look at split()'s return value, and use the length of the last string to compute the index return value of _find_name_version_sep() (after doing whatever checks you need to do).

("foo-_bar-1.0", "foo-bar", 8),
],
)
def test_find_name_version_sep(egg_info, canonical_name, expected):
Copy link
Member

@cjerdonek cjerdonek Oct 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a test of how this function behaves if the version specifier is missing (e.g. "zope-interface") -- as well as for _egg_info_matches() below if not already present.

Another edge case I think would be worth adding is something like "zope.interface.-4.5.0" (where the prefix you check winds up ending with a non-hyphen separator character). Two dashes in the separator is another case: "zope.interface--4.5.0".

Copy link
Member Author

@uranusjr uranusjr Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to PEP 508, the package name must match ^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$, so zope.interface.-4.5.0 is invalid. I’ll add zope.interface--4.5.0 (the first dash would be the separator).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know in advance that all of the egg_info strings passed to the function will be valid? Or is there the possibility of mistaken input by the user? If the latter, I think it would be worth having some test cases to make sure e.g. the function doesn't treat invalid input as valid. (More generally, what guarantees do we have about what will be passed to the function?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The egg info strings come from an index or find links page. They can be invalid only if the page is formatted incorrectly. I think it is reasonable to either expect invalid inputs or not, depending on how you think pip should handle non-standard-conforming inputs.

Copy link
Member

@cjerdonek cjerdonek Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, what I was getting at is that I think it would be worth adding a few more test cases that came to mind, even if invalid -- not because the additional cases should be considered valid, but because it provides more visibility into how the two functions behave on various edge cases. If it's possible for these inputs to happen in real life, then that's even more reason.

For example, I noticed that none of the _find_name_version_sep() test cases hit the ValueError line (though some of the _egg_info_matches() test cases hit that code path by virtue of calling _find_name_version_sep()).

Here are additional test cases that come to mind (for both functions):

("zope.interface4.5.0", "zope-interface"),
("zope.interface.4.5.0", "zope-interface"),
("zope.interface.-4.5.0", "zope-interface"),
("zope.interface--4.5.0", "zope-interface"),
("zope.interface--", "zope-interface"),
("zope.interface-", "zope-interface"),
("zope.interface", "zope-interface"),

(You may want to put the test cases for _find_name_version_sep() that cause a ValueError into its own test function.)

8
"""
# Project name and version must be separated by one single dash. Find all
# occurances of dashes; if the string in front of it matches the canonical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"occurrences" btw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

The idea is to use regex to scan for character sequences that would
be canonicalized into a dash, and skip the equal amount of sequences
to the given canonical name.
@uranusjr
Copy link
Member Author

Regarding _find_name_version_sep(), I am actually quite satisfied with the current implementation, so there is no need to duplicate the -_. regex. I considered using zip in the previous implementation (when I was working on it), but the end result was less straightforward than I’d like. Maybe the implementation in the future, especially now it’s properly unit-tested, but for now I feel the current one is actually the simplest to understand/

@uranusjr
Copy link
Member Author

Added test cases as suggested. I also modified the implementation a little to make the return value of _egg_info_matches more consistent (always return None if the version is not found, instead of an empty string sometimes).

@cjerdonek
Copy link
Member

Okay, awesome! I'll take a look tomorrow and hopefully push. And then you can move onto whatever you have planned next.. :)

@cjerdonek cjerdonek merged commit 5c93250 into pypa:master Oct 30, 2018
@cjerdonek
Copy link
Member

Thanks for all your great work on this, @uranusjr!

@cjerdonek
Copy link
Member

@uranusjr By the way, it occurred to me while reviewing the patch that it would also be good to have, as a separate PR, a higher-level regression test for #5870. This would show, for example, that _egg_info_matches() is being called correctly, and in particular that canonical is being supplied rather than supplied (pun intended). Would you be able to do / have any interest in doing that, too?

@uranusjr uranusjr deleted the egg-info-canonical branch November 1, 2018 08:33
@lock
Copy link

lock bot commented May 31, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label May 31, 2019
@lock lock bot locked as resolved and limited conversation to collaborators May 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: finder PackageFinder and index related code C: wheel The wheel format and 'pip wheel' command type: refactor Refactoring code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pip is unable to install source-only packages via canonical names (e.g. "zope-interface" for "zope.interface")
2 participants