Rewrite egg_info_matches with canonicalize_name #5875

uranusjr · 2018-10-11T08:53:36Z

Fix #5870. I also took the chance to clean up repeated calls to match.group(0) in the function.

cjerdonek · 2018-10-11T10:32:28Z

src/pip/_internal/index.py

        return None
+    return full_match[(len(search_name) + 1):]


Is there a case to consider if the original match didn't occur at the beginning of the string? For example, it seems like a test case like ("pip-pip-18.0", "pip", "18.0") would yield version pip-18.0.

Python packaging tools (Distutils, maybe Setuptools? I am not sure) historically allow packages to specify an arbitrary string as version, so if a package named pip has a version pip-18.0, pip-pip-18.0 is exactly the sdist name to expect.

Links on a simple API page are guaranteed to come from a specific package (e.g. pip if the URL is /simple/pip/, or pip-pip if /simple/pip-pip/), so the name should match the beginning. Ambiguities occur for flat HTML pages (find-links values), but this has been the case since as far as I can recall, and there’s really no way to work around it either.

I guess what I meant to ask is if there are any real-world cases where _egg_info_re.search(egg_info) doesn't match the full provided egg_info string (e.g. starts after the beginning and/or ends before getting to the end) and whether it would make sense to add such test cases, because it looks like all of the test cases match the full string. (The example I provided wasn't such a case after all.)

Hmm, I never thought about that. Python package names (as defined in PEP 508) must return a match to

re.match(r'^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$', re.IGNORECASE)

Currently _egg_info_re is r'([a-z0-9_.]+)-([a-z0-9_.!+-]+)', which always matches the whole string for any validly-named links, so I don’t think we need to consider cases that it doesn’t match the whole thing.

cjerdonek · 2018-10-12T08:40:04Z

I have a feeling like the implementation of egg_info_matches() can be made a lot simpler. For starters, it seems like it would be better from an overall code perspective if it's documented that the second argument has to be a canonicalized name. Then egg_info_matches() can be passed in search.canonical for the second argument in the one spot in the code that a non-None value is passed, instead of search.supplied.

I also feel like it should be discussed whether (1) the _egg_info_re regex has to be applied at all at the beginning of the function, and (2) if it's applied, whether it can be required that it has to match at the beginning and end. For example, if it has to match at the beginning, then perhaps the regex doesn't have to be applied at all at least in the case where the second argument is non-None, because startswith() can be used after canonicalizing the egg_info argument. Also, is there any case where it would actually be important to truncate possible invalid characters from the end of the egg_info value?

uranusjr · 2018-10-12T09:21:09Z

I can’t really guess what the original intention was, but right now the function is used in two similar, but kind of distinctive ways:

In index.py, when parsing an HTML page, to get the version part of an sdist.
In wheel.py, when building a wheel, to tell whether the source came from an sdist or local directory. (?? I am not really sure. git blame shows this line was from 08acb66)

I fully agree the function can (and should) be simplified for 1., but I am not confident about 2. Maybe it is a good idea to separate the two usages, and simplify 1. specifically for index.py.

cjerdonek · 2018-10-12T09:29:22Z

Making two functions sounds like a good approach worth considering. For 2, search_name is always None, which would make that version of the function much simpler (and supports the idea of making two functions). Are you able to answer the questions I had in the first use case of index.py? That would reveal to what extent the implementation of 1. can be made simpler.

uranusjr · 2018-10-12T09:40:34Z

For index.py specifically, I don’t think the regex needs to be applied at all. Assuming the input is always valid (should be), it would be enough to simply canonicalise, check with .startswith(), and slice out the version.

cjerdonek · 2018-10-12T09:44:38Z

Excellent! That's what I was hoping. Thanks for thinking it through. Would you be willing to try out that PR?

uranusjr · 2018-10-12T11:04:31Z

Sure, should I just overwrite this one or open a new one?

cjerdonek · 2018-10-12T17:35:39Z

I would just overwrite this one, and save it locally for now. But if you’d rather keep this public, you could close this PR and open a new one. It’s up to you.

…

On Fri, Oct 12, 2018 at 4:04 AM Tzu-ping Chung ***@***.***> wrote: Sure, should I just overwrite this one or open a new one? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5875 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAVt7nrnFNgv2Rl5STNh6dLz8GvhvgHcks5ukHdIgaJpZM4XXEwP> .

uranusjr · 2018-10-12T19:38:12Z

I found my implementation is flawed. There is no guarantee the canonicalised string is as long as the input, since the logic collapses continuous dashes, dots, and underlines. A (totally valid according to PEP 508) package name pip..pip would have an sdist named pip..pip-18.1.tar.gz, but that wouldn’t correctly return 18.1 if the specified name is pip-pip.

More elaborated parsing would be needed, but I am not sure if this is even worth the effort. I am certainly not aware of a real-world package having a name and canonicalised name with different lengths.

cjerdonek · 2018-10-12T19:52:34Z

Which function version / code path are you saying is flawed: 1 or 2 in the previous comments? Also, I'm not sure I understand. Doesn't the canonicalization do the collapsing you're talking about, or is there separate collapsing going on somewhere else?

cjerdonek · 2018-10-12T19:55:48Z

Also, was the flaw you're referring to present in the current implementation, too, or only in the approach as you tried implementing?

cjerdonek · 2018-10-12T19:58:04Z

src/pip/_internal/wheel.py

@@ -620,6 +620,17 @@ def supported(self, tags=None):
        return bool(set(tags).intersection(self.file_tags))


+def _is_like_egg_info(


Should this better be named contains_egg_info, since the regex can be matching on any substring?

cjerdonek · 2018-10-12T19:59:50Z

src/pip/_internal/wheel.py

+
+    :param s: The string to parse. E.g. foo-2.1
+    """
+    if not _egg_info_re.search(s):


Also, I would reverse the True and False cases for simplicity, or even cast to boolean.

cjerdonek · 2018-10-12T21:44:18Z

More elaborated parsing would be needed

So to paraphrase, you need a function which strips from the beginning of egg_info, the prefix that canonicalizes to the given canonical_name? It seems like that's pretty doable via different approaches (and could be tested separately).

uranusjr · 2018-10-13T09:06:06Z

Sorry I didn’t describe it clearly. I was talking about 1. (in index.py), if the package name is something like pip..pip. Ideally, pip install pip-pip should match this, since pip-pip is the canonicalised form of pip..pip.

The old code fails to do so since it requires the argument to almost match the actual package name. The new code still won’t be able to match it, since the string slicing method does not work if the canonicalised name has a different length from the actual package name.

To overcome this, the implementation would need to know how to use pip-pip to strip pip..pip-18.0 into 18.0. I’m quite sure it should be doable, but haven’t figured out how to do this exactly (and also not sure if it is worthwhile).

cjerdonek · 2018-10-13T09:32:45Z

The old code fails to do so since it requires the argument to almost match the actual package name.

Yes, I agree with your assessment. So this would be addressing yet another bug in the current implementation.

but haven’t figured out how to do this exactly

Roughly speaking, one option would be to look at successively larger prefixes (using an increasing index i), canonicalize the prefix, and then stop if it matches. Or you could start from the full string and go down. (With either approach, you would also have to decide how to handle the boundary case of e.g. multiple hyphens separating the project name portion from the version portion.) This could be optimized in various ways to avoid checking all i's. But even if not optimized at all, I think it would be fine (it would be an O(n) algorithm).

uranusjr · 2018-10-13T12:35:15Z

I added an implementation using re.finditer(). Since canonicalize_name() only collapses [-_.] into -, this finds sequences that would have been collapsed/replaced by a dash, and skip them until the supposed end of the package name part. The next match should be the dash separating name and version.

Since this uses a duplicate regex pattern (and shares the canonicalisation logic) with canonicalize_name(), I feel more strongly this should probably be added to packaging. (pypa/packaging#136)

cjerdonek · 2018-10-13T22:43:08Z

Since this uses a duplicate regex pattern

This is one reason why the approach I suggested I feel is simpler and less error prone. It depends only on canonicalize_name() and doesn't involve duplicating logic or creating new regexes (and so is more DRY). Either way, I think the implementation should be in its own function with separate tests to test various edge cases (e.g. combinations of characters at the boundary). The function can be called something like find_name_end_index(egg_info, canonical_name) and return the index of the end of the portion corresponding to the given name.

cjerdonek · 2018-10-14T01:51:36Z

Also, with the approach that I described, if the index you start with is len(canonical_name) (the smallest index that can possibly work), then in the common case the match will occur with the first slice you look at.

uranusjr · 2018-10-15T08:07:36Z

I thought about this again. Instead of checking the name first, and then try to find the dash, we can find all dashes first, and check which one of them actually yields the correct canonical name (by calling canonicalize_name repeatedly, as suggested).

I haven’t seen any packages with three or more dashes in their names, so practically this would be able to find the match within three tries (and always once with modern setuptools).

cjerdonek · 2018-10-17T21:39:01Z

Okay, great. That sounds okay.

cjerdonek · 2018-10-20T21:15:23Z

Oh, sorry, I only saw your comment -- not that you had already updated the PR. It's looking a lot better.

I was thinking more about your earlier implementation, and I think I might be okay with using the [-_.]+ regular expression (or similar) after all. It wouldn't be the worst thing in the world to duplicate that pattern. What do you think? (This would only mean changing _find_name_version_sep() in your latest iteration, of course.)

One of my thoughts on your earlier version is that I do think it can be made simpler. For example, I think there are approaches that don't involve calling the lower-level next() and handling StopIteration explicitly. For example, you could use enumerate() in conjunction with finditer(), or perhaps zip() in some way with a finite range of integers. Another option that occurred to me is using the re module's split(), and passing a maxsplit with the expected / proper count. The advantage with that approach is that you wouldn't have to deal with match objects directly. You could just look at split()'s return value, and use the length of the last string to compute the index return value of _find_name_version_sep() (after doing whatever checks you need to do).

cjerdonek · 2018-10-20T22:21:35Z

tests/unit/test_index.py

+        ("foo-_bar-1.0", "foo-bar", 8),
+    ],
+)
+def test_find_name_version_sep(egg_info, canonical_name, expected):


I would also add a test of how this function behaves if the version specifier is missing (e.g. "zope-interface") -- as well as for _egg_info_matches() below if not already present.

Another edge case I think would be worth adding is something like "zope.interface.-4.5.0" (where the prefix you check winds up ending with a non-hyphen separator character). Two dashes in the separator is another case: "zope.interface--4.5.0".

According to PEP 508, the package name must match ^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$, so zope.interface.-4.5.0 is invalid. I’ll add zope.interface--4.5.0 (the first dash would be the separator).

Do we know in advance that all of the egg_info strings passed to the function will be valid? Or is there the possibility of mistaken input by the user? If the latter, I think it would be worth having some test cases to make sure e.g. the function doesn't treat invalid input as valid. (More generally, what guarantees do we have about what will be passed to the function?)

The egg info strings come from an index or find links page. They can be invalid only if the page is formatted incorrectly. I think it is reasonable to either expect invalid inputs or not, depending on how you think pip should handle non-standard-conforming inputs.

Okay, what I was getting at is that I think it would be worth adding a few more test cases that came to mind, even if invalid -- not because the additional cases should be considered valid, but because it provides more visibility into how the two functions behave on various edge cases. If it's possible for these inputs to happen in real life, then that's even more reason.

For example, I noticed that none of the _find_name_version_sep() test cases hit the ValueError line (though some of the _egg_info_matches() test cases hit that code path by virtue of calling _find_name_version_sep()).

Here are additional test cases that come to mind (for both functions):

("zope.interface4.5.0", "zope-interface"), ("zope.interface.4.5.0", "zope-interface"), ("zope.interface.-4.5.0", "zope-interface"), ("zope.interface--4.5.0", "zope-interface"), ("zope.interface--", "zope-interface"), ("zope.interface-", "zope-interface"), ("zope.interface", "zope-interface"),

(You may want to put the test cases for _find_name_version_sep() that cause a ValueError into its own test function.)

cjerdonek · 2018-10-21T08:26:17Z

src/pip/_internal/index.py

+    8
+    """
+    # Project name and version must be separated by one single dash. Find all
+    # occurances of dashes; if the string in front of it matches the canonical


"occurrences" btw

Thanks for the catch!

The idea is to use regex to scan for character sequences that would be canonicalized into a dash, and skip the equal amount of sequences to the given canonical name.

uranusjr · 2018-10-23T08:00:08Z

Regarding _find_name_version_sep(), I am actually quite satisfied with the current implementation, so there is no need to duplicate the -_. regex. I considered using zip in the previous implementation (when I was working on it), but the end result was less straightforward than I’d like. Maybe the implementation in the future, especially now it’s properly unit-tested, but for now I feel the current one is actually the simplest to understand/

uranusjr · 2018-10-29T07:51:26Z

Added test cases as suggested. I also modified the implementation a little to make the return value of _egg_info_matches more consistent (always return None if the version is not found, instead of an empty string sometimes).

cjerdonek · 2018-10-29T07:57:33Z

Okay, awesome! I'll take a look tomorrow and hopefully push. And then you can move onto whatever you have planned next.. :)

cjerdonek · 2018-10-30T08:30:57Z

Thanks for all your great work on this, @uranusjr!

cjerdonek · 2018-10-30T08:55:06Z

@uranusjr By the way, it occurred to me while reviewing the patch that it would also be good to have, as a separate PR, a higher-level regression test for #5870. This would show, for example, that _egg_info_matches() is being called correctly, and in particular that canonical is being supplied rather than supplied (pun intended). Would you be able to do / have any interest in doing that, too?

lock · 2019-05-31T14:48:42Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

uranusjr force-pushed the egg-info-canonical branch from 33fea2e to 8673ab4 Compare October 11, 2018 08:59

cjerdonek reviewed Oct 11, 2018

View reviewed changes

cjerdonek mentioned this pull request Oct 11, 2018

Correctly normalize package names. Fixes #2956 and #3002 pypa/pipenv#3008

Closed

cjerdonek added type: refactor Refactoring code C: wheel The wheel format and 'pip wheel' command T: bugfix C: finder PackageFinder and index related code labels Oct 12, 2018

cjerdonek added this to the 19.0 milestone Oct 12, 2018

uranusjr force-pushed the egg-info-canonical branch from 8673ab4 to b5e8dd9 Compare October 12, 2018 19:30

cjerdonek reviewed Oct 12, 2018

View reviewed changes

uranusjr force-pushed the egg-info-canonical branch from b5e8dd9 to 557b552 Compare October 13, 2018 08:47

uranusjr force-pushed the egg-info-canonical branch from 557b552 to f0f715c Compare October 13, 2018 09:12

uranusjr force-pushed the egg-info-canonical branch from dc1fc99 to 57777c7 Compare October 13, 2018 12:36

uranusjr mentioned this pull request Oct 14, 2018

Add file name parsing logic pypa/packaging#136

Closed

uranusjr force-pushed the egg-info-canonical branch from 57777c7 to 9801eb7 Compare October 15, 2018 08:00

cjerdonek reviewed Oct 20, 2018

View reviewed changes

cjerdonek reviewed Oct 21, 2018

View reviewed changes

Split, simplify, and fix egg_info_matches

3368819

uranusjr force-pushed the egg-info-canonical branch from c4fba95 to 1ead7da Compare October 22, 2018 08:47

uranusjr added 2 commits October 22, 2018 16:48

More robust egg-info/name match

ad9cbf5

The idea is to use regex to scan for character sequences that would be canonicalized into a dash, and skip the equal amount of sequences to the given canonical name.

Add (invalid) edge case inputs to tests

5d0a576

uranusjr force-pushed the egg-info-canonical branch from 1ead7da to 5d0a576 Compare October 22, 2018 08:48

More test cases

033a47c

cjerdonek merged commit 5c93250 into pypa:master Oct 30, 2018

uranusjr deleted the egg-info-canonical branch November 1, 2018 08:33

lock bot added the auto-locked Outdated issues that have been locked by automation label May 31, 2019

lock bot locked as resolved and limited conversation to collaborators May 31, 2019

		@@ -620,6 +620,17 @@ def supported(self, tags=None):
		return bool(set(tags).intersection(self.file_tags))


		def _is_like_egg_info(

Rewrite egg_info_matches with canonicalize_name #5875

Rewrite egg_info_matches with canonicalize_name #5875

Conversation

uranusjr commented Oct 11, 2018

Choose a reason for hiding this comment

uranusjr Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uranusjr Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

cjerdonek commented Oct 12, 2018

uranusjr commented Oct 12, 2018 • edited Loading

cjerdonek commented Oct 12, 2018 • edited Loading

uranusjr commented Oct 12, 2018

cjerdonek commented Oct 12, 2018 • edited Loading

uranusjr commented Oct 12, 2018

cjerdonek commented Oct 12, 2018 via email

uranusjr commented Oct 12, 2018

cjerdonek commented Oct 12, 2018

cjerdonek commented Oct 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjerdonek commented Oct 12, 2018

uranusjr commented Oct 13, 2018

cjerdonek commented Oct 13, 2018

uranusjr commented Oct 13, 2018

cjerdonek commented Oct 13, 2018 • edited Loading

cjerdonek commented Oct 14, 2018

uranusjr commented Oct 15, 2018

cjerdonek commented Oct 17, 2018

cjerdonek commented Oct 20, 2018

cjerdonek Oct 20, 2018 • edited Loading

Choose a reason for hiding this comment

uranusjr Oct 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjerdonek Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uranusjr commented Oct 23, 2018

uranusjr commented Oct 29, 2018

cjerdonek commented Oct 29, 2018

cjerdonek commented Oct 30, 2018

cjerdonek commented Oct 30, 2018

lock bot commented May 31, 2019

uranusjr Oct 11, 2018 •

edited

Loading

uranusjr Oct 11, 2018 •

edited

Loading

uranusjr commented Oct 12, 2018 •

edited

Loading

cjerdonek commented Oct 12, 2018 •

edited

Loading

cjerdonek commented Oct 12, 2018 •

edited

Loading

cjerdonek commented Oct 13, 2018 •

edited

Loading

cjerdonek Oct 20, 2018 •

edited

Loading

uranusjr Oct 22, 2018 •

edited

Loading

cjerdonek Oct 25, 2018 •

edited

Loading