Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get correct module info for zipimport modules (to then correctly classify them). #47

Merged
merged 1 commit into from
Feb 27, 2020

Conversation

shea-parkes
Copy link
Contributor

@shea-parkes shea-parkes commented Feb 26, 2020

Do this because there are scenarios where zipped up packages end up being imported, and the import classification logic was marking those as part of the standard library.

Trying to do full TDD. So opening with just a commit to add tests to hopefully get a nice CI failure first.

Resolves #46

@shea-parkes
Copy link
Contributor Author

Looks like the CI is running back on Python 2.7 as well. I likely did something in the tests that isn't Python2.7 compatible (likely assuming path-likes). I'll see if I can decipher the CI return before trying to get a Python2.7 environment setup locally.

@shea-parkes
Copy link
Contributor Author

There we go; CI looks happy now. How's this patch @asottile ? I tried to keep it small and to the point.

I left the commit history above so you could see the TDD evidence. Happy to squash if you'd prefer. I also left you push rights as well to the branch I believe.

Thanks for considering (and for making a very useful tool).

Copy link
Owner

@asottile asottile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might get better mileage if you run reorder-python-imports without access to your site-packages?

aspy/refactor_imports/classify.py Outdated Show resolved Hide resolved
aspy/refactor_imports/classify.py Outdated Show resolved Hide resolved
tests/classify_test.py Outdated Show resolved Hide resolved
@shea-parkes
Copy link
Contributor Author

Well that is annoying. Coverage complains about skipping that assert for Python 2.7. I've got about 15 min left to try and get a Python 2.7 environment up to see about making that test pass for Python 2.7. I'll see what I can do.

@shea-parkes
Copy link
Contributor Author

I got the Python2.7 environment stood up, but I was having some trouble even getting into a debugger with the fzip module ~findable.

While editing, I noticed how you did Python version-specific coverage disables elsewhere, so I went ahead and just used one of those in the tests file. CI is happy now, but I understand you may not be.

I've got to go do my day job now, but happy to discuss further or try other things if you like.

@asottile
Copy link
Owner

just because I'm curious (and this patch has gotten way complicated) -- what happens if you invoke reorder-python-imports without PYTHONPATH set? it should do the right thing in this case (identify that module as third party)

@shea-parkes
Copy link
Contributor Author

Well, it would be an un-importable module, so I believe this would be the result:

python -i .\aspy\refactor_imports\classify.py
>>> classify_import('notamodule')
'THIRD_PARTY'
>>>

I think you'd always want things importable in a pre-commit/CI scenario though. Numerous other dev tools need to do module inspection.

@shea-parkes
Copy link
Contributor Author

And as for the complexity of this pull request, I can pull back out the embedded code path (and tests) to keep it simpler if you prefer.

@asottile
Copy link
Owner

actually it's designed for the opposite, pre-commit creates isolated environments :)

@shea-parkes
Copy link
Contributor Author

Fair enough on pre-commit, although I am a bit confused on how you would do a pylint score based pre-commit. I see a few solutions out on github (e.g. https://github.com/Botpy/pre-commit-pylint ). I'll have to read a bit more on those; I would naively expect them to vomit import-error warnings. I'm guessing they just expect you to disable those or the like.

In our own pre-commit usage we aspire to cover the scenario where the PYTHONPATH has been set and where it has not. It all matches except for this oddball zipimport that is part of pyspark. I should go read the default pre-commit hook and see if it wipes the PythonPath...

Separate from pre-release, we're using this solution (and black and pylint) in "check" mode in CI as well.

I still think having zipimport support would be a positive here, but I respect your right to decline.

@asottile
Copy link
Owner

yeah pylint intentionally doesn't really fit into the model as it does dynamic analysis -- there's a bit of nudge in that direction here and here in the docs -- as well as a large write up here

@shea-parkes
Copy link
Contributor Author

Thanks for that context; I definitely missed those hints in my read-thrus thus far. We've been using various CI checks internally for years, and we're just easing into some pre-commit to compliment (and get feedback quicker), so there are definitely some quirks in our knowledge to work through. For pre-commit we're just using this and black right now. CI is still the only thing running pylint (and py.test), but we did get this and black added to CI concurrently so we get dev-side and server-side consistency in our checks.

So what additional information can I provide to help you evaluate this proposed change? Happy to just give you some space and let you think about it for a day or two.

@shea-parkes
Copy link
Contributor Author

Also, after a bit more thought, I think this issue could be viewed as an accidental dynamic inspection bleed into what is meant to be a static inspection tool.

@asottile
Copy link
Owner

yeah the main thing is it's not specific to zipimports -- any special importer / sys.path / PYTHONPATH manipulation will end up with the same problem:

$ mkdir -p x && touch x/t.py
$ python3 -i aspy/refactor_imports/classify.py
>>> import sys
>>> sys.path.append('x')
>>> classify_import('t')
'BUILTIN'

@shea-parkes
Copy link
Contributor Author

Well, in our case, it's specific to zipimports. I'll do some more testing here in a bit, but other than zipimport we're getting the same sort with and without pythonpath set.

@asottile
Copy link
Owner

hmmm, is there an open source reproduction I can look at, I'd be interested in the not-PYTHONPATH case since it should be found = False I would think 🤔

@shea-parkes
Copy link
Contributor Author

So when you say "not-PYTHONPATH" case, you mean an example of doing sys.path manipulation at runtime? I'll see about getting one together, although I think that's much less realistic than a PYTHONPATH example.

In the meantime, here's a PythonPath example (using an internal library):

First, before setting PythonPath:

 python -i .\aspy\refactor_imports\classify.py
>>> classify_import('streaming_pandas')
'THIRD_PARTY'
>>> import streaming_pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'streaming_pandas'

After setting PythonPath:

$env:pythonpath = 'C:\Users\shea.parkes\repos\streaming-pandas\python'
 python -i .\aspy\refactor_imports\classify.py
>>> classify_import('streaming_pandas')
'THIRD_PARTY'
>>> import streaming_pandas
>>> exit()

It identifies streaming_pandas as THIRD_PARTY whether it is importable or not.

@shea-parkes
Copy link
Contributor Author

Here's the PythonPath example with https://github.com/aio-libs/yarl (import failed due to multi-dict, but you can see it at least found the library).

(base) C:\Users\shea.parkes\repos\aspy.refactor_imports [master ≡]> python -i .\aspy\refactor_imports\classify.py
>>> classify_import('yarl')
'THIRD_PARTY'
>>> import yarl
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'yarl'
>>> exit()
(base) C:\Users\shea.parkes\repos\aspy.refactor_imports [master ≡]> $env:pythonpath = 'C:\Users\shea.parkes\repos\yarl'
(base) C:\Users\shea.parkes\repos\aspy.refactor_imports [master ≡]> python -i .\aspy\refactor_imports\classify.py
>>> classify_import('yarl')
'THIRD_PARTY'
>>> import yarl
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\shea.parkes\repos\yarl\yarl\__init__.py", line 7, in <module>
    from multidict import MultiDict, MultiDictProxy
ModuleNotFoundError: No module named 'multidict'
>>>

Before and after setting PythonPath it was identified as "THIRD_PARTY".

Now I'll see about doing a sys.path amendment version and see if the answer changes.

@shea-parkes
Copy link
Contributor Author

And here's showing that modifying sys.path results in "BUILTIN":

(base) C:\Users\shea.parkes\repos\aspy.refactor_imports [master ≡]> python -i .\aspy\refactor_imports\classify.py
>>> import sys
>>> sys.path.append('c:/users/shea.parkes/repos/yarl')
>>> classify_import('yarl')
'BUILTIN'
>>>

@shea-parkes
Copy link
Contributor Author

I'll deconstruct the above scenarios a bit more in another comment or two and call it a night then.

@shea-parkes
Copy link
Contributor Author

Alright, one step up the chain, here are the outputs of _get_module_info() for each of the three scenarios:

(base) C:\Users\shea.parkes\repos\aspy.refactor_imports [master ≡]> python -i .\aspy\refactor_imports\classify.py
>>> _get_module_info('yarl', ('.', ))
(False, 'yarl.notlocal', False)
>>> exit()

(base) C:\Users\shea.parkes\repos\aspy.refactor_imports [master ≡]> python -i .\aspy\refactor_imports\classify.py
>>> sys.path.append('c:/users/shea.parkes/repos/yarl')
>>> _get_module_info('yarl', ('.', ))
(True, 'c:/users/shea.parkes/repos/yarl\\yarl', False)
>>> exit()

(base) C:\Users\shea.parkes\repos\aspy.refactor_imports [master ≡]> $env:pythonpath = 'C:\Users\shea.parkes\repos\yarl'
(base) C:\Users\shea.parkes\repos\aspy.refactor_imports [master ≡]> python -i .\aspy\refactor_imports\classify.py
>>> _get_module_info('yarl', ('.', ))
(True, 'C:\\Users\\shea.parkes\\repos\\yarl\\yarl', False)
>>> exit()

Actually, other than some path formatting the sys.path and $pythonpath scenarios seem to return the same from _get_module_info(), so I'm not actually sure why they return different from classify_imports(). I'll dig a bit further into that.

@shea-parkes
Copy link
Contributor Author

Alright, so there is an explicit _due_to_pythonpath check that explains the different results when setting $pythonpath versus manipulating sys.path I was showing in the prior comment.

After digging through all this, I still think it's a reasonable change to improve the detection of zipimports. With the changes in this pull request the results of _get_module_info() are more "truthful" when encountering a zipimport, so downstream logic can be reasoned about better. Once the _get_module_info() reports back a useful module_path, then the existing logic in classify_import and _due_to_pythonpath correctly classifies between BUILTIN and THIRD_PARTY. Maintaining a coherent program state under more scenarios is a positive change to me.

Is there any further information I can provide or tests/scenarios you'd like me to explore?

@asottile
Copy link
Owner

hmmmm! that's a good point -- maybe if we return True, spec.origin, False in the case we detect zipimporter then it would just work? since it would get caught by _due_to_pythonpath

@shea-parkes
Copy link
Contributor Author

I think that's a reasonable solution. I'll go ahead and add an "embedded test" that puts a zipimport on sys.path without putting it on pythonpath. It should return BUILTIN in that scenario. Lemme see what I can do here.

@shea-parkes
Copy link
Contributor Author

Alright, I believe I got that switch made. I did enough research I believe I can do the zipimport detection in the python2.7 version of _get_module_info, but I need to take an ~hour pause. I'll see about doing that during my commute.

@shea-parkes
Copy link
Contributor Author

Here's a stackoverflow post with the general pattern for find_module and zipimports to work together in Python 2.7: https://stackoverflow.com/questions/28962344/imp-find-module-which-supports-zipped-eggs

I got it working in a REPL. I'll see about getting it into classify.py.

@asottile
Copy link
Owner

would it make it easier if I droped py27? I was planning to do that in the next few weeks anyway

@shea-parkes
Copy link
Contributor Author

Hah, well, I just spent the hour to wrangle Python 2.7, so it's a moot point to me now. But for future devs sake I'd suggest dropping it.

The difference in the import engines between 2.7 and 3.5+ is indeed jarring. Let's see if CI is happy with the commit I just pushed.

@shea-parkes
Copy link
Contributor Author

Alright, CI seems happy. There's parity between Python 2.7 and 3 (although it took far more code lines for Python 2.7).

The change is now limited to detecting zipimport objects in _get_module_info and adds a couple tests to show common uses of zipimports (e.g. py4j with Apache Spark and stdlib for embedded Python).

I personally think this is good to go again. Let me know if you have any more requests/comments/concerns.

pass
# Sweep for zipimports on sys.path
for path_entry in sys.path:
path_entry_ext = os.path.splitext(path_entry)[-1].lower()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that zipimport.zipimporter would likely validate that it's a zip file, so we could switch this to "Easier to Ask Forgiveness". Although, for performance reasons, likely faster to check the string suffix since I'm assuming zipimport.zipimporter will do some file I/O before failing. Let me know if you'd like to remove this to be more pythonic though.

@shea-parkes shea-parkes changed the title Classify "zipimports" as Third Party Libraries. Get correct module info for zipimport modules (to then correctly classify them). Feb 27, 2020
@asottile
Copy link
Owner

I'll rebase this for ya since #48 is going to conflict -- sorry about that

@shea-parkes
Copy link
Contributor Author

Thanks.

Copy link
Owner

@asottile asottile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks again for your EXTREME patience!

@asottile asottile merged commit 9140145 into asottile:master Feb 27, 2020
@shea-parkes
Copy link
Contributor Author

Your rebase and cleanup looks nice and clean. Thanks for being patient with me as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aspy.refactor_imports.classify - _get_module_info fails for zipped third party libraries
2 participants