-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Type checking support #32609
Comments
Antoine Pitrou / @pitrou: (also I'm assuming that Cython is compatible with type annotations, but I'm not 100% sure) |
Jorrick Sleijster: Thanks for reaching out. I checked out the code indeed and I have to say, it's super clean and I was overwhelmed. I have no worked with Cython code bases before but if we just changed the
Let's take this line for example: Line 96 in 9391951
def ls(self, path, detail=False): That would become def ls(self, path: str, detail: bool = False) -> List[Dict[str, Any]]: Would that change require recompilation? |
Antoine Pitrou / @pitrou: |
David Li / @lidavidm: |
David Li / @lidavidm: |
Jorrick Sleijster: The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library. My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.
I think PySpark 3.2.2 was a nice example of having stubs: https://github.com/apache/spark/tree/v3.2.2/python/pyspark I'm pretty sure they created them manually though (and note: this is a Java bindings and not C, but I don't think that's a lot of difference in terms of stubs). However, they changed it in their latest version to ditch the pyi files. I think this is because they have a lot more percentage of the code in Python compared to Java. |
Jorrick Sleijster: |
Antoine Pitrou / @pitrou: |
Joris Van den Bossche / @jorisvandenbossche: (but I certainly agree it's fine to give this a go with a small subset, see how that looks, and discuss further from there) |
Joris Van den Bossche / @jorisvandenbossche:
It's indeed not exactly the same. But in practice, I think both aspects are very much related and we could (should?) do those at the same time. If we start adding type annotations so that pyarrow can used by other projects that are type-checked, it would be good that at the same time we also check that those type annotations we are adding are correct (although, based on my limited experience with this, just running mypy on the code base is always a bit limited I suppose, as it doesn't guarantee the type checks are actually correct? (it only might find some incorrect ones)) |
Jorrick Sleijster: I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation. Hence, I think it's better to threat them separate for now and start of with stub generation, which can then later be replaced by a better implementation once available. |
Joris Van den Bossche / @jorisvandenbossche: |
Jorrick Sleijster: As far as I can see from this page, these are only used for external project checking it's types against this project and not for internal project checking. |
Uwe Korn / @xhochy: |
Jorrick Sleijster: Looker at other projects, this seems like a very cumbersome and tedious thing to work on. For example, this one took a very long time to get right as well; pysam-developers/pysam#1008 I suppose the best it to postpone until python/mypy#7542 is implemented then. |
Joris Van den Bossche / @jorisvandenbossche: |
Not planning on looking at this soon. But if someone does, one approach would be to use this branch of stubgen (python/mypy#13284) so that includes the docstrings, and then one can fill in the types based on them. I've heard from @mariusvniekerk that this workflow has helped when he generated type stubs for our Flight submodule for his own project. On detecting regressions in type stubs for Cython, I think we should be able to catch these as long as we are running type checking on our test suites. |
Great to see the discussion here. I would be in favor of having type annotations. It will make the code more robust, and also helps the user to see what arguments can be passed in. I'm working on #33974 and figured that also type checking will help to make sure that the docstrings are up to date (if you update a type, you should update the docstring as well). |
Working with the pyarrow library this really sticks out as a sore thumb. Good type stubs and docstrings are IMO just as valuable as API documentation (which the project does very well) because you don't need to leave your IDE and open 10 tabs to find the information. A couple thoughts:
|
@jorisvandenbossche I'm sorry to ping you about this but since we've interacted before you felt like the least worst person to personally annoy. Could we get some thoughts from the team on this? I really think it would not be that hard to get started and improve the typing stubs over time. Based on the 👍🏻 in my comment above there's a good amount of interest and it's easy pickings for 3rd party contributors. |
I think a good place to start would be to choose a submodule with a decently limited API (maybe
|
I'm happy to try to make an initial PR. My approach would be to add a
IMO the best way is to force typing into tests. That's also good dogfooding. But it can result in a lot of code churn (need to rewrite a lot of the tests I imagine). Type checkers do allow whitelisting files so it would probably make sense to pick a very small test file to whitelist and either fix all stubs it needs or add |
FWIW: I recently added a method to PySpark to return a DataFrame as a PyArrow Table (apache/spark#45481). Now I'm trying to add support for going in the other direction (apache/spark#46529) but I'm stymied by type checking problems, including the problem described at #24376 (comment). |
I've experimented a little with this in nanoarrow (where I'd at least like to get autocomplete for methods on some objects that have to be implemented in Cython) and found a few things that might be helpful:
I would lean towards some kind of programmatic approach where type hinting is specified in a docstring or something to minimize the pain of keeping .pxi files synced. Off the top of my head, I would probably start with the mypy-generated stubs, parse it into an ast, do some transformation to add type hints based on some pattern in the docstring or parsing of the argument list (Cython methods provide the file number and line number of the definition). That is almost certainly a can of worms (but so are any alternatives I know about). The PR adding very basic mypy stubs to nanoarrow is here: apache/arrow-nanoarrow#468 |
Looking through this discussion I don't think anyone is opposed to an initial PR and it would be welcome. Picking a small module would be good to prove out the actual workflow. I think the next step then is to provide typings for |
Seems like someone already did some groundwork a few years ago: https://github.com/zen-xu/pyarrow-stubs |
I've used https://github.com/typeddjango/pytest-mypy-plugins before and it works rather well. |
In the past few days, I rewrote pyarrow-stubs instead of generating them through |
@zen-xu thanks a lot for providing that package!
Yes, that's a good reminder. But just to be sure, my understanding from reading https://typing.readthedocs.io/en/latest/spec/distributing.html#import-resolution-ordering is that if we would start to add some type hints gradually, already add a But if some type checker would vendor those (although no idea if that happens, and pyarrow is not in https://github.com/python/typeshed or https://github.com/microsoft/python-type-stubs), that would no longer get picked up?
Yes, agreed, and I will try to do such an initial PR next week to get things started. And I also assume that the priority would be to add return types (so that at least the use case of autocomplete in IDEs would work). Is that a correct analysis? However, the most relevant parts are of course in cython. For those we need to add |
The end goal should be that we have rather complete type annotations for the users of pyarrow, i.e. with most part of pyarrow being in cython. Thinking through some options how to get there, and how to maintain and distribute those type stubs:
Thoughts on this? Preferences? Other options you can think of? Personally, I think that if we have a decent solution for auto-generation that produces "good enough" stubs, that would be my preference. |
Thank you for your recognition. I accidentally discovered that the package I created a long time ago in my free time could be helpful to you, and I’m happy to contribute code to your project. |
Additionally, if pyarrow needs to maintain its own type annotations, I recommend using the wrapper pattern, as For example, if there is a function |
|
mypy and static type checking
As of Python3.6, it has been possible to introduce typing information in the code. This became immensely popular in a short period of time. Shortly after, the tool
mypy
arrived and this has become the industry standard for static type checking inside Python. It is able to check very quickly for invalid types which makes it possible to serve as a pre-commit. It has raised many bugs that I did not see myself and has been a very valuable tool.Now what does this mean for PyArrow?
When we run mypy on code that uses PyArrow, you will get error message as follows:
More information is available here: https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker
You can solve this in three ways:
Ignore the message. This, however, will put all types from PyArrow to
Any
, making it unable to find user errors with the PyArrow libraryCreate a Python stub file. This is what previously used to be the standard, however, it no longer a popular option. This is because stubs are extra, next to the source code, while you can also inline the code with type hints, which brings me to our third option.
Create a
py.typed
file and use inline type hints. This is the most popular option today because it requires no extra files (except for the py.typed file), allows all the type hints to be with the code (like now in the documentation) and not only provides your users but also the developers of the library themselves with type hints (and hinting of issues inside your IDE).My personal opinion already shines through the options, it is 3 as this has shortly become the industry standard since the introduction.
What should we do?
I'd very much like to work on this, however, I don't feel like wasting time. Therefore, I am raising this ticket to see if this had been considered before or if we just didn't get to this yet.
I'd like to open the discussion here:
Do you agree with number ARROW-10: Fix mismatch of javadoc names and method parameters #3 as type hints.
Should we remove the documentation annotations for the type hints given they will be inside the functions? Or should we keep it and specify it in the code? Which would make it double.
Reporter: Jorrick Sleijster
Note: This issue was originally created as ARROW-17335. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: