-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use strict type hinting for namedarray #8241
Conversation
In reviewing pydata#8241, I realize that we actually want `check-untyped-defs`, which is a bit less strict, but lets us add some more modules on. I did have to add a couple of ignores, think it's a reasonable tradeoff to add big modules like `computation` on. Errors with this enabled are actual type errors, not just `mypy` pedanticness, so would be good to get as much as possible into this list...
Current errors:
|
for more information, see https://pre-commit.ci
…xarray into disallow_untyped_defs
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…xarray into disallow_untyped_defs
for more information, see https://pre-commit.ci
…xarray into disallow_untyped_defs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mypy and (most of) CI is passing now! I'm stopping here.
fill_value: np.typing.ArrayLike | Default = _default, | ||
) -> Self: | ||
""" | ||
use sparse-array as backend. | ||
""" | ||
import sparse | ||
|
||
# TODO: what to do if dask-backended? | ||
if fill_value is dtypes.NA: | ||
if fill_value is _default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype.NA was only used as a default value. So why not use the _default we already have?
class CustomArray(CustomArrayBase): | ||
def __array__(self) -> np.ndarray[Any, np.dtype[np.generic]]: | ||
return np.array(self.array) | ||
|
||
class CustomArrayIndexable(CustomArrayBase, xr.core.indexing.ExplicitlyIndexed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExplicitlyIndexed has a non-standard __array__
method. This made inheriting from it difficult.
def is_chunked_duck_array( | ||
x: T_DuckArray, | ||
) -> TypeGuard[_ChunkedArray[np.dtype[np.generic]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using _ChunkedArray here, though I think it should be the TypeVar T_ChunkedArray.
I can see you're point, and I gave it a try in the last commits but I couldn't figure it out. Since I'm running out of steam with this PR, I think I'll merge this as is. Figuring out the exact grammar on |
* upstream/main: (46 commits) xfail flaky test (pydata#8299) Most of mypy 1.6.0 passing (pydata#8296) Rename `reset_encoding` to `drop_encoding` (pydata#8287) Enable `.rolling_exp` to work on dask arrays (pydata#8284) Fix `GroupBy` import (pydata#8286) Ask bug reporters to confirm they're using a recent version of xarray (pydata#8283) Add pyright type checker (pydata#8279) Improved typing of align & broadcast (pydata#8234) Update ci-additional.yaml (pydata#8280) Fix time encoding regression (pydata#8272) Allow a function in `.sortby` method (pydata#8273) make more args kw only (except 'dim') (pydata#6403) Use duck array ops in more places (pydata#8267) Don't raise rename warning if it is a no operation (pydata#8266) Mandate kwargs on `to_zarr` (pydata#8257) copy the `dtypes` module to the `namedarray` package. (pydata#8250) Add xarray-regrid to ecosystem.rst (pydata#8270) Use strict type hinting for namedarray (pydata#8241) Update type annotation for center argument of dataaray_plot methods (pydata#8261) [pre-commit.ci] pre-commit autoupdate (pydata#8262) ...
def is_dask_collection(x: object) -> TypeGuard[DaskCollection]: | ||
if module_available("dask"): | ||
from dask.base import is_dask_collection | ||
from dask.typing import DaskCollection | ||
|
||
return is_dask_collection(x) | ||
return isinstance(x, DaskCollection) | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a related note, i encountered a strange issue while trying to consolidate the two implementations of is_dask_collection
in the utils.py
and pycompat.py
files:
xarray/xarray/namedarray/utils.py
Lines 55 to 60 in c9ba2be
def is_dask_collection(x: object) -> TypeGuard[DaskCollection]: | |
if module_available("dask"): | |
from dask.typing import DaskCollection | |
return isinstance(x, DaskCollection) | |
return False |
and
xarray/xarray/core/pycompat.py
Lines 89 to 94 in c9ba2be
def is_dask_collection(x): | |
if module_available("dask"): | |
from dask.base import is_dask_collection | |
return is_dask_collection(x) | |
return False |
when using pint
arrays, i observed different behaviors. Here is an example:
In [22]: from dask.typing import DaskCollection
In [23]: from dask.base import is_dask_collection
In [24]: import xarray as xr, pint_xarray
In [25]: ds = xr.Dataset({"a": ("x", [0, 1, 2]), "b": ("y", [-3, 5, 1], {"units": "m"})}).pint.quantify(a="s")
In [27]: ds.b
Out[27]:
<xarray.DataArray 'b' (y: 3)>
<Quantity([-3 5 1], 'meter')>
Dimensions without coordinates: y
In [34]: isinstance(ds.b.data, DaskCollection)
Out[34]: True
In [35]: is_dask_collection(ds.b.data)
Out[35]: False
@Illviljan, i'm curious to hear your thoughts on how to address this divergence in implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance(x, DaskCollection)
This doesn't seem right. It just asserts that dask protocols are present on x
.
This is always true for xarray objects and pint objects. But these things are technically dask collections if and only if they wrap an underlying dask array. is_dask_collection
recognizes this (https://docs.dask.org/en/latest/_modules/dask/base.html#is_dask_collection)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just learned that this is a well known issue,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcherian, thank you for chiming in and the pointer to the doc page. i'm going to revert to using is_dask_collection()
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function just checked for .__dask_graph__
in october 2023. Using isinstance(x, DaskCollection)
correctly narrows down the typing and made sense back then.
Compare the old version: dask/dask#10676
Dask's definition of a DaskCollection
requires:
__dask_graph__
__dask_keys__
__dask_postcompute__
__dask_postpersist__
__dask_tokenize__
compute
persist
visualize
which is_dask_collection
ignores.
This is also strange and another reason why I preferred the more modern DaskCollection
-protocol.
Towards the strict goal in #8239.