Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using dask dataframe with incompatible column dtypes #1235

Closed
ianthomas23 opened this issue Jun 16, 2023 · 0 comments · Fixed by #1236
Closed

Error using dask dataframe with incompatible column dtypes #1235

ianthomas23 opened this issue Jun 16, 2023 · 0 comments · Fixed by #1236
Labels
Milestone

Comments

@ianthomas23
Copy link
Member

Consider datashading a dask dataframe containing columns of different dtypes that are not actually used in the datashade operation:

import dask.dataframe as dd
import datashader as ds
import numpy as np
import pandas as pd

df = pd.DataFrame(
    data=dict(
        x = [0, 1, 2],
        y = [0, 1, 2],
        dates = np.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64'),
    )
)
ddf = dd.from_pandas(df, npartitions=2)

canvas = ds.Canvas(2, 2)
agg = canvas.points(ddf, 'x', 'y', ds.count())

Note the dates column is not used in the canvas.points call. Running this gives the following error:

Traceback (most recent call last):
  File "/Users/iant/github_temp/datashader_temp/dask_dtypes.py", line 16, in <module>
    agg = canvas.points(ddf, 'x', 'y', ds.count())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/core.py", line 220, in points
    return bypixel(source, self, glyph, agg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/core.py", line 1257, in bypixel
    return bypixel.pipeline(source, schema, canvas, glyph, agg, antialias=antialias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/utils.py", line 109, in __call__
    return lk[typ](head, *rest, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/data_libraries/dask.py", line 22, in dask_pipeline
    dsk, name = glyph_dispatch(glyph, df, schema, canvas, summary, antialias=antialias, cuda=cuda)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/utils.py", line 112, in __call__
    return lk[cls](head, *rest, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/data_libraries/dask.py", line 122, in default
    dtype = np.result_type(*dtypes)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in result_type
TypeError: The DType <class 'numpy.dtype[datetime64]'> could not be promoted by <class 'numpy.dtype[int64]'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtype[int64]'>, <class 'numpy.dtype[int64]'>, <class 'numpy.dtype[datetime64]'>)

Internally in the code that handles dask dataframes there is an attempt to find a dtype that is compatible for all columns of the dataframe. This is unnecessary, we only need to consider the x and y columns here so we can ignore the others.

First reported by @hoxbro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant