Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix
dask_cudf.read_parquet
regression for legacy timestamp data (ra…
…pidsai#15929) cudf does not currently support timezone-aware datetime columns. For example: ```python pdf = pd.DataFrame( { "time": pd.to_datetime( ["1996-01-02", "1996-12-01"], utc=True, ), "x": [1, 2], } ) cudf.DataFrame.from_pandas(pdf) ``` ``` NotImplementedError: cuDF does not yet support timezone-aware datetimes ``` However, `cudf.read_parquet` **does** allow you to read this same data from a Parquet file. This PR adds a simple fix to allow the same data to be read with `dask_cudf`. The dask_cudf version was previously "broken" because it relies on upstream pyarrow logic to construct `meta` as a pandas DataFrame (and then we just convert `meta` from pandas to cudf). As illustrated in the example above, this direct conversion is not allowed when one or more columns contain timezone information. **Important Context** The actual motivation for this PR is to fix a **regression** in 24.06+ for older parquet files containing "legacy" timestamp types (e.g. `TIMESTAMP_MILLIS` and `TIMESTAMP_MICROS`). In `pyarrow 14.0.2` (used by cudf-24.04), these legacy types were not automatically translated to timezone-aware dtypes by pyarrow. In `pyarrow 16.1.0` (used by cudf-24.06+), the legacy types **ARE** automatically translated. Therefore, in moving from cudf-24.04 to cudf-24.06+, some `dask_cudf` users will find that they can no longer read the same parquet file containing legacy timestamp data. I'm not entirely sure if cudf should always allow users to read Parquet data with timezone-aware dtypes (e.g. if the timezone is **not** utc), but it definitely makes sense for cudf to ignore automatic/unnecessary timezone translations. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Lawrence Mitchell (https://github.com/wence-) URL: rapidsai#15929
- Loading branch information