KFold cross validation fails with dask dataframes #956

phobson · 2023-01-10T07:06:57Z

Describe the issue:

KFold.split doesn't support dask dataframes. With the recent integrations of dask in e.g., xgboost, optuna, it would be very useful if it did. The error message acknowledges that dataframe are not supported and should be converted to dask arrays. With modern ML workflows, this isn't ideal since datasets commonly contain fields of many types (float, int, bool, categorical).

Minimal Complete Verifiable Example:

import dask.dataframe as dd
from dask_ml.model_selection import train_test_split, KFold
ddf = dd.demo.make_timeseries()
train, test = train_test_split(ddf)  # works

k_folder = KFold(n_splits=5)
for train, test in k_folder.split(ddf):  # fails
    pass

traceback

TypeError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 for train, test in k_folder.split(ddf):
      2     pass

File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/model_selection/_split.py:241, in KFold.split(self, X, y, groups)
    240 def split(self, X, y=None, groups=None):
--> 241     X = check_array(X)
    242     n_samples = X.shape[0]
    243     n_splits = self.n_splits

File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/utils.py:197, in check_array(array, accept_dask_array, accept_dask_dataframe, accept_unknown_chunks, accept_multiple_blocks, preserve_pandas_dataframe, remove_zero_chunks, *args, **kwargs)
    195 elif isinstance(array, dd.DataFrame):
    196     if not accept_dask_dataframe:
--> 197         raise TypeError(
    198             "This estimator does not support dask dataframes. "
    199             "This might be resolved with one of\n\n"
    200             "    1. ddf.to_dask_array(lengths=True)\n"
    201             "    2. ddf.to_dask_array()  # may cause other issues because "
    202             "of unknown chunk sizes"
    203         )
    204     # TODO: sample?
    205     return array

TypeError: This estimator does not support dask dataframes. This might be resolved with one of

    1. ddf.to_dask_array(lengths=True)
    2. ddf.to_dask_array()  # may cause other issues because of unknown chunk sizes

Anything else we need to know?:

We recently worked around this limitation with the following:

def _make_cv(df, num_folds):
    frac = [1 / num_folds]*num_folds
    splits = df.random_split(frac, shuffle=True)
    for i in range(num_folds):
        train = [splits[j] for j in range(num_folds) if j != i]
        test = splits[i]
        yield train, test

for i, (train, test) in enumerate(_make_cv(ddf, n_splits)):
    pass

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2023-01-10T17:18:42Z

Thanks @phobson. I agree it would be nice if Dask DataFrames were supported here (this would also match scikit-learn's behavior).

cc @mmccarty for visibility in case you, or folks around you, have bandwidth to look into this

mmccarty · 2023-01-11T18:19:04Z

Thanks @jrbourbeau and @phobson I'll take a look.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KFold cross validation fails with dask dataframes #956

KFold cross validation fails with dask dataframes #956

phobson commented Jan 10, 2023

jrbourbeau commented Jan 10, 2023

mmccarty commented Jan 11, 2023

KFold cross validation fails with dask dataframes #956

KFold cross validation fails with dask dataframes #956

Comments

phobson commented Jan 10, 2023

jrbourbeau commented Jan 10, 2023

mmccarty commented Jan 11, 2023