Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KFold cross validation fails with dask dataframes #956

Open
phobson opened this issue Jan 10, 2023 · 2 comments
Open

KFold cross validation fails with dask dataframes #956

phobson opened this issue Jan 10, 2023 · 2 comments

Comments

@phobson
Copy link

phobson commented Jan 10, 2023

Describe the issue:

KFold.split doesn't support dask dataframes. With the recent integrations of dask in e.g., xgboost, optuna, it would be very useful if it did. The error message acknowledges that dataframe are not supported and should be converted to dask arrays. With modern ML workflows, this isn't ideal since datasets commonly contain fields of many types (float, int, bool, categorical).

Minimal Complete Verifiable Example:

import dask.dataframe as dd
from dask_ml.model_selection import train_test_split, KFold
ddf = dd.demo.make_timeseries()
train, test = train_test_split(ddf)  # works

k_folder = KFold(n_splits=5)
for train, test in k_folder.split(ddf):  # fails
    pass
traceback
TypeError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 for train, test in k_folder.split(ddf):
      2     pass

File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/model_selection/_split.py:241, in KFold.split(self, X, y, groups)
    240 def split(self, X, y=None, groups=None):
--> 241     X = check_array(X)
    242     n_samples = X.shape[0]
    243     n_splits = self.n_splits

File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/utils.py:197, in check_array(array, accept_dask_array, accept_dask_dataframe, accept_unknown_chunks, accept_multiple_blocks, preserve_pandas_dataframe, remove_zero_chunks, *args, **kwargs)
    195 elif isinstance(array, dd.DataFrame):
    196     if not accept_dask_dataframe:
--> 197         raise TypeError(
    198             "This estimator does not support dask dataframes. "
    199             "This might be resolved with one of\n\n"
    200             "    1. ddf.to_dask_array(lengths=True)\n"
    201             "    2. ddf.to_dask_array()  # may cause other issues because "
    202             "of unknown chunk sizes"
    203         )
    204     # TODO: sample?
    205     return array

TypeError: This estimator does not support dask dataframes. This might be resolved with one of

    1. ddf.to_dask_array(lengths=True)
    2. ddf.to_dask_array()  # may cause other issues because of unknown chunk sizes

Anything else we need to know?:

We recently worked around this limitation with the following:

def _make_cv(df, num_folds):
    frac = [1 / num_folds]*num_folds
    splits = df.random_split(frac, shuffle=True)
    for i in range(num_folds):
        train = [splits[j] for j in range(num_folds) if j != i]
        test = splits[i]
        yield train, test

for i, (train, test) in enumerate(_make_cv(ddf, n_splits)):
    pass
@jrbourbeau
Copy link
Member

Thanks @phobson. I agree it would be nice if Dask DataFrames were supported here (this would also match scikit-learn's behavior).

cc @mmccarty for visibility in case you, or folks around you, have bandwidth to look into this

@mmccarty
Copy link
Member

Thanks @jrbourbeau and @phobson I'll take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants