Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask_cudf read_csv does not support dask.dataframe's blocksize argument (vs. chunksize) #11847

Closed
beckernick opened this issue Oct 3, 2022 · 1 comment · Fixed by #12394
Labels
bug Something isn't working dask Dask issue good first issue Good for newcomers Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Oct 3, 2022

dask_cudf's read_csv uses the parameter chunksize to control the size of each chunk when reading a csv file, while dask.dataframe uses blocksize. This can cause issues when using Dask-SQL and flipping between CPU and GPU workloads.

dask_cudf silently ignores this parameter rather than throwing a ValueError (or a warning).

We should support blocksize.

import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask_cudf

n = 100000
k = 10
df = pd.DataFrame({f"c{x}": np.arange(n) for x in range(k)})
df.to_csv("temp.csv", index=False)
dd.read_csv("temp.csv", chunksize="1 MB")
---------------------------------------------------------------------------
ValueError 
...

dd.read_csv("temp.csv", blocksize="1 MB")
Dask DataFrame Structure:
                  c0     c1     c2     c3     c4     c5     c6     c7     c8     c9
npartitions=5                                                                      
               int64  int64  int64  int64  int64  int64  int64  int64  int64  int64
                 ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
...              ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
                 ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
                 ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
Dask Name: read-csv, 5 tasks
print(dask_cudf.read_csv("temp.csv", blocksize="1 MB"))
<dask_cudf.DataFrame | 1 tasks | 1 npartitions>

print(dask_cudf.read_csv("temp.csv", chunksize="1 MB"))
<dask_cudf.DataFrame | 6 tasks | 6 npartitions>
conda list | grep "rapids\|dask"
# packages in environment at /home/nicholasb/miniconda3/envs/rapids-22.10-dasksql:
cucim                     22.10.00a220930 cuda_11_py39_gf4229e3_51    rapidsai-nightly
cudf                      22.10.00a220929 cuda_11_py39_g5fad28942e_286    rapidsai-nightly
cudf_kafka                22.10.00a220929 py39_g920b58f948_288    rapidsai-nightly
cugraph                   22.10.00a220930 cuda11_py39_g91598080_88    rapidsai-nightly
cuml                      22.10.00a220930 cuda11_py39_g96da84cc1_50    rapidsai-nightly
cusignal                  22.10.00a220930 py39_gd075e87_12    rapidsai-nightly
cuspatial                 22.10.00a220930 py39_g6922ef5_55    rapidsai-nightly
custreamz                 22.10.00a220929 py39_g920b58f948_288    rapidsai-nightly
cuxfilter                 22.10.00a220930 py39_ge1aa0b2_17    rapidsai-nightly
dask                      2022.7.1           pyhd8ed1ab_0    conda-forge
dask-core                 2022.7.1           pyhd8ed1ab_0    conda-forge
dask-cuda                 22.10.00a220930 py39_gc0ae66c_20    rapidsai-nightly
dask-cudf                 22.10.00a220929 cuda_11_py39_g920b58f948_288    rapidsai-nightly
dask-sql                  2022.9.1a220928 py39_ga7583b5_13    dask/label/dev
datashader                0.13.1a                    py_0    rapidsai-nightly
libcucim                  22.10.00a220930 cuda11_gf4229e3_51    rapidsai-nightly
libcudf                   22.10.00a220929 cuda11_g920b58f948_288    rapidsai-nightly
libcudf_kafka             22.10.00a220929 g920b58f948_288    rapidsai-nightly
libcugraph                22.10.00a220930 cuda11_g91598080_88    rapidsai-nightly
libcugraph_etl            22.10.00a220930 cuda11_g91598080_88    rapidsai-nightly
libcugraphops             22.10.00a220930 cuda11_g553bacf_29    rapidsai-nightly
libcuml                   22.10.00a220930 cuda11_g96da84cc1_50    rapidsai-nightly
libcumlprims              22.10.00a220804 cuda11_g2adfe69_0    rapidsai-nightly
libcuspatial              22.10.00a220930 cuda11_g6922ef5_55    rapidsai-nightly
libraft-distance          22.10.00a220930 cuda11_g2e98138c_57    rapidsai-nightly
libraft-headers           22.10.00a220930 cuda11_g2e98138c_57    rapidsai-nightly
libraft-nn                22.10.00a220930 cuda11_g2e98138c_57    rapidsai-nightly
librmm                    22.10.00a220929 cuda11_g8a3a552e_28    rapidsai-nightly
libxgboost                1.6.2dev.rapidsai22.10       cuda_11_0    rapidsai-nightly
ptxcompiler               0.6.0           cuda_11_py39_g455bc7f_2    rapidsai-nightly
py-xgboost                1.6.2dev.rapidsai22.10  cuda_11_py39_0    rapidsai-nightly
pylibcugraph              22.10.00a220930 cuda11_py39_g91598080_88    rapidsai-nightly
pylibraft                 22.10.00a220930 cuda11_py39_g2e98138c_57    rapidsai-nightly
raft-dask                 22.10.00a220930 cuda11_py39_g2e98138c_57    rapidsai-nightly
rapids                    22.10.00a220930 cuda11_py39_gbce77c5_69    rapidsai-nightly
rapids-xgboost            22.10.00a220930 cuda11_py39_gbce77c5_69    rapidsai-nightly
rmm                       22.10.00a220929 cuda11_py39_g8a3a552e_28    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.28.00a220927  py39_g2ab6070_27    rapidsai-nightly
xgboost                   1.6.2dev.rapidsai22.10  cuda_11_py39_0    rapidsai-nightly
@beckernick beckernick added bug Something isn't working Python Affects Python cuDF API. dask Dask issue labels Oct 3, 2022
@beckernick beckernick changed the title [BUG] dask_cudf read_csv does not align with dask.dataframe's blocksize argument (vs. chunksize) [BUG] dask_cudf read_csv does not support dask.dataframe's blocksize argument (vs. chunksize) Oct 3, 2022
@beckernick
Copy link
Member Author

With the merge of #11920 , this now also applies when users want to toggle between CPU and GPU:

with dask.config.set({"dataframe.backend": "cudf"}):
    ddf = dd.read_csv("/home/nicholasb/dev/data/HIGGS.csv", chunksize="200 MB")

with dask.config.set({"dataframe.backend": "pandas"}):
    ddf = dd.read_csv("/home/nicholasb/dev/data/HIGGS.csv", blocksize="25 MB")

@beckernick beckernick added the good first issue Good for newcomers label Oct 24, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 13, 2023
Deprecates `chunksize` in favor of `blocksize` to be consistent with `dask.dataframe.read_csv`.

Closes #11847

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #12394
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dask Dask issue good first issue Good for newcomers Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant