Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handler csv reader options in cudf-polars #16211

Merged

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Jul 8, 2024

Description

Previously we were just relying on the default cudf read_csv options which doesn't do the right thing if the user has configured things.

Now that polars passes through the information to us, we can handle things properly, and raise for unsupported cases.

While here, update to new polars release and adapt tests to bug fixes that have been made upstream.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested review from a team as code owners July 8, 2024 11:32
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Jul 8, 2024
@wence- wence- added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jul 8, 2024
# because polars will already have handled them.
path = Path(p)
with path.open() as f:
while f.readline() == "\n":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that read_csv should eventually implement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point (maybe soon?), cudf-polars is not going to use cudf.read_csv, but rather the libcudf layer directly, so we'll probably need to do this here anyway.

@@ -206,24 +208,104 @@ def __post_init__(self) -> None:
if self.file_options.n_rows is not None:
raise NotImplementedError("row limit in scan")
if self.typ not in ("csv", "parquet"):
raise NotImplementedError(f"Unhandled scan type: {self.typ}")
if self.cloud_options is not None and any(
self.cloud_options[k] is not None for k in ("aws", "azure", "gcp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that these strings are guaranteed to be in the dict at this point? If this option isn't supported at all would it be worthwhile just to only accept None here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csv reader has non-None cloud_options but the values are None, so I can't punt for cloud_options is not None.

Convert internal PySeries to public Series in translation.
If one passes an empty list for the names, things should work.
Possibly some pieces are missing, but this is much closer to complete.
@wence- wence- force-pushed the wence/fea/polars-csv-options branch from a13836a to 184979c Compare July 9, 2024 08:40
@wence-
Copy link
Contributor Author

wence- commented Jul 9, 2024

/merge

@rapids-bot rapids-bot bot merged commit b693e79 into rapidsai:branch-24.08 Jul 9, 2024
80 checks passed
@wence- wence- deleted the wence/fea/polars-csv-options branch July 9, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants