Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC writer incorrectly encodes a column of empty or null strings #7620

Closed
vuule opened this issue Mar 16, 2021 · 2 comments · Fixed by #7656
Closed

[BUG] ORC writer incorrectly encodes a column of empty or null strings #7620

vuule opened this issue Mar 16, 2021 · 2 comments · Fixed by #7656
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@vuule
Copy link
Contributor

vuule commented Mar 16, 2021

The following code reproduces the issue:

    buffer = BytesIO()

    df = pd.DataFrame()
    df["String"] = np.array([None, ""]) #also fails with np.array(["", ""])
    df = cudf.from_pandas(df) 
    df.to_orc(buffer)

    orcfile = pa.orc.ORCFile(buffer)

    expect = orcfile.read().to_pandas()
    assert_eq(cudf.read_orc(buffer), df.reset_index(drop=True).to_pandas()) # fails

    assert_eq(cudf.read_orc(buffer), df.reset_index(drop=True)) # fails

Looks like the issue is in the writer, since it reproes with both cuDF reader and pyarrow reader.

@vuule vuule added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 16, 2021
@vuule vuule self-assigned this Mar 16, 2021
@vuule
Copy link
Contributor Author

vuule commented Mar 18, 2021

The output streams seem correct:
["", ""] -> present: null, length:64 1 0 (two zeros)
[None, ""] -> present: 255 64 (false, true), length: 64 0 0 (one zero)
["", None] -> present: 255 128 (true, false), length: 64 0 0 (one zero)
So this still might be a reader issue (maybe an assumption that data stream must be present for string column to not be full of nulls).
@rgsl888prabhu any idea what could be the issue in the reader to cause this?

@vuule
Copy link
Contributor Author

vuule commented Mar 18, 2021

Based on the spec, it could just be that the DATA stream needs to be present for string columns.

rapids-bot bot pushed a commit that referenced this issue Mar 19, 2021
There was a [condition in reader where if the data size is zero](https://github.com/rapidsai/cudf/blob/8773a40f4c8ce63f56ed6eb67b4eaf959106939f/cpp/src/io/orc/reader_impl.cu#L538), then stream pointer was not getting updated. 
But in case of `["", ""]` where it is a valid data with 0 size, it was reading it as `[null, null]`, so the condition has been removed which caused this issue.

I have also added test cases to validate.

closes #7620

Authors:
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)

Approvers:
  - Devavret Makkar (@devavret)
  - Vukasin Milovanovic (@vuule)
  - Keith Kraus (@kkraus14)

URL: #7656
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant