Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the test data for pylibcudf I/O tests #16247

Merged
merged 3 commits into from
Jul 12, 2024

Conversation

lithomas1
Copy link
Contributor

@lithomas1 lithomas1 commented Jul 10, 2024

Description

Don't just use random integers for every data type.

Decided not to use hypothesis since I don't think there's a good way to re-use the table across calls
(and I would like to keep the runtime of pylibcudf tests down).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@lithomas1 lithomas1 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 10, 2024
@lithomas1 lithomas1 requested a review from a team as a code owner July 10, 2024 23:21
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 10, 2024
@@ -182,20 +182,6 @@ def test_read_json_basic(
source_or_sink, pa_table, lines=lines, compression=compression_type
)

request.applymarker(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the data is more well-formed I guess, this isn't crashing anymore.

I think there still might be an issue in libcudf's JSON reader, though.
(will open a followup issue if I can still reproduce, but it's a little hard to reproduce)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Generate random ASCII strings
strs = []
for _ in range(length):
chrs = np.random.randint(33, 128, length)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't start from 0 since 0-33 is ASCII control characters, and that can interfere with some of the text formats like CSV/JSON.

"""
if pa_type == pa.int64():
half = length // 2
negs = np.random.randint(-length, 0, half, dtype=np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best practice to use a generator with a seed for reproducibility rather than using np.randint

@pytest.fixture(
params=set(CompressionType).difference(unsupported_text_compression_types)
)
def text_compression_type(request):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this since most formats don't support all of the compression types, so it probably makes sense to break it out into text (CSV/JSON) vs binary (ORC/Parquet)

@lithomas1
Copy link
Contributor Author

OK, comments should be addressed now. I also brought over some of my changes from the CSV PR in here as well.
(mainly just pulling out the utilities that I added for testing the JSON reader)

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lithomas1
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 1ff7461 into rapidsai:branch-24.08 Jul 12, 2024
79 checks passed
@lithomas1 lithomas1 deleted the pylibcudf-io-tests branch July 12, 2024 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants