-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the test data for pylibcudf I/O tests #16247
Improve the test data for pylibcudf I/O tests #16247
Conversation
@@ -182,20 +182,6 @@ def test_read_json_basic( | |||
source_or_sink, pa_table, lines=lines, compression=compression_type | |||
) | |||
|
|||
request.applymarker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the data is more well-formed I guess, this isn't crashing anymore.
I think there still might be an issue in libcudf's JSON reader, though.
(will open a followup issue if I can still reproduce, but it's a little hard to reproduce)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Generate random ASCII strings | ||
strs = [] | ||
for _ in range(length): | ||
chrs = np.random.randint(33, 128, length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't start from 0 since 0-33 is ASCII control characters, and that can interfere with some of the text formats like CSV/JSON.
""" | ||
if pa_type == pa.int64(): | ||
half = length // 2 | ||
negs = np.random.randint(-length, 0, half, dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best practice to use a generator with a seed for reproducibility rather than using np.randint
@pytest.fixture( | ||
params=set(CompressionType).difference(unsupported_text_compression_types) | ||
) | ||
def text_compression_type(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this since most formats don't support all of the compression types, so it probably makes sense to break it out into text (CSV/JSON) vs binary (ORC/Parquet)
OK, comments should be addressed now. I also brought over some of my changes from the CSV PR in here as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
Description
Don't just use random integers for every data type.
Decided not to use hypothesis since I don't think there's a good way to re-use the table across calls
(and I would like to keep the runtime of pylibcudf tests down).
Checklist