-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable batched multi-source reading of JSONL files with large records #16687
Enable batched multi-source reading of JSONL files with large records #16687
Conversation
…f into json-multi-source-reading-bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the solution!
Looks good, just a few small questions/suggestions.
@@ -171,3 +172,77 @@ TEST_F(JsonReaderTest, ByteRange_MultiSource) | |||
CUDF_TEST_EXPECT_TABLES_EQUIVALENT(current_reader_table.tbl->view(), result->view()); | |||
} | |||
} | |||
|
|||
TEST_F(JsonReaderTest, ByteRangeWithRealloc_MultiSource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this test fail without the fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out - this test is not good and actually passes without the fix since split_byte_range_reading
sort of aligns the byte ranges to the row offsets in the file.
We want to design a test as follows: Construct an input such that passing a small byte range of it to the reader will end up reading a large line. Since the size_per_subchunk
is the geometric mean of byte range size and 10kb, as long as the length of the line is sufficiently greater than size_per_subchunk * num_subchunks_prealloced
, the buffer will have to be reallocated. Since the previous code does not perform any reallocation, it will fail such a test.
I've added the above test to json_test.cpp
since we are not using split_byte_range_reading
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart solution 🚀
{ | ||
std::string long_string = "haha"; | ||
std::size_t log_repetitions = 12; | ||
long_string.reserve(long_string.size() * (1UL << log_repetitions)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit better than std::pow
(that I suggested) :D
/merge |
Description
Addresses #16664
Implements reallocate-and-retry logic when the initial buffer size estimate fails for byte range reading.
Chunked reader test checks for correct reallocation for different chunk sizes.
Checklist