Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable batched multi-source reading of JSONL files with large records #16687

Merged

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Aug 29, 2024

Description

Addresses #16664

Implements reallocate-and-retry logic when the initial buffer size estimate fails for byte range reading.
Chunked reader test checks for correct reallocation for different chunk sizes.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 29, 2024
@shrshi shrshi added bug Something isn't working cuIO cuIO issue improvement Improvement / enhancement to an existing function labels Aug 29, 2024
@shrshi shrshi added non-breaking Non-breaking change and removed improvement Improvement / enhancement to an existing function labels Aug 29, 2024
@shrshi shrshi marked this pull request as ready for review August 29, 2024 06:40
@shrshi shrshi requested a review from a team as a code owner August 29, 2024 06:40
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the solution!
Looks good, just a few small questions/suggestions.

cpp/src/io/json/read_json.cu Show resolved Hide resolved
cpp/tests/io/json/json_chunked_reader.cu Outdated Show resolved Hide resolved
@@ -171,3 +172,77 @@ TEST_F(JsonReaderTest, ByteRange_MultiSource)
CUDF_TEST_EXPECT_TABLES_EQUIVALENT(current_reader_table.tbl->view(), result->view());
}
}

TEST_F(JsonReaderTest, ByteRangeWithRealloc_MultiSource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this test fail without the fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out - this test is not good and actually passes without the fix since split_byte_range_reading sort of aligns the byte ranges to the row offsets in the file.
We want to design a test as follows: Construct an input such that passing a small byte range of it to the reader will end up reading a large line. Since the size_per_subchunk is the geometric mean of byte range size and 10kb, as long as the length of the line is sufficiently greater than size_per_subchunk * num_subchunks_prealloced, the buffer will have to be reallocated. Since the previous code does not perform any reallocation, it will fail such a test.

I've added the above test to json_test.cpp since we are not using split_byte_range_reading.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart solution 🚀

cpp/tests/io/json/json_chunked_reader.cu Outdated Show resolved Hide resolved
@shrshi shrshi requested a review from vuule August 30, 2024 17:22
{
std::string long_string = "haha";
std::size_t log_repetitions = 12;
long_string.reserve(long_string.size() * (1UL << log_repetitions));
Copy link
Contributor

@vuule vuule Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit better than std::pow (that I suggested) :D

@shrshi shrshi added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Aug 30, 2024
@shrshi
Copy link
Contributor Author

shrshi commented Aug 30, 2024

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants