-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36990: [R] Expose Parquet ReaderProperties #36992
Conversation
I've opened #36997 for the follow-up work of exposing this functionality (and other functionality) in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Just a few questions!
r/tests/testthat/test-parquet.R
Outdated
on.exit(file$close()) | ||
|
||
# We get an error if we set the Thrift string size limit too small | ||
expect_error(ParquetFileReader$create(file, reader_props = reader_props)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that expect_error()
without a regex will also possibly swallow an unrelated error (if not now than possibly in the future when somebody updates some piece of relevant code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I know we didn't want to test C++ errors generally, but perhaps it makes sense to in this test as it seems strange to write a custom error handler just for the sake of testing, right? Maybe we need to relax the "don't test C++ errors" assumption in some circumstances - what do you reckon is best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's not ideal to test an error message from C++; however, we do it a lot and I think it's much much better than expect_error()
without a message (which can swallow unrelated errors or even completely invalid code). expect_error()
without any constraining information has bitten me a number of times and our existing tests that use this pattern don't seem to cause us maintenance effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed. Had a think about this and (see comment below), I think that given that we are already taking responsibility for integration testing of libarrow and the R package, it makes sense. If the libarrow error message changes, it'll break the tests on a PR anyway, so that justifies this approach more than compared to if libarrow was some external library.
r/tests/testthat/test-parquet.R
Outdated
@@ -484,3 +484,30 @@ test_that("Can read Parquet files from a URL", { | |||
expect_true(tibble::is_tibble(pu)) | |||
expect_identical(dim(pu), c(10L, 11L)) | |||
}) | |||
|
|||
test_that("thrift string size can be specified when reading Parquet files", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this test also include the container size limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
# We get an error if we set the Thrift string size limit too small | ||
expect_error(ParquetFileReader$create(file, reader_props = reader_props), "TProtocolException: Exceeded size limit") | ||
|
||
# Increase the size and we can read successfully | ||
reader_props$set_thrift_string_size_limit(10000) | ||
reader <- ParquetFileReader$create(file, reader_props = reader_props) | ||
data <- reader$ReadTable() | ||
expect_identical(collect.ArrowTabular(data), example_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this isn't urgent to merge, I'd like to slow down here and think at a more general level about what we are testing. Part of the problem here which has forced us into testing C++ messages is the fact that we're testing things we don't control.
I started thinking we should only test that we can get/set these values, but not test the outcomes of running the code. However, this was because I'd assumed that setting thrift_string_size_limit
would be tested in the C++ layer, but after looking at the original PR, #13275, I see that it is not, and is testing in PyArrow, so we should probably follow that pattern and consider both the unit tests (i.e. are our R objects working as expected?) and integration tests (i.e. are the returned values as expected?) our responsibility.
I did have thoughts about splitting out tests into separate files for integration tests versus unit tests, but I think this would be more of an exercise in philosophical pedantry than adding actual value or making our lives easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to C++ messages, there is an issue that you or I could tackle to make that slightly better ( #34826 ); however, we test parts of error messages in dozens of places and it doesn't cause any maintenance effort that I am aware of. Testing the C++ message is less about testing C++ behaviour and 100% about making sure we are testing the code path that we think we are testing. expect_error(with_a_tpyo())
still passes, as would a change in the underlying code that produces a completely unrelated error (e.g., a typo in the thrift setting/getting that results in "object not found" of "closure not subsettable"). It is very important that we do not swallow those errors and the C++ error messages almost never change (when they do, C++ developers have historically been very good at updating our tests to fix the failing CI).
I started thinking we should only test that we can get/set these values, but not test the outcomes of running the code.
I think that is the 100% correct way to go about it! The connection between pyarrow and Arrow C++ is closer and it is sometimes easier to write the tests in Python...I would not be surprised if many pieces of Arrow C++ relied on test coverage from Python. I don't think duplicating those tests in this PR is needed (you could open an issue for the tests to be added to Arrow C++...in the discussion for that ticket, it may come about that R is a good choice for those tests or it may not, or we might not have bandwidth to get there).
I did have thoughts about splitting out tests into separate files for integration tests versus unit tests, but I think this would be more of an exercise in philosophical pedantry than adding actual value or making our lives easier.
I think this does have value and would make our lives easier; however, I don't think either of us have time to do it right now. To a certain extent we already do this via benchmarks and the extra-tests folder, which don't run on CRAN but do run on CI (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing the C++ message is less about testing C++ behaviour and 100% about making sure we are testing the code path that we think we are testing
I don't get this, can you expand? I was thinking that as we're testing the outcome of setting the thrift string value ridiculously low and the fact that it does cause an error, then that's testing the C++ behaviour, and so is more integration testing (but honestly, probably so are most of our tests then). We can also take this chat offline for our catchup if you like, as while it's interesting, it's not necessarily all that relevant to us merging this PR ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point...we don't have a separate code path that an error would trigger (there are some places where we use the fact that an error was generated to do something else, so to get 100% test coverage we need to trigger an error).
I think it's reasonable to add that test here because it verifies that the changes in this PR do, in fact fix the user issue, and, in the dataset case, it makes sure that the value actually makes it from open_dataset()
to its final C++ home. You could also make the argument that we're testing C++ and so we should remove that test...I think either are fine (but expect_error()
without a regex or class argument is not!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks!
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit cd8830b. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change Expose the ReaderProperties class in R so that the thrift size settings can be altered. ### What changes are included in this PR? Add R6 class, link it up to the C++ class, use it when reading Parquet files. ### Are these changes tested? Yes ### Are there any user-facing changes? Nope * Closes: apache#36990 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Rationale for this change
Expose the ReaderProperties class in R so that the thrift size settings can be altered.
What changes are included in this PR?
Add R6 class, link it up to the C++ class, use it when reading Parquet files.
Are these changes tested?
Yes
Are there any user-facing changes?
Nope