Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36990: [R] Expose Parquet ReaderProperties #36992

Merged
merged 11 commits into from
Aug 14, 2023

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented Aug 2, 2023

Rationale for this change

Expose the ReaderProperties class in R so that the thrift size settings can be altered.

What changes are included in this PR?

Add R6 class, link it up to the C++ class, use it when reading Parquet files.

Are these changes tested?

Yes

Are there any user-facing changes?

Nope

@thisisnic
Copy link
Member Author

I've opened #36997 for the follow-up work of exposing this functionality (and other functionality) in the read_parquet() function signature.

@thisisnic thisisnic marked this pull request as ready for review August 2, 2023 20:50
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just a few questions!

on.exit(file$close())

# We get an error if we set the Thrift string size limit too small
expect_error(ParquetFileReader$create(file, reader_props = reader_props))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that expect_error() without a regex will also possibly swallow an unrelated error (if not now than possibly in the future when somebody updates some piece of relevant code).

Copy link
Member Author

@thisisnic thisisnic Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I know we didn't want to test C++ errors generally, but perhaps it makes sense to in this test as it seems strange to write a custom error handler just for the sake of testing, right? Maybe we need to relax the "don't test C++ errors" assumption in some circumstances - what do you reckon is best?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's not ideal to test an error message from C++; however, we do it a lot and I think it's much much better than expect_error() without a message (which can swallow unrelated errors or even completely invalid code). expect_error() without any constraining information has bitten me a number of times and our existing tests that use this pattern don't seem to cause us maintenance effort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. Had a think about this and (see comment below), I think that given that we are already taking responsibility for integration testing of libarrow and the R package, it makes sense. If the libarrow error message changes, it'll break the tests on a PR anyway, so that justifies this approach more than compared to if libarrow was some external library.

@@ -484,3 +484,30 @@ test_that("Can read Parquet files from a URL", {
expect_true(tibble::is_tibble(pu))
expect_identical(dim(pu), c(10L, 11L))
})

test_that("thrift string size can be specified when reading Parquet files", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this test also include the container size limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

r/R/parquet.R Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 9, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 10, 2023
Comment on lines +502 to +509
# We get an error if we set the Thrift string size limit too small
expect_error(ParquetFileReader$create(file, reader_props = reader_props), "TProtocolException: Exceeded size limit")

# Increase the size and we can read successfully
reader_props$set_thrift_string_size_limit(10000)
reader <- ParquetFileReader$create(file, reader_props = reader_props)
data <- reader$ReadTable()
expect_identical(collect.ArrowTabular(data), example_data)
Copy link
Member Author

@thisisnic thisisnic Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this isn't urgent to merge, I'd like to slow down here and think at a more general level about what we are testing. Part of the problem here which has forced us into testing C++ messages is the fact that we're testing things we don't control.

I started thinking we should only test that we can get/set these values, but not test the outcomes of running the code. However, this was because I'd assumed that setting thrift_string_size_limit would be tested in the C++ layer, but after looking at the original PR, #13275, I see that it is not, and is testing in PyArrow, so we should probably follow that pattern and consider both the unit tests (i.e. are our R objects working as expected?) and integration tests (i.e. are the returned values as expected?) our responsibility.

I did have thoughts about splitting out tests into separate files for integration tests versus unit tests, but I think this would be more of an exercise in philosophical pedantry than adding actual value or making our lives easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to C++ messages, there is an issue that you or I could tackle to make that slightly better ( #34826 ); however, we test parts of error messages in dozens of places and it doesn't cause any maintenance effort that I am aware of. Testing the C++ message is less about testing C++ behaviour and 100% about making sure we are testing the code path that we think we are testing. expect_error(with_a_tpyo()) still passes, as would a change in the underlying code that produces a completely unrelated error (e.g., a typo in the thrift setting/getting that results in "object not found" of "closure not subsettable"). It is very important that we do not swallow those errors and the C++ error messages almost never change (when they do, C++ developers have historically been very good at updating our tests to fix the failing CI).

I started thinking we should only test that we can get/set these values, but not test the outcomes of running the code.

I think that is the 100% correct way to go about it! The connection between pyarrow and Arrow C++ is closer and it is sometimes easier to write the tests in Python...I would not be surprised if many pieces of Arrow C++ relied on test coverage from Python. I don't think duplicating those tests in this PR is needed (you could open an issue for the tests to be added to Arrow C++...in the discussion for that ticket, it may come about that R is a good choice for those tests or it may not, or we might not have bandwidth to get there).

I did have thoughts about splitting out tests into separate files for integration tests versus unit tests, but I think this would be more of an exercise in philosophical pedantry than adding actual value or making our lives easier.

I think this does have value and would make our lives easier; however, I don't think either of us have time to do it right now. To a certain extent we already do this via benchmarks and the extra-tests folder, which don't run on CRAN but do run on CI (I think).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing the C++ message is less about testing C++ behaviour and 100% about making sure we are testing the code path that we think we are testing

I don't get this, can you expand? I was thinking that as we're testing the outcome of setting the thrift string value ridiculously low and the fact that it does cause an error, then that's testing the C++ behaviour, and so is more integration testing (but honestly, probably so are most of our tests then). We can also take this chat offline for our catchup if you like, as while it's interesting, it's not necessarily all that relevant to us merging this PR ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point...we don't have a separate code path that an error would trigger (there are some places where we use the fact that an error was generated to do something else, so to get 100% test coverage we need to trigger an error).

I think it's reasonable to add that test here because it verifies that the changes in this PR do, in fact fix the user issue, and, in the dataset case, it makes sure that the value actually makes it from open_dataset() to its final C++ home. You could also make the argument that we're testing C++ and so we should remove that test...I think either are fine (but expect_error() without a regex or class argument is not!).

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 10, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 10, 2023
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Aug 10, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 10, 2023
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Aug 10, 2023
@thisisnic thisisnic merged commit cd8830b into apache:main Aug 14, 2023
12 checks passed
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit cd8830b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
### Rationale for this change

Expose the ReaderProperties class in R so that the thrift size settings can be altered.

### What changes are included in this PR?

Add R6 class, link it up to the C++ class, use it when reading Parquet files.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Nope
* Closes: apache#36990

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Expose Parquet ReaderProperties
2 participants