GH-36990: [R] Expose Parquet ReaderProperties #36992

thisisnic · 2023-08-02T13:34:52Z

Rationale for this change

Expose the ReaderProperties class in R so that the thrift size settings can be altered.

What changes are included in this PR?

Add R6 class, link it up to the C++ class, use it when reading Parquet files.

Are these changes tested?

Yes

Are there any user-facing changes?

Nope

Closes: [R] Expose Parquet ReaderProperties #36990

thisisnic · 2023-08-02T20:45:23Z

I've opened #36997 for the follow-up work of exposing this functionality (and other functionality) in the read_parquet() function signature.

paleolimbot

Looking good! Just a few questions!

paleolimbot · 2023-08-09T20:05:34Z

r/tests/testthat/test-parquet.R

+  on.exit(file$close())
+
+  # We get an error if we set the Thrift string size limit too small
+  expect_error(ParquetFileReader$create(file, reader_props = reader_props))


I worry that expect_error() without a regex will also possibly swallow an unrelated error (if not now than possibly in the future when somebody updates some piece of relevant code).

Good point. I know we didn't want to test C++ errors generally, but perhaps it makes sense to in this test as it seems strange to write a custom error handler just for the sake of testing, right? Maybe we need to relax the "don't test C++ errors" assumption in some circumstances - what do you reckon is best?

I agree that it's not ideal to test an error message from C++; however, we do it a lot and I think it's much much better than expect_error() without a message (which can swallow unrelated errors or even completely invalid code). expect_error() without any constraining information has bitten me a number of times and our existing tests that use this pattern don't seem to cause us maintenance effort.

Yeah, agreed. Had a think about this and (see comment below), I think that given that we are already taking responsibility for integration testing of libarrow and the R package, it makes sense. If the libarrow error message changes, it'll break the tests on a PR anyway, so that justifies this approach more than compared to if libarrow was some external library.

paleolimbot · 2023-08-09T20:06:33Z

r/tests/testthat/test-parquet.R

@@ -484,3 +484,30 @@ test_that("Can read Parquet files from a URL", {
  expect_true(tibble::is_tibble(pu))
  expect_identical(dim(pu), c(10L, 11L))
 })
+
+test_that("thrift string size can be specified when reading Parquet files", {


Should this test also include the container size limit?

r/R/parquet.R

…er options

thisisnic · 2023-08-10T09:14:23Z

r/tests/testthat/test-parquet.R

+  # We get an error if we set the Thrift string size limit too small
+  expect_error(ParquetFileReader$create(file, reader_props = reader_props), "TProtocolException: Exceeded size limit")
+
+  # Increase the size and we can read successfully
+  reader_props$set_thrift_string_size_limit(10000)
+  reader <- ParquetFileReader$create(file, reader_props = reader_props)
+  data <- reader$ReadTable()
+  expect_identical(collect.ArrowTabular(data), example_data)


Given this isn't urgent to merge, I'd like to slow down here and think at a more general level about what we are testing. Part of the problem here which has forced us into testing C++ messages is the fact that we're testing things we don't control.

I started thinking we should only test that we can get/set these values, but not test the outcomes of running the code. However, this was because I'd assumed that setting thrift_string_size_limit would be tested in the C++ layer, but after looking at the original PR, #13275, I see that it is not, and is testing in PyArrow, so we should probably follow that pattern and consider both the unit tests (i.e. are our R objects working as expected?) and integration tests (i.e. are the returned values as expected?) our responsibility.

I did have thoughts about splitting out tests into separate files for integration tests versus unit tests, but I think this would be more of an exercise in philosophical pedantry than adding actual value or making our lives easier.

With respect to C++ messages, there is an issue that you or I could tackle to make that slightly better ( #34826 ); however, we test parts of error messages in dozens of places and it doesn't cause any maintenance effort that I am aware of. Testing the C++ message is less about testing C++ behaviour and 100% about making sure we are testing the code path that we think we are testing. expect_error(with_a_tpyo()) still passes, as would a change in the underlying code that produces a completely unrelated error (e.g., a typo in the thrift setting/getting that results in "object not found" of "closure not subsettable"). It is very important that we do not swallow those errors and the C++ error messages almost never change (when they do, C++ developers have historically been very good at updating our tests to fix the failing CI).

I started thinking we should only test that we can get/set these values, but not test the outcomes of running the code.

I think that is the 100% correct way to go about it! The connection between pyarrow and Arrow C++ is closer and it is sometimes easier to write the tests in Python...I would not be surprised if many pieces of Arrow C++ relied on test coverage from Python. I don't think duplicating those tests in this PR is needed (you could open an issue for the tests to be added to Arrow C++...in the discussion for that ticket, it may come about that R is a good choice for those tests or it may not, or we might not have bandwidth to get there).

I did have thoughts about splitting out tests into separate files for integration tests versus unit tests, but I think this would be more of an exercise in philosophical pedantry than adding actual value or making our lives easier.

I think this does have value and would make our lives easier; however, I don't think either of us have time to do it right now. To a certain extent we already do this via benchmarks and the extra-tests folder, which don't run on CRAN but do run on CI (I think).

Testing the C++ message is less about testing C++ behaviour and 100% about making sure we are testing the code path that we think we are testing

I don't get this, can you expand? I was thinking that as we're testing the outcome of setting the thrift string value ridiculously low and the fact that it does cause an error, then that's testing the C++ behaviour, and so is more integration testing (but honestly, probably so are most of our tests then). We can also take this chat offline for our catchup if you like, as while it's interesting, it's not necessarily all that relevant to us merging this PR ;)

That's a good point...we don't have a separate code path that an error would trigger (there are some places where we use the fact that an error was generated to do something else, so to get 100% test coverage we need to trigger an error).

I think it's reasonable to add that test here because it verifies that the changes in this PR do, in fact fix the user issue, and, in the dataset case, it makes sure that the value actually makes it from open_dataset() to its final C++ home. You could also make the argument that we're testing C++ and so we should remove that test...I think either are fine (but expect_error() without a regex or class argument is not!).

paleolimbot

This looks great! Thanks!

conbench-apache-arrow · 2023-08-14T18:18:21Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit cd8830b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

### Rationale for this change Expose the ReaderProperties class in R so that the thrift size settings can be altered. ### What changes are included in this PR? Add R6 class, link it up to the C++ class, use it when reading Parquet files. ### Are these changes tested? Yes ### Are there any user-facing changes? Nope * Closes: apache#36990 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

thisisnic added 3 commits August 2, 2023 11:39

Expose thrift size limits in Parquet Reader proprties

93a6263

Add some of the R machinery

1da72b7

Expose the reader properties class and use when reading Parquet files

c7ad894

github-actions bot added Component: R Component: Parquet awaiting committer review Awaiting committer review labels Aug 2, 2023

thisisnic added 3 commits August 2, 2023 14:50

Fix typo, expose properties, add tests

cb2bfa8

Fix typo

42b7a05

Rebuild docs

1113eda

Add docs

3fbe195

thisisnic marked this pull request as ready for review August 2, 2023 20:50

thisisnic requested a review from paleolimbot as a code owner August 2, 2023 20:50

Add new class to pkgdown site

3cf01cb

paleolimbot reviewed Aug 9, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 9, 2023

Add thrift options to dataset implementation, and test thrift contain…

311aad3

…er options

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 10, 2023

thisisnic commented Aug 10, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 10, 2023

Add tests for thrift string and container size limits for datasets too

7b593af

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 10, 2023

appease the linter

f6ee477

github-actions bot removed the awaiting changes Awaiting changes label Aug 10, 2023

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 10, 2023

paleolimbot approved these changes Aug 10, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Aug 10, 2023

thisisnic merged commit cd8830b into apache:main Aug 14, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-36990: [R] Expose Parquet ReaderProperties #36992

GH-36990: [R] Expose Parquet ReaderProperties #36992

thisisnic commented Aug 2, 2023 •

edited

Loading

thisisnic commented Aug 2, 2023

paleolimbot left a comment

paleolimbot Aug 9, 2023

thisisnic Aug 9, 2023 •

edited

Loading

paleolimbot Aug 10, 2023

thisisnic Aug 10, 2023

paleolimbot Aug 9, 2023

thisisnic Aug 10, 2023

thisisnic Aug 10, 2023 •

edited

Loading

paleolimbot Aug 10, 2023

thisisnic Aug 10, 2023

paleolimbot Aug 10, 2023

paleolimbot left a comment

conbench-apache-arrow bot commented Aug 14, 2023

GH-36990: [R] Expose Parquet ReaderProperties #36992

GH-36990: [R] Expose Parquet ReaderProperties #36992

Conversation

thisisnic commented Aug 2, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

thisisnic commented Aug 2, 2023

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Aug 14, 2023

thisisnic commented Aug 2, 2023 •

edited

Loading

thisisnic Aug 9, 2023 •

edited

Loading

thisisnic Aug 10, 2023 •

edited

Loading