GH-37841: [Java] Dictionary decoding not using the compression factory from the ArrowReader #38371

vibhatha · 2023-10-20T04:43:20Z

Rationale for this change

This PR addresses #37841.

What changes are included in this PR?

Adding compression-based write and read for Dictionary data.

Are these changes tested?

Yes.

Are there any user-facing changes?

No

Closes: [Java] Dictionary decoding not using the compression factory from the ArrowReader #37841

github-actions · 2023-10-20T04:43:49Z

⚠️ GitHub issue #37841 has been automatically assigned in GitHub to PR creator.

danepitkin

Great work, @vibhatha !

danepitkin · 2023-11-14T15:25:23Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java

+  private CompressionCodec.Factory compressionFactory;
+
+  private CompressionUtil.CodecType codecType;
+
+  private Optional<Integer> compressionLevel;
+


Should these be marked private final and grouped with the other private final vars?

+1, way neater that way.

danepitkin · 2023-11-14T17:46:34Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java

+    VectorUnloader unloader = new VectorUnloader(dictRoot, /*includeNullCount*/ true,
+        this.compressionLevel.isPresent() ?
+            this.compressionFactory.createCodec(this.codecType, this.compressionLevel.get()) :
+            this.compressionFactory.createCodec(this.codecType),
+        /*alignBuffers*/ true);


Maybe add a helper function for creating a codec?

Can use getCodec() here, too!

danepitkin · 2023-11-14T18:15:25Z

...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java

+    dictionaryVector1.close();
+    allocator.close();
+  }
+
  @Test
  public void testArrowFileZstdRoundTrip() throws Exception {


Optional: Somewhat unrelated to the issue, but should we parameterize the tests to use all the different types of compression? See TestCompressionCodec.java as an example.

danepitkin · 2023-11-14T18:23:19Z

...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java

+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary1);
+
+    final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);


I think you can get rid of these in favor of this.allocator if you keep the @Before/@After functions

Yes, I agree. I refactored the code to use this approach.

danepitkin · 2023-11-14T18:25:33Z

...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java


+  @Before


If the @Before/@After functionality isn't used for all @Tests, it might be better to move this to a helper function instead.

Good point. I will update.

Should we use @BeforeEach so that the allocator and root is reset for each test? It might make the tests slower, but not sure if it's better to have a fresh allocator for each test.

Good point. I think that is better and safe.

This is unaddressed.

Btw @lidavidm

I updated the JUnit annotations for consistency and compatibility. The @beforeeach annotation from JUnit 5 was being used in conjunction with @test from JUnit 4, causing the setup method not to run as expected before each test.

Furthermore do we need to make the usage of JUnit consistent across tests?

Is this change okay?

danepitkin · 2023-11-14T18:36:31Z

...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java

+  }
+
+  @Test
+  public void testArrowFileZstdRoundTripWithDictionary() throws Exception {


Do you think we still need the original test function testArrowFileZstdRoundTrip? Is this new test case possibly testing the same code path + dictionaries?

It is the same path + dictionaries, I refactored and reorganized the test cases, does it make sense or useful? I think my previous code had duplication.

vibhatha · 2023-11-15T05:10:11Z

@danepitkin Thanks a lot for the review comments, I will address them.

danepitkin

Great work! The tests look very clean now. left a few small additional comments.

danepitkin · 2023-11-21T19:26:12Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java

+    VectorUnloader unloader = new VectorUnloader(dictRoot, /*includeNullCount*/ true,
+        this.compressionLevel.isPresent() ?
+            this.compressionFactory.createCodec(this.codecType, this.compressionLevel.get()) :
+            this.compressionFactory.createCodec(this.codecType),
+        /*alignBuffers*/ true);


Can use getCodec() here, too!

...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java

danepitkin · 2023-11-21T19:29:09Z

...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java


+  @Before


Should we use @BeforeEach so that the allocator and root is reset for each test? It might make the tests slower, but not sure if it's better to have a fresh allocator for each test.

danepitkin · 2023-11-21T20:13:47Z

...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java

+      if (expectSuccess) {
+        Assert.assertTrue(reader.loadNextBatch());
+        Assert.assertTrue(root.equals(reader.getVectorSchemaRoot()));
+        Assert.assertFalse(reader.loadNextBatch());
+      } else {
+        Exception exception = Assert.assertThrows(IllegalArgumentException.class, reader::loadNextBatch);
+        Assert.assertEquals(expectedErrorMessage, exception.getMessage());


optional nit: I think the readArrowFile() and readArrowStream() functions are not needed. While they improve code duplication, I think they also decrease code readability. Personally, I like to quickly see what is asserted in the test functions themselves. I do like how you've separated out other functionality into their own functions like createAndWriteArrowFile().

I had doubts after the cleanup 😄
Let me update the PR.

vibhatha · 2023-11-22T08:09:55Z

@danepitkin I am not sure if this one is practical though 🤔

danepitkin · 2023-11-22T15:58:51Z

@danepitkin I am not sure if this one is practical though 🤔

I'm okay with leaving it as-is if there are issues with using @BeforeEach instead of @Before.

Overall, LGTM! I think it's ready for final review from a committer. Excellent job.

vibhatha · 2023-11-22T23:39:21Z

@lidavidm appreciate your feedback.

...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java

lidavidm · 2023-11-27T18:26:00Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java

@@ -174,6 +178,12 @@ public long bytesWritten() {
    return out.getCurrentPosition();
  }

+  private CompressionCodec getCodec() {


It seems we should be able to initialize the codec once and reuse it in the constructor, rather than add all the new fields?

@lidavidm an issue was filed here: #39222
Also, I can work on this.

Again: why did we pull this out? It's called only once. Why are we adding a bunch of new fields? We don't use them.

vibhatha · 2023-12-13T11:57:00Z

@lidavidm I updated the PR. Appreciate another round of reviews.

lidavidm

@vibhatha can you file a followup to add a new integration test to cover this scenario?

lidavidm · 2023-12-13T14:17:20Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java

-                         Map<String, String> metaData, IpcOption option, CompressionCodec.Factory compressionFactory,
-                         CompressionUtil.CodecType codecType, Optional<Integer> compressionLevel) {
-    super(root, provider, out, option, compressionFactory, codecType, compressionLevel);
+      Map<String, String> metaData, IpcOption option, CompressionCodec codec) {


Isn't this removing a public constructor?

@lidavidm for my clarification.

Maybe I have misunderstood your comment here.

I thought it would be rather cleaner to pass the CompressionCodec rather than passing all the other parameters which make this object.

Did you mean something else?

You can delegate constructors without removing public ones (which breaks API).

@lidavidm does the updated change make sense?

java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamWriter.java

java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java

lidavidm · 2023-12-13T14:19:10Z

...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java


+  @Before


This is unaddressed.

This reverts commit 4eb1836.

vibhatha · 2024-02-01T15:42:33Z

@github-actions crossbow submit java

github-actions · 2024-02-01T15:45:12Z

Revision: 907195a

Submitted crossbow builds: ursacomputing/crossbow @ actions-27dab1a4d2

Task	Status
java-jars
verify-rc-source-java-linux-almalinux-8-amd64
verify-rc-source-java-linux-conda-latest-amd64
verify-rc-source-java-linux-ubuntu-20.04-amd64
verify-rc-source-java-linux-ubuntu-22.04-amd64
verify-rc-source-java-macos-amd64

conbench-apache-arrow · 2024-02-02T10:34:13Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f9b7ac2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

…factory from the ArrowReader (apache#38371) ### Rationale for this change This PR addresses apache#37841. ### What changes are included in this PR? Adding compression-based write and read for Dictionary data. ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#37841 Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com> Co-authored-by: vibhatha <vibhatha@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

github-actions bot added Component: Java awaiting review Awaiting review labels Oct 20, 2023

vibhatha marked this pull request as ready for review November 1, 2023 08:16

vibhatha requested a review from lidavidm as a code owner November 1, 2023 08:16

vibhatha marked this pull request as draft November 1, 2023 08:16

vibhatha marked this pull request as ready for review November 14, 2023 16:58

danepitkin reviewed Nov 14, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 14, 2023

danepitkin reviewed Nov 21, 2023

View reviewed changes

lidavidm reviewed Nov 27, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 27, 2023

vibhatha force-pushed the gh-37841 branch from 82484f4 to f62277c Compare December 13, 2023 09:41

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 13, 2023

vibhatha requested a review from lidavidm December 13, 2023 11:16

lidavidm reviewed Dec 13, 2023

View reviewed changes

github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes awaiting committer review Awaiting committer review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes labels Dec 13, 2023

vibhatha added 8 commits February 1, 2024 17:00

fix: address reviews

b9010b4

fix: adding a method to get codecs

796f62d

fix: address reviews

90c01aa

fix: addressing reviews

253daef

Revert "fix: addressing reviews"

f4615e1

This reverts commit 4eb1836.

fix: reviews v2

3bce128

fix: reviews v2

d09533c

feat: minor change

184d827

vibhatha force-pushed the gh-37841 branch from 81ec17b to 184d827 Compare February 1, 2024 11:31

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Feb 1, 2024

fix: address reviews

907195a

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 1, 2024

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 1, 2024

fix: address reviews v2

8f482da

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 1, 2024

lidavidm approved these changes Feb 1, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 1, 2024

lidavidm merged commit f9b7ac2 into apache:main Feb 1, 2024
14 of 15 checks passed

lidavidm removed the awaiting merge Awaiting merge label Feb 1, 2024

GH-37841: [Java] Dictionary decoding not using the compression factory from the ArrowReader #38371

GH-37841: [Java] Dictionary decoding not using the compression factory from the ArrowReader #38371

Conversation

vibhatha commented Oct 20, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Oct 20, 2023

danepitkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented Nov 15, 2023

danepitkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented Nov 22, 2023

danepitkin commented Nov 22, 2023

vibhatha commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented Dec 13, 2023

lidavidm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented Feb 1, 2024

github-actions bot commented Feb 1, 2024

conbench-apache-arrow bot commented Feb 2, 2024

vibhatha commented Oct 20, 2023 •

edited

Loading

vibhatha Dec 14, 2023 •

edited

Loading