-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37841: [Java] Dictionary decoding not using the compression factory from the ArrowReader #38371
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @vibhatha !
private CompressionCodec.Factory compressionFactory; | ||
|
||
private CompressionUtil.CodecType codecType; | ||
|
||
private Optional<Integer> compressionLevel; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be marked private final
and grouped with the other private final
vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, way neater that way.
VectorUnloader unloader = new VectorUnloader(dictRoot, /*includeNullCount*/ true, | ||
this.compressionLevel.isPresent() ? | ||
this.compressionFactory.createCodec(this.codecType, this.compressionLevel.get()) : | ||
this.compressionFactory.createCodec(this.codecType), | ||
/*alignBuffers*/ true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a helper function for creating a codec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use getCodec()
here, too!
dictionaryVector1.close(); | ||
allocator.close(); | ||
} | ||
|
||
@Test | ||
public void testArrowFileZstdRoundTrip() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Somewhat unrelated to the issue, but should we parameterize the tests to use all the different types of compression? See TestCompressionCodec.java as an example.
DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider(); | ||
provider.put(dictionary1); | ||
|
||
final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can get rid of these in favor of this.allocator if you keep the @Before/@After
functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I refactored the code to use this approach.
|
||
@Before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the @Before/@After
functionality isn't used for all @Test
s, it might be better to move this to a helper function instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use @BeforeEach
so that the allocator and root is reset for each test? It might make the tests slower, but not sure if it's better to have a fresh allocator for each test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think that is better and safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unaddressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw @lidavidm
I updated the JUnit annotations for consistency and compatibility. The @beforeeach annotation from JUnit 5 was being used in conjunction with @test from JUnit 4, causing the setup method not to run as expected before each test.
Furthermore do we need to make the usage of JUnit consistent across tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change okay?
} | ||
|
||
@Test | ||
public void testArrowFileZstdRoundTripWithDictionary() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we still need the original test function testArrowFileZstdRoundTrip
? Is this new test case possibly testing the same code path + dictionaries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same path + dictionaries, I refactored and reorganized the test cases, does it make sense or useful? I think my previous code had duplication.
@danepitkin Thanks a lot for the review comments, I will address them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! The tests look very clean now. left a few small additional comments.
VectorUnloader unloader = new VectorUnloader(dictRoot, /*includeNullCount*/ true, | ||
this.compressionLevel.isPresent() ? | ||
this.compressionFactory.createCodec(this.codecType, this.compressionLevel.get()) : | ||
this.compressionFactory.createCodec(this.codecType), | ||
/*alignBuffers*/ true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use getCodec()
here, too!
...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java
Show resolved
Hide resolved
|
||
@Before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use @BeforeEach
so that the allocator and root is reset for each test? It might make the tests slower, but not sure if it's better to have a fresh allocator for each test.
if (expectSuccess) { | ||
Assert.assertTrue(reader.loadNextBatch()); | ||
Assert.assertTrue(root.equals(reader.getVectorSchemaRoot())); | ||
Assert.assertFalse(reader.loadNextBatch()); | ||
} else { | ||
Exception exception = Assert.assertThrows(IllegalArgumentException.class, reader::loadNextBatch); | ||
Assert.assertEquals(expectedErrorMessage, exception.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional nit: I think the readArrowFile()
and readArrowStream()
functions are not needed. While they improve code duplication, I think they also decrease code readability. Personally, I like to quickly see what is asserted in the test functions themselves. I do like how you've separated out other functionality into their own functions like createAndWriteArrowFile()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had doubts after the cleanup 😄
Let me update the PR.
@danepitkin I am not sure if this one is practical though 🤔 |
I'm okay with leaving it as-is if there are issues with using Overall, LGTM! I think it's ready for final review from a committer. Excellent job. |
@lidavidm appreciate your feedback. |
...ression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java
Outdated
Show resolved
Hide resolved
@@ -174,6 +178,12 @@ public long bytesWritten() { | |||
return out.getCurrentPosition(); | |||
} | |||
|
|||
private CompressionCodec getCodec() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we should be able to initialize the codec once and reuse it in the constructor, rather than add all the new fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again: why did we pull this out? It's called only once. Why are we adding a bunch of new fields? We don't use them.
@lidavidm I updated the PR. Appreciate another round of reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vibhatha can you file a followup to add a new integration test to cover this scenario?
Map<String, String> metaData, IpcOption option, CompressionCodec.Factory compressionFactory, | ||
CompressionUtil.CodecType codecType, Optional<Integer> compressionLevel) { | ||
super(root, provider, out, option, compressionFactory, codecType, compressionLevel); | ||
Map<String, String> metaData, IpcOption option, CompressionCodec codec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this removing a public constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can delegate constructors without removing public ones (which breaks API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm does the updated change make sense?
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamWriter.java
Outdated
Show resolved
Hide resolved
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java
Outdated
Show resolved
Hide resolved
|
||
@Before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unaddressed.
@github-actions crossbow submit java |
Revision: 907195a Submitted crossbow builds: ursacomputing/crossbow @ actions-27dab1a4d2 |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f9b7ac2. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…factory from the ArrowReader (apache#38371) ### Rationale for this change This PR addresses apache#37841. ### What changes are included in this PR? Adding compression-based write and read for Dictionary data. ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#37841 Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com> Co-authored-by: vibhatha <vibhatha@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…factory from the ArrowReader (apache#38371) ### Rationale for this change This PR addresses apache#37841. ### What changes are included in this PR? Adding compression-based write and read for Dictionary data. ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#37841 Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com> Co-authored-by: vibhatha <vibhatha@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…factory from the ArrowReader (apache#38371) ### Rationale for this change This PR addresses apache#37841. ### What changes are included in this PR? Adding compression-based write and read for Dictionary data. ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#37841 Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com> Co-authored-by: vibhatha <vibhatha@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Rationale for this change
This PR addresses #37841.
What changes are included in this PR?
Adding compression-based write and read for Dictionary data.
Are these changes tested?
Yes.
Are there any user-facing changes?
No