Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Excessive resizing in VectorAppender leads to avoidable OversizedAllocationException #37829

Closed
hrishisd opened this issue Sep 22, 2023 · 1 comment · Fixed by #37844
Closed

Comments

@hrishisd
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

Arrow version

Built from main branch as of 9/21/2023

Problem description

When appending two variable length vectors, VectorAppender repeatedly resizes the validity and offset buffers of the target vector until they can hold the combined elements. While doing so, it also resizes the data buffer which can cause the data buffer to exceed the max allocation limit when we append a large number of small elements to a vector with a single large element.

// Body of VectorAppender::visit

// make sure there is enough capacity
while (targetVector.getValueCapacity() < newValueCount) {
  targetVector.reAlloc(); // should only realloc validity and offset buffers.
}
while (targetVector.getDataBuffer().capacity() < newValueCapacity) {
  ((BaseVariableWidthVector) targetVector).reallocDataBuffer();
}

Steps to reproduce

The error can be reproduced using the snippet below.

@Test
public void testResizingBug() {
  var allocator = new RootAllocator();
  System.err.println("max allocation size: " + BaseValueVector.MAX_ALLOCATION_SIZE);
  // arrow.vector.max_allocation_bytes is set to 1048576 (1 MiB)
  // create a vector with a single 256 KiB string
  VarCharVector target = makeVec(1, 256 * 1024, allocator);
  // create a vector with a total of 1 KiB
  VarCharVector delta = makeVec(1024, 1, allocator);
  // we should be able to fit all the strings into a single vector using less than 1 MiB.
  // this works
  new VectorAppender(delta).visit(target, null);
  // this fails
  new VectorAppender(target).visit(delta, null);
}

private static VarCharVector makeVec(int nElements, int bytesPerElement, BufferAllocator allocator) {
  var v = new VarCharVector("test", allocator);
  v.allocateNew(nElements);
  for (int i = 0; i < nElements; i++) {
    v.setSafe(i, "A".repeat(bytesPerElement).getBytes(StandardCharsets.UTF_8));
  }
  v.setValueCount(nElements);
  return v;
}

The example produces the following error

org.apache.arrow.vector.util.OversizedAllocationException: Memory required for vector is (2097152), which is overflow or more than max allowed (1048576). You could consider using LargeVarCharVector/LargeVarBinaryVector for large strings/large bytes types

	at org.apache.arrow.vector.BaseVariableWidthVector.checkDataBufferSize(BaseVariableWidthVector.java:435)
	at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:542)
	at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:520)
	at org.apache.arrow.vector.BaseVariableWidthVector.reAlloc(BaseVariableWidthVector.java:497)
	at org.apache.arrow.vector.util.VectorAppender.visit(VectorAppender.java:119)
	at org.apache.arrow.vector.util.TestVectorSchemaRootAppender.testResizingBug(TestVectorSchemaRootAppender.java:68)

It looks like the issue is also present when appending other variable-length vector types.


I'm happy to post a PR if this looks reasonable.

Component(s)

Java

@lidavidm
Copy link
Member

Is the proposed change to make reAlloc not reallocate the data buffer? That seems reasonable, but I worry about it possibly being a backwards compatibility problem. A PR would be welcome and we can discuss there.

hrishisd pushed a commit to hrishisd/arrow that referenced this issue Sep 24, 2023
hrishisd pushed a commit to hrishisd/arrow that referenced this issue Sep 24, 2023
lidavidm pushed a commit that referenced this issue Sep 25, 2023
…able length vectors (#37844)

### Rationale for this change
This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. 

When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. 

### What changes are included in this PR?
The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements.

The data buffer is resized based on the total required data size of the combined elements.

### Are these changes tested?
Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. 

### Are there any user-facing changes?
No.

* Closes: #37829

Authored-by: hrishisd <hdharam@berkeley.edu>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm lidavidm added this to the 14.0.0 milestone Sep 25, 2023
etseidl pushed a commit to etseidl/arrow that referenced this issue Sep 28, 2023
…g variable length vectors (apache#37844)

### Rationale for this change
This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. 

When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. 

### What changes are included in this PR?
The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements.

The data buffer is resized based on the total required data size of the combined elements.

### Are these changes tested?
Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. 

### Are there any user-facing changes?
No.

* Closes: apache#37829

Authored-by: hrishisd <hdharam@berkeley.edu>
Signed-off-by: David Li <li.davidm96@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…g variable length vectors (apache#37844)

### Rationale for this change
This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. 

When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. 

### What changes are included in this PR?
The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements.

The data buffer is resized based on the total required data size of the combined elements.

### Are these changes tested?
Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. 

### Are there any user-facing changes?
No.

* Closes: apache#37829

Authored-by: hrishisd <hdharam@berkeley.edu>
Signed-off-by: David Li <li.davidm96@gmail.com>
dongjoon-hyun pushed a commit to apache/spark that referenced this issue Nov 4, 2023
### What changes were proposed in this pull request?
This pr upgrade Apache Arrow from 13.0.0 to 14.0.0.

### Why are the changes needed?
The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes.
‎
In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)).
‎
The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)).
‎
The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)).
‎
Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)).

The full release notes as follows:
- https://arrow.apache.org/release/14.0.0.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43650 from LuciferYang/arrow-14.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…g variable length vectors (apache#37844)

### Rationale for this change
This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. 

When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. 

### What changes are included in this PR?
The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements.

The data buffer is resized based on the total required data size of the combined elements.

### Are these changes tested?
Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. 

### Are there any user-facing changes?
No.

* Closes: apache#37829

Authored-by: hrishisd <hdharam@berkeley.edu>
Signed-off-by: David Li <li.davidm96@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…g variable length vectors (apache#37844)

### Rationale for this change
This change prevents avoidable `OversizedAllocationException`s when appending a variable-length vector with many small elements to a variable-length vector with a few large elements. 

When appending variable-length vectors, `VectorAppender` iteratively doubles the offset and validity buffers until they can accommodate the combined elements. In the previous implementation, each iteration would also double the data buffer's capacity. This behavior is appropriate for vectors of fixed-size types but can result in an oversized data buffers when appending many small elements to a variable length vector with a large data buffer. 

### What changes are included in this PR?
The new behavior only resizes the offset and validity buffers when resizing the target vector's buffers to ensure they can hold the total number of combined elements.

The data buffer is resized based on the total required data size of the combined elements.

### Are these changes tested?
Yes. I added a unit test that results in an `OversizedAllocationException` when run against the previous version of the code. 

### Are there any user-facing changes?
No.

* Closes: apache#37829

Authored-by: hrishisd <hdharam@berkeley.edu>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants