In the first part of the series (see part 1), I showed that for large enough binary blob types, Arrow's Java implementation can deliver up-to 160+ Gbps (giving about ~10 Gbps/core) my 16 core machine. Though it is not the perfect scalability from a single core performance of ~28 Gbps/core, but we leave it for now.
In this part 2 of the blog post series, I am going to focus on the performance of a single-column integer schema. In contrast to binary blobs, which can mask many overheads in the code implementation, integer schema cannot mask overheads because it has to materialize an integer out of every 4 bytes.
As usual the benchmark code is available
at https://github.com/animeshtrivedi/benchmarking-arrow.
Its README.md might not be up-to date yet, so please have a look at the source code when in doubt.
This blog was done using 9a71596ba5ec186ac4ef449c1437763b394d9dac
commit from October 9, 2018.
- Recap
- Single core integer schema performance
- Core scalability of the performance
- Conclusions
- Next Steps
Lets do a quick recap of recommendations from part 1:
- Make sure that NUMA settings are right on your machines - I am going to restrict the single thread
benchmarking on a single core and its NUMA memory domain. I usenumactl
tool for this. - Make sure that the JVM has enough young heap memory and the right GC setting. I will be using
-XX:+UseG1GC
(G1 GC),-Xmn120G
(young heap size),-Xmx256G
(max heap size) settings. - Use the holder API. Now the code takes a parameter with
-e
which you can set to-e holder
. - We try to keep the batch size between ~1-10MB. This is still a bit of black magic for me. So, we choose a arrow batch size of 1M rows, which for integers (4 bytes) gives us ~4MB/thread footprint.
So our benchmarking command becomes:
$numactl -C 0 -m 0 \
java -XX:+UseG1GC -Xmn120G -Xmx256G \
-cp ./4j/:./benchmark-arrow-1.0.jar:./dependency/* com.github.animeshtrivedi.benchmark.Main ...
We use one of the data generators (others are binary
and long
types) in the benchmark to generate
integers. To generate and benchmark, we use the command line:
numactl -C 0 -m 0 \
java -XX:+UseG1GC -Xmn200G -Xmx256G \
-cp ./4j/:./benchmark-arrow-1.0.jar:./dependency/* com.github.animeshtrivedi.benchmark.Main \
-t ArrowMemBench -p 1 -r (num_rows/per_thread) -c 1 -g (arrow_batch_size) -n int
We start with a single thread, generating and consuming 1 Billion integers in the arrow batch size of 1 Million rows (gives about 4MB/thread of footprint). Here is our command line and results:
numactl -C 0 -m 0 \
java -XX:+UseG1GC -Xmn200G -Xmx256G \
-cp ./4j/:./benchmark-arrow-1.0.jar:./dependency/* com.github.animeshtrivedi.benchmark.Main \
-t ArrowMemBench -p 1 -r 1000000000 -c 1 -g 1000000 -n int
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73GiB) bandwidth 5.42 Gbps
-----------------------------------------------------------------------
This establishes our baseline performance at 5.42 Gbps.
Lets have a detailed look what is going on in the profile using perf:
93.86% java perf-9949.map [.] Lcom/github/animeshtrivedi/benchmark/ArrowReader;::consumeInt4
1.51% java libjvm.so [.] acl_CopyRight
1.02% perf [kernel.kallsyms] [k] __mod_node_page_state
[...]
This does not tell much, because Java and JIT compiler optimizes away many details and only the top CPU-cycle consuming function is left. So the trick here is to disable inlining, and then we will see the profile with all functions. You can do this by setting
java -XX:InlineSmallCode=0 -XX:MaxInlineSize=0 -XX:FreqInlineSize=0 [...]
Mind that it will slow down performance immensely but it will give you a right CPU profile. So, now with these setting in place, we do another round of CPU profile and we get:
22.59% java perf-21415.map [.] Lio/netty/buffer/ArrowBuf;::refCnt
12.08% java perf-21415.map [.] Lio/netty/buffer/ArrowBuf;::checkIndexD
8.77% java perf-21415.map [.] Lio/netty/buffer/AbstractByteBuf;::ensureAccessible
7.96% java perf-21415.map [.] Lio/netty/util/internal/PlatformDependent;::getInt
6.92% java perf-21415.map [.] Lio/netty/util/internal/PlatformDependent;::getByte
5.69% java perf-21415.map [.] Lcom/github/animeshtrivedi/benchmark/ArrowHolderReader;::consumeInt4
4.70% java perf-21415.map [.] Lio/netty/buffer/ArrowBuf;::getByte
4.35% java perf-21415.map [.] Ljava/util/concurrent/atomic/AtomicInteger;::get
4.23% java perf-21415.map [.] Lorg/apache/arrow/vector/IntVector;::get
3.47% java perf-21415.map [.] Lio/netty/buffer/ArrowBuf;::getInt
3.04% java perf-21415.map [.] Lio/netty/util/internal/PlatformDependent0;::getInt
2.76% java perf-21415.map [.] Lorg/apache/arrow/vector/BaseFixedWidthVector;::isSet
2.69% java perf-21415.map [.] Lio/netty/buffer/ArrowBuf;::capacity
2.19% java perf-21415.map [.] Lio/netty/util/internal/PlatformDependent0;::getByte
1.92% java perf-21415.map [.] Lio/netty/buffer/ArrowBuf;::chk
1.03% java perf-21415.map [.] Lio/netty/buffer/ArrowBuf;::addr
0.71% java perf-21415.map [.] Lio/netty/util/internal/PlatformDependent;::putByte
0.57% java perf-21415.map [.] Lio/netty/buffer/ArrowBuf;::setByte
0.53% java perf-21415.map [.] Lio/netty/util/internal/PlatformDependent0;::putByte
0.44% perf [kernel.kallsyms] [k] __slab_free
[...]
This is more like it. We can clearly see that reference counting and boundary checks are the key
contributors. Looking at the Arrow code, we can find that BoundsChecking
class enables or
disables checks, and this can be controlled by setting drill.enable_unsafe_memory_access
flag.
Very intuitive guys! Thats what we want, to set a flag for Drill to enable fast Arrow accesses (da!).
<tag>face_palm</tag>
. Anyways, lets disable checks for arrow and netty code. So now our java command line
looks like:
java -XX:+UseG1GC -Xmn200G -Xmx256G \
-Ddrill.enable_unsafe_memory_access=true \
-Dio.netty.buffer.bytebuf.checkAccessible=false
Lets enables the inline optimizations again, and run our benchmark:
[...]
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73 GiB) bandwidth 9.84 Gbps
-----------------------------------------------------------------------
Whoa, good improvements by almost ~80%.
Now we do another around of profiling for the reader that gives us 9.84 Gbps, we get:
17.38% java perf-10645.map [.] Lio/netty/buffer/ArrowBuf;::getInt
13.36% java perf-10645.map [.] Lio/netty/buffer/ArrowBuf;::addr
13.28% java perf-10645.map [.] Lio/netty/buffer/ArrowBuf;::getByte
9.61% java perf-10645.map [.] Lio/netty/util/internal/PlatformDependent;::getInt
9.41% java perf-10645.map [.] Lcom/github/animeshtrivedi/benchmark/ArrowReaderHolder;::consumeInt4
6.63% java perf-10645.map [.] Lio/netty/util/internal/PlatformDependent;::getByte
5.96% java perf-10645.map [.] Lio/netty/util/internal/PlatformDependent0;::getByte
4.68% java perf-10645.map [.] Lorg/apache/arrow/vector/BaseFixedWidthVector;::isSet
3.97% java perf-10645.map [.] Lio/netty/util/internal/PlatformDependent0;::getInt
3.70% java perf-10645.map [.] Lorg/apache/arrow/vector/IntVector;::get
3.59% java perf-10645.map [.] Lio/netty/buffer/ArrowBuf;::setByte
1.60% java perf-10645.map [.] Lio/netty/util/internal/PlatformDependent;::putByte
1.30% java perf-10645.map [.] Lio/netty/buffer/ArrowBuf;::chk
0.76% java libjvm.so [.] acl_CopyRight
[...]
First thing that pops out is that there are so many function calls, just to get an integer and a byte
(for the bitmap). Second, one thing that looked odd to me is why there are cycles being spent on set/putBytes?
When I looked at the call-graph as well, it showed me that the calls were from BitVectorHelper.loadValidityBuffer()
function. A closer look at the function say that it tries to optimize for cases when all values are
either null or set, and it generates a corresponding bitmask. In our benchmark, all values are indeed
set, that is why this shows up in the profile. But I do not get the sense behind this logic (Q1)
because at this point the bitmap from storage is already loaded, why not just return that?
Later I had a closer look at the isSet
function in BaseFixedWidthVector
class.
public abstract class BaseFixedWidthVector extends BaseValueVector
implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
//...
public int isSet(int index) {
final int byteIndex = index >> 3;
final byte b = validityBuffer.getByte(byteIndex);
final int bitIndex = index & 7;
return Long.bitCount(b & (1L << bitIndex));
}
}
As fas as I can tell, it does the index bit check and masking (that is ok), and then checks how many bits are set in the final result long, why? (Q2). If the result long is zero, then the index bit was not, otherwise set. Why do extra calculation to find out how many bits are set in the result?
At this point, I was tempted to write to my simple "integer" column reader to establish the peak
performance. I hacked an Arrow integer reader, which is not doing much except read and materialize
integers as fast as possible. The code is in ArrowIntReader
class (it is a bit messy). When I run
that I get (you can run this code by passing -x
):
[...]
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73 GiB) bandwidth 14.78 Gbps
-----------------------------------------------------------------------
and looking at the profile, it looks something like:
42.48% java perf-28294.map [.] Lcom/github/animeshtrivedi/arrow/ArrowIntReader;::isNull
28.66% java perf-28294.map [.] Lcom/github/animeshtrivedi/arrow/ArrowIntReader;::consumeIntBatchDirectNull
10.64% java perf-28294.map [.] Lcom/github/animeshtrivedi/benchmark/Platform;::getByte
8.30% java perf-28294.map [.] Lcom/github/animeshtrivedi/benchmark/Platform;::getInt
3.29% swapper [kernel.kallsyms] [k] poll_idle
0.78% perf [kernel.kallsyms] [k] unmap_page_range
So, the profile looks normal as one would except and performance looks sensible. One can definitely optimize more in these four functions, but for now we leave it.
Based on observations so far, I decided to write a low-level Arrow reader that is a bit optimized version of Arrow's implementation. Key optimizations are
- Instead of calling
getInt
onIntVector
, my code consumes and materialzes values directly from value and validity buffers. You can get address of these buffers fromgetValidityBufferAddress()
andgetValueBufferAddress()
. - Has my own optimized version of
isSet
function
The complete code is in ArrowReaderUnsafe
class and can be called by setting -e unsafe
. Here are
the result when using this code:
[...]
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73 GiB) bandwidth 13.13 Gbps
-----------------------------------------------------------------------
and when disable the bitmap optimization in the Arrow code, it bumps the performance little bit more to 13.61 Gbps. Al though this is a small change, but it shows you the importance of writing efficient code when running the code in high-performance environments.
[...]
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73 GiB) bandwidth 13.61 Gbps
-----------------------------------------------------------------------
Recall that my initial (optimized and customized) Arrow reader had performance of 14.78 Gbps, and now this new generic Arrow reader is at 13.61 Gbps. Though, we are off the mark by < 10%, but I can live with this. These last 10% will take 90% of the effort.
Having established the best case performance for a single column integer schema at 13.61 Gbps, we now look at the performance scalability with respect to the number of cores. So, if we just scale this performance for 1, 2, 4, 8, 16 cores in my system, this is what I get:
Table 1 - core scalability
Cores | Bandwidth |
---|---|
1 | 13.61 Gbps |
2 | 25.40 Gbps |
4 | 36.24 Gbps |
8 | 59.80 Gbps |
16 | 112.91 Gbps |
There are some hick-ups in the scaling performance, but at 16 cores we cross 100+ Gbps, so that is good. Keep in mind this is still just in-memory Arrow performance. We have not hooked up any networking or storage devices yet.
I am yet to have a detailed look about what stops perfect scalability for Arrow.
We managed to increase the performance of basic integer reader from 5.42 Gbps to 13.61 Gbps, a 2.5x improvement.
Recommendations:
- Switch off validity checks (
-Ddrill.enable_unsafe_memory_access=true -Dio.netty.buffer.bytebuf.checkAccessible=false
) and make sure that the Assert Mode in the JVM is disabled (no-ea
flag to the JVM). - Check and see if you can write your own low-level implementation of the reader using the Java Unsafe API. It depends what data type/schema you are using. See
ArrowReaderUnsafe
for example. - Check if you can optimize bitmap operations
- I am going to follow up with the bitmap operation discussion in the mailing list
- Have a detailed look at the performance scalability
- Evaluate TCP-DS table performance
- Integrate Crail (with high-performance networking and storage devices)