Investigating performance of Apache Arrow (Java) in-memory (part 2)

Update: part 1, part 3

In the first part of the series (see part 1), I showed that for large enough binary blob types, Arrow's Java implementation can deliver up-to 160+ Gbps (giving about ~10 Gbps/core) my 16 core machine. Though it is not the perfect scalability from a single core performance of ~28 Gbps/core, but we leave it for now.

In this part 2 of the blog post series, I am going to focus on the performance of a single-column integer schema. In contrast to binary blobs, which can mask many overheads in the code implementation, integer schema cannot mask overheads because it has to materialize an integer out of every 4 bytes.

As usual the benchmark code is available at https://github.com/animeshtrivedi/benchmarking-arrow. Its README.md might not be up-to date yet, so please have a look at the source code when in doubt. This blog was done using 9a71596ba5ec186ac4ef449c1437763b394d9dac commit from October 9, 2018.

Index

Recap
Single core integer schema performance
- 2.1 The default out of the box performance
- 2.2 Disable checks
- 2.3 Writing your own Arrow reader
- 2.4 Bitmap Optimizations
- 2.5 Unsafe Arrow Reader
Core scalability of the performance
Conclusions
Next Steps

Recap

Lets do a quick recap of recommendations from part 1:

Make sure that NUMA settings are right on your machines - I am going to restrict the single thread
benchmarking on a single core and its NUMA memory domain. I use numactl tool for this.
Make sure that the JVM has enough young heap memory and the right GC setting. I will be using -XX:+UseG1GC (G1 GC), -Xmn120G (young heap size), -Xmx256G (max heap size) settings.
Use the holder API. Now the code takes a parameter with -e which you can set to -e holder.
We try to keep the batch size between ~1-10MB. This is still a bit of black magic for me. So, we choose a arrow batch size of 1M rows, which for integers (4 bytes) gives us ~4MB/thread footprint.

So our benchmarking command becomes:

$numactl -C 0 -m 0 \
java -XX:+UseG1GC -Xmn120G -Xmx256G \
-cp ./4j/:./benchmark-arrow-1.0.jar:./dependency/* com.github.animeshtrivedi.benchmark.Main ...

Single core integer schema performance

We use one of the data generators (others are binary and long types) in the benchmark to generate integers. To generate and benchmark, we use the command line:

numactl -C 0 -m 0 \
java -XX:+UseG1GC -Xmn200G -Xmx256G \
-cp ./4j/:./benchmark-arrow-1.0.jar:./dependency/* com.github.animeshtrivedi.benchmark.Main \
-t ArrowMemBench -p 1 -r (num_rows/per_thread) -c 1 -g (arrow_batch_size) -n int

The default out of the box performance

We start with a single thread, generating and consuming 1 Billion integers in the arrow batch size of 1 Million rows (gives about 4MB/thread of footprint). Here is our command line and results:

numactl -C 0 -m 0 \
java -XX:+UseG1GC -Xmn200G -Xmx256G \
-cp ./4j/:./benchmark-arrow-1.0.jar:./dependency/* com.github.animeshtrivedi.benchmark.Main \
-t ArrowMemBench -p 1 -r 1000000000 -c 1 -g 1000000 -n int 
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73GiB) bandwidth 5.42 Gbps
-----------------------------------------------------------------------

This establishes our baseline performance at 5.42 Gbps.

Disable checks

Lets have a detailed look what is going on in the profile using perf:

93.86%  java           perf-9949.map      [.] Lcom/github/animeshtrivedi/benchmark/ArrowReader;::consumeInt4
 1.51%  java           libjvm.so          [.] acl_CopyRight
 1.02%  perf           [kernel.kallsyms]  [k] __mod_node_page_state
[...]

This does not tell much, because Java and JIT compiler optimizes away many details and only the top CPU-cycle consuming function is left. So the trick here is to disable inlining, and then we will see the profile with all functions. You can do this by setting

java -XX:InlineSmallCode=0 -XX:MaxInlineSize=0 -XX:FreqInlineSize=0 [...]

Mind that it will slow down performance immensely but it will give you a right CPU profile. So, now with these setting in place, we do another round of CPU profile and we get:

22.59%  java           perf-21415.map      [.] Lio/netty/buffer/ArrowBuf;::refCnt
12.08%  java           perf-21415.map      [.] Lio/netty/buffer/ArrowBuf;::checkIndexD
 8.77%  java           perf-21415.map      [.] Lio/netty/buffer/AbstractByteBuf;::ensureAccessible
 7.96%  java           perf-21415.map      [.] Lio/netty/util/internal/PlatformDependent;::getInt
 6.92%  java           perf-21415.map      [.] Lio/netty/util/internal/PlatformDependent;::getByte
 5.69%  java           perf-21415.map      [.] Lcom/github/animeshtrivedi/benchmark/ArrowHolderReader;::consumeInt4
 4.70%  java           perf-21415.map      [.] Lio/netty/buffer/ArrowBuf;::getByte
 4.35%  java           perf-21415.map      [.] Ljava/util/concurrent/atomic/AtomicInteger;::get
 4.23%  java           perf-21415.map      [.] Lorg/apache/arrow/vector/IntVector;::get
 3.47%  java           perf-21415.map      [.] Lio/netty/buffer/ArrowBuf;::getInt
 3.04%  java           perf-21415.map      [.] Lio/netty/util/internal/PlatformDependent0;::getInt
 2.76%  java           perf-21415.map      [.] Lorg/apache/arrow/vector/BaseFixedWidthVector;::isSet
 2.69%  java           perf-21415.map      [.] Lio/netty/buffer/ArrowBuf;::capacity
 2.19%  java           perf-21415.map      [.] Lio/netty/util/internal/PlatformDependent0;::getByte
 1.92%  java           perf-21415.map      [.] Lio/netty/buffer/ArrowBuf;::chk
 1.03%  java           perf-21415.map      [.] Lio/netty/buffer/ArrowBuf;::addr
 0.71%  java           perf-21415.map      [.] Lio/netty/util/internal/PlatformDependent;::putByte
 0.57%  java           perf-21415.map      [.] Lio/netty/buffer/ArrowBuf;::setByte
 0.53%  java           perf-21415.map      [.] Lio/netty/util/internal/PlatformDependent0;::putByte
 0.44%  perf           [kernel.kallsyms]   [k] __slab_free
 [...]

This is more like it. We can clearly see that reference counting and boundary checks are the key contributors. Looking at the Arrow code, we can find that BoundsChecking class enables or disables checks, and this can be controlled by setting drill.enable_unsafe_memory_access flag. Very intuitive guys! Thats what we want, to set a flag for Drill to enable fast Arrow accesses (da!). <tag>face_palm</tag>. Anyways, lets disable checks for arrow and netty code. So now our java command line looks like:

java -XX:+UseG1GC -Xmn200G -Xmx256G \
-Ddrill.enable_unsafe_memory_access=true \
-Dio.netty.buffer.bytebuf.checkAccessible=false

Lets enables the inline optimizations again, and run our benchmark:

[...]
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73 GiB) bandwidth 9.84 Gbps
-----------------------------------------------------------------------

Whoa, good improvements by almost ~80%.

Bitmap optimizations

Now we do another around of profiling for the reader that gives us 9.84 Gbps, we get:

    17.38%  java           perf-10645.map     [.] Lio/netty/buffer/ArrowBuf;::getInt
    13.36%  java           perf-10645.map     [.] Lio/netty/buffer/ArrowBuf;::addr
    13.28%  java           perf-10645.map     [.] Lio/netty/buffer/ArrowBuf;::getByte
     9.61%  java           perf-10645.map     [.] Lio/netty/util/internal/PlatformDependent;::getInt
     9.41%  java           perf-10645.map     [.] Lcom/github/animeshtrivedi/benchmark/ArrowReaderHolder;::consumeInt4
     6.63%  java           perf-10645.map     [.] Lio/netty/util/internal/PlatformDependent;::getByte
     5.96%  java           perf-10645.map     [.] Lio/netty/util/internal/PlatformDependent0;::getByte
     4.68%  java           perf-10645.map     [.] Lorg/apache/arrow/vector/BaseFixedWidthVector;::isSet
     3.97%  java           perf-10645.map     [.] Lio/netty/util/internal/PlatformDependent0;::getInt
     3.70%  java           perf-10645.map     [.] Lorg/apache/arrow/vector/IntVector;::get
     3.59%  java           perf-10645.map     [.] Lio/netty/buffer/ArrowBuf;::setByte
     1.60%  java           perf-10645.map     [.] Lio/netty/util/internal/PlatformDependent;::putByte
     1.30%  java           perf-10645.map     [.] Lio/netty/buffer/ArrowBuf;::chk          
     0.76%  java           libjvm.so          [.] acl_CopyRight
     [...]

First thing that pops out is that there are so many function calls, just to get an integer and a byte (for the bitmap). Second, one thing that looked odd to me is why there are cycles being spent on set/putBytes? When I looked at the call-graph as well, it showed me that the calls were from BitVectorHelper.loadValidityBuffer() function. A closer look at the function say that it tries to optimize for cases when all values are either null or set, and it generates a corresponding bitmask. In our benchmark, all values are indeed set, that is why this shows up in the profile. But I do not get the sense behind this logic (Q1) because at this point the bitmap from storage is already loaded, why not just return that?

Later I had a closer look at the isSet function in BaseFixedWidthVector class.

public abstract class BaseFixedWidthVector extends BaseValueVector
        implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
//...
public int isSet(int index) {
    final int byteIndex = index >> 3;
    final byte b = validityBuffer.getByte(byteIndex);
    final int bitIndex = index & 7;
    return Long.bitCount(b & (1L << bitIndex));
  }
}

As fas as I can tell, it does the index bit check and masking (that is ok), and then checks how many bits are set in the final result long, why? (Q2). If the result long is zero, then the index bit was not, otherwise set. Why do extra calculation to find out how many bits are set in the result?

Writing your own Arrow reader

At this point, I was tempted to write to my simple "integer" column reader to establish the peak performance. I hacked an Arrow integer reader, which is not doing much except read and materialize integers as fast as possible. The code is in ArrowIntReader class (it is a bit messy). When I run that I get (you can run this code by passing -x):

[...]
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73 GiB) bandwidth 14.78 Gbps
-----------------------------------------------------------------------

and looking at the profile, it looks something like:

42.48%  java           perf-28294.map     [.] Lcom/github/animeshtrivedi/arrow/ArrowIntReader;::isNull
28.66%  java           perf-28294.map     [.] Lcom/github/animeshtrivedi/arrow/ArrowIntReader;::consumeIntBatchDirectNull
10.64%  java           perf-28294.map     [.] Lcom/github/animeshtrivedi/benchmark/Platform;::getByte
 8.30%  java           perf-28294.map     [.] Lcom/github/animeshtrivedi/benchmark/Platform;::getInt
 3.29%  swapper        [kernel.kallsyms]  [k] poll_idle
 0.78%  perf           [kernel.kallsyms]  [k] unmap_page_range

So, the profile looks normal as one would except and performance looks sensible. One can definitely optimize more in these four functions, but for now we leave it.

Unsafe Arrow Reader

Based on observations so far, I decided to write a low-level Arrow reader that is a bit optimized version of Arrow's implementation. Key optimizations are

Instead of calling getInt on IntVector, my code consumes and materialzes values directly from value and validity buffers. You can get address of these buffers from getValidityBufferAddress() and getValueBufferAddress().
Has my own optimized version of isSet function

The complete code is in ArrowReaderUnsafe class and can be called by setting -e unsafe. Here are the result when using this code:

[...]	 
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73 GiB) bandwidth 13.13 Gbps
-----------------------------------------------------------------------

and when disable the bitmap optimization in the Arrow code, it bumps the performance little bit more to 13.61 Gbps. Al though this is a small change, but it shows you the importance of writing efficient code when running the code in high-performance environments.

[...]	 
-----------------------------------------------------------------------
Total bytes: 4000000000(3.73 GiB) bandwidth 13.61 Gbps
-----------------------------------------------------------------------

Recall that my initial (optimized and customized) Arrow reader had performance of 14.78 Gbps, and now this new generic Arrow reader is at 13.61 Gbps. Though, we are off the mark by < 10%, but I can live with this. These last 10% will take 90% of the effort.

Core scalability of the performance

Having established the best case performance for a single column integer schema at 13.61 Gbps, we now look at the performance scalability with respect to the number of cores. So, if we just scale this performance for 1, 2, 4, 8, 16 cores in my system, this is what I get:

Table 1 - core scalability

Cores	Bandwidth
1	13.61 Gbps
2	25.40 Gbps
4	36.24 Gbps
8	59.80 Gbps
16	112.91 Gbps

There are some hick-ups in the scaling performance, but at 16 cores we cross 100+ Gbps, so that is good. Keep in mind this is still just in-memory Arrow performance. We have not hooked up any networking or storage devices yet.

I am yet to have a detailed look about what stops perfect scalability for Arrow.

Conclusions

We managed to increase the performance of basic integer reader from 5.42 Gbps to 13.61 Gbps, a 2.5x improvement.

Recommendations:

Switch off validity checks (-Ddrill.enable_unsafe_memory_access=true -Dio.netty.buffer.bytebuf.checkAccessible=false) and make sure that the Assert Mode in the JVM is disabled (no -ea flag to the JVM).
Check and see if you can write your own low-level implementation of the reader using the Java Unsafe API. It depends what data type/schema you are using. See ArrowReaderUnsafe for example.
Check if you can optimize bitmap operations

Next Steps

I am going to follow up with the bitmap operation discussion in the mailing list
Have a detailed look at the performance scalability
Evaluate TCP-DS table performance
Integrate Crail (with high-performance networking and storage devices)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2018-10-09-arrow-int.md

2018-10-09-arrow-int.md

Investigating performance of Apache Arrow (Java) in-memory (part 2)

Index

Recap

Single core integer schema performance

The default out of the box performance

Disable checks

Bitmap optimizations

Writing your own Arrow reader

Unsafe Arrow Reader

Core scalability of the performance

Conclusions

Next Steps

Files

2018-10-09-arrow-int.md

Latest commit

History

2018-10-09-arrow-int.md

File metadata and controls

Investigating performance of Apache Arrow (Java) in-memory (part 2)

Index

Recap

Single core integer schema performance

The default out of the box performance

Disable checks

Bitmap optimizations

Writing your own Arrow reader

Unsafe Arrow Reader

Core scalability of the performance

Conclusions

Next Steps