-
Notifications
You must be signed in to change notification settings - Fork 3k
Down the Rabbit Hole
This is where we give away the recipe to the secret sauce. When you come in with benchmarks like ours there
is a certain amount of skepticism that must be addressed. If you think of performance, and of connection pools, you might be tempted into thinking that the pool is the most important part of the performance equation. Not so clearly so. The number of getConnection()
operations in comparison to other JDBC operations is small. A large amount of performance gains come in the optimization of the "delegates" that wrap Connection
, Statement
, etc.
In order to make HikariCP as fast as it is, we went down to bytecode-level engineering, and beyond. We pulled out every trick we know to help the JIT help you. We studied the bytecode output of the compiler, and even the assembly output of the JIT to limit key routines to less than the JIT inline-threshold. We flattened inheritance hierarchies, shadowed member variables, eliminated casts.
HikariCP contains many micro-optimizations that individually are barely measurable, but together combine as a boost to overall performance. Some of these optimizations are measured in fractions of a millisecond amortized over millions of invocations.
One non-trivial (performance-wise) optimization was eliminating the use of an ArrayList<Statement>
instance
in the ProxyConnection
used to track open Statement
instances. When a Statement
is closed, it must
be removed from this collection, and when the Connection
is closed it must iterate the collection and close
any open Statement
instances, and finally must clear the collection. The Java ArrayList
, wisely for
general purpose use, performs a range check upon every get(int index)
call. However, because we can
provide guarantees about our ranges, this check is merely overhead.
Additionally, the remove(Object)
implementation performs a scan from head to tail, however common patterns in JDBC programming are to close Statements immediately after use, or in reverse order of opening. For these cases, a scan that starts at the tail will perform better. Therefore, ArrayList<Statement>
was replaced with a custom class
FastList
which eliminates range checking and performs removal scans from tail to head.
HikariCP contains a custom lock-free collection called a ConcurrentBag. The idea was borrowed from the C# .NET ConcurrentBag class, but the internal implementation quite different. The ConcurrentBag provides...
- A lock-free design
- ThreadLocal caching
- Queue-stealing
- Direct hand-off optimizations
...resulting in a high degree of concurrency, extremely low latency, and minimized occurrences of false-sharing.
In order to generate proxies for Connection, Statement, and ResultSet instances HikariCP was initially
using a singleton factory, held in the case of ProxyConnection
in a static field (PROXY_FACTORY).
There was a dozen or so methods resembling the following:
public final PreparedStatement prepareStatement(String sql, String[] columnNames) throws SQLException
{
return PROXY_FACTORY.getProxyPreparedStatement(this, delegate.prepareStatement(sql, columnNames));
}
Using the original singleton factory, the generated bytecode looked like this:
public final java.sql.PreparedStatement prepareStatement(java.lang.String, java.lang.String[]) throws java.sql.SQLException;
flags: ACC_PRIVATE, ACC_FINAL
Code:
stack=5, locals=3, args_size=3
0: getstatic #59 // Field PROXY_FACTORY:Lcom/zaxxer/hikari/proxy/ProxyFactory;
3: aload_0
4: aload_0
5: getfield #3 // Field delegate:Ljava/sql/Connection;
8: aload_1
9: aload_2
10: invokeinterface #74, 3 // InterfaceMethod java/sql/Connection.prepareStatement:(Ljava/lang/String;[Ljava/lang/String;)Ljava/sql/PreparedStatement;
15: invokevirtual #69 // Method com/zaxxer/hikari/proxy/ProxyFactory.getProxyPreparedStatement:(Lcom/zaxxer/hikari/proxy/ConnectionProxy;Ljava/sql/PreparedStatement;)Ljava/sql/PreparedStatement;
18: return
You can see that first there is a getstatic
call to get the value of the static field PROXY_FACTORY
, as
well as (lastly) the invokevirtual
call to getProxyPreparedStatement()
on the ProxyFactory
instance.
We eliminated the singleton factory (which was generated by Javassist) and replaced it with a final class having
static
methods (whose bodies are generated by Javassist). The Java code became:
public final PreparedStatement prepareStatement(String sql, String[] columnNames) throws SQLException
{
return ProxyFactory.getProxyPreparedStatement(this, delegate.prepareStatement(sql, columnNames));
}
Where getProxyPreparedStatement()
is a static
method defined in the ProxyFactory
class. The
resulting bytecode is:
private final java.sql.PreparedStatement prepareStatement(java.lang.String, java.lang.String[]) throws java.sql.SQLException;
flags: ACC_PRIVATE, ACC_FINAL
Code:
stack=4, locals=3, args_size=3
0: aload_0
1: aload_0
2: getfield #3 // Field delegate:Ljava/sql/Connection;
5: aload_1
6: aload_2
7: invokeinterface #72, 3 // InterfaceMethod java/sql/Connection.prepareStatement:(Ljava/lang/String;[Ljava/lang/String;)Ljava/sql/PreparedStatement;
12: invokestatic #67 // Method com/zaxxer/hikari/proxy/ProxyFactory.getProxyPreparedStatement:(Lcom/zaxxer/hikari/proxy/ConnectionProxy;Ljava/sql/PreparedStatement;)Ljava/sql/PreparedStatement;
15: areturn
There are three things of note here:
- The
getstatic
call is gone. - The
invokevirtual
call is replaced with ainvokestatic
call that is more easily optimized by the JVM. - Lastly, possibly not noticed at first glance is that the stack size is reduced from 5 elements to 4 elements. This is because in the case of
invokevirtual
there is an implicit passing of the instance of ProxyFactory on the stack (i.ethis
), and there is an additional (unseen) pop of that value from the stack whengetProxyPreparedStatement()
was called.
In all, this change removed a static field access, a push and pop from the stack, and made the invocation easier for the JIT to optimize because the callsite is guaranteed not to change.
In our benchmark, we are obviously running against a stub JDBC driver implementation, so the JIT is doing a lot of inlining. However, the same inlining at the stub-level is occurring for other pools in the benchmark. So, no inherent advantage to us.
But inlining is certainly a big part of the equation even when real drivers are in use, which brings us to another topic...
Some light reading.
TL;DR Obviously, when you're running 400 threads "at once", you aren't really running them "at once" unless you have 400 cores. The operating system, using N CPU cores, switches between your threads giving each a small "slice" of time to run called a quantum.
With a lot of threads running, as in many applications, when your time-slice runs out (as a thread) it may be a "long time" before the scheduler gives you a chance to run again. It is therefore crucial that a thread get as much as possible done during its time-slice, and avoid locks that force it to give up that time-slice, otherwise there is a performance penalty to be paid. And not a small one.
Which brings us to...
Another big hit incurred when you can't get your work done in a quanta is CPU cache-line invalidation. If your thread is preempted by the scheduler, when it does get a chance to run again all of the data it was frequently accessing is likely no longer in the core's L1 or core-pair L2 cache. Even more likely because you have no control over which core you will be scheduled on next.