-
Notifications
You must be signed in to change notification settings - Fork 6
Ultra low latency tuning summary
Baoying Wang edited this page Jan 19, 2018
·
12 revisions
https://github.com/baoyingwang/OrderBook/wiki/Ultra-low-latency-tuning-summary wiz is the latest update.
- Build app
- Build profiling script
- Analyze profiling result
- Refactor your app; and upgrade profiling scripts(if required); and re-profiling
- Got to #3
Why How Monitor the GC behavior and understand it
Why: know your app and easy verify diff options
- Big picture - less layers( thread model) How: less layers Less transforming , eg all bytes
- how many cpu and hz? How cpu works? L1/L2/L3 cache.
- how many memory? how memory works? how it communicate with Cput(and L1/2/3)
- how IO is related with memory and cpu? why SSD is much faster than HDD?
- your application thread model, and the layers of critical path
- how many calculation/memory will be used in your app. Convert to max possible performance considering cpu and memory limit.
- Avoid any disk IO on critical path
- Use SSD
- Log level info
- Log buffer size increasing . It will reduce kernel / user switch.
- avoid string concat while debugging
- avoid toString on debug
- busy spin , or block wait
- Know your app feature. How many long live objects, and how long.
- know basic : minor GC / major GC
- (par new + CMS ) or G1
- Applications stop because of biased locking
- disable it if many such stops
- Application stop because of safepoint
- mutable is preferred on critical path, to avoid more temp objects
- avoid log.debug("{}", obj.toString()); log.debug("{}", obj). You should save toString().
- put less data to your executor, to use less old generation e.g. below code lines, #1 handleER is better, since it will only pass short string(128) to executor, but #2 handleER will pass the whole fixER(1k?) message to executor. If the executor is slow for some reason, old generation increase much faster for #2.
int bufferSize = 16*1024*1024;
BufferedOutputStream output = TestToolUtil.setupOutputLatencyFile(_latencyDataFile, bufferSize);
#1
void handleER(String fixER, long recvTimeNano){
if(fixER.indexOf("\u000156="+MatchingEngineApp.LATENCY_ENTITY_PREFIX) > 0){
_latencyWritingExecutor.submit(()->{
try {
output.write(latencyRecord.getBytes());
}catch (Exception e){
log.error("fail to write", e);
}
});
}
}
#2
void handleER(String fixER, long recvTimeNano){
if(fixER.indexOf("\u000156="+MatchingEngineApp.LATENCY_ENTITY_PREFIX) > 0){
String latencyRecord= TestToolUtil.getLantecyRecord(fixER,recvTimeNano);
_latencyWritingExecutor.submit(()->{
try {
String latencyRecord= TestToolUtil.getLantecyRecord(fixER,recvTimeNano);
output.write(latencyRecord.getBytes());
}catch (Exception e){
log.error("fail to write", e);
}
});
}
}
It has been very different with years ago. Everything is easy(for starting at least).
- easy to setup/run an application now, because of new libraries, including Spring Boot(both standalong app and web app)
- easy to setup a dev project from scratch with Intellij + Gradle + Maven repo, and integrate with populate libraries, e.g. spring boots, etc
- easy to write java scripts, e.g. Angular framework, C3 chart, etc.
But some aspects are still difficult, especially those some libraries for business, e.g. QFJ.
- QFJ has been upgraded/pushed to maven repo. Good.
- But it is still NOT well documented on the start guide. I am writing a wiki to make it easier.
Find the good tools/libraries to free yourself.
- Spring boot
- Angular for browser side
- QuickFixJ for FIX interface(both client and server)
- Gradle - save much time to avoid copy libs, or setup projects Gradle Fat jar - same time to setup runtime classpath
- Intellij Good IDE with Gradle. Suppoty Python highlight(and python project) Support bash highlight Eclipse is not very good for Gradle
- Disruptor Good at busy cpu spin. For the sleep strategy, similar with jdk BlockingQ( then not required to introduce the complexity).
- ChronicleQ Use file. Good for bytes. But for same jvm, bytes marshal/unmarshal are also big burden. Good for cross jvm sharing. It has a util LongPause. It will adapt (based on input) the wait intervals. Downgrade sleep/park time from nano ,to us, to ms,..
- Netty NIO framework.
- guava event bus ( sync / async mode). Not good for ultra latency(us level). But it greatly simply our code. In memory.
- btrace instead of injecting lines to source code, btrade is recommended. pls use Sampled for fast method to reduce burden to application, see http://btraceio.github.io/btrace/2015/02/sampled-profiling/
- C3 for web graph
- python matlibplot/seasorn/etc
- python pandas
- linux vmstat
- windows - performance monitor (define your own collector, and write to csv file)
- collect the status by Java JMX MBean - refer baoying.orderbook.app.SysPerfDataCollectionEngine
- MAT for heap analyze
- jvisualvm for cpu usage, memory allocation of each thread, etc