The inference time of LightGlue-ONNX is compared to that of the original PyTorch implementation with adaptive configuration and FlashAttention.
Following the implementation details of the LightGlue paper, we report the inference time, or latency, of only the LightGlue matcher; that is, the time taken for feature extraction, postprocessing, copying data between the host & device, or finding inliers (e.g., CONSAC/MAGSAC) is not measured. The average inference time is defined as the median over all samples in the MegaDepth test dataset. We use the data provided by LoFTR here - a total of 403 image pairs.
Each image is resized such that its longer side is 1024 before being fed into the feature extractor. The average inference time of the LightGlue matcher is then measured for different numbers of keypoints: 512, 1024, 2048, and 4096. The SuperPoint extractor is used. See eval.py for the measurement code.
All experiments are conducted on an i9-12900HX CPU and RTX4080 12GB GPU with CUDA==11.8.1
, TensorRT==8.6.1
, torch==2.1.0
, and onnxruntime==1.16.0
.
The measured latencies are plotted in the figure below as image pairs per second.
Number of Keypoints | 512 | 1024 | 2048 | 4096 |
---|---|---|---|---|
Model | Latency (ms) | |||
PyTorch (Adaptive) | 12.81 | 13.65 | 16.49 | 24.35 |
ORT Fused FP32 | 9.52 | 14.90 | 36.21 | 97.37 |
ORT Fused FP16 | 7.48 | 9.06 | 12.99 | 28.97 |
TensorRT FP16 | 7.11 | 7.56 | 10.81 | 24.46 |
In general, the fused ORT models can match the speed of the adaptive PyTorch model despite being non-adaptive (going through all attention layers). The PyTorch model provides more consistent latencies across the board, while the fused ORT models become slower at higher keypoint numbers due to a bottleneck in the NonZero
operator. On the other hand, the TensorRT Execution Provider can reach very low latencies, but it is also inconsistent and unpredictable.