-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YOLOv9 with Quantization-Aware Training (QAT) for TensorRT #327
Comments
Updated. |
Perfomance / AccuracyTensorRT version: 10.0.0 ModelAccuracy ReportYOLOv9-C Evaluation Results
Evaluation Comparison
Latency/Throughput Report using only TensorRTDevice
Latency/Throughput
Latency/Throughput Comparison
|
@WongKinYiu Do you happen to have a YOLOv9-C or YOLOv9-E model trained with ReLU or ReLU6 activation functions? I need it for performance testing with quantization. If available and you could share it, it would greatly help me. |
Sorry for late reply, yolov9-relu.pt is here. |
@WongKinYiu Thank you for providing the weights file. The current results have been quite satisfactory, achieving a minimum latency value of Below are the tables of the results: YOLOv9 - with ReLUPerfomance / AccuracyTensorRT version: 10.0.0 Device
Accuracy ReportEvaluation Results
Evaluation Comparison
Latency/Throughput Report using TensorRTLatency/Throughput
Latency/Throughput Comparison
|
can we infer the Pytorch int8 model? what is the benchmark report pytorch int8 vs trt int8? |
Could you help for examine the latency/throughput without NMS? |
@WongKinYiu |
Thanks! |
Excuse me, I would like to borrow your time again. |
Latency/ThroughputLH-YOLOV9-C-FINE
LH-YOLOV9-C-COARSE
YOLOV9-C-FINE
YOLOV9-C-COARSE
EvaluationYOLOV9-C-COARSE
YOLOV9-C-FINE
LH-YOLOV9-C-FINE
LH-YOLOV9-C-COARSE
|
@WongKinYiu |
Thanks a lot. By the way, fine branch need not nms for post-processing. |
I am currently using the YOLOv9 code found at this link: https://github.com/levipereira/yolov9-qat/blob/master/val_trt.py#L249-L290. If you already have the corresponding code to evaluate without NMS, I would greatly appreciate it. |
Currently I just remove nms part of def no_max_suppression(
prediction,
conf_thres=0.25,
iou_thres=0.45,
classes=None,
agnostic=False,
multi_label=False,
labels=(),
max_det=300,
nm=0, # number of masks
):
"""Non-Maximum Suppression (NMS) on inference results to reject overlapping detections
Returns:
list of detections, on (n,6) tensor per image [xyxy, conf, cls]
"""
if isinstance(prediction, (list, tuple)): # YOLO model in validation model, output = (inference_out, loss_out)
prediction = prediction[0] # select only inference output
device = prediction.device
mps = 'mps' in device.type # Apple MPS
if mps: # MPS not fully supported yet, convert tensors to CPU before NMS
prediction = prediction.cpu()
bs = prediction.shape[0] # batch size
nc = prediction.shape[1] - nm - 4 # number of classes
mi = 4 + nc # mask start index
xc = prediction[:, 4:mi].amax(1) > conf_thres # candidates
# Checks
assert 0 <= conf_thres <= 1, f'Invalid Confidence threshold {conf_thres}, valid values are between 0.0 and 1.0'
assert 0 <= iou_thres <= 1, f'Invalid IoU {iou_thres}, valid values are between 0.0 and 1.0'
# Settings
# min_wh = 2 # (pixels) minimum box width and height
max_wh = 7680 # (pixels) maximum box width and height
max_nms = 300 # maximum number of boxes into torchvision.ops.nms()
time_limit = 2.5 + 0.05 * bs # seconds to quit after
redundant = True # require redundant detections
multi_label &= nc > 1 # multiple labels per box (adds 0.5ms/img)
merge = False # use merge-NMS
t = time.time()
output = [torch.zeros((0, 6 + nm), device=prediction.device)] * bs
for xi, x in enumerate(prediction): # image index, image inference
# Apply constraints
# x[((x[:, 2:4] < min_wh) | (x[:, 2:4] > max_wh)).any(1), 4] = 0 # width-height
x = x.T[xc[xi]] # confidence
# Cat apriori labels if autolabelling
if labels and len(labels[xi]):
lb = labels[xi]
v = torch.zeros((len(lb), nc + nm + 5), device=x.device)
v[:, :4] = lb[:, 1:5] # box
v[range(len(lb)), lb[:, 0].long() + 4] = 1.0 # cls
x = torch.cat((x, v), 0)
# If none remain process next image
if not x.shape[0]:
continue
# Detections matrix nx6 (xyxy, conf, cls)
box, cls, mask = x.split((4, nc, nm), 1)
box = xywh2xyxy(box) # center_x, center_y, width, height) to (x1, y1, x2, y2)
if multi_label:
i, j = (cls > conf_thres).nonzero(as_tuple=False).T
x = torch.cat((box[i], x[i, 4 + j, None], j[:, None].float(), mask[i]), 1)
else: # best class only
conf, j = cls.max(1, keepdim=True)
x = torch.cat((box, conf, j.float(), mask), 1)[conf.view(-1) > conf_thres]
# Filter by class
if classes is not None:
x = x[(x[:, 5:6] == torch.tensor(classes, device=x.device)).any(1)]
# Check shape
n = x.shape[0] # number of boxes
if not n: # no boxes
continue
elif n > max_nms: # excess boxes
x = x[x[:, 4].argsort(descending=True)[:max_nms]] # sort by confidence
else:
x = x[x[:, 4].argsort(descending=True)] # sort by confidence
output[xi] = x
if mps:
output[xi] = output[xi].to(device)
if (time.time() - t) > time_limit:
LOGGER.warning(f'WARNING ⚠️ NMS time limit {time_limit:.3f}s exceeded')
break # time limit exceeded
return output |
clean up the code. def no_max_suppression(
prediction,
conf_thres=0.25,
iou_thres=0.45,
classes=None,
agnostic=False,
multi_label=False,
labels=(),
max_det=300,
nm=0, # number of masks
):
"""No Maximum Suppression on inference results
Returns:
list of detections, on (n,6) tensor per image [xyxy, conf, cls]
"""
if isinstance(prediction, (list, tuple)): # YOLO model in validation model, output = (inference_out, loss_out)
prediction = prediction[0] # select only inference output
device = prediction.device
mps = 'mps' in device.type # Apple MPS
if mps: # MPS not fully supported yet, convert tensors to CPU before NMS
prediction = prediction.cpu()
bs = prediction.shape[0] # batch size
nc = prediction.shape[1] - nm - 4 # number of classes
mi = 4 + nc # mask start index
xc = prediction[:, 4:mi].amax(1) > conf_thres # candidates
# Checks
assert 0 <= conf_thres <= 1, f'Invalid Confidence threshold {conf_thres}, valid values are between 0.0 and 1.0'
assert 0 <= iou_thres <= 1, f'Invalid IoU {iou_thres}, valid values are between 0.0 and 1.0'
# Settings
time_limit = 2.5 + 0.05 * bs # seconds to quit after
multi_label &= nc > 1 # multiple labels per box (adds 0.5ms/img)
t = time.time()
output = [torch.zeros((0, 6 + nm), device=prediction.device)] * bs
for xi, x in enumerate(prediction): # image index, image inference
# Apply constraints
x = x.T[xc[xi]] # confidence
# Cat apriori labels if autolabelling
if labels and len(labels[xi]):
lb = labels[xi]
v = torch.zeros((len(lb), nc + nm + 5), device=x.device)
v[:, :4] = lb[:, 1:5] # box
v[range(len(lb)), lb[:, 0].long() + 4] = 1.0 # cls
x = torch.cat((x, v), 0)
# If none remain process next image
if not x.shape[0]:
continue
# Detections matrix nx6 (xyxy, conf, cls)
box, cls, mask = x.split((4, nc, nm), 1)
box = xywh2xyxy(box) # center_x, center_y, width, height) to (x1, y1, x2, y2)
if multi_label:
i, j = (cls > conf_thres).nonzero(as_tuple=False).T
x = torch.cat((box[i], x[i, 4 + j, None], j[:, None].float(), mask[i]), 1)
else: # best class only
conf, j = cls.max(1, keepdim=True)
x = torch.cat((box, conf, j.float(), mask), 1)[conf.view(-1) > conf_thres]
# Filter by class
if classes is not None:
x = x[(x[:, 5:6] == torch.tensor(classes, device=x.device)).any(1)]
# Check shape
n = x.shape[0] # number of boxes
if not n: # no boxes
continue
elif n > max_det: # excess boxes
x = x[x[:, 4].argsort(descending=True)[:max_det]] # sort by confidence
else:
x = x[x[:, 4].argsort(descending=True)] # sort by confidence
output[xi] = x
if mps:
output[xi] = output[xi].to(device)
if (time.time() - t) > time_limit:
LOGGER.warning(f'WARNING ⚠️ NMS time limit {time_limit:.3f}s exceeded')
break # time limit exceeded
return output |
I have retrained the model completely due to the change that eliminates the need to process the NMS. We pick the best model based on mAP. YOLOV9-C-FINE (NO-NMS)
LH-YOLOV9-C-FINE (NO-NMS)
|
lh-yolov9-c-coarse.pt, lh-yolov9-c-fine.pt are updated. |
LH-YOLOV9-C-FINE (mse) NMS-free
LH-YOLOV9-C-FINE (percentile=99.999) NMS-free
LH-YOLOV9-C-COARSE (mse)
LH-YOLOV9-C-COARSE (percentile=99.999)
I initially performed the default MSE calibration, but the results were unsatisfactory. Consequently, I modified the calibration method to use percentile=99.999, which yielded better outcomes. I believe that the these model has more sensitive layers that need to be treated differently. Additionally, I need to explore the new HEAD of the model since I only performed quantization for YOLOv9-C/E. I am generating a latency report. |
LH-YOLOV9-C-FINE (INT8)
LH-YOLOV9-C-COARSE(INT8)
|
Thanks! It seems old weights (#327 (comment)) have more stable QAT performance. Since old weights and new weights are trained by different strategies, maybe it is worth to discuss the relation between pretrain methods and QAT step. I will provide weights of YOLOV9-C-FINE trained by same way as #327 (comment) in few days to make sure if sensitive layers are caused by different training methods. If yes, I could try to analyze and design QAT friendly pretrained methods in the future. Thank you for bring this possible research direction to me. |
I ran the tests to find the most sensitive layer (PQT Baseline), and here are the results: LH-YOLOV9-C-FINE(#327 (comment))
Today my day was quite busy, but I believe I will be able to run the training with layer 22 using fp16 and see the performance and accuracy results. |
Indeed, layer 22 is the most sensitive layer. I disabled the quantization in layer 22 and managed to recover the precision with better performance at batch size 1. However, when increasing the batch size to 8 or 12, there is a slight increase in latency and a decrease in throughput. LH-YOLOV9-C-FINE (mse) NMS-free
LH-YOLOV9-C-FINE (INT8)
|
I upload the weights and update the file name. training method 1: training method 2: By the way, could you help for examine latency/throughput of tiny/small/medium models also. Thanks. |
I have observed that the last layer of model is often the most sensitive to quantization. This sensitivity arises because this layer tends to generate more outliers. From a quantization perspective, these outliers are normalized, leading to a loss of precision, as these outliers are crucial for the model’s accuracy. By changing the training method, we have effectively reduced the generation of outliers, which are critical for quantization. The different training approach has shown to produce fewer values that are considered outliers, thus preserving the precision and overall performance of the quantized model. To address the sensitivity of the final layer to quantization, I implemented a straightforward approach: disabling the quantization of layer 22. Instead of retraining the model, I simply disabled the quantization for this specific layer and re-evaluated the model to assess the impact on performance. Quantization Disabled at layer 22 is indicated by the suffix I performed the calibration using MSE, although in some cases, using percentile = 99.999 proved to be more efficient. Model Performance Tables (MSE)YOLOv9-C-Coarse - Method 1
YOLOv9-C-Coarse - Method 2
LH-YOLOv9-C-Coarse - Method 1
LH-YOLOv9-C-Coarse - Method 2
YOLOv9-C-Fine - Method 1
YOLOv9-C-Fine - Method 2
LH-YOLOv9-C-Fine - Method 1
LH-YOLOv9-C-Fine - Method 2
I still owe the tests for the remaining models as well as the latency tests, which I will send as soon as possible. |
Result of Tiny/Small/MediumI have encountered several performance issues regarding latency and throughput in the quantized Tiny, Small, and Medium models. They performed worse than the FP16 models, generating many reformat operations that directly impacted the model's latency. I am currently researching and studying the behavior of quantization in these models to resolve the issue. Latencyyolov9-t-converted (FP16)
yolov9-s-converted (FP16)
yolov9-m-converted (FP16)
Evaluationyolov9-t-converted
yolov9-s-converted
yolov9-m-converted
|
Thanks. Yes, the throughput seems strange. The main difference between t/s/m and c/e is that t/s/m use AConv and c/e use ADown for downsampling. |
I have noticed that model performance is often measured solely by latency. However, during my research, I discovered that different models can have very similar latencies with a batch size of 1. But as the batch size increases, they show significant differences in both throughput and latency. Therefore, testing only with a batch size of 1 and focusing solely on latency can lead to incorrect conclusions about the model's potential. To accurately measure a model's potential, we should consider both latency and batch size. On the GPU, models can have a certain latency, but increasing the batch size doesn't cause latency to grow proportionally. This is evident in the performance tables. Thus, the best model is the one that achieves the highest throughput with the largest batch size and the lowest latency. I will attempt to illustrate my finds visually. Testing with Batch Size = 1GR Active means GPU was on 100% (The percentage of cycles the compute engine is active.)SM Active only 24% - What means a lot space to increase Batch SizeTesting with Batch Size = 24GR still 100%SM Active was 85%When SM Active reaches 100%, the model's performance drops, resulting in increased latency and decreased throughput. Therefore, when measuring the potential of the model, we should also consider the batch size. The best model is the one that achieves the highest throughput with the largest batch size and the lowest latency. |
I will perform profiling to see the differences. |
Batch 1 and large batch are both important. Large batch inference is importance on cloud service.
I have encounter similar issues (small model and large model have same inference speed) on yolov4 when using some build-in pytorch version in nvidia docker. |
I will test these models on different servers and TensorRT version. I often see performance reports comparing perfomance between YOLO Series models with a batch size of 1, using latency as the primary comparison parameter. However, without testing the variable batch size, it's possible that some models may have significantly worse performance when using larger batch sizes compared to others. A classic example was test of batch size 1 on YOLOv9-t with a latency of 0.7 ms versus YOLOv9-s with 0.9 ms and the throughput difference was only about 280 IPS. |
laugh12321 gets similar inference speed as your reports. Three possible reasons:
Since c model has 13 times flops of t model, it really strange. |
To check if the number of layer is the one of reason, number of layers: e > t = s > s1 = c > m |
gelan-s2 (FP16)
I don't believe the problem is with the host or the installation. Maybe be some bug/issue in TensorRT, because only a few models exhibit this strange behavior. I'm having a lot of difficulty identifying why the t/s/m model is performing poorly when quantized. I've noticed a lot reformat operations due different scales. I implemented AConv similar to ADown, but the poor results persist. I also observed some DFL operations in the slice of the initial layers what differ from Yolov9-c. However, I'm still investigating this carefully. Model FP16Model QAT |
Thank you for your effort. Yes, it seems there are many unnecessary reformat layers are generated by tensorrt. |
about #327 (comment) |
Well, do not know why after convert to tensorrt, yolov9-m has many layers. |
I have been analyzing the models and noticed that YOLOv9-C vs. YOLOv9-M has several Reformat operations where some nodes were not fused. The same issue occurs with the QAT models, where some nodes, despite being on the same scale, are not being fused, resulting in multiple Reformat operations. I searched on GitHub and found several users experiencing issues with node fusion, where TensorRT did not support certain fusions. Given that these models introduce new modules, it is possible that this has caused issues with TensorRT. We need to open another front to address this issue in the TensorRT repository to understand where the potential problem lies. |
Could you help for take a look if YOLOv7 have same issue. |
These past few days I was away on a business trip. I'm returning now and we will pick up where we left off. I'm sorry for the delay in responding. |
YOLOv9 with Quantization-Aware Training (QAT) for TensorRT
https://github.com/levipereira/yolov9-qat/
This repository hosts an implementation of YOLOv9 integrated with Quantization-Aware Training (QAT), optimized for deployment on TensorRT-supported platforms to achieve hardware-accelerated inference. It aims to deliver an efficient, low-latency version of YOLOv9 for real-time object detection applications. If you're not planning to deploy your model using TensorRT, it's advisable not to proceed with this implementation.
Implementation Details:
Perfomance Report
@WongKinYiu I've successfully created a comprehensive implementation of Quantization in a separate repository. It works as a patch for the original YOLOv9 version. However, there are still some challenges to address as the implementation is functional but has room for improvement.
I'm closing the issue #253 and will continue the discussion in this thread. If possible, please replace the reference to issue #253 with this new issue #327 in the Useful Links section.
I'll provide the latency reports shortly.
The text was updated successfully, but these errors were encountered: