Add a new object detector, Lite-DINO (#2457)

* Initial implementation of Lite DETR * Update model config for lite dino * Add norm to intermediate layer of ffn * Change FFN's norm order and add enc_scale attribute to encoder's layers * Merge with incremental recipe * Add model pretrained weight path * Update model info and add intg tests * Update docs * Update CHANGELOG * Change num iters
openvinotoolkit · Aug 31, 2023 · 8045480 · 8045480
1 parent fc6386c
commit 8045480
Show file tree

Hide file tree

Showing 13 changed files with 648 additions and 9 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,7 @@ All notable changes to this project will be documented in this file.
 - Add ONNX metadata to detection, instance segmantation, and segmentation models (<https://github.com/openvinotoolkit/training_extensions/pull/2418>)
 - Add a new feature to configure input size(<https://github.com/openvinotoolkit/training_extensions/pull/2420>)
 - Introduce the OTXSampler and AdaptiveRepeatDataHook to achieve faster training at the small data regime (<https://github.com/openvinotoolkit/training_extensions/pull/2428>)
+- Add a new object detector Lite-DINO(<https://github.com/openvinotoolkit/training_extensions/pull/2457>)
 
 ### Enhancements
 

diff --git a/docs/source/guide/explanation/algorithms/object_detection/object_detection.rst b/docs/source/guide/explanation/algorithms/object_detection/object_detection.rst
@@ -100,6 +100,8 @@ In addition to these models, we supports experimental models for object detectio
 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
 | `Custom_Object_Detection_Gen3_DINO <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/algorithms/detection/configs/detection/resnet50_dino/template_experimental.yaml>`_                        |        DINO         | 235                 | 182.0           |
 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
+| `Custom_Object_Detection_Gen3_Lite_DINO <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/algorithms/detection/configs/detection/resnet50_litedino/template_experimental.yaml>`_               |      Lite-DINO      | 140                 | 190.0           |
++---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
 | `Custom_Object_Detection_Gen3_ResNeXt101_ATSS <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/algorithms/detection/configs/detection/resnext101_atss/template_experimental.yaml>`_           |   ResNeXt101-ATSS   | 434.75              | 344.0           |
 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
 | `Object_Detection_YOLOX_S <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/algorithms/detection/configs/detection/cspdarknet_yolox_s/template_experimental.yaml>`_                            |       YOLOX_S       | 33.51               | 46.0            |
@@ -110,6 +112,7 @@ In addition to these models, we supports experimental models for object detectio
 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
 
 `Deformable_DETR <https://arxiv.org/abs/2010.04159>`_ is `DETR <https://arxiv.org/abs/2005.12872>`_ based model, and it solves slow convergence problem of DETR. `DINO <https://arxiv.org/abs/2203.03605>`_ improves Deformable DETR based methods via denoising anchor boxes. Current SOTA models for object detection are based on DINO.
+`Lite-DINO <https://arxiv.org/abs/2303.07335>`_ is efficient structure for DINO. It reduces FLOPS of transformer's encoder which takes the highest computational costs.
 Although transformer based models show notable performance on various object detection benchmark, CNN based model still show good performance with proper latency.
 Therefore, we added a new experimental CNN based method, ResNeXt101-ATSS. ATSS still shows good performance among `RetinaNet <https://arxiv.org/abs/1708.02002>`_ based models. We integrated large ResNeXt101 backbone to our Custom ATSS head, and it shows good transfer learning performance.
 In addition, we added a YOLOX variants to support users' diverse situations.
@@ -154,6 +157,8 @@ We trained each model with a single Nvidia GeForce RTX3090.
 +----------------------------+------------------+-----------+-----------+-----------+-----------+--------------+
 | ResNet50-DINO              | 49.0 (66.4)      | 47.2      | 99.5      | 62.9      | 93.5      | 99.1         |
 +----------------------------+------------------+-----------+-----------+-----------+-----------+--------------+
+| ResNet50-Lite-DINO         | 48.1 (64.4)      | 47.0      | 99.0      | 62.5      | 93.6      | 99.4         |
++----------------------------+------------------+-----------+-----------+-----------+-----------+--------------+
 | YOLOX_S                    | 40.3 (59.1)      | 37.1      | 93.6      | 54.8      | 92.7      | 98.8         |
 +----------------------------+------------------+-----------+-----------+-----------+-----------+--------------+
 | YOLOX_L                    | 49.4 (67.1)      | 44.5      | 94.6      | 55.8      | 91.8      | 99.0         |

diff --git a/src/otx/algorithms/common/adapters/mmcv/ops/multi_scale_deformable_attn_pytorch.py b/src/otx/algorithms/common/adapters/mmcv/ops/multi_scale_deformable_attn_pytorch.py
@@ -78,6 +78,7 @@ def _custom_grid_sample(im: torch.Tensor, grid: torch.Tensor, align_corners: boo
     Returns:
         torch.Tensor: A tensor with sampled points, shape (N, C, Hg, Wg)
     """
+    device = im.device
     n, c, h, w = im.shape
     gn, gh, gw, _ = grid.shape
     assert n == gn
@@ -113,14 +114,14 @@ def _custom_grid_sample(im: torch.Tensor, grid: torch.Tensor, align_corners: boo
     x0, x1, y0, y1 = x0 + 1, x1 + 1, y0 + 1, y1 + 1
 
     # Clip coordinates to padded image size
-    x0 = torch.where(x0 < 0, torch.tensor(0), x0)
-    x0 = torch.where(x0 > padded_w - 1, torch.tensor(padded_w - 1), x0)
-    x1 = torch.where(x1 < 0, torch.tensor(0), x1)
-    x1 = torch.where(x1 > padded_w - 1, torch.tensor(padded_w - 1), x1)
-    y0 = torch.where(y0 < 0, torch.tensor(0), y0)
-    y0 = torch.where(y0 > padded_h - 1, torch.tensor(padded_h - 1), y0)
-    y1 = torch.where(y1 < 0, torch.tensor(0), y1)
-    y1 = torch.where(y1 > padded_h - 1, torch.tensor(padded_h - 1), y1)
+    x0 = torch.where(x0 < 0, torch.tensor(0).to(device), x0)
+    x0 = torch.where(x0 > padded_w - 1, torch.tensor(padded_w - 1).to(device), x0)
+    x1 = torch.where(x1 < 0, torch.tensor(0).to(device), x1)
+    x1 = torch.where(x1 > padded_w - 1, torch.tensor(padded_w - 1).to(device), x1)
+    y0 = torch.where(y0 < 0, torch.tensor(0).to(device), y0)
+    y0 = torch.where(y0 > padded_h - 1, torch.tensor(padded_h - 1).to(device), y0)
+    y1 = torch.where(y1 < 0, torch.tensor(0).to(device), y1)
+    y1 = torch.where(y1 > padded_h - 1, torch.tensor(padded_h - 1).to(device), y1)
 
     im_padded = im_padded.view(n, c, -1)
 

diff --git a/src/otx/algorithms/detection/adapters/mmdet/models/detectors/__init__.py b/src/otx/algorithms/detection/adapters/mmdet/models/detectors/__init__.py
@@ -6,6 +6,7 @@
 from .custom_atss_detector import CustomATSS
 from .custom_deformable_detr_detector import CustomDeformableDETR
 from .custom_dino_detector import CustomDINO
+from .custom_lite_dino import CustomLiteDINO
 from .custom_maskrcnn_detector import CustomMaskRCNN
 from .custom_maskrcnn_tile_optimized import CustomMaskRCNNTileOptimized
 from .custom_single_stage_detector import CustomSingleStageDetector
@@ -19,6 +20,7 @@
 __all__ = [
     "CustomATSS",
     "CustomDeformableDETR",
+    "CustomLiteDINO",
     "CustomDINO",
     "CustomMaskRCNN",
     "CustomSingleStageDetector",

diff --git a/src/otx/algorithms/detection/adapters/mmdet/models/detectors/custom_lite_dino.py b/src/otx/algorithms/detection/adapters/mmdet/models/detectors/custom_lite_dino.py
@@ -0,0 +1,21 @@
+"""OTX Lite-DINO Class for object detection."""
+
+# Copyright (C) 2023 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+#
+
+from mmdet.models.builder import DETECTORS
+
+from otx.algorithms.common.utils.logger import get_logger
+from otx.algorithms.detection.adapters.mmdet.models.detectors import CustomDINO
+
+logger = get_logger()
+
+
+@DETECTORS.register_module()
+class CustomLiteDINO(CustomDINO):
+    """Custom Lite-DINO <https://arxiv.org/pdf/2303.07335.pdf> for object detection."""
+
+    def load_state_dict_pre_hook(self, model_classes, ckpt_classes, ckpt_dict, *args, **kwargs):
+        """Modify official lite dino version's weights before weight loading."""
+        super(CustomDINO, self).load_state_dict_pre_hook(model_classes, ckpt_classes, ckpt_dict, *args, *kwargs)
diff --git a/src/otx/algorithms/detection/adapters/mmdet/models/layers/__init__.py b/src/otx/algorithms/detection/adapters/mmdet/models/layers/__init__.py
@@ -5,5 +5,13 @@
 
 from .dino import CustomDINOTransformer
 from .dino_layers import CdnQueryGenerator, DINOTransformerDecoder
+from .lite_detr_layers import EfficientTransformerEncoder, EfficientTransformerLayer, SmallExpandFFN
 
-__all__ = ["CustomDINOTransformer", "DINOTransformerDecoder", "CdnQueryGenerator"]
+__all__ = [
+    "CustomDINOTransformer",
+    "DINOTransformerDecoder",
+    "CdnQueryGenerator",
+    "EfficientTransformerEncoder",
+    "EfficientTransformerLayer",
+    "SmallExpandFFN",
+]