Skip to content

A survey of object detection papers with notes, links, and other details, along with run experiments and results.

Notifications You must be signed in to change notification settings

sobhanshukueian/Object-Detectives

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 

Repository files navigation

Object Detection Papers and Experiments

Object Detection [👁️] Papers and Experiments

A survey of object detection papers with notes, links, and other details, along with run experiments and results.

PapersYOLO-V4 VS YOLO-V5ApplicationsExperiments

📚 Papers

R-CNN

  • Paper: R-CNN Paper
  • Type: Two-Stage
  • mAP: 66.0 on VOC(2007)
  • Speed: 0.02 FPS
  • Backbone: VGG-16
  • Neck: RPN
  • Head: Fast R-CNN
  • Main Idea: Proposal-based object detection with a Region Proposal Network (RPN) and a Fast R-CNN classifier.

Fast R-CNN

  • Paper: Fast R-CNN Paper
  • Type: Two-Stage
  • mAP: 66.9 on VOC(2007)
  • Speed: 0.5 FPS
  • Backbone: VGG-16
  • Neck: RPN
  • Head: Fast R-CNN
  • Augmentation: Horizontal Flipping, Scale Jittering
  • Main Idea: A faster version of R-CNN with a shared convolutional feature map, a Region Proposal Network (RPN), and a Fast R-CNN classifier.

You Only Look Once: Unified, Real-Time Object Detection

  • Paper: YOLO v1
  • Publish date: 2016
  • Type: One Stage
  • mAP: 63.40%
  • Speed: 45 fps
  • Backbone: "designed their own convolutional backbone which was inspired by GoogLeNet
  • Head: 2 fully connected layer with grid cell and bounding boxes
  • Augmentation: For data augmentation introduce random scaling and translations of up to 20% of the original image size. also randomly adjust the exposure and saturation of the image by up to a factor of 1:5 in the HSV color space
  • Notes: Frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B bounding boxes and confidence scores for those boxes"
  • image

SSD: Single Shot MultiBox Detector

  • Paper: SSD Paper
  • Publish date: 2019
  • Type: One Stage
  • mAP: "300300 input 74.3% mAP on VOC2007 and 500500 input 76.9%"
  • Speed: "input 300*300 on VOC2007 59 FPS on nvidia titan X
  • Backbone: VGG-16
  • Augmentations: Use the entire original input image. Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9. Randomly sample a patch.
  • image

YOLO9000: Better, Faster, Stronger

  • Paper: YOLO v2 Paper
  • Publish date: 2017
  • Type: One Stage
  • mAP: 78.6 on voc2007
  • Speed: 40 FPS
  • Backbone: Darknet-19
  • Augmentation: Use a similar data augmentation to YOLO and SSD with random crops, color shifting, etc.
  • Training Details: Train the network for 160 epochs with a starting learning rate of 10e−3, dividing it by 10 at 60 and 90 epochs. use a weight decay of 0.0005 and momentum of 0.9.
  • Notes: "1) adding batch normalization on all of the convolutional layers in YOLO get more than 2% improvement in mAP.
  1. first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet. then fine tune the resulting network on detection. This gives us an increase of almost 4% mAP.
  2. Convolutional With Anchor Boxes: remove the fully connected layers from YOLO and use anchor boxes to predict bounding boxes.Even though the mAP decreases, the increase in recall happens.
  3. run k-means clustering on the training set bounding boxes to automatically find good priors.
  4. Instead of predicting offsets follow the approach of YOLO and predict location coordinates relative to the locationof the grid cell. This bounds the ground truth to fall between 0 and 1. We use a logistic.
  5. Passthrough layer that brings features from an earlier layer at 26 × 26 resolution.
  6. Training Instead of fixing the input image size change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size.
  7. Combine datasets using WordTree can train our joint model on classification and detection.

YOLOv3: An Incremental Improvement

  • Paper: YOLO v3 Paper
  • Publish date: 2018
  • Type: One Stage
  • mAP: 51.5%
  • Speed: 78 fps
  • Backbone: Darknet-53
  • Training Details: Train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. use the Darknet neural network framework for training and testing
  • Notes: Better at detecting smaller objects and stronger than previous versions!.for detect smaller objects:YOLO v3, in total uses 9 anchor boxes. Three for each scale. If you’re training YOLO on your own dataset, you should go about using K-Means clustering to generate 9 anchors.

YOLOv4: Optimal Speed and Accuracy of Object Detection

  • Paper: YOLO v4 Paper
  • Publish date: 2020
  • Type: One Stage
  • mAP: 43.5% AP (65.7% AP50)
  • Speed: 65 FPS on Tesla V100
  • Backbone: CSPDarknet53
  • Head: YOLO-v3
  • Notes: YoloV4 is an important improvement of YoloV3, the implementation of a new architecture in the Backbone and the modifications in the Neck have improved the mAP(mean Average Precision) by 10% and the number of FPS(Frame per Second) by 12%. In addition, it has become easier to train this neural network on a single GPU.

YOLOX: Exceeding YOLO Series in 2021

  • Paper: YOLO X Paper
  • Publish date: 2021
  • Type: One Stage
  • Backbone: DarkNet53
  • Neck: SPP
  • Head: Anchor free
  • Augmentations: Random Horizontal Flip, Color Jitter, discard the RandomResizedCrop strategy, because we found the RandomResizedCrop is kind of overlapped with the planned mosaic augmentation.
  • Training Details: EMA weights updating, cosine lr schedule, IoU loss and IoU-aware branch.300 epochs with 5 epochs warmup on COCO train2017" with
  • Notes: (1). Replacing YOLO’s head with a decoupled one greatly improves the converging speed . 2). The decoupled head is essential to the end-to-end version of YOLO)
  • image

Densely Connected Convolutional Networks

  • Paper: DenseNet Paper
  • Publish date: 2018
  • Main Idea: The core of DenseNet is using Dense blocks which is an essential of the idea behind it all. The core idea is that within a block, it contains multiple layers. All previous attempts before this paper only used the layers in sequential manner. An output of a layer is fed to the next layer.
  • Advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.One big advantage of DenseNets is their improved flow of information and gra dients throughout the network,Each layer has direct access to the gradients from the loss function and the original input signal,eading to an implicit deep supervision.Further, we also observe that dense connections have a regularizing effect.
  • Notes: we never combine features through summation before they are passed into a layer; in stead, we combine features by concatenating them.Further, we also observe that dense connections have a regularizing effect,One explanation for the improved accuracy of dense convolutional networks may be that individual layers receive additional supervision from the loss function through the shorter connections.
  • image

CSPNET: A NEW BACKBONE THAT CAN ENHANCE LEARNING CAPABILITY OF CNN

  • Paper: CSPNET Paper
  • Publish date: 2019
  • Main Idea: The main purpose of designing CSPNet is to enable this architecture to achieve a richer gradient combination while reducing the amount of computation. This aim is achieved by partitioning feature map of the base layer into two parts and then merging them through a proposed cross-stage hierarchy. Our main concept is to make the gradient flow propagate through different network paths by splitting the gradient flow. In this by splitting the gradient flow
  • Advantages: 1) Strengthening learning ability of a CNN. 2) Removing computational bottlenecks. 3) Reducing memory costs
  • image

Path Aggregation Network for Instance Segmentation

  • Paper: PA Net Paper
  • Publish date: 2018
  • Main Idea: Specifically, we enhance the entire feature hierarchy with accurate localization signals in lower layers by bottom-up path augmentation,
  • Advantages: Features in multiple levels together are helpful for accurate prediction.
  • Notes: We use max operation to fuse features from different levels, which letsvnetwork select element-wise useful information.
  • image

EfficientDet: Scalable and Efficient Object Detection

  • Paper: EfficientDet Paper
  • Publish date: 2020
  • Main Idea: They propose a weighted bi-directional feature pyra mid network (BiFPN), which allows easy and fast multi scale feature fusion; Second, They propose a compound scal ing method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class pre diction networks at the same time.
  • Advantages: Aiming at optimizing both accuracy and efficiency, we would like to develop a family of models that can meet a wide spectrum of resource constraints.
  • Notes: While fusing different input features, most previous works simply sum them up without distinction;which introduces learnable weights to learn the importance of different input features, while repeatedly applying top down and bottom-up multi-scale feature fusion.
  • image

Squeeze-and-Excitation Networks

  • Paper: SE Net Paper
  • Publish date: 2019
  • Main Idea: They focus instead on the channel relationship and propose a novel architectural unit, which they term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.
  • Advantages: They show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. They further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost.
  • image

🔎 Comparison between YOLOv4 and YOLOv5

YOLO-V4

image

  • BackboneCSP-Darknet53
  • NeckSPPPAN
  • HeadYOLOv3 Head

What they checked :

  • Input: Image, Patches, Image Pyramid
  • Backbones: VGG16 , ResNet-50, SpineNet, EfficientNet-B0/B7, CSPResNeXt50, CSPDarknet53
  • Neck:
    • Additional blocks: SPP, ASPP, RFB, SAM
    • Path-aggregation blocks: FPN, PAN, NAS-FPN, Fully-connected FPN, BiFPN, ASFF, SFAM
  • Heads:
    • Dense Prediction (one-stage): RPN, SSD, YOLO, RetinaNet (anchor-based) CornerNet, CenterNet, MatrixNet, FCOS(anchor free)
    • Sparse Prediction (two-stage): Faster R-CNN, R-FCN, Mask RCNN (anchor-based), RepPoints (anchor free)

YOLO-V4 USE:

  • Bag of Freebies (BoF) for backboneCutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing
  • Bag of Specials (BoS) for backbone:Mish activation, Cross-stage partial connections (CSP), Multi-input weighted residual connections (MiWRC)
  • Bag of Freebies (BoF) for detectorCIoU-loss, CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for single ground truth, Cosine annealing scheduler, Optimal hyperparameters, Random training shapes
  • Bag of Specials (BoS) for detector: Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS\


YOLO-V5

  • BackboneNew CSP-Darknet53
  • NeckSPPFNew CSP-PAN
  • HeadYOLOv3 Head

https://user-images.githubusercontent.com/31005897/158507974-2c275082-95bd-4c2c-839d-6aec19121f02.png


YOLO-v4 vs YOLO-v5

Yolov4: Mish || Yolov5: SiLU

image image

Yolov4: SPP || Yolov5: SPPF

image


💻 Applications

Object detection has a wide range of applications, including:

  • Retail
  • Self-driving cars
  • Medical diagnosis
  • Robotics
  • Animal detection in Agriculture
  • Optical character recognition
  • Automated CCTV

📈 Run Experiments and Results

Paper GPU (Colab) CPU (Colab) Mobile (CPU) Mobile (GPU) Laptop (CPU)
FPS - mAP FPS - mAP FPS - mAP FPS - mAP FPS - mAP
YOLO v4 9fps - 65 - - - -
YOLO v3 tiny - - - - 14fps - 33.1
YOLO v5 medium 10fps - 67 2fps - 67 - - 4fps - 67
YOLO v5 small 33fps - 62 3fps - 62 - - 9fps - 62
YOLO v5 nano 58fps - 46 11fps - 46 - - 20fps - 46
YOLOX tiny - - 10fps - 32.8 7fps - 32.8 -
YOLOX nano - - 20 fps - 25.8 13 fps - 25.8 -
YOLO v5 320-lite-e - - 27 fps - 35.1 17 fps - 35.1 20fps - 33.7
YOLO v5 416-lite-e - - 20 fps - 35.1 12 fps - 35.1 -
YOLO v5 320-lite-i8e - - 21 fps - 35.1 22 fps - 35.1 -
YOLO v5 416-lite-i8e - - 17 fps - 35.1 17 fps - 35.1 -
YOLO v5 416-lite-s - - 18 fps - 42 11 fps - 42 -
YOLO v5 416-lite-i8s - - 19 fps - 42 17 fps - 42 -
YOLO v5 512-lite-c - - 9 fps - 50.9 6 fps - 50.9 12fps - 44

📝 Contributing

We welcome contributions to this repository! Please open an issue or pull request for any suggestions or changes.

⭐️ Please Star This Repo ⭐️

If you found this project useful or interesting, please consider giving it a star on GitHub! This helps other users discover the project and provides valuable feedback to the maintainers.

Thank you for your support!

About

A survey of object detection papers with notes, links, and other details, along with run experiments and results.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published