3D bounding box annotation, essential for training object detection models, can be acquired by professional data annotation engineers or through auto-labeling methods. However, annotated 3D boxes may not be entirely reliable, potentially harboring various errors. Low-quality annotation has a detrimental impact on the performance of trained models. In this paper, we introduce SMART, a Self-supervised Multi-modal 3D bounding box Annotation eRror deTection framework. SMART generates pseudo-erroneous 3D boxes to create supervisory signals, and subsequently trains the error detector accordingly. This detector is proficient in integrating multi-modal data and directly regressing error scores. A novel loss function is adopted to train the error detector, allowing certain 3D bounding boxes in the initial annotation to possess higher error scores while ensuring that generated pseudo-erroneous 3D boxes do not have lower error scores. Remarkably, SMART operates without reliance on prior knowledge and can be universally applied for detecting 3D bounding boxes across various classes (i.e., open-vocabulary). Extensive experiments demonstrate the effectiveness of SMART in detecting errors within annotated 3D boxes, thereby aiding users in enhancing annotation quality.
In the absence of an existing 3D bounding box error detection dataset, we opt to inject erroneous boxes into the KITTI and nuScenes (mini) 3D object detection datasets for evaluation. Erroneous boxes are generated using PointPillars, a 3D object detection approach, or by randomly changing the class of some boxes. There are three steps involved in generating erroneous boxes using the PointPillars method. Firstly, utilizing the PointPillars model, a 3D object detection approach, we generate initial 3D bounding boxes. Next, we proceed to remove boxes that intersect with pre-existing 3D bounding boxes in the dataset for each object class. Finally, we inject erroneous boxes by randomly selecting from the 3D bounding boxes output in the previous step.
To improve our experimental datasets’ reliability, we manually correct any inaccurate annotations within the datasets. Additionally, we reassess the injected errors to ensure their accuracy, which has resulted in reclassifying some injected boxes as correct. The basic statistics of these datasets are shown below.
Pedestrian | Cyclist | Car | |
---|---|---|---|
Total Number of Boxes | 4502 | 1621 | 27198 |
Number of Erroneous Boxes | 299 | 115 | 1125 |
Error Ratio(%) | 6.64 | 7.09 | 4.14 |
The 3D bounding boxes of the KITTI dataset are stored in the 'label_add_KITTI.zip' file within this project. Each file within the 'label_add_KITTI.zip' corresponds to a sample_id. The format for each box is as follows:
Car | 0.0 | 0 | -1.56 | 564.62 | 174.59 | 616.43 | 224.74 | 1.61 | 1.66 | 3.6 | -0.69 | 1.65 | 25.21 | -1.59 | 0.0
The values within the box represent the following:
- Car : box_class
- 0.0: Not applicable
- 0: Not applicable
- -1.56: Not applicable
- 564.62: Not applicable
- 174.59: Not applicable
- 616.43: Not applicable
- 224.74: Not applicable
- 1.61: height
- 1.66: width
- 3.6: length
- -0.69: x coordinate
- 1.65: y coordinate
- 25.21: z coordinate
- -1.59: yaw
- 0.0: label_represent_if_erroneous (0.0 indicates correct, 1.0 indicates erroneous)
The x, y, and z are under the camera coordinate system.
Pedestrian | Bicycle | Car | Bus | Motorcycle | Trailer | Truck | |
---|---|---|---|---|---|---|---|
Total Number of Boxes | 2593 | 121 | 4232 | 318 | 268 | 55 | 474 |
Number of Erroneous Boxes | 134 | 6 | 212 | 16 | 14 | 2 | 24 |
Error Ratio(%) | 5.17 | 7.09 | 5.01 | 5.03 | 5.22 | 3.64 | 5.06 |
The 3D bounding boxes of the nuScenes dataset are stored in the 'label_add_nuScenes-mini.zip' file within this project. Each file within the 'label_add_KITTI.zip' corresponds to a sample_id. The format for each box is as follows:
car | 0bdebf547fc94ee19c8d28dc36f157b7 | 0.7 | -5.6 | -1.2 | 4.795 | 2.09 | 2.0 | -0.68 | -0.01 | 0.04 | -0.72 | 0.0
The values within the box represent the following:
- Car : box_class
- 0bdebf547fc94ee19c8d28dc36f157b7: Not applicable
- 0.7: center_x
- -5.6: center_y
- -1.2: center_z
- 4.795: length
- 2.09: width
- 2.0: height
- (-0.68 -0.01 0.04 -0.72): Bounding box orientation as quaternion: w, x, y, z.
- 0.0: label_represent_if_erroneous (0.0 indicates correct, 1.0 indicates erroneous)
The x, y, and z are under the Lidar coordinate system.
We conduct each experiment five times and present the average and standard deviation of results as mean_±_std.
Pedestrian | Pedestrian | Cyclist | Cyclist | Car | Car | |||||
---|---|---|---|---|---|---|---|---|---|---|
F-Score(%) | AP(%) | F-Score(%) | AP(%) | F-Score(%) | AP(%) | F-Score(%) | AP(%) | mF-Score(%) | mAP(%) | |
VAE | 8.8±0.0 | 4.7±0.0 | 44.6±1.5 | 44.4±1.5 | 27.5±1.2 | 21.1±0.5 | 19.6±0.3 | 15.5±0.2 | 30.6±0.9 | 27.0±0.6 |
Deep-SVDD | 8.8±0.0 | 4.6±0.0 | 39.3±0.8 | 38.0±0.6 | 47.6±1.4 | 39.9±1.0 | 16.6±0.8 | 9.3±0.4 | 34.5±1.0 | 29.1±0.6 |
iForest | 8.8±0.0 | 4.6±0.0 | 36.7±1.4 | 25.9±0.5 | 34.9±0.3 | 22.2±0.6 | 15.3±0.4 | 11.3±0.4 | 28.9±0.5 | 19.8±0.4 |
OCSVM | 8.8±0.0 | 4.7±0.0 | 50.8±2.1 | 52.4±2.2 | 44.0±1.9 | 49.6±0.8 | 18.7±0.5 | 14.4±0.5 | 37.8±1.3 | 38.8±0.9 |
ECOD | 8.8±0.0 | 4.6±0.0 | 31.4±0.9 | 19.9±0.4 | 34.5±1.0 | 23.6±0.4 | 14.0±0.3 | 8.8±0.5 | 26.6±0.6 | 17.4±0.3 |
LUNAR | 8.8±0.0 | 4.6±0.0 | 40.8±1.2 | 36.2±1.3 | 46.8±1.1 | 33.6±0.9 | 31.8±0.9 | 22.4±0.6 | 39.8±0.9 | 30.8±0.7 |
SMART | 77.8±0.9 | 84.7±1.4 | 78.7±2.9 | 86.5±2.5 | 78.8±1.4 | 86.2±2.3 | 79.5±0.6 | 86.0±0.6 | 79.0±1.4 | 86.3±1.1 |
Pedestrian | Pedestrian | Bicycle | Bicycle | Car | Car | Truck | Truck | Trailer | Trailer | Bus | Bus | Motorcycle | Motorcycle | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F-Score(%) | AP(%) | F-Score(%) | AP(%) | F-Score(%) | AP(%) | F-Score(%) | AP(%) | F-Score(%) | AP(%) | F-Score(%) | AP(%) | F-Score(%) | AP(%) | F-Score(%) | AP(%) | mF-Score(%) | mAP(%) | |
VAE | 9.6±0.0 | 5.1±0.0 | 13.5±0.1 | 7.1±0.1 | 11.1±0.0 | 5.8±0.1 | 11.8±0.1 | 5.9±0.0 | 9.9±0.0 | 4.9±0.0 | 7.1±0.0 | 2.8±0.0 | 9.7±0.0 | 3.9±0.0 | 21.4±0.9 | 8.1±0.4 | 12.1±0.1 | 5.5±0.0 |
Deep-SVDD | 9.6±0.0 | 5.0±0.0 | 16.4±0.1 | 9.1±0.1 | 25.0±3.2 | 15.2±3.1 | 10.6±0.0 | 5.6±0.0 | 13.5±0.1 | 10.3±0.0 | 59.7±10.8 | 66.7±12.0 | 12.2±0.2 | 6.0±0.1 | 10.5±0.3 | 6.2±0.3 | 21.1±2.1 | 17.0±2.3 |
iForest | 9.6±0.0 | 5.1±0.0 | 14.4±0.2 | 12.9±0.1 | 29.6±3.8 | 28.5±4.2 | 12.2±0.0 | 6.8±0.0 | 12.0±0.0 | 5.7±0.0 | 76.0±14.2 | 83.3±15.8 | 14.5±0.1 | 7.3±0.1 | 11.9±0.2 | 5.7±0.2 | 24.4±2.5 | 21.5±2.6 |
OCSVM | 9.6±0.0 | 5.1±0.0 | 12.7±0.1 | 9.3±0.1 | 28.6±3.1 | 27.8±4.1 | 14.0±0.1 | 8.7±0.0 | 12.5±0.0 | 5.9±0.0 | 56.7±11.4 | 70.0±12.9 | 23.3±0.4 | 10.5±0.2 | 13.8±0.3 | 6.3±0.1 | 23.1±1.9 | 19.8±2.1 |
ECOD | 9.6±0.0 | 5.1±0.0 | 13.3±0.2 | 10.2±0.1 | 27.3±2.4 | 15.9±1.1 | 10.7±0.1 | 6.4±0.0 | 11.1±0.0 | 6.1±0.0 | 42.4±10.0 | 41.7±9.4 | 11.1±0.0 | 5.9±0.0 | 15.2±0.2 | 6.1±0.3 | 18.7±1.6 | 13.2±1.2 |
LUNAR | 9.6±0.0 | 5.1±0.0 | 15.4±0.1 | 11.7±0.1 | 31.6±2.1 | 28.5±2.1 | 16.7±0.1 | 10.0±0.1 | 17.3±0.1 | 9.1±0.1 | 79.8±15.3 | 81.1±12.8 | 22.2±0.2 | 13.2±0.2 | 12.1±0.2 | 6.1±0.2 | 27.9±3.1 | 22.8±2.5 |
SMART* | 56.5±0.9 | 55.9±0.3 | 64.2±1.7 | 59.5±1.9 | 85.7±0.0 | 75.0±0.0 | 48.2±0.6 | 47.4±0.8 | 85.6±0.9 | 76.0±1.4 | 100.0±0.0 | 100.0±0.0 | 100.0±0.0 | 100.0±0.0 | 94.9±1.9 | 97.3±0.8 | 82.7±0.2 | 79.3±0.2 |
- Create conda environment.
conda install --yes --file requirements.txt # You may need to downgrade the torch using pip to match the CUDA version
- Download the KITTI 3D object dataset.
- Download left color images of object data set (12 GB)
- Download Velodyne point clouds, if you want to use laser information (29 GB)
- Download camera calibration matrices of object data set (16 MB)
- Select a directory named YOUR_DATASET_DIR and extract the training subset of the downloaded data from step 2 into this directory. Also, unzip the 'label_add_KITTI.zip' file located in this project folder into YOUR_DATASET_DIR.
├── YOUR_DATASET_DIR
│ ├── calib <- data in 'data_object_calib.zip/training/calib'
│ ├── image <- data in 'data_object_image_2.zip/training/image_2'
│ ├── label_add <- data in 'label_add_KITTI.zip'
│ └── velodyne <- data in 'data_object_velodyne.zip/training/velodyne'
- Set dataPath in cfg_kitti.yaml as YOUR_DATASET_DIR.
- Run the data preprocessing code to preprocess the data and generate pseudo-erroneous 3D bounding boxes.
python data_preprocessing_kitti.py
- Train the model and then use it to detect erroneous boxes in the 'label_add'.
python streamline.py --cfg_file cfg_kitti.yaml
- (Optional): Utilize the trained model for 30 epochs to replicate our results. Download the model_30.pth and save it to the './checkpoints_kitti' directory. Then perform error detection.
python detect.py --cfg_file cfg_kitti.yaml --bg 30
- Create conda environment.
conda install --yes --file requirements.txt # You may need to downgrade the torch using pip to match the CUDA version
-
Download the nuScenes (mini) dataset.
-
Select a directory named YOUR_DATASET_DIR and unzip the downloaded data from step 2 into this directory. Also, unzip the 'label_add_nuScenes-mini.zip' file located in this project folder into YOUR_DATASET_DIR.
├── YOUR_DATASET_DIR
│ ├── label_add <- data in 'label_add_KITTI.zip'
│ ├── lidarseg
│ ├── maps
│ ├── panoptic
│ ├── samples
│ ├── sweeps
│ └── v1.0-mini
- Set dataPath in cfg_nuscenes.yaml as YOUR_DATASET_DIR.
- Run the data preprocessing code to preprocess the data and generate pseudo-erroneous 3D bounding boxes.
python nuscenes_related/data_preprocessing_nuscenes.py
- Train the model and then use it to detect erroneous boxes in the 'label_add'.
python nuscenes_related/streamline.py --cfg_file cfg_nuscenes.yaml
- (Optional): Utilize the trained model for 35 epochs to replicate our results. Download the model_35.pth and save it to the './checkpoints_nuscenes' directory. Then perform error detection.
python nuscenes_related/detect.py --cfg_file cfg_nuscenes.yaml --bg 35