SMART: Self-Supervised Multi-Modal 3D Bounding Box Annotation Error Detection Framework

3D bounding box annotation, essential for training object detection models, can be acquired by professional data annotation engineers or through auto-labeling methods. However, annotated 3D boxes may not be entirely reliable, potentially harboring various errors. Low-quality annotation has a detrimental impact on the performance of trained models. In this paper, we introduce SMART, a Self-supervised Multi-modal 3D bounding box Annotation eRror deTection framework. SMART generates pseudo-erroneous 3D boxes to create supervisory signals, and subsequently trains the error detector accordingly. This detector is proficient in integrating multi-modal data and directly regressing error scores. A novel loss function is adopted to train the error detector, allowing certain 3D bounding boxes in the initial annotation to possess higher error scores while ensuring that generated pseudo-erroneous 3D boxes do not have lower error scores. Remarkably, SMART operates without reliance on prior knowledge and can be universally applied for detecting 3D bounding boxes across various classes (i.e., open-vocabulary). Extensive experiments demonstrate the effectiveness of SMART in detecting errors within annotated 3D boxes, thereby aiding users in enhancing annotation quality.

Datasets

In the absence of an existing 3D bounding box error detection dataset, we opt to inject erroneous boxes into the KITTI and nuScenes (mini) 3D object detection datasets for evaluation. Erroneous boxes are generated using PointPillars, a 3D object detection approach, or by randomly changing the class of some boxes. There are three steps involved in generating erroneous boxes using the PointPillars method. Firstly, utilizing the PointPillars model, a 3D object detection approach, we generate initial 3D bounding boxes. Next, we proceed to remove boxes that intersect with pre-existing 3D bounding boxes in the dataset for each object class. Finally, we inject erroneous boxes by randomly selecting from the 3D bounding boxes output in the previous step.

To improve our experimental datasets’ reliability, we manually correct any inaccurate annotations within the datasets. Additionally, we reassess the injected errors to ensure their accuracy, which has resulted in reclassifying some injected boxes as correct. The basic statistics of these datasets are shown below.

KITTI

	Pedestrian	Cyclist	Car
Total Number of Boxes	4502	1621	27198
Number of Erroneous Boxes	299	115	1125
Error Ratio(%)	6.64	7.09	4.14

The 3D bounding boxes of the KITTI dataset are stored in the 'label_add_KITTI.zip' file within this project. Each file within the 'label_add_KITTI.zip' corresponds to a sample_id. The format for each box is as follows:

Car | 0.0 | 0 |  -1.56 | 564.62 | 174.59 | 616.43 | 224.74 | 1.61 | 1.66 | 3.6 | -0.69 | 1.65 | 25.21 | -1.59 | 0.0

The values within the box represent the following:

- Car : box_class
- 0.0: Not applicable
- 0: Not applicable
- -1.56: Not applicable
- 564.62: Not applicable
- 174.59: Not applicable
- 616.43: Not applicable
- 224.74: Not applicable
- 1.61: height
- 1.66: width
- 3.6: length
- -0.69: x coordinate
- 1.65: y coordinate
- 25.21: z coordinate
- -1.59: yaw
- 0.0: label_represent_if_erroneous (0.0 indicates correct, 1.0 indicates erroneous)

The x, y, and z are under the camera coordinate system.

nuScenes

	Pedestrian	Bicycle	Car	Bus	Motorcycle	Trailer	Truck
Total Number of Boxes	2593	121	4232	318	268	55	474
Number of Erroneous Boxes	134	6	212	16	14	2	24
Error Ratio(%)	5.17	7.09	5.01	5.03	5.22	3.64	5.06

The 3D bounding boxes of the nuScenes dataset are stored in the 'label_add_nuScenes-mini.zip' file within this project. Each file within the 'label_add_KITTI.zip' corresponds to a sample_id. The format for each box is as follows:

car | 0bdebf547fc94ee19c8d28dc36f157b7 | 0.7 | -5.6 | -1.2 | 4.795 | 2.09 | 2.0 | -0.68 | -0.01 | 0.04 | -0.72 | 0.0

The values within the box represent the following:

- Car : box_class
- 0bdebf547fc94ee19c8d28dc36f157b7: Not applicable
- 0.7: center_x
- -5.6: center_y
- -1.2: center_z
- 4.795: length
- 2.09: width
- 2.0: height
- (-0.68 -0.01 0.04 -0.72): Bounding box orientation as quaternion: w, x, y, z.
- 0.0: label_represent_if_erroneous (0.0 indicates correct, 1.0 indicates erroneous)

The x, y, and z are under the Lidar coordinate system.

Experiment results

We conduct each experiment five times and present the average and standard deviation of results as mean_±_std.

Experiment results on KITTI dataset.

			Pedestrian	Pedestrian	Cyclist	Cyclist	Car	Car
	F-Score(%)	AP(%)	F-Score(%)	AP(%)	F-Score(%)	AP(%)	F-Score(%)	AP(%)	mF-Score(%)	mAP(%)
VAE	8.8±0.0	4.7±0.0	44.6±1.5	44.4±1.5	27.5±1.2	21.1±0.5	19.6±0.3	15.5±0.2	30.6±0.9	27.0±0.6
Deep-SVDD	8.8±0.0	4.6±0.0	39.3±0.8	38.0±0.6	47.6±1.4	39.9±1.0	16.6±0.8	9.3±0.4	34.5±1.0	29.1±0.6
iForest	8.8±0.0	4.6±0.0	36.7±1.4	25.9±0.5	34.9±0.3	22.2±0.6	15.3±0.4	11.3±0.4	28.9±0.5	19.8±0.4
OCSVM	8.8±0.0	4.7±0.0	50.8±2.1	52.4±2.2	44.0±1.9	49.6±0.8	18.7±0.5	14.4±0.5	37.8±1.3	38.8±0.9
ECOD	8.8±0.0	4.6±0.0	31.4±0.9	19.9±0.4	34.5±1.0	23.6±0.4	14.0±0.3	8.8±0.5	26.6±0.6	17.4±0.3
LUNAR	8.8±0.0	4.6±0.0	40.8±1.2	36.2±1.3	46.8±1.1	33.6±0.9	31.8±0.9	22.4±0.6	39.8±0.9	30.8±0.7
SMART	77.8±0.9	84.7±1.4	78.7±2.9	86.5±2.5	78.8±1.4	86.2±2.3	79.5±0.6	86.0±0.6	79.0±1.4	86.3±1.1

Experiment results on nuScenes dataset.

			Pedestrian	Pedestrian	Bicycle	Bicycle	Car	Car	Truck	Truck	Trailer	Trailer	Bus	Bus	Motorcycle	Motorcycle
	F-Score(%)	AP(%)	F-Score(%)	AP(%)	F-Score(%)	AP(%)	F-Score(%)	AP(%)	F-Score(%)	AP(%)	F-Score(%)	AP(%)	F-Score(%)	AP(%)	F-Score(%)	AP(%)	mF-Score(%)	mAP(%)
VAE	9.6±0.0	5.1±0.0	13.5±0.1	7.1±0.1	11.1±0.0	5.8±0.1	11.8±0.1	5.9±0.0	9.9±0.0	4.9±0.0	7.1±0.0	2.8±0.0	9.7±0.0	3.9±0.0	21.4±0.9	8.1±0.4	12.1±0.1	5.5±0.0
Deep-SVDD	9.6±0.0	5.0±0.0	16.4±0.1	9.1±0.1	25.0±3.2	15.2±3.1	10.6±0.0	5.6±0.0	13.5±0.1	10.3±0.0	59.7±10.8	66.7±12.0	12.2±0.2	6.0±0.1	10.5±0.3	6.2±0.3	21.1±2.1	17.0±2.3
iForest	9.6±0.0	5.1±0.0	14.4±0.2	12.9±0.1	29.6±3.8	28.5±4.2	12.2±0.0	6.8±0.0	12.0±0.0	5.7±0.0	76.0±14.2	83.3±15.8	14.5±0.1	7.3±0.1	11.9±0.2	5.7±0.2	24.4±2.5	21.5±2.6
OCSVM	9.6±0.0	5.1±0.0	12.7±0.1	9.3±0.1	28.6±3.1	27.8±4.1	14.0±0.1	8.7±0.0	12.5±0.0	5.9±0.0	56.7±11.4	70.0±12.9	23.3±0.4	10.5±0.2	13.8±0.3	6.3±0.1	23.1±1.9	19.8±2.1
ECOD	9.6±0.0	5.1±0.0	13.3±0.2	10.2±0.1	27.3±2.4	15.9±1.1	10.7±0.1	6.4±0.0	11.1±0.0	6.1±0.0	42.4±10.0	41.7±9.4	11.1±0.0	5.9±0.0	15.2±0.2	6.1±0.3	18.7±1.6	13.2±1.2
LUNAR	9.6±0.0	5.1±0.0	15.4±0.1	11.7±0.1	31.6±2.1	28.5±2.1	16.7±0.1	10.0±0.1	17.3±0.1	9.1±0.1	79.8±15.3	81.1±12.8	22.2±0.2	13.2±0.2	12.1±0.2	6.1±0.2	27.9±3.1	22.8±2.5
SMART*	56.5±0.9	55.9±0.3	64.2±1.7	59.5±1.9	85.7±0.0	75.0±0.0	48.2±0.6	47.4±0.8	85.6±0.9	76.0±1.4	100.0±0.0	100.0±0.0	100.0±0.0	100.0±0.0	94.9±1.9	97.3±0.8	82.7±0.2	79.3±0.2

Using Our Code to Reproduce the Results

KITTI

Create conda environment.

 conda install --yes --file requirements.txt # You may need to downgrade the torch using pip to match the CUDA version

Download the KITTI 3D object dataset.

Download left color images of object data set (12 GB)
Download Velodyne point clouds, if you want to use laser information (29 GB)
Download camera calibration matrices of object data set (16 MB)

Select a directory named YOUR_DATASET_DIR and extract the training subset of the downloaded data from step 2 into this directory. Also, unzip the 'label_add_KITTI.zip' file located in this project folder into YOUR_DATASET_DIR.

   ├── YOUR_DATASET_DIR                        
   │ ├── calib      <- data in 'data_object_calib.zip/training/calib'
   │ ├── image      <- data in 'data_object_image_2.zip/training/image_2' 
   │ ├── label_add  <- data in 'label_add_KITTI.zip' 
   │ └── velodyne   <- data in 'data_object_velodyne.zip/training/velodyne'

Set dataPath in cfg_kitti.yaml as YOUR_DATASET_DIR.
Run the data preprocessing code to preprocess the data and generate pseudo-erroneous 3D bounding boxes.

    python data_preprocessing_kitti.py

Train the model and then use it to detect erroneous boxes in the 'label_add'.

    python streamline.py --cfg_file cfg_kitti.yaml

(Optional): Utilize the trained model for 30 epochs to replicate our results. Download the model_30.pth and save it to the './checkpoints_kitti' directory. Then perform error detection.

    python detect.py --cfg_file cfg_kitti.yaml --bg 30

nuScenes

Create conda environment.

 conda install --yes --file requirements.txt # You may need to downgrade the torch using pip to match the CUDA version

Download the nuScenes (mini) dataset.
Select a directory named YOUR_DATASET_DIR and unzip the downloaded data from step 2 into this directory. Also, unzip the 'label_add_nuScenes-mini.zip' file located in this project folder into YOUR_DATASET_DIR.

   ├── YOUR_DATASET_DIR      
   │ ├── label_add  <- data in 'label_add_KITTI.zip'                   
   │ ├── lidarseg      
   │ ├── maps      
   │ ├── panoptic      
   │ ├── samples      
   │ ├── sweeps      
   │ └── v1.0-mini

Set dataPath in cfg_nuscenes.yaml as YOUR_DATASET_DIR.
Run the data preprocessing code to preprocess the data and generate pseudo-erroneous 3D bounding boxes.

    python nuscenes_related/data_preprocessing_nuscenes.py

Train the model and then use it to detect erroneous boxes in the 'label_add'.

    python nuscenes_related/streamline.py --cfg_file cfg_nuscenes.yaml

(Optional): Utilize the trained model for 35 epochs to replicate our results. Download the model_35.pth and save it to the './checkpoints_nuscenes' directory. Then perform error detection.

    python nuscenes_related/detect.py --cfg_file cfg_nuscenes.yaml --bg 35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SMART: Self-Supervised Multi-Modal 3D Bounding Box Annotation Error Detection Framework

Datasets

KITTI

nuScenes

Experiment results

Experiment results on KITTI dataset.

Experiment results on nuScenes dataset.

Using Our Code to Reproduce the Results

KITTI

nuScenes

Files

README.md

Latest commit

History

README.md

File metadata and controls

SMART: Self-Supervised Multi-Modal 3D Bounding Box Annotation Error Detection Framework

Datasets

KITTI

nuScenes

Experiment results

Experiment results on KITTI dataset.

Experiment results on nuScenes dataset.

Using Our Code to Reproduce the Results

KITTI

nuScenes