openvinotoolkit · sooahleex · Oct 23, 2023 · Oct 19, 2023 · Oct 19, 2023 · Oct 19, 2023
@@ -12,6 +12,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   (<https://github.com/openvinotoolkit/datumaro/pull/1153>)
 - Change image default dtype from float32 to uint8
   (<https://github.com/openvinotoolkit/datumaro/pull/1175>)
+- Add comparison level-up doc
+  (<https://github.com/openvinotoolkit/datumaro/pull/1174>)
 
 ### Bug fixes
 - Modify the draw function in the visualizer not to raise an error for unsupported annotation types.

@@ -2,24 +2,20 @@
 
 ## Compare datasets
 
-This command compares two datasets and saves the results in the
-specified directory. The current project is considered to be
-"ground truth".
+This command compares two datasets and saves the results in the specified directory. The current project is considered to be "ground truth".
 
 Datasets can be compared using different methods:
-- `table` - Generate a compare table mainly based on dataset statistics
-- `equality` - Annotations are compared to be equal
-- `distance` - A distance metric is used
+- [`table`](#table) - Generate a compare table mainly based on dataset statistics
+- [`equality`](#equality) - Annotations are compared to be equal
+- [`distance`](#distance) - A distance metric is used
 
 This command has multiple forms:
 ```console
 1) datum compare <revpath>
 2) datum compare <revpath> <revpath>
 ```
 
-1 - Compares the current project's main target (`project`)
-  in the working tree with the specified dataset.
-
+1 - Compares the current project's main target (`project`) in the working tree with the specified dataset.
 2 - Compares two specified datasets.
 
 \<revpath\> - [a dataset path or a revision path](../../user-manual/how_to_use_datumaro.md#dataset-path-concepts).
@@ -59,17 +55,69 @@ Parameters:
   - `--all` - Include matches in the output. By default, only differences are
     printed.
 
-<!-- markdownlint-disable-line MD028 -->Examples:
-- Compare two projects by distance, match boxes if IoU > 0.7,
-  save results to TensorBoard
-  ```console
-  datum compare <path/to/other/project/> -m distance -f tensorboard --iou-thresh 0.7 -o compare/
-  ```
+### Support methods
+#### `table`
+This method allows comparing datasets based on dataset statistics and provides the results in a tabular format. The result report is saved in the formats of `table_compare.json` and `table_compare.txt`, each containing information for "High-level comparison," "Mid-level comparison," and "Low-level comparison."
+
+Firstly, the "High-level comparison" provides information regarding the format, classes, images, and annotations for each dataset. For example:
+```bash
++--------------------------+---------+---------------------+
+| Field                    | First   | Second              |
++==========================+=========+=====================+
+| Format                   | coco    | voc                 |
++--------------------------+---------+---------------------+
+| Number of classes        | 2       | 4                   |
++--------------------------+---------+---------------------+
+| Common classes           | a, b    | a, b                |
++--------------------------+---------+---------------------+
+| Classes                  | a, b    | a, b, background, c |
++--------------------------+---------+---------------------+
+| Images count             | 1       | 1                   |
++--------------------------+---------+---------------------+
+| Unique images count      | 1       | 1                   |
++--------------------------+---------+---------------------+
+| Repeated images count    | 0       | 0                   |
++--------------------------+---------+---------------------+
+| Annotations count        | 1       | 2                   |
++--------------------------+---------+---------------------+
+| Unannotated images count | 0       | 0                   |
++--------------------------+---------+---------------------+
+```
 
-- Compare two projects for equality, exclude annotation groups
-  and the `is_crowd` attribute from comparison
+Secondly, the "Mid-level comparison" displays image means, standard deviations, and label distributions for each subset in the datasets. For example:
+```bash
++--------------------+--------------------------+--------------------------+
+| Field              | First                    | Second                   |
++====================+==========================+==========================+
+| train - Image Mean | 1.00,   1.00,   1.00     | 1.00,   1.00,   1.00     |
++--------------------+--------------------------+--------------------------+
+| train - Image Std  | 0.00,   0.00,   0.00     | 0.00,   0.00,   0.00     |
++--------------------+--------------------------+--------------------------+
+| Label - a          | imgs: 1, percent: 1.0000 |                          |
++--------------------+--------------------------+--------------------------+
+| Label - b          |                          | imgs: 1, percent: 0.5000 |
++--------------------+--------------------------+--------------------------+
+| Label - background |                          |                          |
++--------------------+--------------------------+--------------------------+
+| Label - c          |                          | imgs: 1, percent: 0.5000 |
++--------------------+--------------------------+--------------------------+
+```
+
+Lastly, the "Low-level comparison" uses ShiftAnalyzer to show Covariate shift and Label shift between the two datasets. For example:
+```bash
++-----------------+---------+
+| Field           |   Value |
++=================+=========+
+| Covariate shift |       0 |
++-----------------+---------+
+| Label shift     |     nan |
++-----------------+---------+
+```
+The results are stored in the formats of `table_compare.json` and `table_compare.txt`.
+
+- Compare the current working tree with a dataset in COCO data format to create the tabular report
   ```console
-  datum compare <path/to/other/project/> -m equality -if group -ia is_crowd
+  datum compare <path/to/dataset2/>:coco
   ```
 
 - Compare two projects for table
@@ -82,11 +130,6 @@ Parameters:
   datum compare <path/to/dataset1/>:voc <path/to/dataset2/>:coco
   ```
 
-- Compare the current working tree and a dataset for table
-  ```console
-  datum compare <path/to/dataset2/>:coco
-  ```
-
 - Compare a source from a previous revision and a dataset for table
   ```console
   datum compare HEAD~2:source-2 <path/to/dataset2/>:yolo
@@ -100,3 +143,38 @@ Parameters:
   datum transform <...> -o inference
   datum compare inference -o compare
   ```
+
+#### `equality`
+This method shows how identical items and annotations are between datasets. It indicates the number of unmatched items in each project (dataset), as well as the quantity of conflicting items and the counts of matching and mismatching annotations. For example:
+```bash
+Found:
+The first project has 10 unmatched items
+The second project has 100 unmatched items
+1 item conflicts
+10 matching annotations
+0 mismatching annotations
+```
+The detailed information is stored in `equality_compare.json`. If you'd like to review the specific details, please refer to this file.
+
+Annotations are compared to be equal
+- Compare two projects for equality, exclude annotation groups
+  and the `is_crowd` attribute from comparison
+  ```console
+  datum compare <path/to/other/project/> -m equality -if group -ia is_crowd
+  ```
+
+#### `distance`
+This method demonstrates the consistency of annotations between dataset items. It presents the count of matched annotations between two items in a tabular format, comparing the numbers of label, bbox, polygon, and mask annotations. Additionally, it generates a confusion matrix for each annotation type, which is saved in the form of `<annotation_type>_confusion.png`. It also highlights cases where mismatching labels exist. For example:
+```bash
+Datasets have mismatching labels:
+  #0: a != background
+  #1: b != a
+  #2:  < b
+  #3:  < c
+```
+
+- Compare two projects by distance, match boxes if IoU > 0.7,
+  save results to TensorBoard
+  ```console
+  datum compare <path/to/other/project/> -m distance -f tensorboard --iou-thresh 0.7 -o compare/
+  ```
@@ -2,4 +2,54 @@
 Level 6: Data Comparison with Two Heterogeneous Datasets
 ========================================================
 
-TBD
+Comparison is a fundamental tool that enables users to identify and understand the discrepancies and variations that exist between datasets.
+It allows for a comprehensive assessment of variations in data distribution, format, and annotation standards present across different sources.
+By pinpointing the differences in data distribution, format, and annotation standards across multiple sources, the comparison paves the way for a streamlined and effective dataset consolidation process.
+In essence, it serves as the cornerstone for achieving a cohesive and comprehensive large-scale dataset, a critical requirement for training deep learning models.
+
+In this tutorial, we provide a simple example for comparing two datasets, and the detailed description of the comparison operation is given in the :doc:`Compare <../../command-reference/context_free/compare>` section.
+
+Comparing Datasets
+==================
+
+.. tab-set::
+
+    .. tab-item:: CLI
+
+        Without the project declaration, you can simply compare multiple datasets using the following command:
+
+        .. code-block:: bash
+
+            datum compare <path/to/dataset1> <path/to/dataset2> -o result
+
+        In this case, the ``table`` method is used to generate a comparison table. You will have the comparison report named ``table_compare.json`` and ``table_compare.txt`` inside the output directory.
+
+        To compare if annotations are equal, use:
+
+        .. code-block:: bash
+
+            datum compare <path/to/dataset1> <path/to/dataset2> -m equality -o result
+
+        You will have the comparison report named ``equality_compare.json`` inside the output directory.
+
+        To compare a dataset from another project with a distance metric, use:
+
+        .. code-block:: bash
+
+            datum compare <path/to/other/project/> -m distance -o result
+
+        You will have the comparison report named ``<annotation_type>_confusion.png`` inside the output directory. If there is a label difference, then a ``label_confusion`` result will be created. This supports ``label``, ``bbox``, ``polygon``, and ``mask`` annotation types.
+
+    .. tab-item:: PythonCLI
+
+        With the project-based CLI, you can compare the current project's main target (project) in the working tree with the specified dataset using the following command:
+
+        .. code-block:: bash
+
+            datum compare <path/to/specified/dataset>
+
+        You can also simply compare multiple datasets by using:
+
+        .. code-block:: bash
+
+            datum compare <path/to/dataset1> <path/to/dataset2>
@@ -22,9 +22,9 @@ We here download two aerial datasets named by Eurosat and UC Merced as a simple
 
 .. code-block:: bash
 
-  datum download get -i tfds:eurosat --format imagenet --output-dir <path/to/eurosat> -- --save-media
+  datum download get -i tfds:eurosat -f imagenet --output-dir <path/to/eurosat> -- --save-media
 
-  datum download get -i tfds:uc_merced --format imagenet --output-dir <path/to/uc_merced> -- --save-media
+  datum download get -i tfds:uc_merced -f imagenet --output-dir <path/to/uc_merced> -- --save-media
 
 Merge datasets
 ==============

@@ -37,7 +37,8 @@ Intermediate Skills
 
          Level 06: Dataset Comparison
 
-      :bdg-warning:`Python`
+      :bdg-info:`CLI`
+      :bdg-success:`ProjectCLI`
 
    .. grid-item-card::