Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comparison level-up doc #1174

Merged
merged 8 commits into from
Oct 23, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
(<https://github.com/openvinotoolkit/datumaro/pull/1153>)
- Change image default dtype from float32 to uint8
(<https://github.com/openvinotoolkit/datumaro/pull/1175>)
- Add comparison level-up doc
(<https://github.com/openvinotoolkit/datumaro/pull/1174>)

### Bug fixes
- Modify the draw function in the visualizer not to raise an error for unsupported annotation types.
Expand Down
124 changes: 101 additions & 23 deletions docs/source/docs/command-reference/context_free/compare.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,20 @@

## Compare datasets

This command compares two datasets and saves the results in the
specified directory. The current project is considered to be
"ground truth".
This command compares two datasets and saves the results in the specified directory. The current project is considered to be "ground truth".

Datasets can be compared using different methods:
- `table` - Generate a compare table mainly based on dataset statistics
- `equality` - Annotations are compared to be equal
- `distance` - A distance metric is used
- [`table`](#table) - Generate a compare table mainly based on dataset statistics
- [`equality`](#equality) - Annotations are compared to be equal
- [`distance`](#distance) - A distance metric is used

This command has multiple forms:
```console
1) datum compare <revpath>
2) datum compare <revpath> <revpath>
```

1 - Compares the current project's main target (`project`)
in the working tree with the specified dataset.

1 - Compares the current project's main target (`project`) in the working tree with the specified dataset.
2 - Compares two specified datasets.

\<revpath\> - [a dataset path or a revision path](../../user-manual/how_to_use_datumaro.md#dataset-path-concepts).
Expand Down Expand Up @@ -59,17 +55,69 @@ Parameters:
- `--all` - Include matches in the output. By default, only differences are
printed.

<!-- markdownlint-disable-line MD028 -->Examples:
- Compare two projects by distance, match boxes if IoU > 0.7,
save results to TensorBoard
```console
datum compare <path/to/other/project/> -m distance -f tensorboard --iou-thresh 0.7 -o compare/
```
### Support methods
#### `table`
This method allows comparing datasets based on dataset statistics and provides the results in a tabular format. The result report is saved in the formats of table_compare.json and table_compare.txt, each containing information for "High-level comparison," "Mid-level comparison," and "Low-level comparison."
sooahleex marked this conversation as resolved.
Show resolved Hide resolved

Firstly, the "High-level comparison" provides information regarding the format, classes, images, and annotations for each dataset. For example:
```bash
+--------------------------+---------+---------------------+
| Field | First | Second |
+==========================+=========+=====================+
| Format | coco | voc |
+--------------------------+---------+---------------------+
| Number of classes | 2 | 4 |
+--------------------------+---------+---------------------+
| Common classes | a, b | a, b |
+--------------------------+---------+---------------------+
| Classes | a, b | a, b, background, c |
+--------------------------+---------+---------------------+
| Images count | 1 | 1 |
+--------------------------+---------+---------------------+
| Unique images count | 1 | 1 |
+--------------------------+---------+---------------------+
| Repeated images count | 0 | 0 |
+--------------------------+---------+---------------------+
| Annotations count | 1 | 2 |
+--------------------------+---------+---------------------+
| Unannotated images count | 0 | 0 |
+--------------------------+---------+---------------------+
```

- Compare two projects for equality, exclude annotation groups
and the `is_crowd` attribute from comparison
Secondly, the "Mid-level comparison" displays image means, standard deviations, and label distributions for each subset in the datasets. For example:
```bash
+--------------------+--------------------------+--------------------------+
| Field | First | Second |
+====================+==========================+==========================+
| train - Image Mean | 1.00, 1.00, 1.00 | 1.00, 1.00, 1.00 |
+--------------------+--------------------------+--------------------------+
| train - Image Std | 0.00, 0.00, 0.00 | 0.00, 0.00, 0.00 |
+--------------------+--------------------------+--------------------------+
| Label - a | imgs: 1, percent: 1.0000 | |
+--------------------+--------------------------+--------------------------+
| Label - b | | imgs: 1, percent: 0.5000 |
+--------------------+--------------------------+--------------------------+
| Label - background | | |
+--------------------+--------------------------+--------------------------+
| Label - c | | imgs: 1, percent: 0.5000 |
+--------------------+--------------------------+--------------------------+
```

Lastly, the "Low-level comparison" uses ShiftAnalyzer to show Covariate shift and Label shift between the two datasets. For example:
```bash
+-----------------+---------+
| Field | Value |
+=================+=========+
| Covariate shift | 0 |
+-----------------+---------+
| Label shift | nan |
+-----------------+---------+
```
The results are stored in the formats of `table_compare.json` and `table_compare.txt`.

- Compare the current working tree and a dataset for table
sooahleex marked this conversation as resolved.
Show resolved Hide resolved
```console
datum compare <path/to/other/project/> -m equality -if group -ia is_crowd
datum compare <path/to/dataset2/>:coco
```

- Compare two projects for table
Expand All @@ -82,11 +130,6 @@ Parameters:
datum compare <path/to/dataset1/>:voc <path/to/dataset2/>:coco
```

- Compare the current working tree and a dataset for table
```console
datum compare <path/to/dataset2/>:coco
```

- Compare a source from a previous revision and a dataset for table
```console
datum compare HEAD~2:source-2 <path/to/dataset2/>:yolo
Expand All @@ -100,3 +143,38 @@ Parameters:
datum transform <...> -o inference
datum compare inference -o compare
```

#### `equality`
This method shows how identical items and annotations are between datasets. It indicates the number of unmatched items in each project (dataset), as well as the quantity of conflicting items and the counts of matching and mismatching annotations. For example:
```bash
Found:
The first project has 10 unmatched items
The second project has 100 unmatched items
1 item conflicts
10 matching annotations
0 mismatching annotations
```
The detailed information is stored in `equality_compare.json`. If you'd like to review the specific details, please refer to this file.

Annotations are compared to be equal
- Compare two projects for equality, exclude annotation groups
and the `is_crowd` attribute from comparison
```console
datum compare <path/to/other/project/> -m equality -if group -ia is_crowd
```

#### `distance`
This method demonstrates the consistency of annotations between dataset items. It presents the count of matched annotations between two items in a tabular format, comparing the numbers of label, bbox, polygon, and mask annotations. Additionally, it generates a confusion matrix for each annotation type, which is saved in the form of `<annotation_type>_confusion.png`. It also highlights cases where mismatching labels exist. For example:
```bash
Datasets have mismatching labels:
#0: a != background
#1: b != a
#2: < b
#3: < c
```

- Compare two projects by distance, match boxes if IoU > 0.7,
save results to TensorBoard
```console
datum compare <path/to/other/project/> -m distance -f tensorboard --iou-thresh 0.7 -o compare/
```
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,54 @@
Level 6: Data Comparison with Two Heterogeneous Datasets
========================================================

TBD
Comparison is a fundamental tool that enables users to identify and understand the discrepancies and variations that exist between datasets.
It allows for a comprehensive assessment of variations in data distribution, format, and annotation standards present across different sources.
By pinpointing the differences in data distribution, format, and annotation standards across multiple sources, the comparison paves the way for a streamlined and effective dataset consolidation process.
In essence, it serves as the cornerstone for achieving a cohesive and comprehensive large-scale dataset, a critical requirement for training deep learning models.

In this tutorial, we provide a simple example for comparing two datasets, and the detailed description of the comparison operation is given in the :doc:`Compare <../../command-reference/context_free/compare>` section.

Comparing Datasets
==================

.. tab-set::

.. tab-item:: CLI

Without the project declaration, you can simply compare multiple datasets using the following command:

.. code-block:: bash

datum compare <path/to/dataset1> <path/to/dataset2> -o result

In this case, the ``table`` method is used to generate a comparison table. You will have the comparison report named ``table_compare.json`` and ``table_compare.txt`` inside the output directory.

To compare if annotations are equal, use:

.. code-block:: bash

datum compare <path/to/dataset1> <path/to/dataset2> -m equality -o result

You will have the comparison report named ``equality_compare.json`` inside the output directory.

To compare a dataset from another project with a distance metric, use:

.. code-block:: bash

datum compare <path/to/other/project/> -m distance -o result

You will have the comparison report named ``<annotation_type>_confusion.png`` inside the output directory. If there is a label difference, then a ``label_confusion`` result will be created. This supports ``label``, ``bbox``, ``polygon``, and ``mask`` annotation types.

.. tab-item:: PythonCLI

With the project-based CLI, you can compare the current project's main target (project) in the working tree with the specified dataset using the following command:

.. code-block:: bash

datum compare <path/to/specified/dataset>

You can also simply compare multiple datasets by using:

.. code-block:: bash

datum compare <path/to/dataset1> <path/to/dataset2>
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ We here download two aerial datasets named by Eurosat and UC Merced as a simple

.. code-block:: bash

datum download get -i tfds:eurosat --format imagenet --output-dir <path/to/eurosat> -- --save-media
datum download get -i tfds:eurosat -f imagenet --output-dir <path/to/eurosat> -- --save-media

datum download get -i tfds:uc_merced --format imagenet --output-dir <path/to/uc_merced> -- --save-media
datum download get -i tfds:uc_merced -f imagenet --output-dir <path/to/uc_merced> -- --save-media

Merge datasets
==============
Expand Down
3 changes: 2 additions & 1 deletion docs/source/docs/level-up/intermediate_skills/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ Intermediate Skills

Level 06: Dataset Comparison

:bdg-warning:`Python`
:bdg-info:`CLI`
:bdg-success:`ProjectCLI`

.. grid-item-card::

Expand Down