Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update filtering skill up page #1233

Merged
merged 2 commits into from
Jan 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Develop JsonSectionPageMapper in Rust API
(<https://github.com/openvinotoolkit/datumaro/pull/1224>)
- Add Filtering via User-Provided Python Functions
(<https://github.com/openvinotoolkit/datumaro/pull/1230>)
(<https://github.com/openvinotoolkit/datumaro/pull/1230>, <https://github.com/openvinotoolkit/datumaro/pull/1233>)

### Enhancements
- Optimize Python import to make CLI entrypoint faster
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,13 @@ of data continue to grow, data filtering will become an increasingly important a
By filtering the dataset in this way, we can create a subset of data that is tailored to our specific needs, making it easier
to extract meaningful insights or use it effectively for decision-making purposes.

In this tutorial, we provide the simple example of filtering dataset using item and annotation. To set how to filter dataset,
which satisfied some condition, we use XML as query format. Refer this `XPATH <https://devhints.io/xpath>`_ to set your own filter.
The detailed description for filter operation is given by :doc:`Filter <../../command-reference/context_free/filter>`.
In this tutorial, we provide the simple example of filtering dataset items and annotations.
To set the filtering condition, we can use

1) [**ProjectCLI**, **CLI**, **Python**] `XPATH <https://devhints.io/xpath>`_ query,
2) [**Python**] User-provided Python function query.

The detailed description for the `XPATH <https://devhints.io/xpath>`_ query is given by :doc:`this page <../../command-reference/context_free/filter>`.
The more advanced Python example is given :doc:`this notebook <../../jupyter_notebook_examples/notebooks/04_filter>`.

.. tab-set::
Expand Down Expand Up @@ -57,9 +61,56 @@ The more advanced Python example is given :doc:`this notebook <../../jupyter_not
from datumaro.components.dataset import Dataset

dataset_path = '/path/to/data'
dataset = Dataset.import_from(dataset_path, 'datumaro')
dataset = Dataset.import_from(dataset_path, format='datumaro')

filtered_result = Dataset.filter(dataset, 'how/to/filter/dataset')

We can set ``<how/to/filter/dataset>`` as your own filter like ``'/item/annotation[occluded="True"]'``.
This example command will filter only items through the annotation attribute which has `occluded`.

In addition, you can filter dataset items with your custom Python fuction as well.
For example, an example of filtering dataset items with images larger than 1024 pixels:

.. code-block:: python

from datumaro.components.dataset import Dataset
from datumaro.components.media import Image

def filter_func(item: DatasetItem) -> bool:
h, w = item.media_as(Image).size
return h > 1024 or w > 1024

dataset_path = '/path/to/data'
dataset = Dataset.import_from(dataset_path, format='datumaro')

filtered_result = Dataset.filter(dataset, filter_func)

On the other hand, it is possible to filter dataset annotations with the user-provided Python function.
This is an example of removing bounding boxes sized greater than 50% of the image size:

.. code-block:: python

from datumaro.components.dataset import Dataset
from datumaro.components.media import Image
from datumaro.components.annotation import Annotation, Bbox

def filter_func(item: DatasetItem, ann: Annotation) -> bool:
# If the annotation is not a Bbox, do not filter
if not isinstance(ann, Bbox):
return False

h, w = item.media_as(Image).size
image_size = h * w
bbox_size = ann.h * ann.w

# Accept Bboxes smaller than 50% of the image size
return bbox_size < 0.5 * image_size

def filter_func(item: DatasetItem) -> bool:
h, w = item.media_as(Image).size
return h > 1024 or w > 1024

dataset_path = '/path/to/data'
dataset = Dataset.import_from(dataset_path, format='datumaro')

filtered_result = Dataset.filter(dataset, filter_func)
Loading