-
Notifications
You must be signed in to change notification settings - Fork 1
DeepHealth Toolkit Dataset Format
The DeepHealth Toolkit Dataset Format is a simple and flexible YAML syntax to describe a dataset for the DeepHealth libraries (EDDL/ECVL).
This is only a draft version and every suggestion/contribution to improve the format is appreciated.
The format includes the definition of:
- A name for the dataset (optional)
- A textual description (optional)
- An array with the names of the classes available in the dataset (optional)
- An array with the names of the features available in the dataset (optional)
- An array with the list of images
- A dictionary that specifies the splits of the dataset and the images assigned to them (optional)
The first entries define the basic information of a dataset such as name, description, classes and features.
The classes
entry represents all the categories predictable, while features
describes the additional information related to each image.
# Descriptive string used just for pretty reporting (optional)
name: dataset_name
# Descriptive string to document the file (optional)
description: >
This is an example of long
text which describes the use of this dataset and
whatever I want to annotate.
You can also write multiple paragraphs with the only
care of indenting them correctly.
# Array of class names (optional)
classes: [class_a, class_b, class_c]
# Array of features names (optional)
features: [feature_1, feature_2, feature_3, feature_4]
The array images
lists all the available images. Each image has:
-
location
: absolute or relative path to file -
label
(optional): contains one or more classes of theclasses
field in the form of:- the class name (e.g. in case of single class tasks)
- the class index (e.g. in case of single class tasks)
- an array of class names (e.g. multiclasses tasks)
- an array of class indexes (e.g. multiclasses tasks)
or
- an url to an image (e.g. in case of a segmentation task)
-
values
(optional): are the values that a fixed feature (listed infeatures
) can assume and it can be:- an array of values (with
null
values when the i-th feature is not present) - a dictionary with the name of the feature coupled with its value
- an array of values (with
# Array of images
# images are listed as a couple of location (absolute or relative to this file location) and an optional label.
# location must be unique in the array
images:
# label can be a class name (string)...
# values can be an array with a positional correspondence with the features array...
- location: image_path_and_name_1
label: class_b
values: [value_1, null, value_3, null]
# ... or the class index (integer) wrt the classes array
# ... or a dictionary with the name of the feature coupled with its value
- location: image_path_and_name_2
label: 2
values: { feature_1: value_1, feature_3: value_3 }
# In the case of multi class problems, label can be an array of class names (array of strings)...
- location: image_path_and_name_3
label: [class_a, class_c]
# ... or an array of class indexes (array of integers)
- location: image_path_and_name_4
label: [0, 2]
# label can be a path (string) to an image in case of a segmentation task
- location: image_path_and_name_5
label: path_to_ground_truth_image
# Remember that labels are optional
- location: image_path_and_name_6
- location: image_path_and_name_7
# When only the location is used, it can be omitted
- image_path_and_name_8
- image_path_and_name_9
In the split section you can specify how to divide the data. This dictionary lists the indexes of the images to be used in different phases.
# Split (optional) is a dictionary with a custom number of arrays.
# They list the indexes of the images to be used in different phases.
split:
training: [0, 1]
validation: [2]
test: [3]
# Example of DeepHealth toolkit dataset format
# Arrays are always 0 based
# Descriptive string used just for pretty reporting (optional)
name: dataset_name
# Descriptive string to document the file (optional)
description: >
This is an example of long
text which describes the use of this dataset and
whatever I want to annotate.
You can also write multiple paragraphs with the only
care of indenting them correctly.
# Array of class names (optional)
classes: [class_a, class_b, class_c]
# Array of features names (optional)
features: [feature_1, feature_2, feature_3, feature_4]
# Array of images
# images are listed as a couple of location (absolute or relative to this file location) and an optional label.
# location must be unique in the array
images:
# label can be a class name (string)...
# values can be an array with a positional correspondence with the features array...
- location: image_path_and_name_1
label: class_b
values: [value_1, null, value_3, null]
# ... or the class index (integer) wrt the classes array
# ... or a dictionary with the name of the feature coupled with its value
- location: image_path_and_name_2
label: 2
values: { feature_1: value_1, feature_3: value_3 }
# In the case of multi class problems, label can be an array of class names (array of strings)...
- location: image_path_and_name_3
label: [class_a, class_c]
# ... or an array of class indexes (array of integers)
- location: image_path_and_name_4
label: [0, 2]
# label can be a path (string) to an image in case of a segmentation task
- location: image_path_and_name_5
label: path_to_ground_truth_image
# Remember that labels are optional
- location: image_path_and_name_6
- location: image_path_and_name_7
# When only the location is used, it can be omitted
- image_path_and_name_8
- image_path_and_name_9
# Split (optional) is a dictionary with a custom number of arrays.
# They list the indexes of the images to be used in different phases.
split:
training: [0, 1]
validation: [2]
test: [3]