DeepHealth Toolkit Dataset Format

The DeepHealth Toolkit Dataset Format is a simple and flexible YAML syntax to describe a dataset for the DeepHealth libraries (EDDL/ECVL).

This is only a draft version and every suggestion/contribution to improve the format is appreciated.

The format includes the definition of:

A name for the dataset (optional)
A textual description (optional)
An array with the names of the classes available in the dataset (optional)
An array with the names of the features available in the dataset (optional)
An array with the list of images
A dictionary that specifies the splits of the dataset and the images assigned to them (optional)

Header

The first entries define the basic information of a dataset such as name, description, classes and features. The classes entry represents all the categories predictable, while features describes the additional information related to each image.

# Descriptive string used just for pretty reporting (optional)
name: dataset_name

# Descriptive string to document the file (optional)
description: >
  This is an example of long
  text which describes the use of this dataset and
  whatever I want to annotate.
  
  You can also write multiple paragraphs with the only
  care of indenting them correctly.

# Array of class names (optional)
classes: [class_a, class_b, class_c]

# Array of features names (optional)
features: [feature_1, feature_2, feature_3, feature_4]

Images

The array images lists all the available images. Each image has:

location: absolute or relative path to file
label (optional): contains one or more classes of the classes field in the form of:
- the class name (e.g. in case of single class tasks)
- the class index (e.g. in case of single class tasks)
- an array of class names (e.g. multiclasses tasks)
- an array of class indexes (e.g. multiclasses tasks)
or
- an url to an image (e.g. in case of a segmentation task)
values (optional): are the values that a fixed feature (listed in features) can assume and it can be:
- an array of values (with null values when the i-th feature is not present)
- a dictionary with the name of the feature coupled with its value

# Array of images
# images are listed as a couple of location (absolute or relative to this file location) and an optional label.
# location must be unique in the array
images:
# label can be a class name (string)...
# values can be an array with a positional correspondence with the features array...
  - location: image_path_and_name_1
    label: class_b
    values: [value_1, null, value_3, null]

# ... or the class index (integer) wrt the classes array
# ... or a dictionary with the name of the feature coupled with its value
  - location: image_path_and_name_2
    label: 2
    values: { feature_1: value_1, feature_3: value_3 }

# In the case of multi class problems, label can be an array of class names (array of strings)...
  - location: image_path_and_name_3
    label: [class_a, class_c]

# ... or an array of class indexes (array of integers)
  - location: image_path_and_name_4
    label: [0, 2]

# label can be a path (string) to an image in case of a segmentation task
  - location: image_path_and_name_5
    label: path_to_ground_truth_image

# Remember that labels are optional
  - location: image_path_and_name_6
  - location: image_path_and_name_7

# When only the location is used, it can be omitted
  - image_path_and_name_8
  - image_path_and_name_9

Split

In the split section you can specify how to divide the data. This dictionary lists the indexes of the images to be used in different phases.

# Split (optional) is a dictionary with a custom number of arrays. 
# They list the indexes of the images to be used in different phases.
split:
  training: [0, 1]
  validation: [2]
  test: [3]

Full example

# Example of DeepHealth toolkit dataset format
# Arrays are always 0 based

# Descriptive string used just for pretty reporting (optional)
name: dataset_name

# Descriptive string to document the file (optional)
description: >
  This is an example of long
  text which describes the use of this dataset and
  whatever I want to annotate.
  
  You can also write multiple paragraphs with the only
  care of indenting them correctly.

# Array of class names (optional)
classes: [class_a, class_b, class_c]

# Array of features names (optional)
features: [feature_1, feature_2, feature_3, feature_4]

# Array of images
# images are listed as a couple of location (absolute or relative to this file location) and an optional label.
# location must be unique in the array
images:
# label can be a class name (string)...
# values can be an array with a positional correspondence with the features array...
  - location: image_path_and_name_1
    label: class_b
    values: [value_1, null, value_3, null]

# ... or the class index (integer) wrt the classes array
# ... or a dictionary with the name of the feature coupled with its value
  - location: image_path_and_name_2
    label: 2
    values: { feature_1: value_1, feature_3: value_3 }

# In the case of multi class problems, label can be an array of class names (array of strings)...
  - location: image_path_and_name_3
    label: [class_a, class_c]

# ... or an array of class indexes (array of integers)
  - location: image_path_and_name_4
    label: [0, 2]

# label can be a path (string) to an image in case of a segmentation task
  - location: image_path_and_name_5
    label: path_to_ground_truth_image

# Remember that labels are optional
  - location: image_path_and_name_6
  - location: image_path_and_name_7

# When only the location is used, it can be omitted
  - image_path_and_name_8
  - image_path_and_name_9

# Split (optional) is a dictionary with a custom number of arrays. 
# They list the indexes of the images to be used in different phases.
split:
  training: [0, 1]
  validation: [2]
  test: [3]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepHealth Toolkit Dataset Format

DeepHealth Toolkit Dataset Format

Header

Images

Split

Full example

Clone this wiki locally