Skip to content

Commit

Permalink
Merge pull request #265 from bids-standard/common-derivatives
Browse files Browse the repository at this point in the history
[ENH] BEP 003: Common Derivatives
  • Loading branch information
sappelhoff authored Jun 10, 2020
2 parents 60ab047 + affa960 commit 3a11391
Show file tree
Hide file tree
Showing 18 changed files with 960 additions and 68 deletions.
4 changes: 4 additions & 0 deletions CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,8 @@
/src/04-modality-specific-files/01-magnetic-resonance-imaging-data.md @chrisgorgo
/src/04-modality-specific-files/03-electroencephalography.md @sappelhoff @ezemikulan
/src/04-modality-specific-files/04-intracranial-electroencephalography.md @ezemikulan
/src/05-derivatives/03-imaging.md @effigies
/src/05-derivatives/04-structural-derivatives.md @edickie @ahoopes
/src/05-derivatives/05-functional-derivatives.md @effigies
/src/05-derivatives/06-diffusion-derivatives.md @francopestilli @oesteban @Lestropie
/src/99-appendices/06-meg-file-formats.md @monkeyman192
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ name = "pypi"
[packages]
mkdocs = "==1.0.4"
mkdocs-material = "==4.1.2"
pymdown-extensions = "==6.0.0"
mkdocs-branchcustomization-plugin = "~=0.1.3"

[dev-packages]
Expand Down
16 changes: 5 additions & 11 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 7 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ extra_javascript:
markdown_extensions:
- toc:
anchorlink: true
- pymdownx.superfences
plugins:
- search
- branchcustomization:
Expand All @@ -32,8 +33,12 @@ nav:
- Physiological and other continuous recordings: 04-modality-specific-files/06-physiological-and-other-continuous-recordings.md
- Behavioral experiments (with no MRI): 04-modality-specific-files/07-behavioral-experiments.md
- Genetic Descriptor: 04-modality-specific-files/08-genetic-descriptor.md
- Longitudinal and multi-site studies: 05-longitudinal-and-multi-site-studies.md
- BIDS Extension Proposals: 06-extensions.md
- Derivatives:
- BIDS Derivatives: 05-derivatives/01-introduction.md
- Common data types and metadata: 05-derivatives/02-common-data-types.md
- Imaging data types: 05-derivatives/03-imaging.md
- Longitudinal and multi-site studies: 06-longitudinal-and-multi-site-studies.md
- BIDS Extension Proposals: 07-extensions.md
- Appendix:
- Contributors: 99-appendices/01-contributors.md
- Licenses: 99-appendices/02-licenses.md
Expand Down
2 changes: 1 addition & 1 deletion src/01-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ different backgrounds.
The BIDS specification can be extended in a backwards compatible way and will
evolve over time. This is accomplished through community-driven BIDS Extension
Proposals (BEPs). For more information about the BEP process, see
[Extending the BIDS specification](06-extensions.md).
[Extending the BIDS specification](07-extensions.md).

## Citing BIDS

Expand Down
173 changes: 140 additions & 33 deletions src/02-common-principles.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,39 +120,34 @@ in the appendix.

## Source vs. raw vs. derived data

BIDS in its current form is designed to harmonize and describe raw (unprocessed
or minimally processed due to file format conversion) data. During analysis such
data will be transformed and partial as well as final results will be saved.
BIDS was originally designed to describe and apply consistent naming conventions
to raw (unprocessed or minimally processed due to file format conversion) data.
During analysis such data will be transformed and partial as well as final results
will be saved.
Derivatives of the raw data (other than products of DICOM to NIfTI conversion)
MUST be kept separate from the raw data. This way one can protect the raw data
from accidental changes by file permissions. In addition it is easy to
distinguish partial results from the raw data and share the latter. Similar
rules apply to source data which is defined as data before harmonization and/or
file format conversion (for example E-Prime event logs or DICOM files).

This specification currently does not go into details of recommending a
particular naming scheme for including different types of source data (raw event
logs, parameter files, etc. before conversion to BIDS) and data derivatives
(correlation maps, brain masks, contrasts maps, etc.). However, in the case that
these data are to be included:

1. These data MUST be kept in separate `sourcedata` and `derivatives` folders
each with a similar folder structure as presented below for the BIDS-managed
data. For example:
`derivatives/fmriprep/sub-01/ses-pre/sub-01_ses-pre_mask.nii.gz` or
distinguish partial results from the raw data and share the latter.
See [Storage of derived datasets](#storage-of-derived-datasets) for more on
organizing derivatives.

Similar rules apply to source data, which is defined as data before
harmonization, reconstruction, and/or file format conversion (for example, E-Prime event logs or
DICOM files). This specification currently does not go into details of
recommending a particular naming scheme for including different types of
source data (raw event logs, parameter files, etc. before conversion to BIDS).
However, in the case that these data are to be included:

1. These data MUST be kept in separate `sourcedata` folder with a similar
folder structure as presented below for the BIDS-managed data. For example:
`sourcedata/sub-01/ses-pre/func/sub-01_ses-pre_task-rest_bold.dicom.tgz` or
`sourcedata/sub-01/ses-pre/func/MyEvent.sce`.

1. A README file SHOULD be found at the root of the `sourcedata` or the
`derivatives` folder (or both). This file should describe the nature of the
raw data or the derived data. In the case of the existence of a
`derivatives` folder, we RECOMMEND including details about the software
stack and settings used to generate the results. Inclusion of non-imaging
objects that improve reproducibility are encouraged (scripts, settings
files, etc.).

1. We RECOMMEND including the PDF print-out with the actual sequence parameters
generated by the scanner in the `sourcedata` folder.
1. A README file SHOULD be found at the root of the `sourcedata` folder or the
`derivatives` folder, or both.
This file should describe the nature of the raw data or the derived data.
We RECOMMEND including the PDF print-out with the actual sequence
parameters generated by the scanner in the `sourcedata` folder.

Alternatively one can organize their data in the following way

Expand All @@ -167,15 +162,120 @@ my_dataset/
sub-02/
...
derivatives/
pipeline_1/
pipeline_2/
...
```

In this example **only the `rawdata` subfolder needs to be a BIDS compliant
dataset**. This specification does not prescribe anything about the contents of
`sourcedata` and `derivatives` folders in the above example - nor does it
prescribe the `sourcedata`, `derivatives`, or `rawdata` folder names. The above
example is just a convention that can be useful for organizing raw, source, and
derived data while maintaining BIDS compliancy of the raw data folder.
In this example, where `sourcedata` and `derivatives` are not nested inside
`rawdata`, **only the `rawdata` subfolder** needs to be a BIDS-compliant
dataset.
The subfolders of `derivatives` MAY be BIDS-compliant derivatives datasets
(see [Non-compliant derivatives][#non-compliant-derivatives] for further discussion).
This specification does not prescribe anything about the contents of `sourcedata`
folders in the above example - nor does it prescribe the `sourcedata`,
`derivatives`, or `rawdata` folder names.
The above example is just a convention that can be useful for organizing raw,
source, and derived data while maintaining BIDS compliancy of the raw data
folder. When using this convention it is RECOMMENDED to set the `SourceDatasets`
field in `dataset_description.json` of each subfolder of `derivatives` to:

```JSON
{
"SourceDatasets": [ {"URL": "file://../../rawdata/"} ]
}
```

### Storage of derived datasets

Derivatives can be stored/distributed in two ways:

1. Under a `derivatives/` subfolder in the root of the source BIDS dataset
folder to make a clear distinction between raw data and results of data
processing. A data processing pipeline will typically have a dedicated directory
under which it stores all of its outputs. Different components of a pipeline can,
however, also be stored under different subfolders. There are few restrictions on
the directory names; it is RECOMMENDED to use the format `<pipeline>-<variant>` in
cases where it is anticipated that the same pipeline will output more than one variant (e.g.,
`AFNI-blurring`, `AFNI-noblurring`, etc.). For the sake of consistency, the
subfolder name SHOULD be the `GeneratedBy.Name` field in
`data_description.json`, optionally followed by a hyphen and a suffix (see
[Derived dataset and pipeline description][derived-dataset-description]).

Example of derivatives with one directory per pipeline:

```Plain
<dataset>/derivatives/fmriprep-v1.4.1/sub-0001
<dataset>/derivatives/spm/sub-0001
<dataset>/derivatives/vbm/sub-0001
```
Example of a pipeline with split derivative directories:
```Plain
<dataset>/derivatives/spm-preproc/sub-0001
<dataset>/derivatives/spm-stats/sub-0001
```
Example of a pipeline with nested derivative directories:
```Plain
<dataset>/derivatives/spm-preproc/sub-0001
<dataset>/derivatives/spm-preproc/derivatives/spm-stats/sub-0001
```
1. As a standalone dataset independent of the source (raw or derived) BIDS
dataset.
This way of specifying derivatives is particularly useful when the source
dataset is provided with read-only access, for publishing derivatives as
independent bodies of work, or for describing derivatives that were created
from more than one source dataset.
The `sourcedata/` subdirectory MAY be used to include the source dataset(s)
that were used to generate the derivatives.
Likewise, any code used to generate the derivatives from the source data
MAY be included in the `code/` subdirectory.
Example of a derivative dataset including the raw dataset as source:
```Plain
my_processed_data/
code/
processing_pipeline-1.0.0.img
hpc_submitter.sh
...
sourcedata/
dataset_description.json
participants.tsv
sub-01/
sub-02/
...
dataset_description.json
sub-01/
sub-02/
...
```
Throughout this specification, if a section applies particularly to derivatives,
then Case 1 will be assumed for clarity in templates and examples, but removing
`/derivatives/<pipeline>` from the template name will provide the equivalent for
Case 2.
In both cases, every derivatives dataset is considered a BIDS dataset and must
include a `dataset_description.json` file at the root level (see
[Dataset description][dataset-description].
Consequently, files should be organized to comply with BIDS to the full extent
possible (that is, unless explicitly contradicted for derivatives).
Any subject-specific derivatives should be housed within each subject’s directory;
if session-specific derivatives are generated, they should be deposited under a
session subdirectory within the corresponding subject directory; and so on.
### Non-compliant deriatives
Nothing in this specification should be interpreted to disallow the
storage/distribution of non-compliant derivatives of BIDS datasets.
In particular, if a BIDS dataset contains a `derivatives/` sub-directory,
the contents of that directory may be a heterogeneous mix of BIDS Derivatives
datasets and non-compliant derivatives.
## The Inheritance Principle
Expand Down Expand Up @@ -509,3 +609,10 @@ meaning of file names and setting requirements on their contents or metadata.
Validation and parsing tools MAY treat the presence of non-standard files and
directories as an error, so consult the details of these tools for mechanisms
to suppress warnings or provide interpretations of your file names.

[]: <> (################)
[]: <> (Link definitions)
[]: <> (################)

[dataset-description]: 03-modality-agnostic-files.md#dataset-description
[derived-dataset-description]: 03-modality-agnostic-files.md#derived-dataset-and-pipeline-description
66 changes: 65 additions & 1 deletion src/03-modality-agnostic-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Every dataset MUST include this file with the following fields:
| ------------------------------------------------------------------------------| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Name | REQUIRED. Name of the dataset. |
| BIDSVersion | REQUIRED. The version of the BIDS standard that was used. |
| DatasetType | RECOMMENDED. The interpretaton of the dataset. MUST be one of `"raw"` or `"derivative"`. For backwards compatibility, the default value is `"raw"`. |
| License | RECOMMENDED. The license for the dataset. The use of license name abbreviations is RECOMMENDED for specifying a license (see [Appendix II](./99-appendices/02-licenses.md)). The corresponding full license text MAY be specified in an additional `LICENSE` file. |
| Authors | OPTIONAL. List of individuals who contributed to the creation/curation of the dataset. |
| Acknowledgements | OPTIONAL. Text acknowledging contributions of individuals or institutions beyond those listed in Authors or Funding. |
Expand All @@ -32,7 +33,8 @@ Example:
```JSON
{
"Name": "The mother of all experiments",
"BIDSVersion": "1.0.1",
"BIDSVersion": "1.4.0",
"DatasetType": "raw",
"License": "CC0",
"Authors": [
"Paul Broca",
Expand All @@ -55,6 +57,68 @@ Example:
}
```

#### Derived dataset and pipeline description

As for any BIDS dataset, a `dataset_description.json` file MUST be found at the
top level of the a derived dataset:
`<dataset>/derivatives/<pipeline_name>/dataset_description.json`

In addition to the keys for raw BIDS datasets,
derived BIDS datasets include the following REQUIRED and RECOMMENDED
`dataset_description.json` keys:

| **Key name** | **Description** |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| GeneratedBy | REQUIRED. List of [objects][object] with at least one element. |
| SourceDatasets | RECOMMENDED. A list of [objects][object] specifying the locations and relevant attributes of all source datasets. Valid fields in each object include `URL`, `DOI`, and `Version`. |

Each object in the `GeneratedBy` list includes the following REQUIRED, RECOMMENDED
and OPTIONAL keys:

| **Key name** | **Description** |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Name | REQUIRED. Name of the pipeline or process that generated the outputs. Use `"Manual"` to indicate the derivatives were generated by hand, or adjusted manually after an initial run of an automated pipeline. |
| Version | RECOMMENDED. Version of the pipeline. |
| Description | OPTIONAL. Plain-text description of the pipeline or process that generated the outputs. RECOMMENDED if `Name` is `"Manual"`. |
| CodeURL | OPTIONAL. URL where the code used to generate the derivatives may be found. |
| Container | OPTIONAL. [Object][object] specifying the location and relevant attributes of software container image used to produce the derivative. Valid fields in this object include `Type`, `Tag` and `URI`. |

Example:

```JSON
{
"Name": "FMRIPREP Outputs",
"BIDSVersion": "1.4.0",
"DatasetType": "derivative",
"GeneratedBy": [
{
"Name": "fmriprep",
"Version": "1.4.1",
"Container": {
"Type": "docker",
"Tag": "poldracklab/fmriprep:1.4.1"
}
},
{
"Name": "Manual",
"Description": "Re-added RepetitionTime metadata to bold.json files"
}
],
"SourceDatasets": [
{
"DOI": "10.18112/openneuro.ds000114.v1.0.1",
"URL": "https://openneuro.org/datasets/ds000114/versions/1.0.1",
"Version": "1.0.1"
}
]
}
```

If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field
of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name.
That is, in a directory `<dataset>/derivatives/<pipeline>[-<variant>]/`, the first
`GeneratedBy` object should have a `Name` of `<pipeline>`.

### `README`

In addition a free form text file (`README`) describing the dataset in more
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Magnetoencephalography

Support for Magnetoencephalography (MEG) was developed as a [BIDS Extension Proposal](../06-extensions.md#bids-extension-proposals).
Support for Magnetoencephalography (MEG) was developed as a [BIDS Extension Proposal](../07-extensions.md#bids-extension-proposals).
Please cite the following paper when referring to this part of the standard in
context of the academic literature:

Expand Down
Loading

0 comments on commit 3a11391

Please sign in to comment.