From cb6a5aa74c4d06a405fce603991c4e4f867c183e Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Tue, 28 Feb 2023 09:26:42 +0100 Subject: [PATCH 1/4] DOC: move history-comparisons and backward-compat info to docs --- README.md | 28 ------------- docs/source/design/history.rst | 77 ++++++++++++++++++++++++++++++++++ docs/source/design/index.rst | 3 +- 3 files changed, 79 insertions(+), 29 deletions(-) create mode 100644 docs/source/design/history.rst diff --git a/README.md b/README.md index 4aa731a2..293e2da4 100644 --- a/README.md +++ b/README.md @@ -9,34 +9,6 @@ This software is a [DataLad](http://datalad.org) extension that equips DataLad with an alternative command suite for metadata handling (extraction, aggregation, filtering, and reporting). -Please note that the metadata storage format introduced in release 0.3.0 is incompatible -with the metadata storage formate in previous versions, i.e. `0.2.x`, and in DataLad -proper. They both happily coexist on storage, but this version of metalad will not -be able to read metadata that was stored by the previous version and vice versa. -Eventually there will be an importer that will pull old-version metadata into -the new metadata storage. It is planned for release 0.3.1 - -Here is an overview of the changes in 0.3.0 (the new system is quite -different from the previous release in a few ways): - -1. Leaner commands with unix-style behavior, i.e. one command for one operation, and commands are chainable (use results from one command as input for another command, e.g. meta-extract|meta-add). - -2. MetadataRecord modifications does not alter the state of the datalad dataset. In previous releases, changes to metadata have altered the version (commit-hash) of the repository although the primary data did not change. This is not the case in the new system. The new system does provide information about the primary data version, i.e. commit-hash, from which the individual metadata elements were created. - -3. The ability to support a wide range of metadata storage backends in the future (this is facilitated by the [datalad-metadata-model](https://github.com/datalad/metadata-model)) which is developed alongside metalad), which separates the logical metadata model used in metalad from the storage backends, by abstracting the storage backend), Currently git-repository storage is supported. - -4. The ability to transport metadata independently of the data in the dataset. The new system introduces the concept of a *metadata-store* which is usually the git-repository of the datalad dataset that is described by the metadata. But this is not a mandatory configuration, metadata can be stored in almost any git-repository. - -5. The ability to report a subset of metadata from a remote metadata store without downloading the complete remote metadata. In fact only the minimal necessary information is transported from the remote metadata store. This ability is available to all metadata-based operations, for example, also to filtering. - -6. A new simplified extractor model that distinguishes between two extractor-types: dataset-level extractors and file-extractors. The former are executed with a view on a dataset, the latter are executed with specific information about a single file-path in the dataset. The previous extractors (datalad, and datalad-metalad<=0.2.1) are still supported. - -7. A built-in pipeline mechanism that allows parallel execution of metadata operations like metadata extraction, and metadata filtering. (Still in early stage) - -8. A new set of commands that allow operations that map metadata to metadata. Those operations are called filtering and are implemented by MetadataFilter-classes. Filter are dynamically loaded and custom filter are supports, much like extractors. (Still in early stage) - -9. Backward compatibility supporting an import from previous metadata storage (planned for 0.3.1). - Command(s) currently provided by this extension diff --git a/docs/source/design/history.rst b/docs/source/design/history.rst new file mode 100644 index 00000000..59c6666b --- /dev/null +++ b/docs/source/design/history.rst @@ -0,0 +1,77 @@ +.. _history: + +****************************************************** +MetaLad development history and backward compatibility +****************************************************** + +Functionality related to metadata has been a part of the DataLad ecosystem from the very start. +However, it underwent several evolutions, and this extension is the most recent state of it. +If you have been an early adopter of the metadata functionalities of DataLad or MetaLad, this section provides an overview of past systems and notable changes for you to assess upgrades and backward-compatibility to legacy metadata. + +First-generation metadata +------------------------- + +The first generation of metadata commands was implemented in the main ``datalad`` Python package, but barely saw the light of day. +Very early users of DataLad might have caught a glimpse of it. + +In the 1st-gen metadata implementation, metadata of a dataset had two levels. +The first one contained the metadata about the actual content of a dataset (generated by DataLad or other processes), the second one was metadata about the dataset itself (generated by DataLad). +The metadata was represented in `RDF `_. + +Second-generation metadata +-------------------------- + +The second generation of metadata commands came to life when the main ``datalad`` package was a few years old already. +It brought the concept of dedicated _extractors_, including the legacy extractors that are supported to this day. +It also provided a range of dedicated metadata subcommands of a ``datalad metadata`` command such as ``aggregate`` and ``extract``, as well as a dedicated ``datalad search`` command. +Extracted metadata was stored in a dataset in (compressed) files using a JSON +stream format, separately for metadata describing a dataset as a whole, and +metadata describing individual files in a dataset. + +The 2nd-gen metadata implementation was moved into the `datalad-deprecated `_ extension in 2022. + + +Third-generation metadata +------------------------- + +The third generation of metadata commands was developed as the datalad-extension MetaLad. +Initially, until version ``0.2.1``, it was the continuation of developing 2nd generation metadata functionality. +Afterwards, beginning with ``0.3x`` series, the metadata model and command set was once more revised into the current state 3rd-gen metadata implementation. +This implementation came with an entirely new metadata model. + +Gen 2 versus gen 3 metadata +--------------------------- + +This paragraph is important if you have used ``datalad-metalad`` prior to the ``0.3.0`` release. + +Overview of changes +^^^^^^^^^^^^^^^^^^^ + +The new system in ``0.3.0`` is quite different from the previous release in a few ways: + +1. Leaner commands with unix-style behavior, i.e. one command for one operation, and commands are chainable (use results from one command as input for another command, e.g. meta-extract|meta-add). + +2. MetadataRecord modifications does not alter the state of the datalad dataset. In previous releases, changes to metadata have altered the version (commit-hash) of the repository although the primary data did not change. This is not the case in the new system. The new system does provide information about the primary data version, i.e. commit-hash, from which the individual metadata elements were created. + +3. The ability to support a wide range of metadata storage backends in the future (this is facilitated by the [datalad-metadata-model](https://github.com/datalad/metadata-model)) which is developed alongside metalad), which separates the logical metadata model used in metalad from the storage backends, by abstracting the storage backend), Currently git-repository storage is supported. + +4. The ability to transport metadata independently of the data in the dataset. The new system introduces the concept of a *metadata-store* which is usually the git-repository of the datalad dataset that is described by the metadata. But this is not a mandatory configuration, metadata can be stored in almost any git-repository. + +5. The ability to report a subset of metadata from a remote metadata store without downloading the complete remote metadata. In fact only the minimal necessary information is transported from the remote metadata store. This ability is available to all metadata-based operations, for example, also to filtering. + +6. A new simplified extractor model that distinguishes between two extractor-types: dataset-level extractors and file-extractors. The former are executed with a view on a dataset, the latter are executed with specific information about a single file-path in the dataset. The previous extractors (datalad, and datalad-metalad<=0.2.1) are still supported. + +7. A built-in pipeline mechanism that allows parallel execution of metadata operations like metadata extraction, and metadata filtering. (Still in early stage) + +8. A new set of commands that allow operations that map metadata to metadata. Those operations are called filtering and are implemented by MetadataFilter-classes. Filter are dynamically loaded and custom filter are supports, much like extractors. (Still in early stage) + +Backward-compatibility +^^^^^^^^^^^^^^^^^^^^^^ + +Certain versions of MetaLad metadata are temporarily incompatible. + +.. note:: Incompability of 0.3.0 and 0.2.x + + Please note that the metadata storage format introduced in release ``0.3.0`` is incompatible with the metadata storage formate in previous versions, i.e. `0.2.x`, and those in ``datalad-deprecated``. + Both storage formats can coexist in storage, but version ``0.3.0`` of MetaLad will not be able to read metadata that was stored by the previous version and vice versa. + Eventually there will be an importer that will pull old-version metadata into the new metadata storage. diff --git a/docs/source/design/index.rst b/docs/source/design/index.rst index fd89681d..85c0b0cf 100644 --- a/docs/source/design/index.rst +++ b/docs/source/design/index.rst @@ -13,4 +13,5 @@ The chapter describes the design of particular subsystems in DataLad. :maxdepth: 2 conduct - datatypes \ No newline at end of file + datatypes + history \ No newline at end of file From 5ba27a7ce243c1304663156daad9af942a12b618 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Tue, 28 Feb 2023 09:28:21 +0100 Subject: [PATCH 2/4] Introduce subheadings in the README --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 293e2da4..b79a4cc6 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ with an alternative command suite for metadata handling (extraction, aggregation filtering, and reporting). -Command(s) currently provided by this extension +#### Command(s) currently provided by this extension - `meta-extract` -- run an extractor on a file or dataset and emit the resulting metadata (stdout). @@ -35,7 +35,7 @@ such as metadata-extraction and metadata-adding.Processors are usually executed in parallel. A few pipeline definitions are provided with the release. -Commands currently under development: +#### Commands currently under development: - `meta-export` -- write a flat representation of metadata to a file-system. For now you can export your metadata to a JSON-lines file named `metadata-dump.jsonl`: @@ -55,7 +55,7 @@ Commands currently under development: *A word of caution: documentation is still lacking and will be addressed with release 0.3.1.* -Additional metadata extractor implementations +#### Additional metadata extractor implementations - Compatible with the previous families of extractors provided by datalad and by metalad, i.e. `metalad_core`, `metalad_annex`, `metalad_custom`, `metalad_runprov` @@ -74,7 +74,7 @@ data in the input file -Indexers +#### Indexers - Provides indexers for the new datalad indexer-plugin interface. These indexers convert metadata in proprietary formats into a set of key-value pairs that can From 07ce0bad36eb8f532a4be4bb93a1be83bebf8ee3 Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Tue, 28 Feb 2023 09:28:53 +0100 Subject: [PATCH 3/4] remove cautionary note about documentation --- README.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/README.md b/README.md index b79a4cc6..f306a953 100644 --- a/README.md +++ b/README.md @@ -52,9 +52,6 @@ with the release. - `meta-ingest-previous` -- ingest metadata from `metalady<=0.2.1`. -*A word of caution: documentation is still lacking and will be addressed with release 0.3.1.* - - #### Additional metadata extractor implementations - Compatible with the previous families of extractors provided by datalad From 90f3cc6ddbb36892835a20000cc610894cf758db Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Tue, 28 Feb 2023 09:32:43 +0100 Subject: [PATCH 4/4] Typo: metalady -> metalad --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f306a953..81c033d1 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,7 @@ with the release. datalad meta-add -d --json-lines -i metadata-dump.jsonl ``` -- `meta-ingest-previous` -- ingest metadata from `metalady<=0.2.1`. +- `meta-ingest-previous` -- ingest metadata from `metalad<=0.2.1`. #### Additional metadata extractor implementations