Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Make a history page from the README #339

Merged
merged 4 commits into from
Feb 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 5 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,36 +9,8 @@ This software is a [DataLad](http://datalad.org) extension that equips DataLad
with an alternative command suite for metadata handling (extraction, aggregation,
filtering, and reporting).

Please note that the metadata storage format introduced in release 0.3.0 is incompatible
with the metadata storage formate in previous versions, i.e. `0.2.x`, and in DataLad
proper. They both happily coexist on storage, but this version of metalad will not
be able to read metadata that was stored by the previous version and vice versa.
Eventually there will be an importer that will pull old-version metadata into
the new metadata storage. It is planned for release 0.3.1

Here is an overview of the changes in 0.3.0 (the new system is quite
different from the previous release in a few ways):

1. Leaner commands with unix-style behavior, i.e. one command for one operation, and commands are chainable (use results from one command as input for another command, e.g. meta-extract|meta-add).

2. MetadataRecord modifications does not alter the state of the datalad dataset. In previous releases, changes to metadata have altered the version (commit-hash) of the repository although the primary data did not change. This is not the case in the new system. The new system does provide information about the primary data version, i.e. commit-hash, from which the individual metadata elements were created.

3. The ability to support a wide range of metadata storage backends in the future (this is facilitated by the [datalad-metadata-model](https://github.com/datalad/metadata-model)) which is developed alongside metalad), which separates the logical metadata model used in metalad from the storage backends, by abstracting the storage backend), Currently git-repository storage is supported.

4. The ability to transport metadata independently of the data in the dataset. The new system introduces the concept of a *metadata-store* which is usually the git-repository of the datalad dataset that is described by the metadata. But this is not a mandatory configuration, metadata can be stored in almost any git-repository.

5. The ability to report a subset of metadata from a remote metadata store without downloading the complete remote metadata. In fact only the minimal necessary information is transported from the remote metadata store. This ability is available to all metadata-based operations, for example, also to filtering.

6. A new simplified extractor model that distinguishes between two extractor-types: dataset-level extractors and file-extractors. The former are executed with a view on a dataset, the latter are executed with specific information about a single file-path in the dataset. The previous extractors (datalad, and datalad-metalad<=0.2.1) are still supported.

7. A built-in pipeline mechanism that allows parallel execution of metadata operations like metadata extraction, and metadata filtering. (Still in early stage)

8. A new set of commands that allow operations that map metadata to metadata. Those operations are called filtering and are implemented by MetadataFilter-classes. Filter are dynamically loaded and custom filter are supports, much like extractors. (Still in early stage)

9. Backward compatibility supporting an import from previous metadata storage (planned for 0.3.1).


Command(s) currently provided by this extension
#### Command(s) currently provided by this extension

- `meta-extract` -- run an extractor on a file or dataset and emit the
resulting metadata (stdout).
Expand All @@ -63,7 +35,7 @@ such as metadata-extraction and metadata-adding.Processors
are usually executed in parallel. A few pipeline definitions are provided
with the release.

Commands currently under development:
#### Commands currently under development:

- `meta-export` -- write a flat representation of metadata to a file-system. For now you
can export your metadata to a JSON-lines file named `metadata-dump.jsonl`:
Expand All @@ -77,13 +49,10 @@ Commands currently under development:
datalad meta-add -d <dataset-path> --json-lines -i metadata-dump.jsonl
```

- `meta-ingest-previous` -- ingest metadata from `metalady<=0.2.1`.


*A word of caution: documentation is still lacking and will be addressed with release 0.3.1.*
- `meta-ingest-previous` -- ingest metadata from `metalad<=0.2.1`.


Additional metadata extractor implementations
#### Additional metadata extractor implementations

- Compatible with the previous families of extractors provided by datalad
and by metalad, i.e. `metalad_core`, `metalad_annex`, `metalad_custom`, `metalad_runprov`
Expand All @@ -102,7 +71,7 @@ data in the input file



Indexers
#### Indexers

- Provides indexers for the new datalad indexer-plugin interface. These indexers
convert metadata in proprietary formats into a set of key-value pairs that can
Expand Down
77 changes: 77 additions & 0 deletions docs/source/design/history.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
.. _history:

******************************************************
MetaLad development history and backward compatibility
******************************************************

Functionality related to metadata has been a part of the DataLad ecosystem from the very start.
However, it underwent several evolutions, and this extension is the most recent state of it.
If you have been an early adopter of the metadata functionalities of DataLad or MetaLad, this section provides an overview of past systems and notable changes for you to assess upgrades and backward-compatibility to legacy metadata.

First-generation metadata
-------------------------

The first generation of metadata commands was implemented in the main ``datalad`` Python package, but barely saw the light of day.
Very early users of DataLad might have caught a glimpse of it.

In the 1st-gen metadata implementation, metadata of a dataset had two levels.
The first one contained the metadata about the actual content of a dataset (generated by DataLad or other processes), the second one was metadata about the dataset itself (generated by DataLad).
The metadata was represented in `RDF <https://en.wikipedia.org/wiki/Resource_Description_Framework>`_.

Second-generation metadata
--------------------------

The second generation of metadata commands came to life when the main ``datalad`` package was a few years old already.
It brought the concept of dedicated _extractors_, including the legacy extractors that are supported to this day.
It also provided a range of dedicated metadata subcommands of a ``datalad metadata`` command such as ``aggregate`` and ``extract``, as well as a dedicated ``datalad search`` command.
Extracted metadata was stored in a dataset in (compressed) files using a JSON
stream format, separately for metadata describing a dataset as a whole, and
metadata describing individual files in a dataset.

The 2nd-gen metadata implementation was moved into the `datalad-deprecated <http://docs.datalad.org/projects/deprecated>`_ extension in 2022.


Third-generation metadata
-------------------------

The third generation of metadata commands was developed as the datalad-extension MetaLad.
Initially, until version ``0.2.1``, it was the continuation of developing 2nd generation metadata functionality.
Afterwards, beginning with ``0.3x`` series, the metadata model and command set was once more revised into the current state 3rd-gen metadata implementation.
This implementation came with an entirely new metadata model.

Gen 2 versus gen 3 metadata
---------------------------

This paragraph is important if you have used ``datalad-metalad`` prior to the ``0.3.0`` release.

Overview of changes
^^^^^^^^^^^^^^^^^^^

The new system in ``0.3.0`` is quite different from the previous release in a few ways:

1. Leaner commands with unix-style behavior, i.e. one command for one operation, and commands are chainable (use results from one command as input for another command, e.g. meta-extract|meta-add).

2. MetadataRecord modifications does not alter the state of the datalad dataset. In previous releases, changes to metadata have altered the version (commit-hash) of the repository although the primary data did not change. This is not the case in the new system. The new system does provide information about the primary data version, i.e. commit-hash, from which the individual metadata elements were created.

3. The ability to support a wide range of metadata storage backends in the future (this is facilitated by the [datalad-metadata-model](https://github.com/datalad/metadata-model)) which is developed alongside metalad), which separates the logical metadata model used in metalad from the storage backends, by abstracting the storage backend), Currently git-repository storage is supported.

4. The ability to transport metadata independently of the data in the dataset. The new system introduces the concept of a *metadata-store* which is usually the git-repository of the datalad dataset that is described by the metadata. But this is not a mandatory configuration, metadata can be stored in almost any git-repository.

5. The ability to report a subset of metadata from a remote metadata store without downloading the complete remote metadata. In fact only the minimal necessary information is transported from the remote metadata store. This ability is available to all metadata-based operations, for example, also to filtering.

6. A new simplified extractor model that distinguishes between two extractor-types: dataset-level extractors and file-extractors. The former are executed with a view on a dataset, the latter are executed with specific information about a single file-path in the dataset. The previous extractors (datalad, and datalad-metalad<=0.2.1) are still supported.

7. A built-in pipeline mechanism that allows parallel execution of metadata operations like metadata extraction, and metadata filtering. (Still in early stage)

8. A new set of commands that allow operations that map metadata to metadata. Those operations are called filtering and are implemented by MetadataFilter-classes. Filter are dynamically loaded and custom filter are supports, much like extractors. (Still in early stage)

Backward-compatibility
^^^^^^^^^^^^^^^^^^^^^^

Certain versions of MetaLad metadata are temporarily incompatible.

.. note:: Incompability of 0.3.0 and 0.2.x

Please note that the metadata storage format introduced in release ``0.3.0`` is incompatible with the metadata storage formate in previous versions, i.e. `0.2.x`, and those in ``datalad-deprecated``.
Both storage formats can coexist in storage, but version ``0.3.0`` of MetaLad will not be able to read metadata that was stored by the previous version and vice versa.
Eventually there will be an importer that will pull old-version metadata into the new metadata storage.
3 changes: 2 additions & 1 deletion docs/source/design/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ The chapter describes the design of particular subsystems in DataLad.
:maxdepth: 2

conduct
datatypes
datatypes
history