Skip to content

Commit

Permalink
Discovery spec in first draft
Browse files Browse the repository at this point in the history
  • Loading branch information
pietercolpaert committed Oct 1, 2024
1 parent 534c750 commit 830aff5
Showing 1 changed file with 73 additions and 48 deletions.
121 changes: 73 additions & 48 deletions 03-discovery-specification.bs
Original file line number Diff line number Diff line change
Expand Up @@ -11,89 +11,114 @@ Mailing List: public-treecg@w3.org
Mailing List Archives: https://lists.w3.org/Archives/Public/public-treecg/
Editor: Pieter Colpaert, https://pietercolpaert.be
Abstract:
This specification defines how a client can find specific search trees of interest, as well as list the context information.
This specification defines how a client selects a specific dataset and search tree, as well as extracts relevant context information.
</pre>

# The overview # {#overview}
# Definitions # {#overview}

A <code>tree:Collection</code> is a subclass of <code>dcat:Dataset</code> ([[!vocab-dcat-3]]).
A `tree:Collection` is a subclass of `dcat:Dataset` ([[!vocab-dcat-3]]).
The specialization being that this particular dataset is a collection of _members_.

A <code>tree:SearchTree</code> is a subClassOf <code>dcat:Distribution</code>.
A `tree:SearchTree` is a subClassOf `dcat:Distribution`.
The specialization being that it uses the main TREE specification to publish a search tree.

A node from which all other nodes can be found is a `tree:RootNode`, which MAY be explicitely typed as such.
A node from which all other nodes can be found is a `tree:RootNode`.

Note: The `tree:SearchTree` and the `tree:RootNode` MAY be identified by the same IRI when no disambiguation is needed.

A TREE client MUST be provided with a URL to start from, which we call the _entrypoint_.

# Initializing a client with a url # {#starting-from}

The goal of the client is to understand what `tree:Collection` it is using, and to find a `tree:RootNode` or search form to start the traversal phase from.
The goal of the client is to understand what `tree:Collection` it is using, and to find a `tree:RootNode` to start the traversal phase from.
This discovery specification extends the initialization step in the TREE specification for the cases in which multiple options are possible.

```
IN: E: a URL of the entrypoint
OUT: N: tree:RootNode IRI and/or S: search form
```
The client MUST dereference the URL, which will result in a set of quads. The client now MUST first perform the init step from the main specification.
If that did not return any result, then the client MUST check whether the URL before redirects (`E`) has been used in one of the following discovery patterns described in the subsections:
1. `E` is a `tree:Collection`: then the client needs to [select the right search tree](#tree-search-trees)
2. `E` is a `dcat:Dataset`: then the client needs to [select the right distribution or dataservice from a catalog](#dcat-dataset)
3. `E` is a `ldes:EventStream`: then the client MAY take into account [LDES specific properties](#ldes)
4. `E` is a `dcat:Distribution`: then the client needs to [process it accordingly](#dcat-distribution)
5 `E` is a `dcat:DataService`: then the client needs to [process it accordingly](#dcat-dataservice)
6. `E` is a catalog or is not explicitly mentioned: then it needs to select a dataset based on [shape information](#tree-collection-shapes) and [DCAT Catalog information](#dcat-catalog)

The client MUST dereference the URL, which will result in a set of quads.
When the URL given to the TREE client, after all redirects, is used in a triple <code>ex:C tree:view <> .</code>, a client MUST assume the URL after redirects (`E'`) is an identifier of the intended `tree:RootNode` of the collection `ex:C`.
The client MUST check for this `tree:view` property and return the result of the discovery algorithm with `<> → N`.
## Selecting a collection via shapes ## {#tree-collection-shapes}

If there is no such triple, then the client MUST check whether the URL before redirects (`E`) has been used in one of the following patterns:
* `E tree:view ?N.` where there’s exactly one `?N`, then the algorithm MUST return `?N → N`.
* `E tree:rootNode ?N ; tree:search ?S .` then the algorithm MUST return `?N → N` and `?S → S`.
* `?DS dcat:servesDataset E ; dcat:endpointURL ?U` or `E dcat:endpointURL ?U`, then the algorithm MUST repeat the algorithm with `?U` as the entrypoint.
When multiple collections are found by a client, it can choose to prune the collections based on the `tree:shape` property.
The `tree:shape` property will refer to a first `sh:NodeShape`.
The collection MAY be pruned in case there is no overlap in properties the client needs.

Note: When data about the dataset, data service or search tree is found, it is a good idea to also pass this on to the client.
Issue: Will we document the precise algorithm to use? Should we extend shapes with cardinality approximations as well?

## tree:Collection ## {#collection}
## Selecting a collection via a catalog ## {#dcat-catalog}

In order to prioritize a specific view link, the relations and search forms in the entry nodes can be studied for their relation types, path or remaining items.
The class <code>tree:ViewDescription</code> indicates a specific TREE structure on a <code>tree:Collection</code>.
Through the property <code>tree:viewDescription</code> a <code>tree:Node</code> can link to an entity that describes the view, and can be reused in data portals as the <code>dcat:DataService</code>.
A DCAT Catalog is an overview of datasets, data services and distributions.
As TREE clients first need to select a dataset, and then a search tree to use, it aligns wll with how DCAT-AP works.
DCAT discovery extends upon the previous section in which a collection or dataset can be selected based on the `tree:shape` property.

<div class="example">
```turtle
## What can be found in a tree:Node
ex:N1 a tree:Node ;
tree:viewDescription ex:View1 .

ex:C1 a tree:Collection ;
tree:view ex:N1 .

## What can be found on a data portal
ex:C1 a dcat:Dataset .
ex:View1 a tree:ViewDescription, dcat:DataService ;
dcat:endpointURL ex:N1 ; # The entry point that can be advertised in a data portal
dcat:servesDataset ex:C1 .
```
</div>
For now, we will assume the DCAT information is available in subject pages.

Issue: Do we need more text on how to handle different types of DCAT interfaces?

The dataset descriptions can be used for filtering the datasets available in a catalog to a list of datasets that can be useful for the client.
Such properties may include the spatial extent, the time extent, or how it is possibly a part of another `dcat:Dataset`.

Issue: How precise do we need to be in this specification?

When the `dcat:Dataset` is a `tree:Collection`, the DCAT catalog is going to contain a `dct:type` property with `https://w3id.org/tree#Collection` or `https://w3id.org/ldes#EventStream` as the object.

## Choosing from multiple SearchTrees with TREE ## {#tree-search-trees}

Issue: This is yet to be done

## Selecting a search tree via a DCAT dataset ## {#dcat-dataset}

The are two ways in which you can find a search tree from a dataset: via the distributions and via the data services. Both need to be tested.
Selecting a distribution or data service when multiple are available needs to be done based on [the search tree description](tree-search-trees).
If nothing is available, all need to be tested by processing them as exemplifie din the next subsections.

When there is no <code>tree:viewDescription</code> property in this page, a client either already discovered the description of this view in an earlier <code>tree:Node</code>, either the current <code>tree:Node</code> is implicitly the ViewDescription. Therefore, when the property path <code>tree:view → tree:viewDescription</code> does not yield a result, the view properties MUST be extracted from the object of the <code>tree:view</code> triple.
A <code>tree:Node</code> can also be double typed as the <code>tree:ViewDescription</code>. A client must thus check for ViewDescriptions on both the current node without the <code>tree:viewDescription</code> qualification, as on the current node with the <code>tree:viewDescription</code> link.
### Selecting a search tree via DCAT Distribution ### {#dcat-distribution}

## dcat:Catalog ## {#collection}
`E dcat:distribution ?D . ?D dcat:downloadURL ?N .` then ?N is a rootnode of E.

When multiple collections are found by a client, it can choose to prune the collections based on the <code>tree:shape</code> property.
Therefore a data publisher SHOULD annotate a <code>tree:Collection</code> instance with a SHACL shape.
The <code>tree:shape</code> points to a SHACL description of the shape (<code>sh:NodeShape</code>).
Issue: This is yet to be done

Note: the shape can be a blank node, or a named node on which you should follow your nose when it is defined at a different HTTP URL.
### Selecting a search tree from a DCAT data service ### {#dcat-dataservice}

# Context data # {#context}
* `?DS dcat:servesDataset E ; dcat:endpointURL ?U` or `E dcat:endpointURL ?U`, then the algorithm MUST repeat the algorithm with `?U` as the entrypoint.

Issue: This is yet to be done

## Linked Data Event Streams ## {#ldes}

In case the client is not made for query answering, but only for setting up a replication and synchronization system, then there is a special type that can be used to indicate the search tree is made for this purpose: the `ldes:EventSource`.
Clients that want to prioritize taking a _full_ copy MAY give full priority to this server hint.

<div class="example">
```turtle
E a ldes:EventSource ;
tree:rootNode|dcat:downloadURL </node1> .
```
</div>

# Extracting content information # {#context}

Context information is important to understand who the creator of a certain dataset is, when it was last changed, what other datasets it was derived from, etc.
Issue: This is yet to be done

TODO
Context information enabled a cliento understand who the creator of a certain dataset is, when it was last changed, what other datasets it was derived from, etc.

## DCAT and dcterms ## {#context-dcat}

Issue: This is yet to be done

## Provenance ## {#context-prov}

Issue: This is yet to be done

## Linked Data Event Streams ## {#context-ldes}

Issue: This is yet to be done

LDES (https://w3id.org/ldes/specification) is a way to evolve search trees in a consistent way. It defines every member as immutable, and a collection as append-only.
Therefore, one can make sure to only process each member once.
Extra terms are added, such as the concept of an EventStream, retention policies and a timestampPath.

0 comments on commit 830aff5

Please sign in to comment.