-
Notifications
You must be signed in to change notification settings - Fork 34
/
PreservationFragment.txt
203 lines (198 loc) · 11.6 KB
/
PreservationFragment.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
<!-- begin of list of resources BP -->
<div class="practice">
<p><span id="list_resources" class="practicelab">Maintain a list of
resources described in a dataset</span></p>
<p class="practicedesc">A dataset <em class="rfc2119">should</em> be
preserved together with a list of all the resources it describes</p>
<section class="axioms">
<p class="subhead">Why</p>
<!-- <p>Web data is about the description of resources identified with a
URI. It is to be expected that queries made by data consumers will
resolve around finding a preserved description about a particular
resource. </p>-->
<p>data on the Web is essentially about the description of entities
identified by a unique, Web-based, identifier (a URI). Once the
data is dumped and sent to an institute specialised in digital
preservation the link with the Web is broken (de-referencing) but
the role of the URI as a unique identifier still remains. In order
to increase the usability of preserved dataset dumps it is
relevant to maintain a list of these identifiers.</p>
</section>
<section class="outcome">
<p class="subhead">Intended Outcome</p>
<p>It <em class="rfc2119">should</em> be possible to look for a
preserved dataset based on the resources contained in it.</p>
</section>
<section class="how">
<p class="subhead">Possible Approach to Implementation</p>
<p>The list of resources can be created by the data depositor or the
digital repository at ingestion time. A dataset dump is scanned
for all the subject described and this list is stored separately.
For RDF and JSON-LD, this corresponds to all the resources in the
"subject" position of statements.</p>
</section>
<section class="test">
<p class="subhead">How to Test</p>
<p>Every resource listed in the resource list must
have some kind of description in the dataset</p>
</section>
<section class="ucr">
<p class="subhead">Evidence</p>
<p><span>Relevant requirements</span>:<a href="http://www.w3.org/TR/dwbp-ucr/#R-UniqueIdentifier">R-UniqueIdentifier</a></p>
</section>
</div>
<!-- end of list of resources BP -->
<!-- begin of assess dataset BP -->
<div class="practice">
<p><span id="evaluate" class="practicelab">Assess dataset coverage</span></p>
<p class="practicedesc">The coverage of a dataset <em class="rfc2119">should</em>
be assessed prior to its preservation</p>
<section class="axioms">
<p class="subhead">Why</p>
<p>A chunk of Web data is by definition dependent on the rest of the
global graph. This global context influences the meaning of the
description of the resources found in the dataset. Ideally, the
preservation of a particular dataset would involve preserving all
its context. That is the entire Web of Data. </p>
<p>At ingestion time an evaluation of the linkage of Web data
dataset dump to already preserved resources is assessed. The
presence of all the vocabularies and target resources in uses is
sought in a set of digital archives taking care of preserving Web
data. Datasets for which very few of the vocabularies used and/or
resources pointed out are already preserved somewhere should be
flagged as being at risk.</p>
</section>
<section class="outcome">
<p class="subhead">Intended Outcome</p>
<p>It <em class="rfc2119">should</em> be done an evaluation of the preservation
coverage for a given dataset.</p>
</section>
<section class="how">
<p class="subhead">Possible Approach to Implementation</p>
<p>The assessement could be performed by the digital preservation
institute or the dataset depositor. It essentially consists in
checking whether all the resources used are either already
preserved somewhere or provided along with the new dataset
considered for preservation.</p>
</section>
<section class="test">
<p class="subhead">How to Test</p>
<p>Datasets making references to portions of the Web of Data which
are not preserved <em class="rfc2119">should</em> receive a lower
score than those using common resources.</p>
</section>
<section class="ucr">
<p class="subhead">Evidence</p>
<p><span>Relevant requirements</span>:<a href="http://www.w3.org/TR/dwbp-ucr/#R-VocabReference">R-VocabReference</a></p>
</section>
</div>
<!-- end of assess dataset BP -->
<!-- begin of serialisation BP -->
<div class="practice">
<p><span id="serialisation" class="practicelab">Use a trusted
serialisation format for preserved data dumps</span></p>
<p class="practicedesc">Data depositors willing to send a datadump for
long term preservation <em class="rfc2119">must</em> use a well
established serialisation</p>
<section class="axioms">
<p class="subhead">Why</p>
<p>Web data is a abtract data model that can be expressed in
different ways (RDF, JSON-LD, ...). Using a well established
serialisation of this data increases its chances of re-use. </p>
<p>Institute doing digital preservation are tasked with monitoring
file format obsolescence. Datasets which have been acquired in
some format some years ago may have to be converted into another
format in order to still be usable with more modern software (see
[[ROSENTHAL]]). This tasks can be made more challenge, or even
impossible, if non standard serialisation formats are used by data
depositors.</p>
</section>
<section class="outcome">
<p class="subhead">Intended Outcome</p>
<p>It <em class="rfc2119">should</em> be possible to read and load the dataset into a database even its software is no longer supported.</p>
</section>
<section class="how">
<p class="subhead">Possible Approach to Implementation</p>
<p>Give preference to Web data serialisation formats available as
open standards. For instance those provided by the W3C
[[FORMATS]]. </p>
</section>
<section class="test">
<p class="subhead">How to Test</p>
<!--<p>Try to open the data dump with different
software.</p> -->
<p>Try to dereference the URI of the data dump with Content-Type
header according to the format you expect to get, using for
example [[cURL]]</p>
</section>
<section class="ucr">
<p class="subhead">Evidence</p>
<p><span>Relevant requirements</span>:<a href="http://www.w3.org/TR/dwbp-ucr/#R-Archive">R-Archive</a></p>
</section>
</div>
<!-- end of serialisation BP -->
<!-- begin of resource status BP -->
<div class="practice">
<p><span id="resourcestatus" class="practicelab">Update the status of a URI</span></p>
<p class="practicedesc">Preserved datasets <em class="rfc2119">should</em>
be linked with their "live" counterparts</p>
<section class="axioms">
<p class="subhead">Why</p>
<p>URI dereferencing is a primary interface to data on the Web. Linking
preserved datasets with the original URI inform the data consumer of
the status of these resources. </p>
<p>During its life cycle a dataset may undergo several
modifications. Although URIs assigned to things are not expected
to change, the description of these resource will evolve over
time. <!--There are also some new IRIs that will be put into use, some
other that will become deprecated, and some that will get deleted. -->
During this evolution, several snapshots could be made available
for preservation. An example of this is <a href="http://debpedia.org">DBpedia</a> which has undergone
several releases since its first publication and always uses the
same URI for its resources. Every resource is de-referenced to the
most up to date description available for it along with a link to
preserved descriptions using the Memento protocol [[RFC7809]]
(see <a href="http://mementoweb.org/depot/native/dbpedia/">Memento
gateway for DBpedia</a>).</p>
</section>
<section class="outcome">
<p class="subhead">Intended Outcome</p>
<p>A link is maintained between the URI of a resource, the most
up-to-date description available for it, and preserved
descriptions. If the resource does not exist any more the
description <em class="rfc2119">should</em> say so and refer to
the last preserved description that was available.</p>
</section>
<section class="how">
<p class="subhead">Possible Approach to Implementation</p>
<p>There is a variety of HTTP status codes that could be put into use
to relate the URI with its preserved description. In particular,
200, 410 and 303 can be used for different scenarios:</p>
<ul>
<li>200 => there is a new description which contains pointers
to archived description</li>
<!-- <li>404 => the resource just died, the URI consumer has to go
find an archive to look for it</li> -->
<li>410 => the resource is no longer available but it has been removed under a controlled process cf. 404 which simply states that something is not available.</li>
<li>303 => the resource identified by this URI is no longer served here
but there is a preserved description at a different location.</li>
<!-- <li>209 => this resource does not exist any more but we have
some information about it. The description could include a list
of locations having different preserved descriptions over
different times.</li> -->
</ul>
<p>In addition to the status codes, HTTP Link headers can also be used to relate
resources to preserved descriptions.</p> </section>
<section class="test">
<p class="subhead">How to Test</p>
<p>Check that de-referencing the URI of a preserved dataset returns information
about its current status and availability.</p>
</section>
<section class="ucr">
<p class="subhead">Evidence</p>
<p><span>Relevant requirements</span>:<a href="http://www.w3.org/TR/dwbp-ucr/#R-DataUnavailabilityReference">R-DataUnavailabilityReference</a>,
<a href="http://www.w3.org/TR/dwbp-ucr/#R-PersistentIdentification">
R-PersistentIdentification</a></p>
</section>
</div>
<!-- end of resource status BP -->