-
Notifications
You must be signed in to change notification settings - Fork 36
Named graphs that know where they came from
See also Naming sparql service description's sd:NamedGraph and Querying datasets created by csv2rdf4lod.
The design principles of Linked Data and the semantic web allow easy access to distributed data. Using these principles, disparate organizations can provide their data and form connections among others' in an uncoordinated fashion.
Although the distributed nature of data on the semantic web is powerful, it is not the most convenient form when a single application needs to perform a particular analysis or present specific information to its user. Aggregating these distributed data into a single location and organizing it for efficient query is a very common need and is fulfilled with the use of triple stores and SPARQL endpoints. Unfortunately, aggregating data for convenience and speed introduces challenges for consumers concerned with the provenance of their query results.
Named graphs, an official part of SPARQL, can be used to reflect the source of external data loaded into a triple store by naming the graph using the same URL from which the RDF was obtained -- but this implicit convention is insufficient. A named graph in a triple store is a conceptual grouping of data; it allows the loader to group different subsets of data according to their intended use and allows the consumer to query those subsets according to their needs.
For transparency, we load from the same web dump files that everyone else can obtain. By pulling from the web instead of local disk, we're raising the bar for ourselves to the point that everyone else can reproduce what we've done.
$ source /opt/csv2rdf4lod-automation/source-me.sh # Make sure you have CSV2RDF4LOD_HOME set
$ pvload.sh http://sparql.tw.rpi.edu/source/usgs-gov/file/nwis-sites-vi/version/2011-Mar-20/conversion/usgs-gov-nwis-sites-vi-2011-Mar-20.void.ttl \
-ng http://tw.rpi.edu/instances/JinZheng2011Sep16TEST
go to http://sparql.tw.rpi.edu/virtuoso/sparql and execute:
select distinct ?Concept
where {
graph <http://tw.rpi.edu/instances/JinZheng2011Sep16TEST> {
[] a ?Concept
}
}
(pvload can also take a local file just the same)
Based on discussions at http://inference-web.org/wiki/IW_Meeting_2011-04-15, the named graph shouldn't be the thing justified, instead, the pmlp:Information contained by the named graph should be justified. And subsequent loads to the same named graph should be justifying the new pmlp:Information with an pmlj:InferenceStep that references the pmlj:NodeSet from the previous pvload as part of its antecedentList (which includes a reference to the latest rdf-file-from-the-web as an additional antecedent). But we still want to associate each Information to the named graph that contained them (because that's how we think about it), so for now we're using skos:broader. This should probably become some FRBR relation.
Named graphs that know where they came from ... and who put it there, and when they put it there (results on LOGB and LOBD):
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj: <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?graph ?person ?when
WHERE {
GRAPH ?graph {
[] pmlj:hasConclusion [ skos:broader [ sd:name ?graph ] ];
pmlj:isConsequentOf ?infstep .
OPTIONAL { ?infstep hartigprov:involvedActor ?user }
OPTIONAL { ?infstep dcterms:date ?when }
OPTIONAL { ?user sioc:account_of ?person }
}
} ORDER BY DESC(?when)
Now that the pmlp:Information/named graph distinction is made, we need to obtain the URI for the pmlj:NodeSet that justifies the latest pvload (results):
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj: <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
SELECT ?graph ?modified
WHERE {
GRAPH ?graph {
[] pmlj:hasConclusion [ skos:broader [ sd:name ?graph ] ];
pmlj:isConsequentOf [];
dcterms:created ?modified .
}
} ORDER BY DESC(?modified) LIMIT 1
When we know what graph we're loading into, we can ask for any justifications directly (results):
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj: <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
SELECT ?justification ?modified
WHERE {
GRAPH <http://purl.org/twc/id/person/TimLebo> {
?justification
pmlj:hasConclusion [ skos:broader [ sd:name <http://purl.org/twc/id/person/TimLebo> ] ];
pmlj:isConsequentOf [];
dcterms:created ?modified .
}
} ORDER BY DESC(?modified) LIMIT 1
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlp: <http://inference-web.org/2.0/pml-provenance.owl#>
PREFIX pmlj: <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?when ?person ?file_loaded
WHERE {
GRAPH ?graph {
[] pmlj:hasConclusion [ skos:broader [ sd:name
<http://sparql.tw.rpi.edu/source/usgs-gov/dataset/national-water-information-system-nwis-measurements/version/test-version> ] ];
pmlj:isConsequentOf ?infstep .
OPTIONAL { ?infstep hartigprov:involvedActor ?user }
OPTIONAL { ?infstep dcterms:date ?when }
OPTIONAL { ?user sioc:account_of ?person }
?infstep pmlj:hasAntecedentList ( [ a ?type; pmlj:hasConclusion ?file_loaded ] ) .
}
} ORDER BY DESC(?when)
Although mirroring data from one endpoint into another has advantages, a significant disadvantage is the proliferation of data without an increase in information. Knowing that the content in a local named graph is identical to the content in another allows one to focus on sources that provide added conceptual value instead of treading in a sea of mindless duplication. A naive approach to consolidating identical content would be to diff the triples of two datasets, but this is a costly operation. Instead, we can use provenance to describe these common content associations and inform consumers with a simple, direct query.
Desired usage:
$ mirror-endpoint http://dbpedia.org/sparql http://www.w3.org/2002/07/owl#
If we asked dbpedia what named graphs it has loaded:
select distinct ?g where { graph ?g {[] a ?Concept}}
We can grab the RDF content of one of them:
construct { ?s ?p ?o } where { graph <http://www.w3.org/2002/07/owl#> {?s ?p ?o} }
The pvload.sh can be reused (but doesn't model the provenance like cache-queries.sh does):
pvload.sh 'http://dbpedia.org/sparql?query=construct+{+%3Fs+%3Fp+%3Fo+}+where+{+graph+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+{%3Fs+%3Fp+%3Fo}+}&format=application%2Frdf%2Bxml' -ng http://www.w3.org/2002/07/owl#
cache-queries.sh models the query and the endpoint explicitly, so we can shove the query into a file and run:
cache-queries.sh http://dbpedia.org/sparql -p format -o xml -q all.rq
This gives us our output at vi results/all.rq.xml
:
<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Description rdf:about="http://www.w3.org/2002/07/owl#imports"><rdfs:range rdf:resource="http://www.w3.org/2002/07/owl#Ontology"/></rdf:Description>
<rdf:Description rdf:about="http://www.w3.org/2002/07/owl#unionOf"><rdfs:label>unionOf</rdfs:label></rdf:Description>
Which we can load into the endpoint using:
vload `guess-syntax.sh results/all.rq.xml vload` results/all.rq.xml http://www.w3.org/2002/07/owl#
We can toss the provenance in, too:
vload `guess-syntax.sh results/all.rq.xml.pml.ttl vload` results/all.rq.xml.pml.ttl http://www.w3.org/2002/07/owl#
We can wrap all of this up with some nice usage in mirror-endpoint.sh:
$ mirror-endpoint.sh
usage: mirror-endpoint.sh <endpoint> <named_graph> [named_graph] ...
$ mirror-endpoint.sh http://dbpedia.org/sparql http://www.w3.org/2002/07/owl#
export CSV2RDF4LOD_CONVERT_DEBUG_LEVEL=finest
and run it. It'll leave some temp files around your current working directory.
Named graphs that know where they came from ... and who put it there, and when they put it there (OLD results):
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
PREFIX pmlj: <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?graph ?person ?when
WHERE {
GRAPH ?graph {
[] pmlj:hasConclusion [ sd:name ?graph ];
pmlj:isConsequentOf ?infstep .
OPTIONAL { ?infstep hartigprov:involvedActor ?user }
OPTIONAL { ?infstep dcterms:date ?when }
OPTIONAL { ?user sioc:account_of ?person }
}
} ORDER BY DESC(?when)
- Design Objective: Capturing and Exposing Provenance
- http://webr3.org/blog/semantic-web/rdf-named-graphs-vs-graph-literals/
- "How do I refer to the quads that state that a triple was published at a web address yesterday?" 1