-
Notifications
You must be signed in to change notification settings - Fork 36
Named graphs that know where they came from
See also Naming sparql service description's sd:NamedGraph and Querying datasets created by csv2rdf4lod.
The design principles of Linked Data and the semantic web allow easy access to distributed data. Using these principles, disparate organizations can provide their data and form connections among others' in an uncoordinated fashion.
Although the distributed nature of data on the semantic web is powerful, it is not the most convenient form when a single application needs to perform an analysis or present some information to its user. Aggregating these distributed data into a single location and organizing it for efficient query is a very common need and is fulfilled with the use of triple stores and SPARQL endpoints. Unfortunately, aggregating data for convenience and speed introduces challenges for consumers concerned with the provenance of their query results.
Named graphs, an official part of SPARQL, can be used to reflect the source of external data loaded into a triple store -- but this implicit convention is insufficient. A named graph in a triple store is a conceptual grouping of data; it allows the loader to group different subsets of data according to their intended use and allows the consumer to query those subsets according to their needs.
Based on discussions at http://inference-web.org/wiki/IW_Meeting_2011-04-15, the named graph shouldn't be the thing justified, instead, the pmlp:Information contained by the named graph should be justified. And subsequent loads to the same named graph should be justifying the new pmlp:Information with an pmlj:InferenceStep that references the pmlj:NodeSet from the previous pvload as part of its antecedentList (which includes a reference to the latest rdf-file-from-the-web as an additional antecedent). But we still want to associate each Information to the named graph that contained them (because that's how we think about it), so for now we're using skos:broader. This should probably become some FRBR relation.
Named graphs that know where they came from ... and who put it there, and when they put it there (results on LOGB and LOBD):
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj: <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?graph ?person ?when
WHERE {
GRAPH ?graph {
[] pmlj:hasConclusion [ skos:broader [ sd:name ?graph ] ];
pmlj:isConsequentOf ?infstep .
OPTIONAL { ?infstep hartigprov:involvedActor ?user }
OPTIONAL { ?infstep dcterms:date ?when }
OPTIONAL { ?user sioc:account_of ?person }
}
} ORDER BY DESC(?when)
Now that the pmlp:Information/named graph distinction is made, we need to obtain the URI for the pmlj:NodeSet that justifies the latest pvload (results):
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj: <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
SELECT ?graph ?modified
WHERE {
GRAPH ?graph {
[] pmlj:hasConclusion [ skos:broader [ sd:name ?graph ] ];
pmlj:isConsequentOf [];
dcterms:created ?modified .
}
} ORDER BY DESC(?modified) LIMIT 1
When we know what graph we're loading into, we can ask for any justifications directly (results):
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj: <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
SELECT ?justification ?modified
WHERE {
GRAPH <http://purl.org/twc/id/person/TimLebo> {
?justification
pmlj:hasConclusion [ skos:broader [ sd:name <http://purl.org/twc/id/person/TimLebo> ] ];
pmlj:isConsequentOf [];
dcterms:created ?modified .
}
} ORDER BY DESC(?modified) LIMIT 1
Although mirroring data from one endpoint into another has advantages, a significant disadvantage is the proliferation of data without an increase in information. Knowing that the content in a local named graph is identical to the content in another allows one to focus on sources that provide added conceptual value instead of treading in a sea of mindless duplication. A naive approach to consolidating identical content would be to diff the triples of two datasets, but this is a costly operation. Instead, we can use provenance to describe these common content associations and inform consumers with a simple, direct query.
Desired usage:
$ mirror-endpoint http://dbpedia.org/sparql http://www.w3.org/2002/07/owl#
If we asked dbpedia what named graphs it has loaded:
select distinct ?g where { graph ?g {[] a ?Concept}}
We can grab the RDF content of one of them:
construct { ?s ?p ?o } where { graph <http://www.w3.org/2002/07/owl#> {?s ?p ?o} }
The pvload.sh can be reused (but doesn't model the provenance like cache-queries.sh does):
pvload.sh 'http://dbpedia.org/sparql?default-graph-uri=&query=construct+{+%3Fs+%3Fp+%3Fo+}+where+{+graph+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+{%3Fs+%3Fp+%3Fo}+}&format=application%2Frdf%2Bxml' -ng http://www.w3.org/2002/07/owl#
We'd like to avoid LOGD because SparqlProxy is inadequate and the direct Virtuoso endpoint is behind the firewall.
Named graphs that know where they came from ... and who put it there, and when they put it there (OLD results):
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX sd: <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
PREFIX pmlj: <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?graph ?person ?when
WHERE {
GRAPH ?graph {
[] pmlj:hasConclusion [ sd:name ?graph ];
pmlj:isConsequentOf ?infstep .
OPTIONAL { ?infstep hartigprov:involvedActor ?user }
OPTIONAL { ?infstep dcterms:date ?when }
OPTIONAL { ?user sioc:account_of ?person }
}
} ORDER BY DESC(?when)