Named graphs that know where they came from

csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

See also Naming sparql service description's sd:NamedGraph and Querying datasets created by csv2rdf4lod.

Introduction

The design principles of Linked Data and the semantic web allow easy access to distributed data. Using these principles, disparate organizations can provide their data and form connections among others' in an uncoordinated fashion.

Although the distributed nature of data on the semantic web is powerful, it is not the most convenient form when a single application needs to perform a particular analysis or present specific information to its user. Aggregating these distributed data into a single location and organizing it for efficient query is a very common need and is fulfilled with the use of triple stores and SPARQL endpoints. Unfortunately, aggregating data for convenience and speed introduces challenges for consumers concerned with the provenance of their query results.

Named graphs, an official part of SPARQL, can be used to reflect the source of external data loaded into a triple store by naming the graph using the same URL from which the RDF was obtained -- but this implicit convention is insufficient. A named graph in a triple store is a conceptual grouping of data; it allows the loader to group different subsets of data according to their intended use and allows the consumer to query those subsets according to their needs.

For transparency, we load from the same web dump files that everyone else can obtain. By pulling from the web instead of local disk, we're raising the bar for ourselves to the point that everyone else can reproduce what we've done.

Using pvload.sh to record provenance while loading named graphs

$ source /opt/csv2rdf4lod-automation/source-me.sh # Make sure you have CSV2RDF4LOD_HOME set
$ pvload.sh http://sparql.tw.rpi.edu/source/usgs-gov/file/nwis-sites-vi/version/2011-Mar-20/conversion/usgs-gov-nwis-sites-vi-2011-Mar-20.void.ttl \
 -ng http://tw.rpi.edu/instances/JinZheng2011Sep16TEST

go to http://sparql.tw.rpi.edu/virtuoso/sparql and execute:

select distinct ?Concept 
where { 
  graph <http://tw.rpi.edu/instances/JinZheng2011Sep16TEST> {
    [] a ?Concept
  }
}

(pvload can also take a local file just the same)

pvload chaining

Based on discussions at http://inference-web.org/wiki/IW_Meeting_2011-04-15, the named graph shouldn't be the thing justified, instead, the pmlp:Information contained by the named graph should be justified. And subsequent loads to the same named graph should be justifying the new pmlp:Information with an pmlj:InferenceStep that references the pmlj:NodeSet from the previous pvload as part of its antecedentList (which includes a reference to the latest rdf-file-from-the-web as an additional antecedent). But we still want to associate each Information to the named graph that contained them (because that's how we think about it), so for now we're using skos:broader. This should probably become some FRBR relation.

Named graphs that know where they came from ... and who put it there, and when they put it there (results on LOGB and LOBD):

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX sd:         <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc:       <http://rdfs.org/sioc/ns#>
PREFIX skos:       <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj:       <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?graph ?person ?when
WHERE {
  GRAPH ?graph {
    [] pmlj:hasConclusion [ skos:broader [ sd:name ?graph ] ];
       pmlj:isConsequentOf ?infstep .
    OPTIONAL { ?infstep hartigprov:involvedActor ?user   }
    OPTIONAL { ?infstep dcterms:date             ?when   }
    OPTIONAL { ?user    sioc:account_of          ?person }
  }
} ORDER BY DESC(?when)

Now that the pmlp:Information/named graph distinction is made, we need to obtain the URI for the pmlj:NodeSet that justifies the latest pvload (results):

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj:    <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX sd:      <http://www.w3.org/ns/sparql-service-description#>
SELECT ?graph ?modified
WHERE {
  GRAPH ?graph {
    [] pmlj:hasConclusion [ skos:broader [ sd:name ?graph ] ];
       pmlj:isConsequentOf [];
       dcterms:created ?modified .
  }
} ORDER BY DESC(?modified) LIMIT 1

When we know what graph we're loading into, we can ask for any justifications directly (results):

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj:    <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX sd:      <http://www.w3.org/ns/sparql-service-description#>
SELECT ?justification ?modified
WHERE {
  GRAPH <http://purl.org/twc/id/person/TimLebo> {
    ?justification
       pmlj:hasConclusion [ skos:broader [ sd:name <http://purl.org/twc/id/person/TimLebo> ] ];
       pmlj:isConsequentOf [];
       dcterms:created ?modified .
  }
} ORDER BY DESC(?modified) LIMIT 1

Finding the dump file that went into the named graph

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX sd:         <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc:       <http://rdfs.org/sioc/ns#>
PREFIX skos:       <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlp:       <http://inference-web.org/2.0/pml-provenance.owl#>
PREFIX pmlj:       <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?when ?person ?file_loaded
WHERE {
  GRAPH ?graph {
    [] pmlj:hasConclusion [ skos:broader [ sd:name 
         <http://sparql.tw.rpi.edu/source/usgs-gov/dataset/national-water-information-system-nwis-measurements/version/test-version> ] ];
       pmlj:isConsequentOf ?infstep .
    OPTIONAL { ?infstep hartigprov:involvedActor ?user   }
    OPTIONAL { ?infstep dcterms:date             ?when   }
    OPTIONAL { ?user    sioc:account_of          ?person }

    ?infstep pmlj:hasAntecedentList ( [ a ?type; pmlj:hasConclusion ?file_loaded ] ) .
  }
} ORDER BY DESC(?when)

Mirroring another endpoint's named graph

Although mirroring data from one endpoint into another has advantages, a significant disadvantage is the proliferation of data without an increase in information. Knowing that the content in a local named graph is identical to the content in another allows one to focus on sources that provide added conceptual value instead of treading in a sea of mindless duplication. A naive approach to consolidating identical content would be to diff the triples of two datasets, but this is a costly operation. Instead, we can use provenance to describe these common content associations and inform consumers with a simple, direct query.

Desired usage:

$ mirror-endpoint http://dbpedia.org/sparql http://www.w3.org/2002/07/owl#

If we asked dbpedia what named graphs it has loaded:

select distinct ?g where { graph ?g {[] a ?Concept}}

We can grab the RDF content of one of them:

construct { ?s ?p ?o } where { graph <http://www.w3.org/2002/07/owl#> {?s ?p ?o} }

The pvload.sh can be reused (but doesn't model the provenance like cache-queries.sh does):

pvload.sh 'http://dbpedia.org/sparql?query=construct+{+%3Fs+%3Fp+%3Fo+}+where+{+graph+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+{%3Fs+%3Fp+%3Fo}+}&format=application%2Frdf%2Bxml' -ng http://www.w3.org/2002/07/owl#

cache-queries.sh models the query and the endpoint explicitly, so we can shove the query into a file and run:

cache-queries.sh http://dbpedia.org/sparql -p format -o xml -q all.rq

This gives us our output at vi results/all.rq.xml:

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Description rdf:about="http://www.w3.org/2002/07/owl#imports"><rdfs:range rdf:resource="http://www.w3.org/2002/07/owl#Ontology"/></rdf:Description>
<rdf:Description rdf:about="http://www.w3.org/2002/07/owl#unionOf"><rdfs:label>unionOf</rdfs:label></rdf:Description>

Which we can load into the endpoint using:

vload `guess-syntax.sh results/all.rq.xml vload` results/all.rq.xml http://www.w3.org/2002/07/owl#

We can toss the provenance in, too:

vload `guess-syntax.sh results/all.rq.xml.pml.ttl vload` results/all.rq.xml.pml.ttl http://www.w3.org/2002/07/owl#

We can wrap all of this up with some nice usage in mirror-endpoint.sh:

$ mirror-endpoint.sh 
usage: mirror-endpoint.sh <endpoint> <named_graph> [named_graph] ...

$ mirror-endpoint.sh http://dbpedia.org/sparql http://www.w3.org/2002/07/owl#

Keeping the pvload provenance around

export CSV2RDF4LOD_CONVERT_DEBUG_LEVEL=finest and run it. It'll leave some temp files around your current working directory.

Old design: justifying the Named Graph (and not the now-correct pmlp:Information)

Named graphs that know where they came from ... and who put it there, and when they put it there (OLD results):

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX sd:         <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc:       <http://rdfs.org/sioc/ns#>
PREFIX pmlj:       <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?graph ?person ?when
WHERE {
  GRAPH ?graph {
    [] pmlj:hasConclusion [ sd:name ?graph ];
       pmlj:isConsequentOf ?infstep .
    OPTIONAL { ?infstep hartigprov:involvedActor ?user   }
    OPTIONAL { ?infstep dcterms:date             ?when   }
    OPTIONAL { ?user    sioc:account_of          ?person }
  }
} ORDER BY DESC(?when)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly