Skip to content

Named graphs that know where they came from

timrdf edited this page Jan 6, 2013 · 74 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

See also Naming sparql service description's sd:NamedGraph and Querying datasets created by csv2rdf4lod.

Introduction

The design principles of Linked Data and the semantic web allow easy access to distributed data. Using these principles, disparate organizations can provide their data and form connections among others' in an uncoordinated fashion.

Although the distributed nature of data on the semantic web is powerful, it is not the most convenient form when a single application needs to perform a particular analysis or present specific information to its user. Aggregating these distributed data into a single location and organizing it for efficient query is a very common need and is fulfilled with the use of triple stores and SPARQL endpoints. Unfortunately, aggregating data for convenience and speed introduces challenges for consumers concerned with the provenance of their query results.

Named graphs, an official part of SPARQL, can be used to reflect the source of external data loaded into a triple store by naming the graph using the same URL from which the RDF was obtained -- but this implicit convention is insufficient. A named graph in a triple store is a conceptual grouping of data; it allows the loader to group different subsets of data according to their intended use and allows the consumer to query those subsets according to their needs.

For transparency, we load from the same web dump files that everyone else can obtain. By pulling from the web instead of local disk, we're raising the bar for ourselves to the point that everyone else can reproduce what we've done.

Using pvload.sh to record provenance while loading named graphs

$ source /opt/csv2rdf4lod-automation/source-me.sh # Make sure you have CSV2RDF4LOD_HOME set
$ pvload.sh http://sparql.tw.rpi.edu/source/usgs-gov/file/nwis-sites-vi/version/2011-Mar-20/conversion/usgs-gov-nwis-sites-vi-2011-Mar-20.void.ttl \
 -ng http://tw.rpi.edu/instances/JinZheng2011Sep16TEST

go to http://sparql.tw.rpi.edu/virtuoso/sparql and execute:

select distinct ?Concept 
where { 
  graph <http://tw.rpi.edu/instances/JinZheng2011Sep16TEST> {
    [] a ?Concept
  }
}

(pvload can also take a local file just the same)

pvload chaining

Based on discussions at http://inference-web.org/wiki/IW_Meeting_2011-04-15, the named graph shouldn't be the thing justified, instead, the pmlp:Information contained by the named graph should be justified. And subsequent loads to the same named graph should be justifying the new pmlp:Information with an pmlj:InferenceStep that references the pmlj:NodeSet from the previous pvload as part of its antecedentList (which includes a reference to the latest rdf-file-from-the-web as an additional antecedent). But we still want to associate each Information to the named graph that contained them (because that's how we think about it), so for now we're using skos:broader. This should probably become some FRBR relation.

Named graphs that know where they came from ... and who put it there, and when they put it there (results on LOGB and LOBD):

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX sd:         <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc:       <http://rdfs.org/sioc/ns#>
PREFIX skos:       <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj:       <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?graph ?person ?when
WHERE {
  GRAPH ?graph {
    [] pmlj:hasConclusion [ skos:broader [ sd:name ?graph ] ];
       pmlj:isConsequentOf ?infstep .
    OPTIONAL { ?infstep hartigprov:involvedActor ?user   }
    OPTIONAL { ?infstep dcterms:date             ?when   }
    OPTIONAL { ?user    sioc:account_of          ?person }
  }
} ORDER BY DESC(?when)

Now that the pmlp:Information/named graph distinction is made, we need to obtain the URI for the pmlj:NodeSet that justifies the latest pvload (results):

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj:    <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX sd:      <http://www.w3.org/ns/sparql-service-description#>
SELECT ?graph ?modified
WHERE {
  GRAPH ?graph {
    [] pmlj:hasConclusion [ skos:broader [ sd:name ?graph ] ];
       pmlj:isConsequentOf [];
       dcterms:created ?modified .
  }
} ORDER BY DESC(?modified) LIMIT 1

When we know what graph we're loading into, we can ask for any justifications directly (results):

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlj:    <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX sd:      <http://www.w3.org/ns/sparql-service-description#>
SELECT ?justification ?modified
WHERE {
  GRAPH <http://purl.org/twc/id/person/TimLebo> {
    ?justification
       pmlj:hasConclusion [ skos:broader [ sd:name <http://purl.org/twc/id/person/TimLebo> ] ];
       pmlj:isConsequentOf [];
       dcterms:created ?modified .
  }
} ORDER BY DESC(?modified) LIMIT 1

Finding the dump file that went into the named graph

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX sd:         <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc:       <http://rdfs.org/sioc/ns#>
PREFIX skos:       <http://www.w3.org/2004/02/skos/core#>
PREFIX pmlp:       <http://inference-web.org/2.0/pml-provenance.owl#>
PREFIX pmlj:       <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?when ?person ?file_loaded
WHERE {
  GRAPH ?graph {
    [] pmlj:hasConclusion [ skos:broader [ sd:name 
         <http://sparql.tw.rpi.edu/source/usgs-gov/dataset/national-water-information-system-nwis-measurements/version/test-version> ] ];
       pmlj:isConsequentOf ?infstep .
    OPTIONAL { ?infstep hartigprov:involvedActor ?user   }
    OPTIONAL { ?infstep dcterms:date             ?when   }
    OPTIONAL { ?user    sioc:account_of          ?person }

    ?infstep pmlj:hasAntecedentList ( [ a ?type; pmlj:hasConclusion ?file_loaded ] ) .
  }
} ORDER BY DESC(?when)

Mirroring another endpoint's named graph

Although mirroring data from one endpoint into another has advantages, a significant disadvantage is the proliferation of data without an increase in information. Knowing that the content in a local named graph is identical to the content in another allows one to focus on sources that provide added conceptual value instead of treading in a sea of mindless duplication. A naive approach to consolidating identical content would be to diff the triples of two datasets, but this is a costly operation. Instead, we can use provenance to describe these common content associations and inform consumers with a simple, direct query.

Desired usage:

$ mirror-endpoint http://dbpedia.org/sparql http://www.w3.org/2002/07/owl#

If we asked dbpedia what named graphs it has loaded:

select distinct ?g where { graph ?g {[] a ?Concept}}

We can grab the RDF content of one of them:

construct { ?s ?p ?o } where { graph <http://www.w3.org/2002/07/owl#> {?s ?p ?o} }

The pvload.sh can be reused (but doesn't model the provenance like cache-queries.sh does):

pvload.sh 'http://dbpedia.org/sparql?query=construct+{+%3Fs+%3Fp+%3Fo+}+where+{+graph+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E+{%3Fs+%3Fp+%3Fo}+}&format=application%2Frdf%2Bxml' -ng http://www.w3.org/2002/07/owl#

cache-queries.sh models the query and the endpoint explicitly, so we can shove the query into a file and run:

cache-queries.sh http://dbpedia.org/sparql -p format -o xml -q all.rq

This gives us our output at vi results/all.rq.xml:

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Description rdf:about="http://www.w3.org/2002/07/owl#imports"><rdfs:range rdf:resource="http://www.w3.org/2002/07/owl#Ontology"/></rdf:Description>
<rdf:Description rdf:about="http://www.w3.org/2002/07/owl#unionOf"><rdfs:label>unionOf</rdfs:label></rdf:Description>

Which we can load into the endpoint using:

vload `guess-syntax.sh results/all.rq.xml vload` results/all.rq.xml http://www.w3.org/2002/07/owl#

We can toss the provenance in, too:

vload `guess-syntax.sh results/all.rq.xml.pml.ttl vload` results/all.rq.xml.pml.ttl http://www.w3.org/2002/07/owl#

We can wrap all of this up with some nice usage in mirror-endpoint.sh:

$ mirror-endpoint.sh 
usage: mirror-endpoint.sh <endpoint> <named_graph> [named_graph] ...

$ mirror-endpoint.sh http://dbpedia.org/sparql http://www.w3.org/2002/07/owl#

That gives us two ways to load another endpoint's named graph into our local endpoint:

  • pvload a url-encoded SPARQL CONSTRUCT URL into a named graph
  • cache-queries.sh the url-encoded SPARQL CONSTRUCT URL to local file, then load both into a named graph.

Mirroring another csv2rdf4lod-automation endpoint's named graph

The two previous mirroring options work for any moderately sized graph in any SPARQL 1.0 endpoint, but we can leverage the provenance that csv2rdf4lod-automation includes in all of its own endpoint loads to mirror it by dump file. This allows us to avoid inefficient CONSTRUCT * WHERE { ?s ?p ?o } queries. So, we'll make a third mirror script df-mirror-endpoint.sh that caches the dump files of other csv2rdf4lod-automation endpoints. We'll borrow the "directory-as-url" convention established in DataFAQs to store all dump files that a remote csv2rdf4lod-automation endpoint has loaded. This allows us to get a broad collection of disparate RDF datasets. For example, the following endpoints have data loaded by csv2rdf4lod-automation:

What if we wanted to get all of the data that they've produced, just be downloading their RDF dump files?

> cd ~/projects/csv2rdf4lod/data/csv2rd4lod-nodes

> df-mirror-endpoint.sh http://healthdata.tw.rpi.edu/sparql http://aquarius.tw.rpi.edu/projects/provenanceweb/sparql
...

> du -sh *
8.1M	aquarius.tw.rpi.edu
504M	healthdata.tw.rpi.edu

We now have a directory for each endpoint that contains all of the dump files with which they were loaded. df-mirror-endpoint.sh uses the "sd_name" convention to indicate the graph name that the corresponding file should be loaded in our triple store. For example, healthdata.tw.rpi.edu/sparql/__PIVOT__/xmlns.com/foaf/0.1/input_112b6d7443852b15aa3153fa41d7ebf3.rdf should be loaded into http://xmlns.com/foaf/0.1.

> find . -name "*.sd_name"
...
./healthdata.tw.rpi.edu/sparql/__PIVOT__/xmlns.com/foaf/0.1/input_112b6d7443852b15aa3153fa41d7ebf3.rdf.sd_name

> cat ./healthdata.tw.rpi.edu/sparql/__PIVOT__/xmlns.com/foaf/0.1/input_112b6d7443852b15aa3153fa41d7ebf3.rdf.sd_name 
http://xmlns.com/foaf/0.1

df-load-triple-store.sh follows the ".sd_name" convention, and will walk a local directory structure to load the RDF files into the corresponding graph names. df-load-triple-store.sh will load into any (or, all) of the TDB, Virtuoso, and Sesame triple stores, according to the DATAFAQS environment variables DATAFAQS_PUBLISH_TDB, DATAFAQS_PUBLISH_VIRTUOSO, and DATAFAQS_PUBLISH_SESAME equalling true, respectively.

The following will find all ".sd_name" files, and load the corresponding RDF file into the graph named within the .sd_name file.

> cd healthdata.tw.rpi.edu/sparql/__PIVOT__

> df-vars.sh
DATAFAQS_PUBLISH_SESAME                               true
DATAFAQS_PUBLISH_SESAME_HOME                          /Users/me/utilities/sesame/openrdf-sesame-2.6.10
DATAFAQS_PUBLISH_SESAME_SERVER                        http://localhost:8080/openrdf-sesame
DATAFAQS_PUBLISH_SESAME_REPOSITORY_ID                 spo-balance

> df-load-triple-store.sh --target
[INFO] Will load named graphs into repository spo-balance on http://localhost:8080/openrdf-sesame via /Users/me/utilities/sesame/openrdf-sesame-2.6.10/bin/console.sh

> df-load-triple-store.sh --recursive-by-sd-name

Keeping the pvload provenance around

export CSV2RDF4LOD_CONVERT_DEBUG_LEVEL=finest and run it. It'll leave some temp files around your current working directory.

Old design: justifying the Named Graph (and not the now-correct pmlp:Information)

Named graphs that know where they came from ... and who put it there, and when they put it there (OLD results):

PREFIX dcterms:    <http://purl.org/dc/terms/>
PREFIX sd:         <http://www.w3.org/ns/sparql-service-description#>
PREFIX sioc:       <http://rdfs.org/sioc/ns#>
PREFIX pmlj:       <http://inference-web.org/2.0/pml-justification.owl#>
PREFIX hartigprov: <http://purl.org/net/provenance/ns#>
PREFIX conversion: <http://purl.org/twc/vocab/conversion/>
SELECT ?graph ?person ?when
WHERE {
  GRAPH ?graph {
    [] pmlj:hasConclusion [ sd:name ?graph ];
       pmlj:isConsequentOf ?infstep .
    OPTIONAL { ?infstep hartigprov:involvedActor ?user   }
    OPTIONAL { ?infstep dcterms:date             ?when   }
    OPTIONAL { ?user    sioc:account_of          ?person }
  }
} ORDER BY DESC(?when)

Related

Clone this wiki locally