-
Notifications
You must be signed in to change notification settings - Fork 36
Executive summary
csv2rdf4lod is a conversion tool that encodes tabular spreadsheet-like data into well structured and highly connected Resource Description Framework (RDF). RDF is able to integrate across datasets using explicit connections. The basis for these connections is the Uniform Resource Identifier (URI). If the same URI is mentioned in the resulting conversions, then the datasets are connected. csv2rdf4lod uses an RDFS-like conversion vocabulary, which itself is RDF. The enhancement parameters serve as a description of the columns of a table and describe how the resulting RDF predicates should be formed when creating RDF. An important aspect of integrating data with RDF is to reuse existing vocabularies and ontologies. csv2rdf4lod allows this reuse within its enhancement parameters. If common terms are not provided, the vocabulary is designed to be contextualized by scope established by the source organization (usually, their DNS name), an identifier for the dataset that they are providing, and the version of the dataset. All three of these aspects are given identifiers and used to construct the default URIs for the vocabulary and instances created during conversion.
csv2rdf4lod was designed to be used by third parties consuming data that they may not necessarily understand fully when they first work with the data. Thus, the URI design incorporates a "predicate layering" scheme that allows multiple interpretations of the same retrieved data to augment prior interpretations without breaking backward compatibility (in cases where consumers have begun to build applications around the initial conversion).
Another consideration for being a third party integrator of first party data is the degree to which consumers of the integrated data will trust the third parties. Although the data are integrated and likely to be easier to discover, access, understand, query, and use, it is not in the original, authoritative form provided by the first parties. This introduces a level of risk for the third party integrators to make a mistake or intentionally change the original content. To minimize these concerns, csv2rdf4lod incorporates provenance descriptions in all of the essential steps in its conversion process, from retrieval, manual manipulations, conversion, publishing dump files and even loading into triple stores. Many important aspect of these steps are captured using the Proof Markup Language.
Despite the variety of sophisticated methods available to model, store, query, and display information, a tabular structure (e.g. Comma-Separated Values) is regularly used as a "lowest common denominator" to either 1) get a project started in the first place, or 2) transfer among systems already in place. When starting a small project, the use of a spreadsheet often satisfies many if not all information modeling needs, while more complex information systems provide tabular exports devoid of an explicit representation for what the data values meant within the system.
In both cases, a third party data consumer is left with an interpretation problem -- what do these values mean?
Our methodology to convert tabular literals to the Resource Description Framework (RDF) enables answers to novel questions by establishing explicit connections among previously disconnected datasets. We provide a simple, minimal-effort entry path to "just getting it done", so that applications can be rapidly developed without concern for a wealth of information design considerations that our system solves without user effort. In addition to providing a "quick and easy" solution to getting RDF, the information design is prepared to provide backward compatible, iterative improvements of the data as time and needs permit. The same design is also prepared to handle subsequent versions of datasets as their source organizations continue to augment, correct, and re-release their offerings.
The benefits of our initial lightweight organizational structure becomes invaluable for managing the number and heterogeneity of the datasets that one may accumulate. To incorporate an additional dataset, the only information required from a human curator are three local identifiers for the source from which one obtained the data, the dataset by which the source refers to their data, and the version of the concrete artifacts physically obtained from the source.
(todo :-)
- Provenance-inspired naming of datasets and the entities they describe (using "the essential three": version, dataset, and version).
- Minimal effort to obtain initial RDF from tabular formats. Get what you need and quickly move on to the rest of your application.
- Declarative interpretation parameters control resulting RDF structure.
- Parallels RDFS and OWL axioms, but applies to tabular literals instead of existing RDF.
- Provides backwards-compatible enhancements to initial verbatim RDF interpretation (usig layered predicate design).
- Leverages previous enhancement parameters via an include mechanism.
- Leverages RDF output of previous conversions as enhancement parameters for subsequent conversions.
- Abbreviated description of resulting structure (no need to dig into custom code).
- Uniform treatment and results across dataset application .
- No immediate need to worry about what to name resources with (cmp. Krextor)
- No immediate concern for where to name vocabulary classes and predicates (really nice defaults). (cmp. Krextor)
- Nice CURIE handling (slightly easier to read RDF). (cmp Krextor)
- Correctly oriented paradigm (Looking forward and tweaking end result instead of looking back and picking out; all gets through by default (cmp Krextor)).
- number triples of verbatim interpretation parameters vs number triples of enhanced interpretation parameters.
- percentage increase from raw to enhanced compared to percentage increase in number triples in raw to enhanced.
- number of triples in verbatim interpretation vs. number of triples in enhanced interpretation parameters.
- vocabulary reuse distribution in verbatim vs. vocabulary reuse distribution in enhanced.
- vocabulary "depth" - dataset scoped is too low. foaf is high.
- connectivity to other datasets via shared entities, owl:sameAs, common predicates/classes.
- histogram at conversion:num_invocation_logs