Skip to content

Reusing enhancement parameters for multiple versions or datasets

Tim L edited this page Aug 12, 2013 · 52 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What's first?

Finding an input file's enhancement parameters

(Paths are relative to the "conversion cockpit"; see Directory Conventions)

When pulling the conversion trigger, csv2rdf4lod-automation looks for enhancement parameters using the following convention:

  • Start with the local name of the file to convert (e.g., my.csv).
  • Append e1.params.ttl to the local name of the CSV (e.g., my.csv.e1.params.ttl).
  • Look for the enhancement parameters relative to the manual/ directory (e.g., manual/my.csv.e1.params.ttl).

So, for example, if the CSV is source/my.csv, the enhancement parameters will be at manual/my.csv.e1.params.ttl. Although the enhancement parameters [are created automatically](Generating enhancement parameters), they are placed in manual/ because they will be edited by a human. (see Conversion process phase: retrieve for the distinction among source/, manual/, automatic/, and publish/) Subsequently, additional layers can be created by [generating](Generating enhancement parameters) and tweaking additional enhancement parameters, e.g., manual/my.csv.e2.params.ttl, manual/my.csv.e3.params.ttl, and so on.

Reusing enhancement parameters

Although a "one enhancement file for each CSV" clearly establishes which parameters will be (and were) used, it becomes tedious to maintain many enhancement files when there are many CSVs that share similar structure. Thus, there are three ways to apply enhancement parameters outside of the one manual/*.e1.params.ttl file that is used by default.

  • Create file-dependent global enhancement parameters ../my.csv.e1.params.ttl
  • Create file-independent global enhancement parameters ../e1.params.ttl
  • Pull the common enhancement parameters into a separate file and use conversion:includes.

Creating file-dependent global enhancement parameters ../my.csv.e1.params.ttl

If yesterday's version of a dataset has two files:

../yesterday/source/A.csv
../yesterday/source/B.csv

and today's version has similarly-structured files:

source/A.csv
source/B.csv

we don't want to spend time enhancing a structure that we have already enhanced.

ln ../yesterday/source/A.csv ../A.csv.e1.params.ttl
ln ../yesterday/source/B.csv ../B.csv.e1.params.ttl

with these file-dependent global enhancement parameters in place, the automation will use them instead of those in manual/: manual/A.csv.e1.params.ttl and manual/B.csv.e1.params.ttl, making a cached copy at manual/A.csv.e1.global.params.ttl and manual/B.csv.e1.global.params.ttl.

When constructing the manual/*.global.params.ttl, the value of conversion:version_identifier in the file-dependent global enhancement parameters is replaced with the appropriate value for the current version.

Creating discriminator-based global enhancement parameters ../user-profile.e1.params.ttl

Many times, the filenames provided for subsequent dataset versions vary. When they do, they often include a date or version directly in the file name. For example, when we created version 2013-Aug-02 for the received file Sample_PoC_Schema_dataSet_08_02.xlsx, the file that we got for the next version 2013-Aug-10 was named Sample_PoC_Schema_DataSet_0810.xlsx. In these cases, the filenames will not align when looking for the corresponding enhancement parameters. This can be fixed by using discriminator-based global enhancement parameters, which we describe with an example in this section.

version/2013-Aug-10$ cr-create-conversion-trigger.sh -w manual/*.csv

focuses on converting the files:

bash-3.2$ grep datafile= convert-wcl.sh 
datafile="Sample_PoC_Schema_DataSet_0810.xls_Asset.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Asset_Endorsements.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Asset_Skills.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Asset_Types.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Certifications.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Course_Skills.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Courses.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Group_Member.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Group_Owner.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Group_Skills.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Groups.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Sheet3.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Skill_Endorsements.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Skill_Types.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Skills.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Technology.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Follower.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Following.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Learning.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Profile.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Skills.csv"

but, the enhancement parameters from the previous version will not match because the strings "0810" and "08_02". So, we can't use the filename-specific global enhancement parameters (described in the previous section), and we can't use the file-independent global enhancement parameters (described in the next section). Instead, we'll name the global enhancement parameters according to their discriminators.

bash-3.2$ ls ../2013-Aug-02/manual/*.e1.params.ttl | less
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Asset.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Asset_Skills.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Asset_Types.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Certifications.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Course_Skills.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Courses.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Group_Member.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Group_Owner.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Groups.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Skill_Types.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Skills.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Follower.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Following.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Learning.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Profile.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Skills.csv.e1.params.ttl

Creating file-independent global enhancement parameters ../e1.params.ttl

If all files in all versions share the same structure (and thus, the same enhancements), a file-independent global enhancement parameters can be applied to all of them:

ln ../yesterday/source/A.csv ../e1.params.ttl

with these file-independent global enhancement parameters in place, the automation will use them instead of those in manual/: manual/A.csv.e1.params.ttl and manual/B.csv.e1.params.ttl, making a cached copy at manual/A.csv.e1.global.params.ttl and manual/B.csv.e1.global.params.ttl.

When constructing the manual/*.global.params.ttl, the values of conversion:version_identifier and conversion:subject_discriminator in the file-dependent global enhancement parameters are replaced with the appropriate value for the current version and file.

Pull the common enhancement parameters into a separate file and use conversion:includes.

conversion:includes

Precedence order

When attempting to convert source/foo.csv, csv2rdf4lod-automation looks for:

manual/foo.csv.e1.params.ttl

if not there, it looks for

../foo.csv.e1.params.ttl

if not there, it looks for:

../e1.params.ttl

(for the actual implementation, see $CSV2RDF4LOD_HOME/bin/convert.sh)

Clone this wiki locally