-
Notifications
You must be signed in to change notification settings - Fork 36
Reusing enhancement parameters for multiple versions or datasets
(Paths are relative to the "conversion cockpit"; see Directory Conventions)
When pulling the conversion trigger, csv2rdf4lod-automation
looks for enhancement parameters using the following convention:
- Start with the local name of the file to convert (e.g.,
my.csv
). - Append
e1.params.ttl
to the local name of the CSV (e.g.,my.csv.e1.params.ttl
). - Look for the enhancement parameters relative to the
manual/
directory (e.g.,manual/my.csv.e1.params.ttl
).
So, for example, if the CSV is source/my.csv
, the enhancement parameters will be at manual/my.csv.e1.params.ttl
. Although the enhancement parameters [are created automatically](Generating enhancement parameters), they are placed in manual/
because they will be edited by a human. (see Conversion process phase: retrieve for the distinction among source/
, manual/
, automatic/
, and publish/
) Subsequently, additional layers can be created by [generating](Generating enhancement parameters) and tweaking additional enhancement parameters, e.g., manual/my.csv.e2.params.ttl
, manual/my.csv.e3.params.ttl
, and so on.
Although a "one enhancement file for each CSV" clearly establishes which parameters will be (and were) used, it becomes tedious to maintain many enhancement files when there are many CSVs that share similar structure. Thus, there are three ways to apply enhancement parameters outside of the one manual/*.e1.params.ttl
file that is used by default.
- Create file-dependent global enhancement parameters
../my.csv.e1.params.ttl
- Create file-independent global enhancement parameters
../e1.params.ttl
- Pull the common enhancement parameters into a separate file and use conversion:includes.
If yesterday's version of a dataset has two files:
../yesterday/source/A.csv
../yesterday/source/B.csv
and today's version has similarly-structured files:
source/A.csv
source/B.csv
we don't want to spend time enhancing a structure that we have already enhanced.
ln ../yesterday/source/A.csv ../A.csv.e1.params.ttl
ln ../yesterday/source/B.csv ../B.csv.e1.params.ttl
with these file-dependent global enhancement parameters in place, the automation will use them instead of those in manual/
: manual/A.csv.e1.params.ttl
and manual/B.csv.e1.params.ttl
, making a cached copy at manual/A.csv.e1.global.params.ttl
and manual/B.csv.e1.global.params.ttl
.
When constructing the manual/*.global.params.ttl
, the value of conversion:version_identifier
in the file-dependent global enhancement parameters is replaced with the appropriate value for the current version.
Many times, the filenames provided for subsequent dataset versions vary. When they do, they often include a date or version directly in the file name. For example, when we created version 2013-Aug-02
for the received file Sample_PoC_Schema_dataSet_08_02.xlsx
, the file that we got for the next version 2013-Aug-10
was named Sample_PoC_Schema_DataSet_0810.xlsx
. In these cases, the filenames will not align when looking for the corresponding enhancement parameters. This can be fixed by using discriminator-based global enhancement parameters, which we describe with an example in this section.
version/2013-Aug-10$ cr-create-conversion-trigger.sh -w manual/*.csv
focuses on converting the files:
bash-3.2$ grep datafile= convert-wcl.sh
datafile="Sample_PoC_Schema_DataSet_0810.xls_Asset.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Asset_Endorsements.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Asset_Skills.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Asset_Types.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Certifications.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Course_Skills.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Courses.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Group_Member.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Group_Owner.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Group_Skills.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Groups.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Sheet3.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Skill_Endorsements.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Skill_Types.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Skills.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_Technology.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Follower.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Following.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Learning.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Profile.csv"
datafile="Sample_PoC_Schema_DataSet_0810.xls_User_Skills.csv"
but, the enhancement parameters from the previous version will not match because the strings "0810" and "08_02". So, we can't use the filename-specific global enhancement parameters (described in the previous section), and we can't use the file-independent global enhancement parameters (described in the next section). Instead, we'll name the global enhancement parameters according to their discriminators.
bash-3.2$ ls ../2013-Aug-02/manual/*.e1.params.ttl | less
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Asset.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Asset_Skills.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Asset_Types.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Certifications.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Course_Skills.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Courses.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Group_Member.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Group_Owner.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Groups.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Skill_Types.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_Skills.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Follower.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Following.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Learning.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Profile.csv.e1.params.ttl
../2013-Aug-02/manual/Sample_PoC_Schema_dataSet_08_02.xls_User_Skills.csv.e1.params.ttl
If all files in all versions share the same structure (and thus, the same enhancements), a file-independent global enhancement parameters can be applied to all of them:
ln ../yesterday/source/A.csv ../e1.params.ttl
with these file-independent global enhancement parameters in place, the automation will use them instead of those in manual/
: manual/A.csv.e1.params.ttl
and manual/B.csv.e1.params.ttl
, making a cached copy at manual/A.csv.e1.global.params.ttl
and manual/B.csv.e1.global.params.ttl
.
When constructing the manual/*.global.params.ttl
, the values of conversion:version_identifier
and conversion:subject_discriminator
in the file-dependent global enhancement parameters are replaced with the appropriate value for the current version and file.
Pull the common enhancement parameters into a separate file and use conversion:includes.
When attempting to convert source/foo.csv
, csv2rdf4lod-automation
looks for:
manual/foo.csv.e1.params.ttl
if not there, it looks for
../foo.csv.e1.params.ttl
if not there, it looks for:
../e1.params.ttl
(for the actual implementation, see $CSV2RDF4LOD_HOME/bin/convert.sh)