Skip to content

Krextor

Tim L edited this page May 1, 2015 · 65 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

(sidenote: Krextor is a shining example of using XSLTDoc to document XSLT code. I was able to get up and running with XSLTdoc on my own code in about 20 minutes, and now all I want to do is go document my code! But I'll stick to documenting how to use Christoph's Krextor instead.)

I've been writing XSL for almost a decade and have written plenty of it to transform XML to RDF. But every time I do, it seems like a from-scratch repetitive endeavor that is virtually devoid of excitement and reward. Comparing those experiences to the way that I quickly and concisely transform tabular data to RDF (with conversion:Enhancement), there is clearly something missing in the XML case. In the tabular case, I'm finding and expressing the patterns I see and let the converter handle the drudgery. In (my) XML case, I have to act upon any patterns I see by writing the code. I'm hoping that Christoph has found the XML magic that I found for tabular (but couldn't for XML!).

My first input

EEEEEEEEEEAK!

Look at this bad idea that escaped into the world:

<EMSDataSet xsi:schemaLocation="http://www.nemsis.org http://www.nemsis.org/media/XSD/EMSDataSet.xsd" 
            xmlns="http://www.nemsis.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <Header>

...

         <E05>
            <E05_02 xsi:nil="true"/>
            <E05_03>2008-10-30T00:05:51.0Z</E05_03>
            <E05_04>2008-10-30T00:06:04.0Z</E05_04>
            <E05_05>2008-10-30T00:06:57.0Z</E05_05>
            <E05_06>2008-10-30T00:15:43.0Z</E05_06>
            <E05_07>2008-10-30T00:20:05.0Z</E05_07>
            <E05_09>2008-10-30T00:40:56.0Z</E05_09>
            <E05_10>2008-10-30T00:49:11.0Z</E05_10>
            <E05_11>2008-10-30T01:05:24.0Z</E05_11>
            <E05_13 xsi:nil="true"/>
         </E05>
         <E06>
            <E06_01_0>
               <E06_01>WASKI</E06_01>
               <E06_02>LAURA</E06_02>
            </E06_01_0>
            <E06_04_0>
               <E06_04>9473 ROSA L PARKS AVENUE</E06_04>
               <E06_05>51000</E06_05>
               <E06_07>01</E06_07>
               <E06_08>36105</E06_08>
            </E06_04_0>
            <E06_06>01101</E06_06>
            <E06_10>424333300</E06_10>
            <E06_11>655</E06_11>
            <E06_12>670</E06_12>
            <E06_13>695</E06_13>
            <E06_14_0>
               <E06_14>69</E06_14>
               <E06_15>715</E06_15>
            </E06_14_0>
            <E06_16>1936-10-27</E06_16>
            <E06_17>3342539663</E06_17>
         </E06>

...

      </Record>
   </Header>
</EMSDataSet>

Not only can RDF help this situation, but so would another flavor of XML! (if you know what I MEEEEE06_17AN) Regardless, we need the Linkable URI instance naming and OWL ontology that RDF gives us. So lets get started!

Getting started

http://trac.kwarc.info/krextor/ seems to provide the best overview for what Krextor provides.

  • Grab the svn
mkdir -p ~/utilities/krextor/svn
cd ~/utilities/krextor/svn
svn co https://svn.kwarc.info/repos/swim/projects/krextor/trunk krextor
  • How would I run it?

ShellScript, JavaWrapper, RunViaJAXP

bash-3.2$ krextor
Syntax: krextor IN..OUT FILE
Extracts RDF from the XML document FILE.  IN specifies the format of FILE; OUT
specifies the desired RDF serialization.

  -h, --help	Show this help

Looks like OUT can be rxr, ntriples, turtle, rdf-xml, rdfa, or YOUR OWN. Turtle is cool with me...

Looks like IN can be omdoc, ocd, xhtml-rdfa, or hcalendar -- none of which I care about transforming to RDF. Looks like I can establish my own IN identifier...

An example that uses Milhouse and Bart Simpson! Looks like the interface we get with Krextor is by providing a bucket of our own templates (called an extraction module) in the krextor:main mode, then we have a bucket of XSL templates that we can call (krextor:create-resource, krextor:add-uri-property, krextor:add-literal-property) to assert triples.

http://trac.kwarc.info/krextor/browser/trunk/src/xslt/extract shows all of the Extraction Modules that come with Krextor.

Where do I associate the IN id with my extraction module? Based on http://trac.kwarc.info/krextor/browser/trunk/src/xslt/extract, I'm starting to think IN corresponds directly to the file name. Do I have to put the extraction module into krextor's utilities/krextor/svn/krextor/src/xslt/extract? (yes). Not how I'd like to organize my extraction modules because mine will be highly contextualized, but I'll go with it for now. Perhaps I can get past the krextor.sh and invoke saxon.jar myself.

Going with the Simpson's example

mkdir -p ~/utilities/krextor/simpsons-eg
cd ~/utilities/krextor/simpsons-eg

Look at the input:

bash-3.2$ cat milhouse.xml 
<person friends="http://van-houten.name/milhouse">
  <name>Bart Simpson</name>
</person>

Look at the extraction module:

bash-3.2$ cat krextor-extraction-module-for-social-network-xml.xsl 
<!DOCTYPE rdf:RDF [
<!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#">
<!ENTITY dc "http://purl.org/dc/elements/1.1/">
<!ENTITY foaf "http://xmlns.com/foaf/0.1/">
]>

<xsl:transform version="2.0" 
               xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
               xmlns:krextor="http://kwarc.info/projects/krextor"
                      exclude-result-prefixes="">

<xsl:template match="person" mode="krextor:main">
  <xsl:call-template name="krextor:create-resource">
    <xsl:with-param name="type" select="'&foaf;Person'"/>
  </xsl:call-template>
</xsl:template>

<xsl:template match="person/@friends" mode="krextor:main">
  <xsl:call-template name="krextor:add-uri-property">
    <xsl:with-param name="property" select="'&foaf;knows'"/>
  </xsl:call-template>
</xsl:template>

<xsl:template match="person/name" mode="krextor:main">
  <xsl:call-template name="krextor:add-literal-property">
    <xsl:with-param name="property" select="'&foaf;name'"/>
  </xsl:call-template>
</xsl:template>

</xsl:transform>

Get "my" extract module to where krextor.sh can see it:

bash-3.2$ cp krextor-extraction-module-for-social-network-xml.xsl ~/utilities/krextor/svn/krextor/src/xslt/extract
bash-3.2$ l ~/utilities/krextor/svn/krextor/src/xslt/extract
total 216
-rw-r--r--   1 lebot  staff   1247 Mar  3 09:29 krextor-extraction-module-for-social-network-xml.xsl
drwxr-xr-x  10 lebot  staff    340 Feb  2 14:07 util
-rw-r--r--   1 lebot  staff   3113 Feb  2 14:07 hcalendar.xsl
-rw-r--r--   1 lebot  staff  15770 Feb  2 14:07 ocd.xsl
-rw-r--r--   1 lebot  staff  24833 Feb  2 14:07 omdoc-owl.xsl
-rw-r--r--   1 lebot  staff  30614 Feb  2 14:07 omdoc.xsl
-rw-r--r--   1 lebot  staff   2070 Feb  2 14:07 test.xsl
-rw-r--r--   1 lebot  staff   7664 Feb  2 14:07 xhtml-rdfa.xsl
-rw-r--r--   1 lebot  staff   4065 Feb  2 14:07 xmath.xsl
-rw-r--r--   1 lebot  staff   4214 Feb  2 14:07 xml.xsl

Good to note krextor's namespace:

xmlns:krextor="http://kwarc.info/projects/krextor"

Triples!

bash-3.2$ krextor krextor-extraction-module-for-social-network-xml..turtle milhouse.xml 
<file:/Users/me/utilities/krextor/simpsons-eg/milhouse.xml>
	a	<http://xmlns.com/foaf/0.1/Person> ;
	<http://xmlns.com/foaf/0.1/knows>	<http://van-houten.name/milhouse> ;
	<http://xmlns.com/foaf/0.1/name>	"Bart Simpson" .

Given all of what just happened, I'm still not ready for http://trac.kwarc.info/krextor/wiki/YourOwnExtraction#GettingStarted...

  • Storing extraction templates outside of svn/krextor/src/xslt/extract

A temporary fix until I can poke into krextor.sh; I need to keep my extraction modules organized elsewhere.

rm ~/utilities/krextor/svn/krextor/src/xslt/extract/krextor-extraction-module-for-social-network-xml.xsl
cd ~/utilities/krextor/simpsons-eg
mv krextor-extraction-module-for-social-network-xml.xsl krextor-extraction-module-for-social-network-xml.krx
ln -s `pwd`/krextor-extraction-module-for-social-network-xml.krx \
     ~/utilities/krextor/svn/krextor/src/xslt/extract/krextor-extraction-module-for-social-network-xml.xsl

NOTE: I'm going to prepend .krx before the .xsl (or replace it) so I know it is a Krextor Extraction Module. When the stylesheet is not within the context of krextor/src/xslt/extract/, it can no longer be easily recognized as a krextor extraction module.

Back to NEMSIS

Back to the (csv2rdf4lod-automation) conversion cockpit with source/Sample_Output_NEMSIS_XML.xml waiting to become RDF.

vi manual/nemsis-v2.2.1.krx

and pasted in contents from http://trac.kwarc.info/krextor/browser/trunk/src/xslt/extract/xml.xsl

ln -s `pwd`/manual/nemsis-v2.2.1.krx ~/utilities/krextor/svn/krextor/src/xslt/extract/nemsis-v2.2.1.xsl

Had to tweak krextor to get my classpath included:

#java -jar ${SAXON_JAR:-$KREXTOR_HOME/lib/saxon/saxon9.jar} -s:$infile -xsl:$transformer
saxon.sh $transformer foo bar $infile
krextor nemsis-v2.2.1..turtle source/Sample_Output_NEMSIS_XML.xml
<xsl:transform version="2.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
   xmlns:xs="http://www.w3.org/2001/XMLSchema"

   xmlns:xd="http://www.pnp-software.com/XSLTdoc"

   xmlns:krextor="http://kwarc.info/projects/krextor"
   xmlns:krextor-genuri="http://kwarc.info/projects/krextor/genuri"

   xmlns:ems="http://www.nemsis.org"
   xmlns:rat="java:edu.rpi.tw.data.rdf.utils.pipes.starts.Cat"
   xmlns:eparams="java:edu.rpi.tw.data.csv.impl.DefaultEnrichmentParameters"

   xmlns:foaf="http://xmlns.com/foaf/0.1/"
   xmlns:dcterms="http://purl.org/dc/terms/"

   exclude-result-prefixes="#all">

<xsl:include href="model_integration/rutil/foaf-ns.xsl"/>
<xsl:include href="model_integration/rutil/dc-ns.xsl"/>

<xd:doc type="stylesheet">
   <xd:short>Extraction module for NEMSIS v2.1.1</xd:short>
   <xd:author>Timothy Lebo</xd:author>
   <xd:copyright></xd:copyright>
   <xd:svnId></xd:svnId>
</xd:doc>

<xd:doc>Path to RDF encoding of enhancement parameters.</xd:doc>
<xsl:param name="eparams-ttl" select="'ems-nemsis/version/2011-Mar-01/manual/NEMSIS_Data_
Elements_Definitions_v2.2.1.xls.csv.e1.params.ttl'"/>

<xd:doc>Java object representing the RDF encoding of enhancement parameters.</xd:doc>
<xsl:variable name="eParamsRep" select="rat:load($eparams-ttl)"/>

<xd:doc>Java object that calculates namespaces.</xd:doc>
<xsl:variable name="eParams" select="eparams:new($eParamsRep)"/>

<!-- Note that this is not the global default; actually the 
   concrete way of URI generation is decided on element level -->
<!--
<param name="autogenerate-fragment-uris" select="'pseudo-xpath', 'generate-id'"/>
-->
<xsl:param name="autogenerate-fragment-uris" select="'generate-id'"/>

<xsl:strip-space elements="*"/>

<xsl:template match="ems:E04" mode="krextor:main">
   <xsl:call-template name="krextor:create-resource">
      <xsl:with-param name="subject" select="concat(eparams:getURIOfVersionedDataset($eParams),
                                                    '/typed/crew-member/',ems:E04_01)"/>
      <xsl:with-param name="type"    select="($foaf:Person, $foaf:Agent)"/>
      <xsl:with-param name="properties">
         <krextor:property uri="{$dcterms:isReferencedBy}" object="{eparams:getURIOfVersionedDataset($eParams)}"/>
      </xsl:with-param>
   </xsl:call-template>
</xsl:template>

<xsl:template match="ems:E04_02" mode="krextor:main">
   <xsl:call-template name="krextor:add-uri-property">
      <xsl:with-param name="property" select="$foaf:firstName"/>
   </xsl:call-template>
</xsl:template>

<xsl:template match="ems:E06" mode="krextor:main">
   <xsl:variable name="count">
      <xsl:number level="any" count="ems:E06"/>
   </xsl:variable>
   <xsl:call-template name="krextor:create-resource">
      <xsl:with-param name="subject" select="concat(eparams:getURIOfVersionedDataset($eParams),
                                                    '/typed/person/',$count)"/>
      <xsl:with-param name="type"    select="($foaf:Person, $foaf:Agent)"/>
      <xsl:with-param name="properties">
         <krextor:property uri="{$dcterms:isReferencedBy}" object="{eparams:getURIOfVersionedDataset($eParams)}"/>
      </xsl:with-param>
   </xsl:call-template>
</xsl:template>

<xsl:template match="ems:E06_02" mode="krextor:main">
   <xsl:call-template name="krextor:add-literal-property">
      <xsl:with-param name="property" select="$foaf:firstName"/>
   </xsl:call-template>
</xsl:template>

<xsl:template match="ems:E06_01" mode="krextor:main">
   <xsl:call-template name="krextor:add-literal-property">
      <xsl:with-param name="property" select="$foaf:family_name"/>
   </xsl:call-template>
</xsl:template>

</xsl:transform>

Notes

https://trac.kwarc.info/krextor/wiki/Publications

xml.xsl extractor during I-Semantics 2010

Clone this wiki locally