Skip to content

A standard transformation from XML to RDF via XSLT

vandech1 edited this page Jul 26, 2013 · 9 revisions

In this documentation I will help you to understand how to transform a XML file to RDFa format. We are doing this because we need that data in the XML file to be searchable in a SPARQL query which is best served with a RDF file.

Getting your data set & tools

Let's talk about a few things you need before we get started.

  • Get a copy of the XML file from here.

  • Some of the tools: jEdit (to edit your XSL transformation), a web browser (hopefully Chrome), and Saxon 9 (a XSLT Processor)

Starting on your XSL Stylesheet

When beginning your stylesheet you need to have following in order for the XML file to understand the XSL stylesheet.

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"
			xmlns:db="http://drugbank.ca" exclude-result-prefixes="db">
<xsl:output method="text" encoding="UTF-8"/>

And in the XML file you need to add the following so that the XSL file can understand the where the XML file is. Make sure that the output method is a text file format.

<?xml-stylesheet  type="text/xsl" version="2.0" href="drugbank.xsl"?>

There are some basics of the elements for the stylesheet that you should know:

<xsl:template match="/"> - The match pattern for the template.

<xsl:for-each select="/"> - This element can be used to select every XML element of a specified node-set.

<xsl:value-of select=""/> - This element extracts the value of a selected node.

Notice here we have namespaces for our prefixes:

   @prefix dcterms: &lt;http://purl.org/dc/terms/&gt; .
   @prefix nanopub: &lt;http://www.nanopub.org/nschema#&gt; .
   @prefix opm: &lt;http://purl.org/net/opmv/ns#&gt; .
   @prefix pav: &lt;http://swan.mindinformatics.org/ontologies/1.2/pav/&gt; .
   @prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
   @prefix sio: &lt;http://semanticscience.org/resource/&gt; .
   @prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; .
   @prefix : &lt;http://purl.org/twc/healthdata/source/drugbank/nanopub&gt; .

These allow the SPARQL query to find a URI based on the type prefix that is used. That is important because we need our data to call on a specific website that has information relating to our namespace needs.

Here under <xsl:for-each select="db:drugs/db:drug"> we now need to label our durg types what that type of data that they represent in the drugbank and how the relate.

:<xsl:value-of select="db:drugbank-id"/> a nanopub:Nanopublication ;
            nanopub:hasAssertion :NanoPub_1_Assertion ;
            nanopub:hasProvenance :NanoPub_1_Provenance .

:<xsl:value-of select="db:drugbank-id"/>_Provenance nanopub:hasAttribution :<xsl:value-of select="db:drugbank-id"/>_Attribution ;
       		nanopub:hasSupporting :<xsl:value-of select="db:drugbank-id"/>_Supporting .

:<xsl:value-of select="db:drugbank-id"/>_Assertion a nanopub:Assertion .
       :<xsl:value-of select="db:drugbank-id"/>_Provenance a nanopub:Provenance .
       :<xsl:value-of select="db:drugbank-id"/>_Attribution a nanopub:Attribution .
       :<xsl:value-of select="db:drugbank-id"/>_Supporting a nanopub:Supporting .

We are are saying above that the drugbank id has a assertion for the drug that was selected. Now you must remember that if you're doing a file outside of this example your XSL stylesheet will be completely different. Now we can have our XSLT look of specific names, descriptions, etc. We do this by adding :<xsl:value-of select="db:drugbank-id"/>_Assertion to our stylesheet.

drugbank:<xsl:value-of select="db:drugbank-id"/> a sio:drug;
        rdfs:label &quot;<xsl:value-of select="db:name"/>&quot;;
        	dcterms:description &quot;<xsl:value-of select="normalize-space(db:description/text())"/>&quot;;

Looking at rdfs:label we make a RDF label of the name of the drug as seen with the tag <xsl:value-of select="db:name"/>. Our XSL stylesheet then goes on to gather that data for the other information need. This is done where you see dcterms:[nameOfTag]. The tag is read and it selects the text under the tag.

Thus under :<xsl:value-of select="db:drugbank-id"/>_Attribution and :<xsl:value-of select="db:drugbank-id"/>_Supportingyou would do the same type of construction putting down what info you need for the transformation.

Easy peasy... now we can move onto running the actual transformation.

Using Saxon 9

Now that you have a XSL stylesheet that works it's time to transform that into our RDF file. Saxon 9 is a great tool to use for doing this easily without much hassle. First make sure that your Saxon 9 folder has the jar files. Next put your XML and XSL in that same location.

Steps

  1. Open the location that your saxon folder is in with cmd.exe.
  2. Enter this to start your transformation: java -jar -Xmx1024m saxon9he.jar -xsl:drugbank.xsl -s:drugbank.xml -o:drugbank.rdf.
  3. Look in your folder to see the converted file.
  4. Done

This is a sample of what your RDF file should look like after the conversion:

{
       :DB00001 a nanopub:Nanopublication ;
            nanopub:hasAssertion :NanoPub_1_Assertion ;
            nanopub:hasProvenance :NanoPub_1_Provenance .
 
       :DB00001_Provenance nanopub:hasAttribution :DB00001_Attribution ;
       		nanopub:hasSupporting :DB00001_Supporting .
 
       :DB00001_Assertion a nanopub:Assertion .
       :DB00001_Provenance a nanopub:Provenance .
       :DB00001_Attribution a nanopub:Attribution .
       :DB00001_Supporting a nanopub:Supporting .
}
:DB00001_Assertion {
    drugbank:DB00001 a sio:drug;
        rdfs:label "Lepirudin";
        	dcterms:description "Lepirudin is identical to natural hirudin...";
        	dcterms:substrate "";
        	dcterms:enzymes "";
        	dcterms:mechanism-of-action "Lepirudin forms a stable non-covalent...";
                dcterms:targets "inhibitor # Turpie AG: Anticoagulants in acute...";
}
:DB00001_Attribution {
       	:DB00001_Assertion prov:wasAttributedTo <http://drugbank.ca>;
          	dcterms:created "2005-06-13 07:24:05 -0600"^^xsd:dateTime .
          	dcterms:updated "2013-05-12 21:37:25 -0600"^^xsd:dateTime .
}
:DB00001_Supporting {
		:DB00001_Assertion prov:wasDerivedFrom <http://drugbank.ca>;
			dcterms:general-references "# Smythe MA, Stephens JL, Koerber JM...";
}

Resources

Here are the file that where created form the work we did: Drugbank RDF file. Drugbank XSL file. Drugbank XML file.

Clone this wiki locally