XSLT-based conversion from MARCXML to BIBFRAME 2.0
- Introduction
- Using the converter
- Converter design
- Testing
- Active record conversion
- Metaproxy integration
- Known issues
- Repository contents
- Dependencies
This repository contains an XSLT 1.0 application for converting MARCXML records to RDF/XML, using the BIBFRAME 2.0 and MADSRDF ontologies. The expected input is a MARCXML record or collection, and the output is an XML document expressing the data as a set of RDF triples in the striped RDF/XML syntax. In addition, there is a sample configuration for the Metaproxy search gateway server from Index Data, showing the integration of the application with Metaproxy to provide both a "static" conversion of MARC records and an "active" conversion that attempts to resolve identifiers for configured entities.
The specification for the conversion has been published by the Library of Congress at http://www.loc.gov/bibframe/mtbf/.
In the simplest case, you can invoke an XSLT processor with the
main stylesheet (xsl/marc2bibframe2.xsl
) as the first argument, and
an XML file containing MARCXML as the second:
xsltproc xsl/marc2bibframe2.xsl test/data/marc.xml
The converter supports four optional parameters:
-
baseuri
- the URI stem for generated entities. Default ishttp://example.org/
, which will result in minting URIs likehttp://example.org/<record ID>#Work
-
idfield
- the field of the MARC record that contains the record ID, used in minting URIs as above. Default is001
. If theidfield
refers to a MARC data field rather than a MARC control field, the subfield can also be indicated - e.g.035a
(the default subfield isa
). Note - there is no built-in facility in the stylesheets for URI-encoding. -
idsource
- a URI used to identify the source of the Local identifier derived from theidfield
- e.g.,http://id.loc.gov/vocabulary/organizations/dlc
. This will be empty by default, resulting in no source property being defined. -
serialization
- the RDF serialization to be used for output. Currently onlyrdfxml
is supported (the default).
Different XSLT processors have different syntaxes for passing parameters. For xsltproc, the syntax is:
xsltproc --stringparam baseuri http://mylibrary.org/ --stringparam idsource http://id.loc.gov/vocabulary/organizations/dlc xsl/marc2bibframe2.xsl test/data/marc.xml
For Metaproxy integration, the converter
parameters can be passed to the stylesheets using the <param>
element in the YAZ configuration:
<xslt stylesheet="xsl/marc2bibframe2.xsl">
<param name="baseuri" value="http://mylibrary.org/"/>
</xslt>
Some elements of the conversion can be configured using XML files in the conf directory. Currently, this only includes language mappings for elements generated by 880 tags, and subject thesaurus mappings for MADSRDF elements generated by 6XX tags.
The main stylesheet of the XSLT converter application,
xsl/marc2bibframe.xsl, uses push processing
to process the fields of each MARC record and build the two main
elements it generates, a bf:Work
and a bf:Instance
. In addition,
the fields are pushed through to generate a bflc:adminMetaData
property of the bf:Work
and to generate bf:hasItem
properties of
the bf:Instance
.
Elements in the resulting RDF/XML document that are not blank nodes or
nodes with statically determined URIs are given newly minted URIs
constructed from the stem of the baseuri parameter (default
http://example.org/
), the record ID of the MARC record (by default
the value of the 001 field), and a
hash URI for the new
element. For elements that are not the main bf:Work
or bf:Instance
element generated by the record, the hash URI is constructed from the
element class, the field number, and the position of the field in the
MARC record, e.g.:
http://example.org/13600108#Agent100-12
The templates that match the MARC fields are contained in included
stylesheets from the main stylesheet, along with some utility
templates in the utils.xsl
stylesheet and templates for matching
control subfields in the ConvSpec-ControlSubfields.xsl
stylesheet. Configuration information is read into variables using the
document()
function.
As much as possible, templates representing each specification document in the specifications are contained in a stylesheet with the same name, for easier maintenance.
Each of the specification documents in the specifications is represented in a corresponding test suite in the test directory, with test data in the test/data directory.
The tests are written for the XSpec testing framework, a behavior driven development testing framework for XSLT and XQuery. To run the tests, you must install the Saxon XSLT and XQuery processor as well as XSpec. Installation instructions are available on the XSpec wiki.
Once you have XSpec installed, you can run the entire test suite with the command (for Mac OS or Linux):
xspec.sh test/marc2bibframe2.xspec
Test reports will be output in the test/xspec directory.
Active conversion of records - resolving URIs for elements of the RDF/XML output from authoritative sources, like the Library of Congress Name Authority File, is achieved through a retrieval tool conversion in the YAZ toolkit.
The retrieval tool in YAZ is driven by an XML configuration,
documented in the
YAZ User's Guide and Reference.
The YAZ conversion for RDF/XML is called rdf-lookup
, and the
configuration looks like this:
<backend syntax="xml" name="rdf-lookup">
<xslt stylesheet="xsl/marc2bibframe2.xsl"/>
<rdf-lookup debug="1">
<namespace prefix="bf" href="http://id.loc.gov/ontologies/bibframe/" />
<namespace prefix="bflc" href="http://id.loc.gov/ontologies/bflc/"/>
<lookup xpath="//bf:contribution/bf:Contribution/bf:agent/bf:Agent">
<key field="bflc:name00MatchKey"/>
<key field="bflc:name01MatchKey"/>
<key field="bflc:name11MatchKey"/>
<server url="http://id.loc.gov/authorities/names/label/%s" method="HEAD"/>
</lookup>
</rdf-lookup>
</backend>
From the YAZ User's Guide:
The debug="1" attribute tells the filter to add XML comments to the key nodes that indicate what lookup it tried to do, how it went, and how long it took. The namespace prefix bf: is defined in the namespace tags. These namespaces are used in the xpath expressions in the lookup sections. The lookup tag specifies one tag to be looked up. The xpath attribute defines which node to modify. It may make use of the namespace definitions above. The server tag gives the URL to be used for the lookup. A %s in the string will get replaced by the key value. If there is no server tag, the one from the preceding lookup section is used, and if there is no previous section, the id.loc.gov address is used as a default. The default is to make a GET request, this example uses HEAD.
With this configuration saved as record-conv.xml, you could perform an active conversion of a MARCXML file using the yaz-record-conv utility like so:
yaz-record-conv record-conv.xml test/data/marc.xml
The rdf-lookup conversion support was first introduced in YAZ v5.19.0. YAZ 5.20.0 provided a significant performance improvement for HEAD requests, so using that version or higher is highly recommended.
Both the static and active conversions can be easily integrated into
Index Data's Metaproxy
metasearch gateway software as a record output format. A sample filter
configuration is in the metaproxy directory. With this
filter configuration, an SRU request to the server like
http://metaproxy.mylibrary.org/?version=1.1&operation=searchRetrieve&query=rec.id%3D13600108&recordSchema=bibframe2&startRecord=1&maximumRecords=1
would retrieve and display the requested record converted into
BIBFRAME triples in RDF/XML format. The
install-filters.sh script in that
directory would deploy the filters into a running Metaproxy
configuration.
In addition, we have provided a Vagrantfile and Ansible playbook to build a local Metaproxy VM using VirtualBox for testing, available in the deploy directory.
-
Dealing with punctuation embedded in cataloged bibliographic records is an inexact science. The specifications address this issue very minimally. Some attempt has been made to do a reasonable amount of punctuation handling. In general, for rdfs:label elements, punctuation is left in place as it is found in the source record. For other data elements, an attempt is made to strip end punctuation where appropriate.
-
The handling of alternative scripts through the MARC subfield 6 and data field 880 is done by processing 880 tags as if they were the source tag - so a data field like:
<datafield ind1="0" ind2=" " tag="880"> <subfield code="6">130-01/(3</subfield> <subfield code="a">ملحمة دانيال</subfield> </datafield>
Is processed as though it was a MARC data field 130. No attempt is made to link the subfield 6 of the source tag with the appropriate 880.
-
bf:hasItem
properties andbf:Item
elements can be created by several MARC data fields. It is not always clear whether the data fields in the record refer to the same item, or to different items held by different institutions. This will likely result in the creation of separatebf:Item
elements that need to be collapsed.