Skip to content

Latest commit

 

History

History
260 lines (213 loc) · 11.3 KB

README.md

File metadata and controls

260 lines (213 loc) · 11.3 KB

Build Status

marc2bibframe2

XSLT-based conversion from MARCXML to BIBFRAME 2.0

Introduction

This repository contains an XSLT 1.0 application for converting MARCXML records to RDF/XML, using the BIBFRAME 2.0 and MADSRDF ontologies. The expected input is a MARCXML record or collection, and the output is an XML document expressing the data as a set of RDF triples in the striped RDF/XML syntax. In addition, there is a sample configuration for the Metaproxy search gateway server from Index Data, showing the integration of the application with Metaproxy to provide both a "static" conversion of MARC records and an "active" conversion that attempts to resolve identifiers for configured entities.

The specification for the conversion has been published by the Library of Congress at http://www.loc.gov/bibframe/mtbf/.

Using the converter

In the simplest case, you can invoke an XSLT processor with the main stylesheet (xsl/marc2bibframe2.xsl) as the first argument, and an XML file containing MARCXML as the second:

xsltproc xsl/marc2bibframe2.xsl test/data/marc.xml

Converter parameters

The converter supports four optional parameters:

  • baseuri - the URI stem for generated entities. Default is http://example.org/, which will result in minting URIs like http://example.org/<record ID>#Work

  • idfield - the field of the MARC record that contains the record ID, used in minting URIs as above. Default is 001. If the idfield refers to a MARC data field rather than a MARC control field, the subfield can also be indicated - e.g. 035a (the default subfield is a). Note - there is no built-in facility in the stylesheets for URI-encoding.

  • idsource - a URI used to identify the source of the Local identifier derived from the idfield - e.g., http://id.loc.gov/vocabulary/organizations/dlc. This will be empty by default, resulting in no source property being defined.

  • serialization - the RDF serialization to be used for output. Currently only rdfxml is supported (the default).

Different XSLT processors have different syntaxes for passing parameters. For xsltproc, the syntax is:

xsltproc --stringparam baseuri http://mylibrary.org/ --stringparam idsource http://id.loc.gov/vocabulary/organizations/dlc xsl/marc2bibframe2.xsl test/data/marc.xml

For Metaproxy integration, the converter parameters can be passed to the stylesheets using the <param> element in the YAZ configuration:

<xslt stylesheet="xsl/marc2bibframe2.xsl">
  <param name="baseuri" value="http://mylibrary.org/"/>
</xslt>

Converter configuration

Some elements of the conversion can be configured using XML files in the conf directory. Currently, this only includes language mappings for elements generated by 880 tags, and subject thesaurus mappings for MADSRDF elements generated by 6XX tags.

Converter design

The main stylesheet of the XSLT converter application, xsl/marc2bibframe.xsl, uses push processing to process the fields of each MARC record and build the two main elements it generates, a bf:Work and a bf:Instance. In addition, the fields are pushed through to generate a bflc:adminMetaData property of the bf:Work and to generate bf:hasItem properties of the bf:Instance.

Elements in the resulting RDF/XML document that are not blank nodes or nodes with statically determined URIs are given newly minted URIs constructed from the stem of the baseuri parameter (default http://example.org/), the record ID of the MARC record (by default the value of the 001 field), and a hash URI for the new element. For elements that are not the main bf:Work or bf:Instance element generated by the record, the hash URI is constructed from the element class, the field number, and the position of the field in the MARC record, e.g.:

http://example.org/13600108#Agent100-12

The templates that match the MARC fields are contained in included stylesheets from the main stylesheet, along with some utility templates in the utils.xsl stylesheet and templates for matching control subfields in the ConvSpec-ControlSubfields.xsl stylesheet. Configuration information is read into variables using the document() function.

As much as possible, templates representing each specification document in the specifications are contained in a stylesheet with the same name, for easier maintenance.

Testing

Each of the specification documents in the specifications is represented in a corresponding test suite in the test directory, with test data in the test/data directory.

The tests are written for the XSpec testing framework, a behavior driven development testing framework for XSLT and XQuery. To run the tests, you must install the Saxon XSLT and XQuery processor as well as XSpec. Installation instructions are available on the XSpec wiki.

Once you have XSpec installed, you can run the entire test suite with the command (for Mac OS or Linux):

xspec.sh test/marc2bibframe2.xspec

Test reports will be output in the test/xspec directory.

Active record conversion

Active conversion of records - resolving URIs for elements of the RDF/XML output from authoritative sources, like the Library of Congress Name Authority File, is achieved through a retrieval tool conversion in the YAZ toolkit.

The retrieval tool in YAZ is driven by an XML configuration, documented in the YAZ User's Guide and Reference. The YAZ conversion for RDF/XML is called rdf-lookup, and the configuration looks like this:

<backend syntax="xml" name="rdf-lookup">
  <xslt stylesheet="xsl/marc2bibframe2.xsl"/>
  <rdf-lookup debug="1">
    <namespace prefix="bf" href="http://id.loc.gov/ontologies/bibframe/" />
    <namespace prefix="bflc" href="http://id.loc.gov/ontologies/bflc/"/>
    <lookup xpath="//bf:contribution/bf:Contribution/bf:agent/bf:Agent">
      <key field="bflc:name00MatchKey"/>
      <key field="bflc:name01MatchKey"/>
      <key field="bflc:name11MatchKey"/>
      <server url="http://id.loc.gov/authorities/names/label/%s" method="HEAD"/>
    </lookup>
  </rdf-lookup>
</backend>

From the YAZ User's Guide:

The debug="1" attribute tells the filter to add XML comments to the key nodes that indicate what lookup it tried to do, how it went, and how long it took. The namespace prefix bf: is defined in the namespace tags. These namespaces are used in the xpath expressions in the lookup sections. The lookup tag specifies one tag to be looked up. The xpath attribute defines which node to modify. It may make use of the namespace definitions above. The server tag gives the URL to be used for the lookup. A %s in the string will get replaced by the key value. If there is no server tag, the one from the preceding lookup section is used, and if there is no previous section, the id.loc.gov address is used as a default. The default is to make a GET request, this example uses HEAD.

With this configuration saved as record-conv.xml, you could perform an active conversion of a MARCXML file using the yaz-record-conv utility like so:

yaz-record-conv record-conv.xml test/data/marc.xml

The rdf-lookup conversion support was first introduced in YAZ v5.19.0. YAZ 5.20.0 provided a significant performance improvement for HEAD requests, so using that version or higher is highly recommended.

Metaproxy integration

Both the static and active conversions can be easily integrated into Index Data's Metaproxy metasearch gateway software as a record output format. A sample filter configuration is in the metaproxy directory. With this filter configuration, an SRU request to the server like http://metaproxy.mylibrary.org/?version=1.1&operation=searchRetrieve&query=rec.id%3D13600108&recordSchema=bibframe2&startRecord=1&maximumRecords=1 would retrieve and display the requested record converted into BIBFRAME triples in RDF/XML format. The install-filters.sh script in that directory would deploy the filters into a running Metaproxy configuration.

In addition, we have provided a Vagrantfile and Ansible playbook to build a local Metaproxy VM using VirtualBox for testing, available in the deploy directory.

Known issues

  • Dealing with punctuation embedded in cataloged bibliographic records is an inexact science. The specifications address this issue very minimally. Some attempt has been made to do a reasonable amount of punctuation handling. In general, for rdfs:label elements, punctuation is left in place as it is found in the source record. For other data elements, an attempt is made to strip end punctuation where appropriate.

  • The handling of alternative scripts through the MARC subfield 6 and data field 880 is done by processing 880 tags as if they were the source tag - so a data field like:

    <datafield ind1="0" ind2=" " tag="880">
      <subfield code="6">130-01/(3</subfield>
      <subfield code="a">ملحمة دانيال</subfield>
    </datafield>

    Is processed as though it was a MARC data field 130. No attempt is made to link the subfield 6 of the source tag with the appropriate 880.

  • bf:hasItem properties and bf:Item elements can be created by several MARC data fields. It is not always clear whether the data fields in the record refer to the same item, or to different items held by different institutions. This will likely result in the creation of separate bf:Item elements that need to be collapsed.

Repository contents

  • dataset - sample records for exercising conversion
  • metaproxy - sample Metaproxy configuration for static and active conversion
  • test - Unit tests for the XSpec testing framework, and test data
  • xsl - XSLT 1.0 stylesheets for transformation, configuration in xsl/conf

Dependencies