Created in October-December 2022 for the National Library of Scotland's Data Foundry by Gustavo Candela, National Librarian’s Research Fellowship in Digital Scholarship 2022-23
To explore the collections listed below in live Jupyter Notebooks, open them here in Binder (please note this may take several minutes to load).
To copy and work with the Jupyter Notebooks on your own machine, see Setup.
- Datasets
- Setup
- Suggested Citations
- Moving Image Archive
- National Bibliography of Scotland and BOSLIT
- Data quality assessment
- References
- Owner: National Library of Scotland
- Creator: National Library of Scotland
- Website: Visit the NLS Data Foundry
- Licence: Public Domain
- Owner: National Library of Scotland
- Creator: National Library of Scotland
- Website: Visit the NLS Data Foundry
- DOI: https://doi.org/10.34812/7cda-ep21
- Date created: 2022
- Licence: Creative Commons Attribution 4.0 International (CC-BY 4.0)
- Owner: National Library of Scotland
- Creator: National Library of Scotland
- Website: Visit the NLS Data Foundry
- Licence: Creative Commons Attribution 4.0 International (CC-BY 4.0)
To clone the Git repository and run the Jupyter Notebooks on your own machine, follow the steps below according to your operating system.
For Mac and Linux
# Install (if not yet installed) the Python venv (virtual environment) module
python3 -m pip install --user virtualenv
# Create a virtual environment (you can change 'env' to a name of your choice)
python3 -m venv env
# Activate the environment
source env/bin/activate
# Clone the repository
git clone https://github.com/NLS-Digital-Scholarship/nls-fellowship-2022-23.git
# Enter and initialize the repository
cd nls-fellowship-2022-23
git init
# Install the required packages
python3 -m pip install -r requirements.txt
# When you're finished, deactivate the environment (simply activate it once
# more and enter the repository if when you wish to resume your work)
deactivate
For Windows
# Install (if not yet installed) the Python venv (virtual environment) module
py -m pip install --user virtualenv
# Create a virtual environment (you can change 'env' to a name of your choice)
py -m venv env
# Activate the environment
.\env\Scripts\activate
# Clone the repository
git clone https://github.com/NLS-Digital-Scholarship/nls-fellowship-2022-23.git
# Enter and initialize the repository
cd nls-fellowship-2022-23
git init
# Install the required packages
py -m pip install -r requirements.txt
# When you're finished, deactivate the environment (simply activate it once
# more and enter the repository if when you wish to resume your work)
deactivate
Gustavo Candela. Towards a semantic approach in GLAM Labs: the case of the Data Foundry at the National Library of Scotland. CoRR abs/2301.11182 (2023) https://arxiv.org/abs/2301.11182
This dataset represents the descriptive metadata from the Moving Image Archive catalogue, which is Scotland’s national collection of moving images.
- Data format: metadata available as MARCXML and Dublin Core
- Data source: https://data.nls.uk/data/metadata-collections/moving-image-archive/
The Jupyter Notebooks include a set of examples to reproduce the transformation to RDF and enrichment with external repositories:
- Data extraction
- Exploring the CSV text file
- Transformation to LOD
- Enrichment
- Exploring with SPARQL
- Exploring geographic locations
- Data Quality assessment
The transformation is based on the vocabulary schema.org, using the entity VideoObject.
Several approaches have been followed to create a map visualisation to show the locations named in the metadata provided in the dataset. First, the Python library folium has been used to create a map.
A second approach, is based on Wikidata and uses the links to create a map as a result of a SPARQL query. Please, click the following link to see the visualisation Wikidata.
The SPARQL uses the instruction VALUES
to use the links provided by the RDF data:
#defaultView:Map
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?r ?rLabel (SAMPLE(?image) as ?img) (SAMPLE(?location) as ?l)
WHERE {
VALUES ?r {wd:Q793283 wd:Q207257 wd:Q211091 wd:Q980084 wd:Q17582129 wd:Q1247435 wd:Q652539 wd:Q2421 wd:Q23436 wd:Q1061313 wd:Q189912 wd:Q530296 wd:Q81052 wd:Q202177 wd:Q54809 wd:Q786649 wd:Q664892 wd:Q1247396 wd:Q1147435 wd:Q9177476 wd:Q47134 wd:Q3643362 wd:Q4093 wd:Q206934 wd:Q550606 wd:Q864668 wd:Q100166 wd:Q123709 wd:Q203000 wd:Q80967 wd:Q978599 wd:Q204940 wd:Q182923 wd:Q207268 wd:Q1229763 wd:Q376914 wd:Q106652 wd:Q36405 wd:Q201149 wd:Q1247384 }.
?r wdt:P625 ?location. # coordinates
OPTIONAL {?r wdt:P18 ?image}.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
} GROUP BY ?r ?rLabel ?img
The transformation process is based on the tool marc2bibframe that uses BIBFRAME as main vocabulary to describe the resources.
The original metadata described in MARCXML is automatically transformed into BIBFRAME using the XSLT template provided by the tool marc2bibframe2. Each record provided in the original dataset is extracted using a Python script. The final RDF dataset can be generated and queried using as RDF storage system the TDB2 Apache Jena component provided in this Java project. In order to run the Java project, an Apache Maven installation is required.
In order to generate the RDF we need to download the datasets from the Data Foundry. In addition, we need to download the marc2bibframe2.
Then, we need to setup the paths in the Python script Marc2bibframe:
import lxml.etree as lxml
import xml.etree.cElementTree as ET
from xml.etree import ElementTree
ET.register_namespace('marc',"http://www.loc.gov/MARC21/slim")
# add the path to the dataset
filename = "../data/nls-nbs-v2/NBS_v2_validated_marcxml.xml"
# add the path to the XSLT file in the marc2bibframe2 project
xsl_filename = "../tools/marc2bibframe2/xsl/marc2bibframe2.xsl"
count = 0;
for event, elem in ET.iterparse(filename, events=("start", "end")):
if event == 'end':
# process the tag
if elem.tag == '{http://www.loc.gov/MARC21/slim}record':
xml_str = ElementTree.tostring(elem).decode()
marc_record = lxml.XML(xml_str)
xslt = lxml.parse(xsl_filename)
transform = lxml.XSLT(xslt)
result = transform(marc_record)
result.write_output("../rdf/nbs/nbs_output_"+ str(count) +".rdf.gz", compression=9)
count +=1;
print(count)
elem.clear()
The RDF files generated by the process will be stored in the folder /rdf/nbs/
.
The data modelling is based on BIBFRAME as main vocabulary. The following figure shows an overview of the main classes used to describe the bibliographic information.
Uniform titles (e.g., 130 and 240 MARC fields) are processed in order to create a bf:Hub resource including the title (240 $a), language (240 $l) and main author (100 $a).
In order to store the RDF and be able to query the information, a Jena TDB RDF storage system was used. A Java project was created in order to identify the classes and properties based on BIBFRAME.
The following example of SPARQL query shows how to query the RDF dataset in order to identify works written by authors containing the label Stevenson, Robert Louis:
PREFIX bf:<http://id.loc.gov/ontologies/bibframe/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label ?a
WHERE {
?s bf:contribution ?c .
?c bf:agent ?a .
?a rdfs:label ?label .
FILTER regex(str(?a), "http://")
FILTER regex(str(?label), "Stevenson, Robert Louis")
} LIMIT 10
The RDF datasets has been assessed by means of SPARQL in several ways using a data quality criteria (e.g., timeliness, completeness, accuracy, consistency, interlinking, etc.). In addition, an innovative method to assess RDF repositories has been used based on Shape Expressions (ShEx), a language for describing RDF graph structures. A ShEx schema describes constraints that RDF data graphs must meet in order to be considered conformant. ShEx schemas define the node constraints to assess the triples found in an RDF dataset.
A collection of ShEx schemas has been created to describe and assess the resources stored in the final RDF datasets:
- Moving Image Archive: a ShEx schema has been generated for the main class schema:VideoObject.
shex:VideoObject
{
rdf:type [schema:VideoObject] ; # 100.0 %
dc:identifier IRI ; # 100.0 %
schema:sourceOrganization IRI ; # 100.0 %
schema:identifier IRI ; # 100.0 %
schema:duration xsd:string ?;
# 99.99514751552795 % obj: xsd:string. Cardinality: {1}
schema:abstract xsd:string ?;
# 99.85927795031056 % obj: xsd:string. Cardinality: {1}
schema:name xsd:string ?;
# 99.34006211180125 % obj: xsd:string. Cardinality: {1}
dc:title xsd:string ?;
# 99.34006211180125 % obj: xsd:string. Cardinality: {1}
schema:videoQuality xsd:string ?
# 98.3113354037267 % obj: xsd:string. Cardinality: {1}
}
- National Bibliography of Scotland & BOSLIT : ShEx schemas habe been generated for the classes bibframe:Agent and bibframe:Work.
shex:Work
{
rdf:type [bibframe:Work] ; # 100.0 %
bibframe:title xsd:string +; # 100.0 %
# 97.56637168141593 % obj: xsd:string. Cardinality: {1}
bibframe:hasInstance IRI ?;
# 87.61061946902655 % obj: IRI. Cardinality: {1}
bibframe:contribution xsd:string *;
# 86.72566371681415 % obj: xsd:string. Cardinality: +
rdf:type [bibframe:Text] ?;
# 81.85840707964603 % obj: bibframe:Text. Cardinality: {1}
bibframe:adminMetadata xsd:string ?;
# 81.85840707964603 % obj: xsd:string. Cardinality: {1}
bibframe:content IRI ?;
# 81.85840707964603 % obj: IRI. Cardinality: {1}
bibframe:language IRI ?;
# 81.85840707964603 % obj: IRI. Cardinality: {1}
rdf:type [bibframe:Monograph] ?;
# 81.63716814159292 % obj: bibframe:Monograph. Cardinality: {1}
bibframe:subject IRI *;
# 81.19469026548673 % obj: IRI. Cardinality: +
bibframe:language xsd:string *;
# 80.97345132743364 % obj: xsd:string. Cardinality: +
bibframe:note xsd:string *
# 80.75221238938053 % obj: xsd:string. Cardinality: +
}
The ShEx schemas has been automatically generated using the tool sheXer. The following example shows how to create the ShEx schemas for the classes bibframe:Agent
and bibframe:Work
using as input 500 instances and with a threshold of tolerance value of 0.8 (the minimun percentage of nodes that should conform with a constraint c):
from shexer.shaper import Shaper
from shexer.consts import NT, SHEXC, SHACL_TURTLE, TURTLE
target_classes = [
"http://id.loc.gov/ontologies/bibframe/Agent",
"http://id.loc.gov/ontologies/bibframe/Work"
]
namespaces_dict = {"http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf",
"http://example.org/": "ex",
"http://weso.es/shapes/": "",
"http://www.w3.org/2001/XMLSchema#": "xsd",
"http://id.loc.gov/ontologies/bibframe/":"bibframe",
"http://id.loc.gov/ontologies/bflc/":"bflc"
}
shaper = Shaper(target_classes=target_classes,
url_endpoint="http://localhost:3330/rdf/sparql",
limit_remote_instances=500,
namespaces_dict=namespaces_dict, # Default: no prefixes
instantiation_property="http://www.w3.org/1999/02/22-rdf-syntax-ns#type") # Default rdf:type
output_file = "shaper_bibframe.shex"
shaper.shex_graph(output_file=output_file,
acceptance_threshold=0.8)
print("Done!")