Merge pull request #1 from digicademy/lido-to-cgif

Add LIDO-to-triple conversion
digicademy · Oct 22, 2023 · c374798 · c374798
2 parents 609c344 + 746ee70
commit c374798
Show file tree

Hide file tree

Showing 12 changed files with 380 additions and 63 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,3 +14,8 @@
 - Add option to harvest from file dump
 - Bring back option to compile CSV table from scraped data
 - Implement URL composition feature for Beacon files
+
+## 0.8.4
+
+- Provide infrastructure for CGIF filters
+- Add ability to read triples from LIDO files
diff --git a/CITATION.cff b/CITATION.cff
@@ -35,5 +35,5 @@ keywords:
   - Culture Graph Interchange Format
   - LIDO
 license: MIT
-version: 0.8.3
-date-released: '2023-10-08'
+version: 0.8.4
+date-released: '2023-10-22'
diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ This code is covered by the [MIT](https://opensource.org/license/MIT/) licence.
 ## Installation
 
 To use this script, make sure your system has a working `python` as well as
-the packages `validators` and `rdflib`. Then clone this repository (e.g. `git
+the packages `validators`, `rdflib`, and `lxml`. Then clone this repository (e.g. `git
 clone https://github.com/digicademy/hydra-scraper.git` or the SSH equivalent).
 Open a terminal in the resulting folder to run the script as described below.
 
@@ -42,9 +42,11 @@ run the script without interaction.
 - `-download '<string list>'`: comma-separated list of what you need, possible values:
   - `lists`: all Hydra-paginated lists (requires `-source_url`)
   - `list_triples`: all RDF triples in a Hydra API (requires`-source_url`)
+  - `list_cgif`: CGIF triples in a Hydra API (requires`-source_url`)
   - `beacon`: Beacon file of all resources listed in an API (requires `-source_url`)
   - `resources`: all resources of an API or Beacon (requires `-source_url`/`_file`)
   - `resource_triples`: all RDF triples of resources (requires `-source_url`/`_file`/`_folder`)
+  - `resource_cgif`: CGIF triples of resources (requires `-source_url`/`_file`/`_folder`)
   - `resource_table`: CSV table of data in resources (requires `-source_url`/`_file`/`_folder`)
 - `-source_url '<url>'`: use this entry-point URL to scrape content (default: none)
 - `-source_file '<path to file>'`: use the URLs in this Beacon file to scrape content (default: none)
@@ -57,6 +59,9 @@ run the script without interaction.
 - `-resource_url_add '<string>'`: add this to the end of each resource URL (default: none)
 - `-clean_resource_names '<string list>'`: build file names from resource URLs (default: enumeration)
 - `-table_data '<string list>'`: comma-separated property URIs to compile in a table (default: all)
+- `-supplement_data_feed '<url>'`: URI of a data feed to bind LIDO files to (default: none)
+- `-supplement_data_catalog '<url>'`: URI of a data catalog the data feed belongs to (default: none)
+- `-supplement_data_catalog_publisher '<url>'`: URI of the publisher of the catalog (default: none)
 
 ## Examples
 
@@ -76,19 +81,25 @@ python go.py -download 'resource_triples' -source_url 'https://nfdi4culture.de/r
 Get **CGIF data** from an API entry point:
 
 ```
-python go.py -download 'list_triples' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'sample-cgif'
+python go.py -download 'list_cgif' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'sample-cgif'
 ```
 
 Get **CGIF data from a Beacon** file:
 
 ```
-python go.py -download 'resource_triples' -source_file 'downloads/sample-cgif/beacon.txt' -target_folder 'sample-cgif'
+python go.py -download 'resource_cgif' -source_file 'downloads/sample-cgif/beacon.txt' -target_folder 'sample-cgif'
+```
+
+Get **CGIF data from a Beacon** file that lists LIDO files:
+
+```
+python go.py -download 'resource_cgif' -source_file 'downloads/sample-cgif/beacon.txt' -target_folder 'sample-cgif' -supplement_data_feed 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -supplement_data_catalog 'https://corpusvitrearum.de' -supplement_data_catalog_publisher 'https://nfdi4culture.de/id/E1834'
 ```
 
 Get **CGIF data from a file dump**:
 
 ```
-python go.py -download 'resource_triples' -source_folder 'downloads/sample-cgif' -content_type 'application/ld+json' -target_folder 'sample-cgif'
+python go.py -download 'resource_cgif' -source_folder 'downloads/sample-cgif' -content_type 'application/ld+json' -target_folder 'sample-cgif'
 ```
 
 ### Corpus Vitrearum Germany
@@ -114,7 +125,7 @@ python go.py -download 'lists,list_triples,beacon,resources,resource_triples' -s
 All available **CGIF (JSON-LD)** data:
 
 ```
-python go.py -download 'lists,list_triples,beacon,resources,resource_triples' -source_url 'https://corpusvitrearum.de/id/about.cgif' -target_folder 'cvma-cgif' -resource_url_filter 'https://corpusvitrearum.de/id/F' -resource_url_add '/about.cgif' -clean_resource_names 'https://corpusvitrearum.de/id/,/about.cgif'
+python go.py -download 'lists,list_triples,list_cgif,beacon,resources,resource_triples,resource_cgif' -source_url 'https://corpusvitrearum.de/id/about.cgif' -target_folder 'cvma-cgif' -resource_url_filter 'https://corpusvitrearum.de/id/F' -resource_url_add '/about.cgif' -clean_resource_names 'https://corpusvitrearum.de/id/,/about.cgif'
 ```
 
 All available **LIDO** data:
@@ -126,13 +137,7 @@ python go.py -download 'beacon,resources' -source_url 'https://corpusvitrearum.d
 All available **embedded metadata**:
 
 ```
-python go.py -download 'lists,list_triples,beacon,resources,resource_triples' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'cvma-embedded' -clean_resource_names 'https://corpusvitrearum.de/id/'
-```
-
-All available **embedded metadata**:
-
-```
-python go.py -download 'lists,list_triples,beacon,resources,resource_triples' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'cvma-embedded' -clean_resource_names 'https://corpusvitrearum.de/id/'
+python go.py -download 'lists,list_triples,list_cgif,beacon,resources,resource_triples,resource_cgif' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'cvma-embedded' -clean_resource_names 'https://corpusvitrearum.de/id/'
 ```
 
 **Table** of specific metadata:
@@ -170,8 +175,8 @@ Use GitHub to make the release. Use semantic versioning once the scraper has rea
 
 - Enable checking `schema:dateModified` when collating paged results
 - Implement a JSON return (including dateModified, number of resources, errors)
-- Add conversion routines, i.e. for LIDO to CGIF or for the RADAR version of DataCite/DataVerse to CGIF
-- Allow filtering triples for CGIF, align triples produced by lists and by resources, add any quality assurance that is needed
+- Add conversion routines, i.e. for the RADAR version of DataCite/DataVerse to CGIF
+- Add filter for CGIF triples which aligns those produced by lists and by resources and could host further quality assurance
 - Allow usage of OAI-PMH APIs to produce Beacon lists
 - Re-add the interactive mode
 - Properly package the script and use the system's download folder, and possibly enable pushing to a Git repo?
diff --git a/assets/workflows.png b/assets/workflows.png
diff --git a/assets/workflows.svg b/assets/workflows.svg
diff --git a/classes/beacon.py b/classes/beacon.py
@@ -7,11 +7,12 @@
 
 
 # Import libraries
-from rdflib import Graph
+from rdflib import Graph, Namespace
 from time import sleep
 
 # Import script modules
 from helpers.config import *
+from helpers.convert import convert_lido_to_cgif
 from helpers.convert import convert_triples_to_table
 from helpers.download import download_file
 from helpers.download import retrieve_local_file
@@ -22,6 +23,8 @@
 from helpers.fileio import save_table
 from helpers.status import echo_progress
 
+# Define namespaces
+SCHEMA = Namespace('http://schema.org/')
 
 # Base class for a beacon list to process
 class Beacon:
@@ -31,15 +34,16 @@ class Beacon:
     status = []
     populated = None
     triples = Graph()
+    triples.bind('schema', SCHEMA)
     resources = []
     resources_from_folder = False
     content_type = ''
     target_folder = ''
     number_of_resources = 0
     missing_resources = 0
     missing_resources_list = []
-    non_rdf_resources = 0
-    non_rdf_resources_list = []
+    incompatible_resources = 0
+    incompatible_resources_list = []
 
 
     def __init__(self, target_folder:str, content_type:str = '', resources:list = []):
@@ -72,7 +76,7 @@ def __str__(self):
             return 'Processed list of individual resources'
 
 
-    def populate(self, save_original_files:bool = True, clean_resource_urls:list = [], beacon_file:str = '', local_folder:str = ''):
+    def populate(self, save_original_files:bool = True, clean_resource_urls:list = [], beacon_file:str = '', local_folder:str = '', supplement_data_feed:str = '', supplement_data_catalog:str = '', supplement_data_catalog_publisher:str = ''):
         '''
         Retrieves all individual resources from the list, populates the object, and optionally stores the original files in the process
 
@@ -81,6 +85,9 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
                 clean_resource_urls (list, optional): List of substrings to remove in the resource URLs to produce a resource's file name, defaults to empty list that enumerates resources
                 beacon_file (str, optional): Path to the beacon file to process, defaults to an empty string
                 local_folder (str, optional): Path to a local folder with an existing file dump to process, defaults to an empty string
+                supplement_data_feed (str, optional): URI of a data feed to bind LIDO files to (defaults to none)
+                supplement_data_catalog (str, optional): URI of a data catalog that the data feed belongs to (defaults to none)
+                supplement_data_catalog_publisher (str, optional): URI of the publisher of the data catalog (defaults to none)
         '''
 
         # Notify object that it is being populated
@@ -145,15 +152,24 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
                     self.missing_resources_list.append(resource_url)
                     continue
 
-                # Add triples to object storage
+                # Add triples to object storage from RDF sources
                 if resource['file_type'] not in config['non_rdf_formats']:
                     try:
                         self.triples.parse(data=resource['content'], format=resource['file_type'])
                     except:
-                        self.non_rdf_resources += 1
-                        self.non_rdf_resources_list.append(resource_url)
+                        self.incompatible_resources += 1
+                        self.incompatible_resources_list.append(resource_url)
                         continue
 
+                # Add triples to object storage from LIDO sources
+                elif resource['file_type'] == 'lido':
+                    lido_cgif = convert_lido_to_cgif(resource['content'], supplement_data_feed, supplement_data_catalog, supplement_data_catalog_publisher)
+                    if lido_cgif != None:
+                        self.triples += lido_cgif
+                    else:
+                        self.incompatible_resources += 1
+                        self.incompatible_resources_list.append(resource_url)
+
                 # Delay next retrieval to avoid a server block
                 echo_progress('Retrieving individual resources', number, self.number_of_resources)
                 if self.resources_from_folder == False:
@@ -163,16 +179,16 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
             if self.missing_resources >= self.number_of_resources:
                 status_report['success'] = False
                 status_report['reason'] = 'All resources were missing.'
-            elif self.missing_resources > 0 and self.non_rdf_resources > 0:
-                status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing and ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
+            elif self.missing_resources > 0 and self.incompatible_resources > 0:
+                status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing and ' + str(self.incompatible_resources) + ' were not compatible.'
                 status_report['missing'] = self.missing_resources_list
-                status_report['non_rdf'] = self.non_rdf_resources_list
+                status_report['incompatible'] = self.incompatible_resources_list
             elif self.missing_resources > 0:
                 status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing.'
                 status_report['missing'] = self.missing_resources_list
-            elif self.non_rdf_resources > 0:
-                status_report['reason'] = 'Resources retrieved, but ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
-                status_report['non_rdf'] = self.non_rdf_resources_list
+            elif self.incompatible_resources > 0:
+                status_report['reason'] = 'Resources retrieved, but ' + str(self.incompatible_resources) + ' were not compatible.'
+                status_report['incompatible'] = self.incompatible_resources_list
 
         # Notify object that it is populated
         self.populated = True
@@ -181,11 +197,12 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
         self.status.append(status_report)
 
 
-    def save_triples(self, file_name:str = 'resources'):
+    def save_triples(self, triple_filter:str = 'none', file_name:str = 'resources'):
         '''
         Saves all downloaded triples into a single Turtle file
 
             Parameters:
+                triple_filter (str, optional): Name of a filter (e.g. 'cgif') to apply to triples before saving them, default to 'none'
                 file_name (str, optional): Name of the triple file without a file extension, defaults to 'resources'
         '''
 
@@ -200,24 +217,37 @@ def save_triples(self, file_name:str = 'resources'):
             status_report['reason'] = 'A list of triples can only be written when the resources were read.'
         else:
 
+            # Generate filter description to use in status updates
+            filter_description = ''
+            if triple_filter == 'cgif':
+                filter_description = 'CGIF-filtered '
+
+            # Optionally filter CGIF triples
+            if triple_filter == 'cgif':
+                # TODO Add CGIF filters here
+                filtered_triples = self.triples
+
             # Initial progress
-            echo_progress('Saving list of resource triples', 0, 100)
+            echo_progress('Saving list of ' + filter_description + 'resource triples', 0, 100)
 
             # Compile file if there are triples
             if len(self.triples):
                 file_path = self.target_folder + '/' + file_name + '.ttl'
-                self.triples.serialize(destination=file_path, format='turtle')
+                if triple_filter == 'cgif':
+                    filtered_triples.serialize(destination=file_path, format='turtle')
+                else:
+                    self.triples.serialize(destination=file_path, format='turtle')
 
                 # Compile success status
                 status_report['success'] = True
-                status_report['reason'] = 'All resource triples listed in a Turtle file.'
+                status_report['reason'] = 'All ' + filter_description + 'resource triples listed in a Turtle file.'
 
             # Report if there are no resources
             else:
-                status_report['reason'] = 'No resource triples to list in a Turtle file.'
+                status_report['reason'] = 'No ' + filter_description + 'resource triples to list in a Turtle file.'
 
             # Final progress
-            echo_progress('Saving list of resource triples', 100, 100)
+            echo_progress('Saving list of ' + filter_description + 'resource triples', 100, 100)
 
         # Provide final status
         self.status.append(status_report)