Skip to content

Commit

Permalink
Merge pull request #1 from digicademy/lido-to-cgif
Browse files Browse the repository at this point in the history
Add LIDO-to-triple conversion
  • Loading branch information
jonatansteller authored Oct 22, 2023
2 parents 609c344 + 746ee70 commit c374798
Show file tree
Hide file tree
Showing 12 changed files with 380 additions and 63 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,8 @@
- Add option to harvest from file dump
- Bring back option to compile CSV table from scraped data
- Implement URL composition feature for Beacon files

## 0.8.4

- Provide infrastructure for CGIF filters
- Add ability to read triples from LIDO files
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,5 @@ keywords:
- Culture Graph Interchange Format
- LIDO
license: MIT
version: 0.8.3
date-released: '2023-10-08'
version: 0.8.4
date-released: '2023-10-22'
33 changes: 19 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ This code is covered by the [MIT](https://opensource.org/license/MIT/) licence.
## Installation

To use this script, make sure your system has a working `python` as well as
the packages `validators` and `rdflib`. Then clone this repository (e.g. `git
the packages `validators`, `rdflib`, and `lxml`. Then clone this repository (e.g. `git
clone https://github.com/digicademy/hydra-scraper.git` or the SSH equivalent).
Open a terminal in the resulting folder to run the script as described below.

Expand All @@ -42,9 +42,11 @@ run the script without interaction.
- `-download '<string list>'`: comma-separated list of what you need, possible values:
- `lists`: all Hydra-paginated lists (requires `-source_url`)
- `list_triples`: all RDF triples in a Hydra API (requires`-source_url`)
- `list_cgif`: CGIF triples in a Hydra API (requires`-source_url`)
- `beacon`: Beacon file of all resources listed in an API (requires `-source_url`)
- `resources`: all resources of an API or Beacon (requires `-source_url`/`_file`)
- `resource_triples`: all RDF triples of resources (requires `-source_url`/`_file`/`_folder`)
- `resource_cgif`: CGIF triples of resources (requires `-source_url`/`_file`/`_folder`)
- `resource_table`: CSV table of data in resources (requires `-source_url`/`_file`/`_folder`)
- `-source_url '<url>'`: use this entry-point URL to scrape content (default: none)
- `-source_file '<path to file>'`: use the URLs in this Beacon file to scrape content (default: none)
Expand All @@ -57,6 +59,9 @@ run the script without interaction.
- `-resource_url_add '<string>'`: add this to the end of each resource URL (default: none)
- `-clean_resource_names '<string list>'`: build file names from resource URLs (default: enumeration)
- `-table_data '<string list>'`: comma-separated property URIs to compile in a table (default: all)
- `-supplement_data_feed '<url>'`: URI of a data feed to bind LIDO files to (default: none)
- `-supplement_data_catalog '<url>'`: URI of a data catalog the data feed belongs to (default: none)
- `-supplement_data_catalog_publisher '<url>'`: URI of the publisher of the catalog (default: none)

## Examples

Expand All @@ -76,19 +81,25 @@ python go.py -download 'resource_triples' -source_url 'https://nfdi4culture.de/r
Get **CGIF data** from an API entry point:

```
python go.py -download 'list_triples' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'sample-cgif'
python go.py -download 'list_cgif' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'sample-cgif'
```

Get **CGIF data from a Beacon** file:

```
python go.py -download 'resource_triples' -source_file 'downloads/sample-cgif/beacon.txt' -target_folder 'sample-cgif'
python go.py -download 'resource_cgif' -source_file 'downloads/sample-cgif/beacon.txt' -target_folder 'sample-cgif'
```

Get **CGIF data from a Beacon** file that lists LIDO files:

```
python go.py -download 'resource_cgif' -source_file 'downloads/sample-cgif/beacon.txt' -target_folder 'sample-cgif' -supplement_data_feed 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -supplement_data_catalog 'https://corpusvitrearum.de' -supplement_data_catalog_publisher 'https://nfdi4culture.de/id/E1834'
```

Get **CGIF data from a file dump**:

```
python go.py -download 'resource_triples' -source_folder 'downloads/sample-cgif' -content_type 'application/ld+json' -target_folder 'sample-cgif'
python go.py -download 'resource_cgif' -source_folder 'downloads/sample-cgif' -content_type 'application/ld+json' -target_folder 'sample-cgif'
```

### Corpus Vitrearum Germany
Expand All @@ -114,7 +125,7 @@ python go.py -download 'lists,list_triples,beacon,resources,resource_triples' -s
All available **CGIF (JSON-LD)** data:

```
python go.py -download 'lists,list_triples,beacon,resources,resource_triples' -source_url 'https://corpusvitrearum.de/id/about.cgif' -target_folder 'cvma-cgif' -resource_url_filter 'https://corpusvitrearum.de/id/F' -resource_url_add '/about.cgif' -clean_resource_names 'https://corpusvitrearum.de/id/,/about.cgif'
python go.py -download 'lists,list_triples,list_cgif,beacon,resources,resource_triples,resource_cgif' -source_url 'https://corpusvitrearum.de/id/about.cgif' -target_folder 'cvma-cgif' -resource_url_filter 'https://corpusvitrearum.de/id/F' -resource_url_add '/about.cgif' -clean_resource_names 'https://corpusvitrearum.de/id/,/about.cgif'
```

All available **LIDO** data:
Expand All @@ -126,13 +137,7 @@ python go.py -download 'beacon,resources' -source_url 'https://corpusvitrearum.d
All available **embedded metadata**:

```
python go.py -download 'lists,list_triples,beacon,resources,resource_triples' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'cvma-embedded' -clean_resource_names 'https://corpusvitrearum.de/id/'
```

All available **embedded metadata**:

```
python go.py -download 'lists,list_triples,beacon,resources,resource_triples' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'cvma-embedded' -clean_resource_names 'https://corpusvitrearum.de/id/'
python go.py -download 'lists,list_triples,list_cgif,beacon,resources,resource_triples,resource_cgif' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'cvma-embedded' -clean_resource_names 'https://corpusvitrearum.de/id/'
```

**Table** of specific metadata:
Expand Down Expand Up @@ -170,8 +175,8 @@ Use GitHub to make the release. Use semantic versioning once the scraper has rea

- Enable checking `schema:dateModified` when collating paged results
- Implement a JSON return (including dateModified, number of resources, errors)
- Add conversion routines, i.e. for LIDO to CGIF or for the RADAR version of DataCite/DataVerse to CGIF
- Allow filtering triples for CGIF, align triples produced by lists and by resources, add any quality assurance that is needed
- Add conversion routines, i.e. for the RADAR version of DataCite/DataVerse to CGIF
- Add filter for CGIF triples which aligns those produced by lists and by resources and could host further quality assurance
- Allow usage of OAI-PMH APIs to produce Beacon lists
- Re-add the interactive mode
- Properly package the script and use the system's download folder, and possibly enable pushing to a Git repo?
Binary file modified assets/workflows.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 6 additions & 6 deletions assets/workflows.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
68 changes: 49 additions & 19 deletions classes/beacon.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,12 @@


# Import libraries
from rdflib import Graph
from rdflib import Graph, Namespace
from time import sleep

# Import script modules
from helpers.config import *
from helpers.convert import convert_lido_to_cgif
from helpers.convert import convert_triples_to_table
from helpers.download import download_file
from helpers.download import retrieve_local_file
Expand All @@ -22,6 +23,8 @@
from helpers.fileio import save_table
from helpers.status import echo_progress

# Define namespaces
SCHEMA = Namespace('http://schema.org/')

# Base class for a beacon list to process
class Beacon:
Expand All @@ -31,15 +34,16 @@ class Beacon:
status = []
populated = None
triples = Graph()
triples.bind('schema', SCHEMA)
resources = []
resources_from_folder = False
content_type = ''
target_folder = ''
number_of_resources = 0
missing_resources = 0
missing_resources_list = []
non_rdf_resources = 0
non_rdf_resources_list = []
incompatible_resources = 0
incompatible_resources_list = []


def __init__(self, target_folder:str, content_type:str = '', resources:list = []):
Expand Down Expand Up @@ -72,7 +76,7 @@ def __str__(self):
return 'Processed list of individual resources'


def populate(self, save_original_files:bool = True, clean_resource_urls:list = [], beacon_file:str = '', local_folder:str = ''):
def populate(self, save_original_files:bool = True, clean_resource_urls:list = [], beacon_file:str = '', local_folder:str = '', supplement_data_feed:str = '', supplement_data_catalog:str = '', supplement_data_catalog_publisher:str = ''):
'''
Retrieves all individual resources from the list, populates the object, and optionally stores the original files in the process
Expand All @@ -81,6 +85,9 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
clean_resource_urls (list, optional): List of substrings to remove in the resource URLs to produce a resource's file name, defaults to empty list that enumerates resources
beacon_file (str, optional): Path to the beacon file to process, defaults to an empty string
local_folder (str, optional): Path to a local folder with an existing file dump to process, defaults to an empty string
supplement_data_feed (str, optional): URI of a data feed to bind LIDO files to (defaults to none)
supplement_data_catalog (str, optional): URI of a data catalog that the data feed belongs to (defaults to none)
supplement_data_catalog_publisher (str, optional): URI of the publisher of the data catalog (defaults to none)
'''

# Notify object that it is being populated
Expand Down Expand Up @@ -145,15 +152,24 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
self.missing_resources_list.append(resource_url)
continue

# Add triples to object storage
# Add triples to object storage from RDF sources
if resource['file_type'] not in config['non_rdf_formats']:
try:
self.triples.parse(data=resource['content'], format=resource['file_type'])
except:
self.non_rdf_resources += 1
self.non_rdf_resources_list.append(resource_url)
self.incompatible_resources += 1
self.incompatible_resources_list.append(resource_url)
continue

# Add triples to object storage from LIDO sources
elif resource['file_type'] == 'lido':
lido_cgif = convert_lido_to_cgif(resource['content'], supplement_data_feed, supplement_data_catalog, supplement_data_catalog_publisher)
if lido_cgif != None:
self.triples += lido_cgif
else:
self.incompatible_resources += 1
self.incompatible_resources_list.append(resource_url)

# Delay next retrieval to avoid a server block
echo_progress('Retrieving individual resources', number, self.number_of_resources)
if self.resources_from_folder == False:
Expand All @@ -163,16 +179,16 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
if self.missing_resources >= self.number_of_resources:
status_report['success'] = False
status_report['reason'] = 'All resources were missing.'
elif self.missing_resources > 0 and self.non_rdf_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing and ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
elif self.missing_resources > 0 and self.incompatible_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing and ' + str(self.incompatible_resources) + ' were not compatible.'
status_report['missing'] = self.missing_resources_list
status_report['non_rdf'] = self.non_rdf_resources_list
status_report['incompatible'] = self.incompatible_resources_list
elif self.missing_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing.'
status_report['missing'] = self.missing_resources_list
elif self.non_rdf_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
status_report['non_rdf'] = self.non_rdf_resources_list
elif self.incompatible_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.incompatible_resources) + ' were not compatible.'
status_report['incompatible'] = self.incompatible_resources_list

# Notify object that it is populated
self.populated = True
Expand All @@ -181,11 +197,12 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
self.status.append(status_report)


def save_triples(self, file_name:str = 'resources'):
def save_triples(self, triple_filter:str = 'none', file_name:str = 'resources'):
'''
Saves all downloaded triples into a single Turtle file
Parameters:
triple_filter (str, optional): Name of a filter (e.g. 'cgif') to apply to triples before saving them, default to 'none'
file_name (str, optional): Name of the triple file without a file extension, defaults to 'resources'
'''

Expand All @@ -200,24 +217,37 @@ def save_triples(self, file_name:str = 'resources'):
status_report['reason'] = 'A list of triples can only be written when the resources were read.'
else:

# Generate filter description to use in status updates
filter_description = ''
if triple_filter == 'cgif':
filter_description = 'CGIF-filtered '

# Optionally filter CGIF triples
if triple_filter == 'cgif':
# TODO Add CGIF filters here
filtered_triples = self.triples

# Initial progress
echo_progress('Saving list of resource triples', 0, 100)
echo_progress('Saving list of ' + filter_description + 'resource triples', 0, 100)

# Compile file if there are triples
if len(self.triples):
file_path = self.target_folder + '/' + file_name + '.ttl'
self.triples.serialize(destination=file_path, format='turtle')
if triple_filter == 'cgif':
filtered_triples.serialize(destination=file_path, format='turtle')
else:
self.triples.serialize(destination=file_path, format='turtle')

# Compile success status
status_report['success'] = True
status_report['reason'] = 'All resource triples listed in a Turtle file.'
status_report['reason'] = 'All ' + filter_description + 'resource triples listed in a Turtle file.'

# Report if there are no resources
else:
status_report['reason'] = 'No resource triples to list in a Turtle file.'
status_report['reason'] = 'No ' + filter_description + 'resource triples to list in a Turtle file.'

# Final progress
echo_progress('Saving list of resource triples', 100, 100)
echo_progress('Saving list of ' + filter_description + 'resource triples', 100, 100)

# Provide final status
self.status.append(status_report)
Expand Down
Loading

0 comments on commit c374798

Please sign in to comment.