Skip to content

Transform a Google Sheet URL into a fully executed Islandora Workbench task

Notifications You must be signed in to change notification settings

lehigh-university-libraries/fabricator

Repository files navigation

fabricator

Transform a Google Sheet URL into a fully executed Islandora Workbench task.

Overview

Content creators can work in Google Sheets to prepare a spreadsheet for bulk ingest. The Google Sheet can be structured in a manor that makes populating the content much more user friendly than the format a workbench CSV may require. e.g. some of Lehigh's fields need populated by workbench using JSON. Instead of asking a person to hand craft JSON, we can just create multiple columns that each contain a value that populates a given JSON field, and we can automate aggregating those columns and representing them in the correct JSON format

A Google Appscript is embeded in the sheet to allow easily checking the data in the spreadsheet is valid

When the spreadsheet is ready to have the content added to the repository, the metadata in the spreadsheet can be ingested into Islandora/Drupal via Islandora Workbench by supplying the sheet URL in the GitHub Action. Write access on this repo is required to execute the workflow

The GitHub Action is executed on a self-hosted runner within Lehigh's infrastructure. This allows uploading files directly from the same file server Lehigh staff use. This also allows ensuring the files referenced in the Google Sheet exist before executing the workbench job.

sequenceDiagram
    actor Alice
    participant Google Sheets
    participant Fabricator
    participant GitHub
    participant Slack
    Alice->>Google Sheets: Edit 1
    Alice->>Google Sheets: Edit 2
    Alice->>Google Sheets: Edit ...
    Alice->>Google Sheets: Edit N
    Alice->>Google Sheets: Click check my work
    Google Sheets->>Fabricator: Check this CSV
    Fabricator->>Alice: Looks good 🚀
    Alice->>GitHub: Run workbench workflow
    GitHub->>Self-hosted Runner: Run workbench workflow
    Self-hosted Runner->>Slack: Workbench job started
    Slack->>Alice: Message notification
    Self-hosted Runner->>Islandora Workbench: python3 workbench
    Islandora Workbench->>Drupal: entity CUD
    Islandora Workbench->>GitHub: logs streamed to GitHub Action UI
    Alice->>GitHub: Clicks Slack link to view GitHub Action logs while job runs
    Self-hosted Runner->>Slack: ✅ Workbench job succeeded!
Loading

Technical details

This is an http service with two routes:

  • /workbench/check
    • check if a google sheet content is well formed
  • /workbench/transform
    • transform a google sheet CSV export into a workbench CSV

Start the server

git clone https://github.com/lehigh-university-libraries/fabricator
docker build -t fabricator:main
docker run --rm -d -p 8080:8080 fabricator:main

Ensure a google sheet CSV has no bad data

The /workbench/check route returns a JSON map keyed by the Google Sheet column/row of a cell and the error that cell contains. If the map is empty, there are no errors.

The route requires the CSV to be uploaded as JSON. This was done since the Google Sheets Appscript does not have a convenient SDK to convert a Google Sheet into a CSV. Instead, the sheet is parsed cell by cell and stored as a JSON map. You can see in the tests how the JSON is structured.

There's also an example script on how to download a Google Sheet into the JSON format and also CSV format.

Example: no errors

$ curl -s \
  -H "X-Secret: $SHARED_SECRET" \
  -XPOST \
  --upload-file csv.json \
  http://localhost:8080/workbench/check
{}

Example: Row 12, Column A has a required field that is blank

$ curl -s \
  -H "X-Secret: $SHARED_SECRET" \
  -XPOST \
  --upload-file csv.json \
  http://localhost:8080/workbench/check
{"A12": "Missing value"}

Get a workbench CSV from a google sheet CSV

The /workbench/transform route transforms a Google Sheet CSV into a Workbench CSV. The route returns a ZIP of CSVs. There are three possible flavors of CSVs that can be returned:

  • target.csv - used to run a workbench create task
    • this is the most common pattern used at Lehigh. This creates metadata and media/files for new content being added to the repository
  • target.agent.csv - used to run a workbench create_terms task
    • this is returned when linked agents have additional metadata beyond just their name (e.g. email/institution/ORCiD)
    • this allows creating those linked agents with the proper metadata and associating that term with the given record
  • target.update.csv - used to run a workbench updaye task
    • this is returned when the Google Sheet contains node IDs in the sheet, signifying the job should be updating metadata for existing nodes
$ curl -s \
  -H "X-Secret: $SHARED_SECRET" \
  -XPOST \
  -o target.zip \
  --upload-file source.csv \
  http://localhost:8080/workbench/transform
$ unzip target.zip

Adding new columns to the ingest template

If the ingest template needs a new column added, these are the code changes that are needed

  • Add the column to the ingest template, making row one the human-friendly label
  • Make the necessary changes to go-islandora
    • Add the column label and machine name to the sheets slice in go-islandora
    • generate the openapi schema and structs go build && ./go-islandora generate sheets-structs --output=workbench.yaml
  • Make the necessary changes in this repo
    • Update go.mod go get -u github.com/lehigh-university-libraries/go-islandora@main
    • Add any necessary checks and tests
  • Deploy the new image to the staging server
isle-stage
cd /opt/islandora/d10_lehigh_agile
sudo docker compose --profile prod pull
sudo systemctl restart islandora

TODO: This should eventually be able to be automatted, and the ingest template is simply generated by this repo (Issue #23).