Harvesting 2.0

Data.gov is embarking on the creation of a new harvesting pipeline for government data. This system will take into account where we are and attempt to support the current harvesting types and metadata schemas that are already in use, while also planning for future updates, changes, and better systems in the future. The features provided in this system will be much more request and compliance driven, instead of the current "whatever is available in the CKAN community" approach.

Language

We define the following terms here for common understanding:

record: a metadata file/set that describes a "dataset"
catalog: a collection of records that are published together from a department/bureau/agency/program, in order to be available to the public
federal catalog: the data.gov collection of metadata records across the federal government. Made up of all the above record or dataset catalogs.
ETL: Extract, Transform, Load. While this is the framework we are using, a few features have been added that are necessary for the features and pipeline we are trying to create (such as the compare and validation). For simplicity we often use ETL, when we really mean ECVTVL or some combination depending on the process.
source state: Since harvest sources are provided "as is", with no change log, we need to download the complete list "as is" before we can compare with previous version(s). The "source state" is the data.gov copy of the harvest source, whether that's WAF, JSON, or something else.
sourceId: UUID generated by controller on creation of new harvest source
jobId: UUID generated by controller when a new job is intiated
recordId: UUID generated by extract service to track status of record within pipeline

Technologies

Airflow - orchestration framework
Postgres - object-relational database
Redis - in-memory keystore

Diagrams

Resources

Roadmap

The following are the features of this epic called Harvesting 2.0. Practically, these each have their own issue "label", and start with H2.0/ to represent being part of the Harvesting 2.0 epic.

Orchestator

The controller will keep state and status of the harvester, and provide interface into the database. This will include starting harvest jobs, tracking changes, evaluating if jobs are fully completed, making sure double jobs aren't run, etc.

Extract

The extract process is essentially trying to save the "state" of a harvest source at a point in time. It gathers all records (whether from DCAT-US catalog, WAF, API, etc), breaks them into pieces, and stores them in S3. It will have to handle guid creation and tracking (as without validation, we can't rely on metadata fields yet).

As this process may take over 15 minutes, this isn't a simple "call and response" request from the controller to the action. It must be a "please start", and the extract process will need to eventually "call back" with the s3 bucket once fully complete. We will also consider utilizing recursion in the WAF extraction type to speed up the process, but source systems may vary in their ability to handle a large number of requests.

There may be optimizations that we can do around curl and timestamp to skip/note items that have not changed since the last harvest, and we can do a basic no-op.

Develop Extract for WAF dataset records
Develop Extract for DCAT-US JSON format
Refactor Extract to account for large JSON sources
[Future] CSW extraction
[Future] Epic plan on DCAT-US at scale

Compare

This comparison is finding out from the extraction, given the source state, what needs to change from our production system? It will need to find out what items need to be removed from our catalog (delete) if they are no longer in the source; what items need to be added to our catalog (create); and what items need to be updated. For creation/deletion, we will use the metadata unique identifier where possible (DCAT-US and ISO19115). If not possible, utilize a hash/key based on certain metadata fields to create a unique value that can be compared against the items in the data.gov catalog.

Noting what items are to be updated will be the most difficult. While the industry standard is to create a hash of the metadata and compare against previous hashes, we've found that some source catalogs/systems are generated inconsistently in their order, causing unnecessary updates. In order to be most efficient, we'll want to re-arrange/order the metadata such that we can use a hash comparison.

Create Initial Compare Function for DCAT-US
[Future] Tune compare based on results from baseline

Validation

The validation will check a given metadata record for if valid under a given schema. This is a stateless job. Depending on the transformation and base metadata type to be utilized, we may implement "looser" or strict checks. We will utilize official tools and 3rd party tools where possible, as we don't own most of the metadata schemas.

It's possible we don't need validation of other metadata standards at all, and let the transformation throw errors.

Transform

The transformation will involve transforming a given metadata schema into another. This is a stateless job. We are considering offloading this to mdTranslator, especially with respect to geospatial metadata.

Implement mdTranslator on cloud.gov
Integrate MDTranslator into Datagov Harvesting Logic
Implement (or validate implementation) of MDTranslator with DCAT-US (current branch work here)
Transform FGDC source to DCAT-US
Develop mdTranslator reader for ISO19115 (reverse engineered from the ISO reader)

Load

The load will involve loading the record into the remote catalog. In the future this may be an as-yet undetermined system; the immediate need is to push this to the CKAN catalog via the API.

Test API load into current catalog codebase
Test API load at scale (1000's of dataset entries) Future:
Decide on catalog technology (javascript framework?)
Develop load

Infrastructure

This piece does the cloud setup for our logic features that were created above. While we lean towards cloud.gov at the time of this writing, the actual location is subject to change. This may involve DB setup, Redis, app deployment, and others.

Infrastructure Diagram
Define the connection points and infrastructure as code locations
Implement infrastructure in dev
Evaluate scaling capabilities

Implement

This feature is the prep, QA, and cutover to the Harvesting 2.0 system. There will be pieces that need to be tested and validated, as well as manual spot checks to confirm that we are ready to move over.

Python script to pull catalog harvest sources (CKAN) and add/create/update Harvest 2.0 (via API)
Run every harvest to completion, and examine all failures for bugs
Announce to data providers the intention of cutover, impact, and review expectations
Turn off current harvester
Cutover

API

With the choice to use Airflow as our orchestrator, we are able to leverage its robust REST API for querying the state of the ETL pipeline.

The source config for harvest sources (Harvest Source DB) will be managed elsewhere. Initially, this will just be an Postgres DB that exists in cloud.gov and is connected to via a GUI or the commmand line. Eventually, this will develop into a full-fledged Harvester UI. Tickets for that will be managed at the Harvest UI section

The API is for public interface with the harvesting system. It will have read routes for the harvest sources, jobs (with warnings/errors), and individual records. It will have the ability to "request" a harvest job start. With authentication, it will allow the following: creating a harvest source, updating a harvest source, deleting a harvest source, stopping a job, starting a job, and maybe a full pipeline detail (how many in the queue, etc)

Develop Read for harvest job
Develop Start for harvest job
Develop Stop (delete) for harvest job (auth)
Develop Create Request for harvest job
- publicly available, cannot run 2 in one day
Develop read pipeline queue state

Harvest UI

Develop Read for harvest source
Develop Create for harvest source
Develop Update for harvest source
Develop Delete for harvest source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly