Skip to content

Harvesting 2.0

Tyler Burton edited this page Dec 19, 2023 · 25 revisions

Data.gov is embarking on the creation of a new harvesting pipeline for government data. This system will take into account where we are and attempt to support the current harvesting types and metadata schemas that are already in use, while also planning for future updates, changes, and better systems in the future. The features provided in this system will be much more request and compliance driven, instead of the current "whatever is available in the CKAN community" approach.

Language

We define the following terms here for common understanding:

  • record: a metadata file/set that describes a "dataset"
  • catalog: a collection of records that are published together from a department/bureau/agency/program, in order to be available to the public
  • federal catalog: the data.gov collection of metadata records across the federal government. Made up of all the above record or dataset catalogs.
  • ETL: Extract, Transform, Load. While this is the framework we are using, a few features have been added that are necessary for the features and pipeline we are trying to create (such as the compare and validation). For simplicity we often use ETL, when we really mean ECVTVL or some combination depending on the process.
  • source state: Since harvest sources are provided "as is", with no change log, we need to download the complete list "as is" before we can compare with previous version(s). The "source state" is the data.gov copy of the harvest source, whether that's WAF, JSON, or something else.
  • sourceId: UUID generated by controller on creation of new harvest source
  • jobId: UUID generated by controller when a new job is intiated
  • recordId: UUID generated by extract service to track status of record within pipeline

Technologies

  • Airflow - orchestration framework
  • Postgres - object-relational database
  • Redis - in-memory keystore

Diagrams

Resources

Roadmap

The following are the features of this epic called Harvesting 2.0. Practically, these each have their own issue "label", and start with H2.0/ to represent being part of the Harvesting 2.0 epic.

The controller will keep state and status of the harvester, and provide interface into the database. This will include starting harvest jobs, tracking changes, evaluating if jobs are fully completed, making sure double jobs aren't run, etc.

The extract process is essentially trying to save the "state" of a harvest source at a point in time. It gathers all records (whether from DCAT-US catalog, WAF, API, etc), breaks them into pieces, and stores them in S3. It will have to handle guid creation and tracking (as without validation, we can't rely on metadata fields yet).

As this process may take over 15 minutes, this isn't a simple "call and response" request from the controller to the action. It must be a "please start", and the extract process will need to eventually "call back" with the s3 bucket once fully complete. We will also consider utilizing recursion in the WAF extraction type to speed up the process, but source systems may vary in their ability to handle a large number of requests.

There may be optimizations that we can do around curl and timestamp to skip/note items that have not changed since the last harvest, and we can do a basic no-op.

This comparison is finding out from the extraction, given the source state, what needs to change from our production system? It will need to find out what items need to be removed from our catalog (delete) if they are no longer in the source; what items need to be added to our catalog (create); and what items need to be updated. For creation/deletion, we will use the metadata unique identifier where possible (DCAT-US and ISO19115). If not possible, utilize a hash/key based on certain metadata fields to create a unique value that can be compared against the items in the data.gov catalog.

Noting what items are to be updated will be the most difficult. While the industry standard is to create a hash of the metadata and compare against previous hashes, we've found that some source catalogs/systems are generated inconsistently in their order, causing unnecessary updates. In order to be most efficient, we'll want to re-arrange/order the metadata such that we can use a hash comparison.

The validation will check a given metadata record for if valid under a given schema. This is a stateless job. Depending on the transformation and base metadata type to be utilized, we may implement "looser" or strict checks. We will utilize official tools and 3rd party tools where possible, as we don't own most of the metadata schemas.

It's possible we don't need validation of other metadata standards at all, and let the transformation throw errors.

The transformation will involve transforming a given metadata schema into another. This is a stateless job. We are considering offloading this to mdTranslator, especially with respect to geospatial metadata.

The load will involve loading the record into the remote catalog. In the future this may be an as-yet undetermined system; the immediate need is to push this to the CKAN catalog via the API.

This piece does the cloud setup for our logic features that were created above. While we lean towards cloud.gov at the time of this writing, the actual location is subject to change. This may involve DB setup, Redis, app deployment, and others.

  • Infrastructure Diagram
  • Define the connection points and infrastructure as code locations
  • Implement infrastructure in dev
  • Evaluate scaling capabilities

This feature is the prep, QA, and cutover to the Harvesting 2.0 system. There will be pieces that need to be tested and validated, as well as manual spot checks to confirm that we are ready to move over.

  • Python script to pull catalog harvest sources (CKAN) and add/create/update Harvest 2.0 (via API)
  • Run every harvest to completion, and examine all failures for bugs
  • Announce to data providers the intention of cutover, impact, and review expectations
  • Turn off current harvester
  • Cutover

With the choice to use Airflow as our orchestrator, we are able to leverage its robust REST API for querying the state of the ETL pipeline.

The source config for harvest sources (Harvest Source DB) will be managed elsewhere. Initially, this will just be an Postgres DB that exists in cloud.gov and is connected to via a GUI or the commmand line. Eventually, this will develop into a full-fledged Harvester UI. Tickets for that will be managed at the Harvest UI section

The API is for public interface with the harvesting system. It will have read routes for the harvest sources, jobs (with warnings/errors), and individual records. It will have the ability to "request" a harvest job start. With authentication, it will allow the following: creating a harvest source, updating a harvest source, deleting a harvest source, stopping a job, starting a job, and maybe a full pipeline detail (how many in the queue, etc)

  • Develop Read for harvest job
  • Develop Start for harvest job
  • Develop Stop (delete) for harvest job (auth)
  • Develop Create Request for harvest job
    • publicly available, cannot run 2 in one day
  • Develop read pipeline queue state
  • Develop Read for harvest source
  • Develop Create for harvest source
  • Develop Update for harvest source
  • Develop Delete for harvest source
Clone this wiki locally