Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Define Redis Queue System #4314

Open
7 tasks
btylerburton opened this issue May 11, 2023 · 0 comments
Open
7 tasks

WIP: Define Redis Queue System #4314

btylerburton opened this issue May 11, 2023 · 0 comments
Labels
H2.0/Harvest-General General Harvesting 2.0 Issues

Comments

@btylerburton
Copy link
Contributor

btylerburton commented May 11, 2023

User Story

In order to manage large numbers of harvest jobs, data.gov wants to define a series of queue systems using Redis.

Acceptance Criteria

First we will define the queues themselves:

queue purpose
job jobs waiting to be picked up by the harvester pipeline
extract harvest source waiting to have catalog parsed
compare an incoming record with unique UUID waiting to be compared with the current record of same identifier
validate a record in need of validation against expected schema
transform a record in need of transformation from one schema to another
load a record ready to be uploaded into current catalog UI (currently created, updated, or deleted in CKAN)

Then we will define their lifecycles:

queue lifecycle state definition
job create a job is awaiting being picked up by the harvester
extract a harvest source being extracted to catalog of records
compare a catalog of records is awaiting being compared with its companion in CKAN
processing record-level processing of add, update, delete
completion harvest job has finished successfully or in error
extract create a harvest source in queue
processing a harvest source is being extracted
completion all records extracted from harvest source and saved to S3 under appropriate prefix
compare create a catalog of records in queue
processing a catalog of records is being compared with harvest source found in UI
completion individual records have been sent to next step, determined by whether they need to be added, updated or deleted
validate create an individual record
processing validation against given schema
completion pass/fail parsed against schema
transform create an individual record
processing transforming a record in one schema to another schema
completion record sent to validation queue for final validation of successful transformation
load create an individual record
processing RESTful operation against CKAN catalog based on whether the record should be created, updated, or deleted
completion success or failure of that process

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Multiple harvest jobs running concurrently will consume excessive system resources. Regardless of pipeline speed, we would like to define a definitive FIFO (first in, first out) system to guarantee linear processing of harvest sources.

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

None

Sketch

  • Define queues
    • job
    • extract
    • compare
    • transform
    • validate
    • load
@btylerburton btylerburton added H2.0/Harvest-General General Harvesting 2.0 Issues H2.0/controller labels May 11, 2023
@robert-bryson robert-bryson self-assigned this May 24, 2023
@robert-bryson robert-bryson removed their assignment May 26, 2023
@btylerburton btylerburton changed the title Define Redis Queue System WIP: Define Redis Queue System May 26, 2023
@btylerburton btylerburton removed the H2.0/Harvest-General General Harvesting 2.0 Issues label Dec 13, 2023
@btylerburton btylerburton added H2.0/Harvest-General General Harvesting 2.0 Issues and removed H2.0/Airflow labels Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-General General Harvesting 2.0 Issues
Projects
Status: 🧊 Icebox
Development

No branches or pull requests

2 participants