Update the Met Museum reingestion partitions #5358
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
good first issue
New-contributor friendly
help wanted
Open to participation from the community
🟩 priority: low
Low priority and doesn't need to be rushed
🧱 stack: catalog
Related to the catalog and Airflow DAGs
🔧 tech: airflow
Involves Apache Airflow
🐍 tech: python
Involves Python
Problem
The Met Museum reingestion DAG (
met_museum_reingestion_workflow
) experiences task timeouts in later intervals (e.g.,date_shift_126
). These tasks process large date ranges (180 days) and often fail due to excessive data volume or slow API responses for older data.Description
To resolve this, we will redistribute the workload by:
six_month_list_length
from30
to20
, decreasing the number of 180-day tasks.three_month_list_length
from18
to38
, adding more 90-day tasks.Why
Alternatives
Increase dagrun_timeout. Drawback: Risks allowing indefinitely long runs and hides inefficiencies rather than addressing the root cause.
Parallelize task processing. Drawback: The Met Museum API (or other providers) may enforce rate limits. Increasing parallelism could violate these limits, leading to failed requests or temporary bans.
The proposed solution balances simplicity, reliability, and adherence to provider constraints.
Additional context
Code location
openverse/catalog/dags/providers/provider_reingestion_workflows.py
Lines 69 to 79 in 820f246
Proposed change
The text was updated successfully, but these errors were encountered: