Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the Met Museum reingestion partitions #5358

Closed
obulat opened this issue Jan 30, 2025 · 0 comments · Fixed by #5370
Closed

Update the Met Museum reingestion partitions #5358

obulat opened this issue Jan 30, 2025 · 0 comments · Fixed by #5370
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python

Comments

@obulat
Copy link
Contributor

obulat commented Jan 30, 2025

Problem

The Met Museum reingestion DAG (met_museum_reingestion_workflow) experiences task timeouts in later intervals (e.g., date_shift_126). These tasks process large date ranges (180 days) and often fail due to excessive data volume or slow API responses for older data.

Description

To resolve this, we will redistribute the workload by:

  1. Reducing six_month_list_length from 30 to 20, decreasing the number of 180-day tasks.
  2. Increasing three_month_list_length from 18 to 38, adding more 90-day tasks.

Why

  • Each 180-day task is split into two 90-day tasks, reducing per-task data volume.
  • Total reingestion days remain unchanged:
    # Original: 30*180 + 18*90 = 7,020 days  
    # Updated:  20*180 + 38*90 = 7,020 days 

Alternatives

  1. Increase dagrun_timeout. Drawback: Risks allowing indefinitely long runs and hides inefficiencies rather than addressing the root cause.

  2. Parallelize task processing. Drawback: The Met Museum API (or other providers) may enforce rate limits. Increasing parallelism could violate these limits, leading to failed requests or temporary bans.

The proposed solution balances simplicity, reliability, and adherence to provider constraints.

Additional context

Code location

ProviderReingestionWorkflow(
# 64 total reingestion days
ingester_class=MetMuseumDataIngester,
max_active_tasks=2,
pull_timeout=timedelta(hours=16),
dagrun_timeout=timedelta(days=7),
daily_list_length=6,
one_month_list_length=9,
three_month_list_length=18,
six_month_list_length=30,
),

Proposed change

ProviderReingestionWorkflow(
    # ... other parameters ...
    three_month_list_length=38,  # Previously 18
    six_month_list_length=20,     # Previously 30
),
@obulat obulat added good first issue New-contributor friendly help wanted Open to participation from the community ✨ goal: improvement Improvement to an existing user-facing feature 🐍 tech: python Involves Python 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Jan 30, 2025
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Jan 30, 2025
@openverse-bot openverse-bot moved this from 📋 Backlog to 🏗 In Progress in Openverse Backlog Feb 3, 2025
@openverse-bot openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant