Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead link data incorporation #3585

Open
4 tasks
stacimc opened this issue Dec 22, 2023 · 0 comments
Open
4 tasks

Dead link data incorporation #3585

stacimc opened this issue Dec 22, 2023 · 0 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🧭 project: thread An issue used to track a project and its progress 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Collaborator

stacimc commented Dec 22, 2023

Description

Summary

Develop and document a process for handling dead links in the Catalog, in order to make dead link validation in the API faster.

Details

Dead link validation is a critical part of the API, and sometimes slows down the API responses. Currently, we do not have a well described process for removing dead links from the catalog. We need:

  • Well documented, clear criteria for when a link is considered dead:
    • Is one 404 response enough to consider a link dead, or should we define a threshold of recorded 404 responses that must be met?
    • Should we validate only the direct URL, or the foreign landing URL, or both?
  • A process/pipeline for handling dead links identified in the API, in the Catalog.
    • One option would be a daily DAG that saves the links marked as dead in the Redis cache to a file in S3
  • Well documented process for how the Catalog should handle these records, once identified:
    • Should the records have removed_from_source set, perhaps using the batched_update DAG?
    • Should the records be removed from the Catalog database entirely, or moved to a separate "dead_links" database?
    • Should the records be moved to the existing DeletedMedia tables?
    • Should we keep parquet files of the removed records instead of moving them to a different database or table?
    • How should the data refresh handle these records?

Documents

  • Project Proposal
  • Implementation Plan(s)
    • Documentation for the criteria for identifying dead links
    • Process for handling dead links in the Catalog, using information from the Redis cache

Milestones/Issues

Prior Art

This project combines the project ideas of Establish Guidelines and Practices for Dead links and Set up Dead links Removal Pipeline Using Redis Cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🧭 project: thread An issue used to track a project and its progress 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: ⌛ Todo
Development

No branches or pull requests

1 participant