Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert the /backfill endpoint to a kubernetes Job run periodically #740

Closed
severo opened this issue Jan 31, 2023 · 7 comments
Closed

Convert the /backfill endpoint to a kubernetes Job run periodically #740

severo opened this issue Jan 31, 2023 · 7 comments

Comments

@severo
Copy link
Collaborator

severo commented Jan 31, 2023

See #708 (comment) and following comments.

Maybe we should not have this endpoint as such. See the related issue: #736. Maybe #736 should be fixed first, then see how we implement a backfill trigger.

@severo severo self-assigned this Feb 2, 2023
@severo severo removed their assignment Mar 2, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo severo changed the title Improve the /backfill endpoint Convert the /backfill endpoint to a kubernetes Job run periodically Mar 27, 2023
@severo
Copy link
Collaborator Author

severo commented Mar 27, 2023

Changed the title: what we need is to convert this endpoint to a "cron" job

@AndreaFrancis
Copy link
Contributor

I would suggest to split this issue in small ones:

  1. Make /backfill a one time job (like cache_refresh and mongo_migration https://github.com/huggingface/datasets-server/tree/main/jobs) to run each deploy (keep /backfill logic as it is initially).
  2. Make backfill k8s job to only append Jobs in order to "complete" missing cache entries (According to current processing graph). This could be done by iterating each supported dataset and verifying its missing cache by processing step (Could be heavy) or we could store somewhere the processing graph steps, and compare it with the current one, in case of new steps or newer job runner version, backfill by step (This could probably work as well to replace cache-refresh job).
  3. Make backfill cron job that will run every x time

@AndreaFrancis
Copy link
Contributor

AndreaFrancis commented Apr 6, 2023

Tasks:

  • Make /backfill a one time job (like cache_refresh and mongo_migration https://github.com/huggingface/datasets-server/tree/main/jobs) to run each deploy (keep /backfill logic as it is initially).
  • Make backfill k8s job to only append Jobs in order to "complete" missing cache entries (According to current processing graph). This could be done by iterating each supported dataset and verifying its missing cache by processing step (Could be heavy) or we could store somewhere the processing graph steps, and compare it with the current one, in case of new steps or newer job runner version, backfill by step (This could probably work as well to replace cache-refresh job).
  • Make backfill cron job that will run every x time

@severo
Copy link
Collaborator Author

severo commented Apr 14, 2023

related issue: #844
Ideally, the backfill should take care of removing the dangling assets/cache entries/jobs/parquet files

@github-actions
Copy link

github-actions bot commented May 8, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo
Copy link
Collaborator Author

severo commented May 9, 2023

Done

@severo severo closed this as completed May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants