-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect the "zombie" jobs, and kill them 🧟 🔫 #741
Comments
Why not to handle that validation before worker loop starts? I mean, before https://github.com/huggingface/datasets-server/blob/main/workers/datasets_based/src/datasets_based/main.py#L15 look for all "started" jobs and cancel them and create another in the queue (Like in force-refresh). |
Yes, you're right, but it's only part of the problem. While the Service is alive, a lot of pods crash. We can see it in the list of pods: |
Also, when a worker starts, others may already be running. We have currently no way to know |
Sorry, I was totally forgetting that there are multiple pods for the same service |
I think we can start with "garbage collector" step first without implementing the heartbeat yet: Garbage collector:
Cons of implementing initially only collector step:
After heartbeat step is implemented:
|
It seems like a good plan to me. WDYT @huggingface/datasets-server? |
We'll just end up with a queue made of unfeasible job no ? We'd probably need allow it a max number of retries (2 or 3)
Some operations like Also we'd need a way to communicate to the worker that it has to cancel the job. Right now we don't know which worker is doing what. Instead, maybe the worker that is executing the job can cancel itself if it takes too long ? That would mean that the worker loop process should be controlled by another process that is responsible for checking if the job needs to be canceled. |
Yes, you're right: we should just insert an error in the cache, with an error code "pod crashed" or something like that. No need to retry, I think.
You're right, but it can be hard, because we're not sure if the files have been updated by the worker, or are from a previous job.
hmmm: the heartbeat thread running along the worker could be in charge of stopping the worker when the duration is above the limit (timeout). |
We just need to revert commits until we reach a valid one ? Valid commits can be stored in the db when a job succeeds |
What if a job failed because of suddenly pod crash but it is a small one, I agree to support a limit number of retry opportunities, that will also let us identify those conflicting datasets.
We need to implement a "rollback" logic for the parquet specific case first, since currently we don't have that logic in cancel-jobs endpoint https://github.com/huggingface/datasets-server/blob/main/services/admin/src/admin/routes/cancel_jobs.py#L33 |
Opened #804 regarding rollback for |
Btw I recall Celery (a task queue) already implements time limits with clean ups and automatic retries with delays. Is someone familiar with it ? It looks interesting |
The commit is stored in the cache, along with the worker version |
Yes, but I think it's better to manage it in another PR, to focus on one feature at a time. |
We are using only one commit to send all the parquet files, I don't know if it's possible for this commit to be "partial"; I don't think so (we use https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.create_commit). |
I look at it along with dagster and airflow (even if the aim is not the same). Maybe we could give it a try, but I'm not sure how we can manage the priorities in the queue (various levels of priority: NORMAL/LOW, limit the number of concurrent jobs for the same user, prioritize the jobs from users that have no started jobs yet, etc.) |
There's a priority level you can set for each task in Celery : celery/celery#2635 (comment) And we can set it based on the user's history for example And I've seen examples of kubernetes and docker-compose deployments on github - you just need to define the celery workers, the rabbitmq queue, etc. |
OK. it might be a good replacement for part of the code. Should we explore this as part of this issue, or fix the issue first and try to migrate afterward? |
I think we're still learning our needs, so for now I'd continue with the current implem to keep full flexibility. Let's think about it again once we have random access to rows ? |
Zombies jobs are now killed every 10 minutes if the last heartbeat is >5min (there's one heartbeat every minute). Next is to put an error response in the cache database when it happens |
Sometimes the pods crash:
The job that was running then stays forever in the "started" status, what we can call a zombie.
The issue is that, for the rules implemented in the queue logic, it can prevent other jobs for the same dataset, or for the same user, to be processed. It even prevents to re-run the same job.
Ideally, we should detect that the job has failed, change its status to "error" and put an error response in the cache database.
To implement this, an option proposed by @XciD is:
The text was updated successfully, but these errors were encountered: