Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heartbeat #824

Merged
merged 12 commits into from
Feb 16, 2023
Merged

Add heartbeat #824

merged 12 commits into from
Feb 16, 2023

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Feb 15, 2023

Add heartbeat to workers.

It adds a last_heartbeat field to documents in the queue.

The field is not mandatory - it only appears for jobs that are or were running when a heartbeat happens (once per minute by default).

Implementations details

I added a WorkerExecutor that runs the worker loop in a subprocess. This way the executor can have its own loop with a heartbeat.

The executor knows about the worker state by reading a JSON file when I store the state of the loop. This is helpful to know the ID of the current Job to update its last_heartbeat field. I used filelock to make sure there's no race conditions when reading/writing this file.

TODO

  • fix merge conflicts
  • tests

related to #741

Copy link
Contributor

@AndreaFrancis AndreaFrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!
But after the having last_heartbeat, will we still need another process to CANCEL/RETRY those jobs with a last_heartbeat more than 60 seconds (It could be more)?
Another dummy question, I assume the json file would persist even if the pod crashes right? If so,I it makes me think again in the logic: On service start, would we read the file and see if there is a curent_job_info with "STARTED" state and CANCEL/RETRY it? Since the file lives in a pod isolated storage I think we could assume that if there is a job in the file it means that one was being processed by the pod that crashed.

@@ -47,6 +50,10 @@ def from_env(cls) -> "WorkerConfig":
sleep_seconds=env.int(name="SLEEP_SECONDS", default=WORKER_SLEEP_SECONDS),
only_job_types=env.list(name="ONLY_JOB_TYPES", default=get_empty_str_list()),
storage_paths=env.list(name="STORAGE_PATHS", default=get_empty_str_list()),
state_path=env.str(name="STATE_PATH", default=None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could move the default value WORKER_STATE_PATH here "worker_state.json", almost all configs had their own default value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WORKER_STATE_PATH has no default - it changes for every sessions using a temporary directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's already what we do with the assets directory: https://github.com/huggingface/datasets-server/blob/main/libs/libcommon/src/libcommon/config.py#L14

Note: I prefer to always indirect via a constant, ie. WORKER_STATE_PATH: Optional[str] = None. Not sure if it's an exaggeration, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment: I didn't get that STATE_PATH is the state file basename (we create a file by appending .lock at the end at https://github.com/huggingface/datasets-server/pull/824/files#diff-643260fba42f231dbf4e91102b52af6acb3c216a0eb6f538ac42bc667ebe5381R145). Maybe the name could be more descriptive, or maybe we could replace by STATE_FILENAME and just use it without appending .lock?

libs/libcommon/src/libcommon/queue.py Show resolved Hide resolved
@lhoestq
Copy link
Member Author

lhoestq commented Feb 16, 2023

But after the having last_heartbeat, will we still need another process to CANCEL/RETRY those jobs with a last_heartbeat more than 60 seconds (It could be more)?

Yes, I can work on this in a subsequent PR

Another dummy question, I assume the json file would persist even if the pod crashes right?

It uses a temporary directory that changes every time the container restarts

@HuggingFaceDocBuilder
Copy link
Collaborator

HuggingFaceDocBuilder commented Feb 16, 2023

The documentation is not available anymore as the PR was closed or merged.

@lhoestq lhoestq marked this pull request as ready for review February 16, 2023 18:34
@lhoestq
Copy link
Member Author

lhoestq commented Feb 16, 2023

this is ready for review :)

Copy link
Contributor

@AndreaFrancis AndreaFrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! thanks for making it as simple as possible :)

@lhoestq lhoestq merged commit 3a27146 into main Feb 16, 2023
@lhoestq lhoestq deleted the left4dead branch February 16, 2023 22:54
@severo
Copy link
Collaborator

severo commented Feb 23, 2023

Super nice implementation, thanks! I put some comments, but it's so good to finally have the heartbeat!

@severo severo mentioned this pull request Feb 23, 2023
@lhoestq lhoestq mentioned this pull request Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants