Add heartbeat #824

lhoestq · 2023-02-15T19:07:08Z

Add heartbeat to workers.

It adds a last_heartbeat field to documents in the queue.

The field is not mandatory - it only appears for jobs that are or were running when a heartbeat happens (once per minute by default).

Implementations details

I added a WorkerExecutor that runs the worker loop in a subprocess. This way the executor can have its own loop with a heartbeat.

The executor knows about the worker state by reading a JSON file when I store the state of the loop. This is helpful to know the ID of the current Job to update its last_heartbeat field. I used filelock to make sure there's no race conditions when reading/writing this file.

TODO

fix merge conflicts
tests

related to #741

AndreaFrancis

Awesome!
But after the having last_heartbeat, will we still need another process to CANCEL/RETRY those jobs with a last_heartbeat more than 60 seconds (It could be more)?
Another dummy question, I assume the json file would persist even if the pod crashes right? If so,I it makes me think again in the logic: On service start, would we read the file and see if there is a curent_job_info with "STARTED" state and CANCEL/RETRY it? Since the file lives in a pod isolated storage I think we could assume that if there is a job in the file it means that one was being processed by the pod that crashed.

AndreaFrancis · 2023-02-15T19:42:11Z

services/worker/src/worker/config.py

@@ -47,6 +50,10 @@ def from_env(cls) -> "WorkerConfig":
                sleep_seconds=env.int(name="SLEEP_SECONDS", default=WORKER_SLEEP_SECONDS),
                only_job_types=env.list(name="ONLY_JOB_TYPES", default=get_empty_str_list()),
                storage_paths=env.list(name="STORAGE_PATHS", default=get_empty_str_list()),
+                state_path=env.str(name="STATE_PATH", default=None),


I think we could move the default value WORKER_STATE_PATH here "worker_state.json", almost all configs had their own default value

WORKER_STATE_PATH has no default - it changes for every sessions using a temporary directory.

Yes, that's already what we do with the assets directory: https://github.com/huggingface/datasets-server/blob/main/libs/libcommon/src/libcommon/config.py#L14

Note: I prefer to always indirect via a constant, ie. WORKER_STATE_PATH: Optional[str] = None. Not sure if it's an exaggeration, though.

Another comment: I didn't get that STATE_PATH is the state file basename (we create a file by appending .lock at the end at https://github.com/huggingface/datasets-server/pull/824/files#diff-643260fba42f231dbf4e91102b52af6acb3c216a0eb6f538ac42bc667ebe5381R145). Maybe the name could be more descriptive, or maybe we could replace by STATE_FILENAME and just use it without appending .lock?

libs/libcommon/src/libcommon/queue.py

lhoestq · 2023-02-16T14:09:39Z

But after the having last_heartbeat, will we still need another process to CANCEL/RETRY those jobs with a last_heartbeat more than 60 seconds (It could be more)?

Yes, I can work on this in a subsequent PR

Another dummy question, I assume the json file would persist even if the pod crashes right?

It uses a temporary directory that changes every time the container restarts

HuggingFaceDocBuilder · 2023-02-16T17:31:49Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2023-02-16T18:37:17Z

this is ready for review :)

AndreaFrancis

Cool! thanks for making it as simple as possible :)

services/worker/pyproject.toml

services/worker/src/worker/main.py

services/worker/src/worker/config.py

services/worker/src/worker/loop.py

services/worker/src/worker/main.py

services/worker/src/worker/config.py

services/worker/src/worker/main.py

severo · 2023-02-23T17:57:15Z

Super nice implementation, thanks! I put some comments, but it's so good to finally have the heartbeat!

lhoestq added 4 commits February 15, 2023 14:30

add worker loop executor

97b50ba

implement executor.heartbeat()

93b7cca

fix missing queue resource

f166a67

style

9d86604

AndreaFrancis reviewed Feb 15, 2023

View reviewed changes

lhoestq added 4 commits February 16, 2023 15:13

add last_heartbeat description

ccedcd4

Merge branch 'main' into left4dead

a6214ce

fix poetry.lock

21253ba

add tests

923b00a

lhoestq added 4 commits February 16, 2023 18:49

fix mypy

40ef7a1

fix test

f6e74cd

fix mypy again

dee9534

more tests

02deca8

lhoestq marked this pull request as ready for review February 16, 2023 18:34

lhoestq requested a review from AndreaFrancis February 16, 2023 18:37

AndreaFrancis approved these changes Feb 16, 2023

View reviewed changes

lhoestq merged commit 3a27146 into main Feb 16, 2023

lhoestq deleted the left4dead branch February 16, 2023 22:54