You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At present,elastic-job-cloud-scheduler treated TASK_LOST like TASK_FAIL/TASK_ERROR, scheduled a new task。This behavior can cause problems:
The master lost contact with a running task (e.g., due to a network partition), but the task may still be running. Then two same task would be running together.
That behavior might be all right for long running service(e.g., web service). But batch job task would be abnormal.
Solution
From Mesos 1.1.0, it supposed PARTITION AWARE feature. Old TASK_LOST state was transformed to 5 different type state:
TASK_DROPPED means “task failed to launch”.
The task is definitely not running.
TASK_UNREACHABLE means that the task was running on an agent that has failed health checks -- i.e., the master hasn’t heard from the agent running the task for a configurable period of time.
The task may still be running.
TASK_GONE(_BY_OPERATOR) means “task was running on an agent that has been terminated.”
The task is definitely not running.
TASK_UNKNOWN means that the master has no knowledge of the task.
This might because either (a) the task was never known to the master, or (b) the agent has been GC’d from the list of unreachable or confirmed-dead agents.
The task may still be running.
scheduler used those state to determine when a task has truly terminated(TASK_DROPPED TASK_GONE, TASK_GONE_BY_OPERATOR,).
scheduler also should support configurable strategy for different states. The strategy might be to launch a new task or just send alerts.
Which version of Elastic-Job do you using?
2.1.4
The text was updated successfully, but these errors were encountered:
Problems
At present,elastic-job-cloud-scheduler treated TASK_LOST like TASK_FAIL/TASK_ERROR, scheduled a new task。This behavior can cause problems:
The master lost contact with a running task (e.g., due to a network partition), but the task may still be running. Then two same task would be running together.
That behavior might be all right for long running service(e.g., web service). But batch job task would be abnormal.
Solution
From Mesos 1.1.0, it supposed PARTITION AWARE feature. Old TASK_LOST state was transformed to 5 different type state:
The task is definitely not running.
The task may still be running.
The task is definitely not running.
This might because either (a) the task was never known to the master, or (b) the agent has been GC’d from the list of unreachable or confirmed-dead agents.
The task may still be running.
scheduler used those state to determine when a task has truly terminated(TASK_DROPPED TASK_GONE, TASK_GONE_BY_OPERATOR,).
scheduler also should support configurable strategy for different states. The strategy might be to launch a new task or just send alerts.
Which version of Elastic-Job do you using?
2.1.4
The text was updated successfully, but these errors were encountered: