New Feature: Partition Aware #373

hanahmily · 2017-07-05T07:35:54Z

Problems

At present,elastic-job-cloud-scheduler treated TASK_LOST like TASK_FAIL/TASK_ERROR, scheduled a new task。This behavior can cause problems:

The master lost contact with a running task (e.g., due to a network partition), but the task may still be running. Then two same task would be running together.

That behavior might be all right for long running service(e.g., web service). But batch job task would be abnormal.

Solution

From Mesos 1.1.0, it supposed PARTITION AWARE feature. Old TASK_LOST state was transformed to 5 different type state:

TASK_DROPPED means “task failed to launch”.
The task is definitely not running.
TASK_UNREACHABLE means that the task was running on an agent that has failed health checks -- i.e., the master hasn’t heard from the agent running the task for a configurable period of time.
The task may still be running.
TASK_GONE(_BY_OPERATOR) means “task was running on an agent that has been terminated.”
The task is definitely not running.
TASK_UNKNOWN means that the master has no knowledge of the task.
This might because either (a) the task was never known to the master, or (b) the agent has been GC’d from the list of unreachable or confirmed-dead agents.
The task may still be running.

scheduler used those state to determine when a task has truly terminated(TASK_DROPPED TASK_GONE, TASK_GONE_BY_OPERATOR,).

scheduler also should support configurable strategy for different states. The strategy might be to launch a new task or just send alerts.

Which version of Elastic-Job do you using?

2.1.4

hanahmily added the new feature label Jul 5, 2017

hanahmily self-assigned this Jul 5, 2017

hanahmily mentioned this issue Jul 7, 2017

Fix #373 Partition Aware #378

Merged

terrymanu closed this as completed in 3bb9fa4 Jul 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Feature: Partition Aware #373

New Feature: Partition Aware #373

hanahmily commented Jul 5, 2017

New Feature: Partition Aware #373

New Feature: Partition Aware #373

Comments

hanahmily commented Jul 5, 2017

Problems

Solution

Which version of Elastic-Job do you using?