-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add lazy assignment job config option #47726
[ML] Add lazy assignment job config option #47726
Conversation
This change adds: - A new option, allow_lazy_open, to anomaly detection jobs - A new option, allow_lazy_start, to data frame analytics jobs Both work in the same way: they allow a job to be opened/started even if no ML node exists that can accommodate the job immediately. In this situation the job waits in the opening/started state until ML node capacity is available. Additionally, the ML nightly maintenance tasks now creates audit warnings for ML jobs that are unassigned. This means that jobs that cannot be assigned to an ML node for a very long time will show a yellow warning triangle in the UI.
Pinging @elastic/ml-core (:ml) |
This change is similar to the one that was made to allow stopping unassigned datafeeds in elastic#39034 because if unassigned jobs are now going to be the norm rather than the exception and remain unassigned for long periods then it's very frustrating to not be allowed to easily close them.
…h others And updating tests to match
The change to allow unassigned jobs to be closed without force is modelled on the similar change that was done for stopping unassigned datafeeds without force in #39034. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a few comments but it looks great!
@@ -56,6 +56,7 @@ public static Builder builder() { | |||
private static final ParseField MODEL_MEMORY_LIMIT = new ParseField("model_memory_limit"); | |||
private static final ParseField CREATE_TIME = new ParseField("create_time"); | |||
private static final ParseField VERSION = new ParseField("version"); | |||
private static final ParseField ALLOW_LAZY_START = new ParseField("allow_lazy_start"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of having this in the config instead of a parameter to the start API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the equivalent setting in anomaly detection jobs, it can be added to the modules the SIEM team is using to configure their ML jobs. Then every time those jobs are started they'll pick up the setting. Basically with the work flows that are envisaged for this new setting we think that it will be something that's always desired or not desired for a particular job, and not desired for some invocations but not others.
Then for the data frame analytics setting specifically I made it symmetrical to the way things are done for anomaly detection jobs.
private void auditUnassignedMlTasks(ClusterState state) { | ||
PersistentTasksCustomMetaData tasks = state.getMetaData().custom(PersistentTasksCustomMetaData.TYPE); | ||
if (tasks != null) { | ||
mlAssignmentNotifier.auditMlTasks(state.nodes(), tasks, tasks, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit odd that we're passing the same tasks
as both the previousTasks
and the CurrentTasks
here. Assuming this works properly, should we encapsulate that in MlAssignmentNotifier
by making the current auditMlTasks
method private
and having another public
one that only accepts a single tasks
arg?
// The listener here can be a no-op, as waitForJobClosed() already waits for | ||
// these persistent tasks to disappear. | ||
persistentTasksService.sendRemoveRequest(jobTask.getId(), | ||
ActionListener.wrap(r -> {}, e -> {})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not handle an error in case the removal of the persistent task fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add logging of the error at this point in the code. To do anything more would mean completely refactoring the way the class waits for the per-job requests.
The overall request will time out instead of reporting success if removing the persistent task for an unassigned job fails. This is similar to what happens if removing the persistent task for an assigned job fails after we've done all the other steps. Effectively the change is making treatment of assigned and unassigned jobs more similar.
@@ -113,8 +113,9 @@ | |||
- match: { analysis_limits.model_memory_limit: "2048mb" } | |||
|
|||
--- | |||
"Test put job with model_memory_limit as string": | |||
|
|||
"Test put job with model_memory_limit as string and lazy open": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a similar test for df-analytics
Trying to add an integration test for lazy assignment showed how unfriendly it is to report starting but unassigned analytics tasks as STOPPED. We will cause enormous customer confusion without a STARTING state.
@@ -433,6 +433,7 @@ public boolean test(PersistentTasksCustomMetaData.PersistentTask<?> persistentTa | |||
case STOPPING: | |||
exception = ExceptionsHelper.conflictStatusException("the task has been stopped while waiting to be started"); | |||
return true; | |||
case STARTING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this means that when a user calls start with lazy
mode they would not get a response until the task is actually assigned, right? We probably don't want that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where the job is too big to be assigned this check higher up the method will return early:
// This means we are awaiting a new node to be spun up, ok to return back to the user to await node creation
if (assignment != null && assignment.equals(JobNodeSelector.AWAITING_LAZY_ASSIGNMENT)) {
return true;
}
The only time the code will get to the STARTING
case in the switch
is during the tiny fraction of a second when a job has been successfully assigned but the call to update its task state has not yet completed. Previously in this situation the state was reported as STOPPED
, so I thought it best to put STARTING
alongside STOPPED
in the switch
to maintain previous behaviour.
(I think the integration tests confirm this works as I described, but just to double check I tested this manually with a job with a 3tb
model_memory_limit
and allow_lazy_start: true
on my laptop and the start endpoint returned success instantly.)
It's equivalent to this bit of TransportOpenJobAction
:
Lines 561 to 563 in f52f2ae
case OPENING: | |
case CLOSED: | |
return false; |
But since it's clearly confusing I'll add some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The BWC version for await_lazy_open/await_lazy_start should be 7.5.0 now elastic#47726 is backported.
The BWC version for await_lazy_open/await_lazy_start should be 7.5.0 now #47726 is backported.
This change augments the SIEM jobs and datafeeds that were added in #47848 with the allow_lazy_open and max_empty_searches options that were added in elastic/elasticsearch#47726 and elastic/elasticsearch#47922 respectively.
…#48372) This change augments the SIEM jobs and datafeeds that were added in #47848 with the allow_lazy_open and max_empty_searches options that were added in elastic/elasticsearch#47726 and elastic/elasticsearch#47922 respectively.
This change adds:
Both work in the same way: they allow a job to be
opened/started even if no ML node exists that can
accommodate the job immediately. In this situation
the job waits in the opening/starting state until ML
node capacity is available. (The starting state for data
frame analytics jobs is new in this change.)
Additionally, the ML nightly maintenance tasks now
creates audit warnings for ML jobs that are unassigned.
This means that jobs that cannot be assigned to an ML
node for a very long time will show a yellow warning
triangle in the UI.
A final change is that it is now possible to close a job
that is not assigned to a node without using force.
This is because previously jobs that were open but
not assigned to a node were an aberration, whereas
after this change they'll be relatively common.