Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add lazy assignment job config option #47726

Merged
merged 14 commits into from
Oct 14, 2019

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Oct 8, 2019

This change adds:

  • A new option, allow_lazy_open, to anomaly detection jobs
  • A new option, allow_lazy_start, to data frame analytics jobs

Both work in the same way: they allow a job to be
opened/started even if no ML node exists that can
accommodate the job immediately. In this situation
the job waits in the opening/starting state until ML
node capacity is available. (The starting state for data
frame analytics jobs is new in this change.)

Additionally, the ML nightly maintenance tasks now
creates audit warnings for ML jobs that are unassigned.
This means that jobs that cannot be assigned to an ML
node for a very long time will show a yellow warning
triangle in the UI.

A final change is that it is now possible to close a job
that is not assigned to a node without using force.
This is because previously jobs that were open but
not assigned to a node were an aberration, whereas
after this change they'll be relatively common.

This change adds:

- A new option, allow_lazy_open, to anomaly detection jobs
- A new option, allow_lazy_start, to data frame analytics jobs

Both work in the same way: they allow a job to be
opened/started even if no ML node exists that can
accommodate the job immediately. In this situation
the job waits in the opening/started state until ML
node capacity is available.

Additionally, the ML nightly maintenance tasks now
creates audit warnings for ML jobs that are unassigned.
This means that jobs that cannot be assigned to an ML
node for a very long time will show a yellow warning
triangle in the UI.
@droberts195 droberts195 added :ml Machine learning and removed WIP labels Oct 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

This change is similar to the one that was made
to allow stopping unassigned datafeeds in elastic#39034
because if unassigned jobs are now going to be
the norm rather than the exception and remain
unassigned for long periods then it's very
frustrating to not be allowed to easily close
them.
@droberts195 droberts195 marked this pull request as ready for review October 9, 2019 15:24
@droberts195
Copy link
Contributor Author

droberts195 commented Oct 9, 2019

The change to allow unassigned jobs to be closed without force is modelled on the similar change that was done for stopping unassigned datafeeds without force in #39034.

@dimitris-athanasiou dimitris-athanasiou self-requested a review October 9, 2019 15:46
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a few comments but it looks great!

@@ -56,6 +56,7 @@ public static Builder builder() {
private static final ParseField MODEL_MEMORY_LIMIT = new ParseField("model_memory_limit");
private static final ParseField CREATE_TIME = new ParseField("create_time");
private static final ParseField VERSION = new ParseField("version");
private static final ParseField ALLOW_LAZY_START = new ParseField("allow_lazy_start");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of having this in the config instead of a parameter to the start API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the equivalent setting in anomaly detection jobs, it can be added to the modules the SIEM team is using to configure their ML jobs. Then every time those jobs are started they'll pick up the setting. Basically with the work flows that are envisaged for this new setting we think that it will be something that's always desired or not desired for a particular job, and not desired for some invocations but not others.

Then for the data frame analytics setting specifically I made it symmetrical to the way things are done for anomaly detection jobs.

private void auditUnassignedMlTasks(ClusterState state) {
PersistentTasksCustomMetaData tasks = state.getMetaData().custom(PersistentTasksCustomMetaData.TYPE);
if (tasks != null) {
mlAssignmentNotifier.auditMlTasks(state.nodes(), tasks, tasks, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit odd that we're passing the same tasks as both the previousTasks and the CurrentTasks here. Assuming this works properly, should we encapsulate that in MlAssignmentNotifier by making the current auditMlTasks method private and having another public one that only accepts a single tasks arg?

// The listener here can be a no-op, as waitForJobClosed() already waits for
// these persistent tasks to disappear.
persistentTasksService.sendRemoveRequest(jobTask.getId(),
ActionListener.wrap(r -> {}, e -> {}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not handle an error in case the removal of the persistent task fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add logging of the error at this point in the code. To do anything more would mean completely refactoring the way the class waits for the per-job requests.

The overall request will time out instead of reporting success if removing the persistent task for an unassigned job fails. This is similar to what happens if removing the persistent task for an assigned job fails after we've done all the other steps. Effectively the change is making treatment of assigned and unassigned jobs more similar.

@@ -113,8 +113,9 @@
- match: { analysis_limits.model_memory_limit: "2048mb" }

---
"Test put job with model_memory_limit as string":

"Test put job with model_memory_limit as string and lazy open":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a similar test for df-analytics

Trying to add an integration test for lazy assignment
showed how unfriendly it is to report starting but
unassigned analytics tasks as STOPPED.  We will cause
enormous customer confusion without a STARTING state.
@@ -433,6 +433,7 @@ public boolean test(PersistentTasksCustomMetaData.PersistentTask<?> persistentTa
case STOPPING:
exception = ExceptionsHelper.conflictStatusException("the task has been stopped while waiting to be started");
return true;
case STARTING:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means that when a user calls start with lazy mode they would not get a response until the task is actually assigned, right? We probably don't want that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the job is too big to be assigned this check higher up the method will return early:

            // This means we are awaiting a new node to be spun up, ok to return back to the user to await node creation
            if (assignment != null && assignment.equals(JobNodeSelector.AWAITING_LAZY_ASSIGNMENT)) {
                return true;
            }

The only time the code will get to the STARTING case in the switch is during the tiny fraction of a second when a job has been successfully assigned but the call to update its task state has not yet completed. Previously in this situation the state was reported as STOPPED, so I thought it best to put STARTING alongside STOPPED in the switch to maintain previous behaviour.

(I think the integration tests confirm this works as I described, but just to double check I tested this manually with a job with a 3tb model_memory_limit and allow_lazy_start: true on my laptop and the start endpoint returned success instantly.)

It's equivalent to this bit of TransportOpenJobAction:

But since it's clearly confusing I'll add some comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, got it.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 merged commit fd83c18 into elastic:master Oct 14, 2019
@droberts195 droberts195 deleted the allow_lazy_job_assignment branch October 14, 2019 11:13
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Oct 14, 2019
The BWC version for await_lazy_open/await_lazy_start
should be 7.5.0 now elastic#47726 is backported.
droberts195 added a commit that referenced this pull request Oct 14, 2019
The BWC version for await_lazy_open/await_lazy_start
should be 7.5.0 now #47726 is backported.
droberts195 added a commit to elastic/kibana that referenced this pull request Oct 15, 2019
This change augments the SIEM jobs and datafeeds that were
added in #47848 with the allow_lazy_open and max_empty_searches
options that were added in elastic/elasticsearch#47726 and
elastic/elasticsearch#47922 respectively.
droberts195 added a commit to elastic/kibana that referenced this pull request Oct 16, 2019
…#48372)

This change augments the SIEM jobs and datafeeds that were
added in #47848 with the allow_lazy_open and max_empty_searches
options that were added in elastic/elasticsearch#47726 and
elastic/elasticsearch#47922 respectively.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants