[ML] Add lazy assignment job config option #47726

droberts195 · 2019-10-08T11:57:45Z

This change adds:

A new option, allow_lazy_open, to anomaly detection jobs
A new option, allow_lazy_start, to data frame analytics jobs

Both work in the same way: they allow a job to be
opened/started even if no ML node exists that can
accommodate the job immediately. In this situation
the job waits in the opening/starting state until ML
node capacity is available. (The starting state for data
frame analytics jobs is new in this change.)

Additionally, the ML nightly maintenance tasks now
creates audit warnings for ML jobs that are unassigned.
This means that jobs that cannot be assigned to an ML
node for a very long time will show a yellow warning
triangle in the UI.

A final change is that it is now possible to close a job
that is not assigned to a node without using force.
This is because previously jobs that were open but
not assigned to a node were an aberration, whereas
after this change they'll be relatively common.

This change adds: - A new option, allow_lazy_open, to anomaly detection jobs - A new option, allow_lazy_start, to data frame analytics jobs Both work in the same way: they allow a job to be opened/started even if no ML node exists that can accommodate the job immediately. In this situation the job waits in the opening/started state until ML node capacity is available. Additionally, the ML nightly maintenance tasks now creates audit warnings for ML jobs that are unassigned. This means that jobs that cannot be assigned to an ML node for a very long time will show a yellow warning triangle in the UI.

elasticmachine · 2019-10-09T09:17:43Z

Pinging @elastic/ml-core (:ml)

This change is similar to the one that was made to allow stopping unassigned datafeeds in elastic#39034 because if unassigned jobs are now going to be the norm rather than the exception and remain unassigned for long periods then it's very frustrating to not be allowed to easily close them.

…h others And updating tests to match

droberts195 · 2019-10-09T15:27:03Z

The change to allow unassigned jobs to be closed without force is modelled on the similar change that was done for stopping unassigned datafeeds without force in #39034.

dimitris-athanasiou

Leaving a few comments but it looks great!

dimitris-athanasiou · 2019-10-10T06:47:19Z

...high-level/src/main/java/org/elasticsearch/client/ml/dataframe/DataFrameAnalyticsConfig.java

@@ -56,6 +56,7 @@ public static Builder builder() {
    private static final ParseField MODEL_MEMORY_LIMIT = new ParseField("model_memory_limit");
    private static final ParseField CREATE_TIME = new ParseField("create_time");
    private static final ParseField VERSION = new ParseField("version");
+    private static final ParseField ALLOW_LAZY_START = new ParseField("allow_lazy_start");


What is the advantage of having this in the config instead of a parameter to the start API?

For the equivalent setting in anomaly detection jobs, it can be added to the modules the SIEM team is using to configure their ML jobs. Then every time those jobs are started they'll pick up the setting. Basically with the work flows that are envisaged for this new setting we think that it will be something that's always desired or not desired for a particular job, and not desired for some invocations but not others.

Then for the data frame analytics setting specifically I made it symmetrical to the way things are done for anomaly detection jobs.

dimitris-athanasiou · 2019-10-10T07:05:51Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MlDailyMaintenanceService.java

+    private void auditUnassignedMlTasks(ClusterState state) {
+        PersistentTasksCustomMetaData tasks = state.getMetaData().custom(PersistentTasksCustomMetaData.TYPE);
+        if (tasks != null) {
+            mlAssignmentNotifier.auditMlTasks(state.nodes(), tasks, tasks, true);


It's a bit odd that we're passing the same tasks as both the previousTasks and the CurrentTasks here. Assuming this works properly, should we encapsulate that in MlAssignmentNotifier by making the current auditMlTasks method private and having another public one that only accepts a single tasks arg?

dimitris-athanasiou · 2019-10-10T07:08:08Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportCloseJobAction.java

+                                                // The listener here can be a no-op, as waitForJobClosed() already waits for
+                                                // these persistent tasks to disappear.
+                                                persistentTasksService.sendRemoveRequest(jobTask.getId(),
+                                                    ActionListener.wrap(r -> {}, e -> {}));


Should we not handle an error in case the removal of the persistent task fails?

I'll add logging of the error at this point in the code. To do anything more would mean completely refactoring the way the class waits for the per-job requests.

The overall request will time out instead of reporting success if removing the persistent task for an unassigned job fails. This is similar to what happens if removing the persistent task for an assigned job fails after we've done all the other steps. Effectively the change is making treatment of assigned and unassigned jobs more similar.

dimitris-athanasiou · 2019-10-10T07:14:30Z

x-pack/plugin/src/test/resources/rest-api-spec/test/ml/jobs_crud.yml

@@ -113,8 +113,9 @@
  - match: { analysis_limits.model_memory_limit: "2048mb" }

 ---
-"Test put job with model_memory_limit as string":
-
+"Test put job with model_memory_limit as string and lazy open":


It would be great to add a similar test for df-analytics

Trying to add an integration test for lazy assignment showed how unfriendly it is to report starting but unassigned analytics tasks as STOPPED. We will cause enormous customer confusion without a STARTING state.

dimitris-athanasiou · 2019-10-13T13:02:46Z

.../src/main/java/org/elasticsearch/xpack/ml/action/TransportStartDataFrameAnalyticsAction.java

@@ -433,6 +433,7 @@ public boolean test(PersistentTasksCustomMetaData.PersistentTask<?> persistentTa
                case STOPPING:
                    exception = ExceptionsHelper.conflictStatusException("the task has been stopped while waiting to be started");
                    return true;
+                case STARTING:


I think this means that when a user calls start with lazy mode they would not get a response until the task is actually assigned, right? We probably don't want that.

In the case where the job is too big to be assigned this check higher up the method will return early:

// This means we are awaiting a new node to be spun up, ok to return back to the user to await node creation if (assignment != null && assignment.equals(JobNodeSelector.AWAITING_LAZY_ASSIGNMENT)) { return true; }

The only time the code will get to the STARTING case in the switch is during the tiny fraction of a second when a job has been successfully assigned but the call to update its task state has not yet completed. Previously in this situation the state was reported as STOPPED, so I thought it best to put STARTING alongside STOPPED in the switch to maintain previous behaviour.

(I think the integration tests confirm this works as I described, but just to double check I tested this manually with a job with a 3tb model_memory_limit and allow_lazy_start: true on my laptop and the start endpoint returned success instantly.)

It's equivalent to this bit of TransportOpenJobAction:

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java

Lines 561 to 563 in f52f2ae

case OPENING:

case CLOSED:

return false;

But since it's clearly confusing I'll add some comments.

Right, got it.

dimitris-athanasiou

LGTM

The BWC version for await_lazy_open/await_lazy_start should be 7.5.0 now elastic#47726 is backported.

The BWC version for await_lazy_open/await_lazy_start should be 7.5.0 now #47726 is backported.

This change augments the SIEM jobs and datafeeds that were added in #47848 with the allow_lazy_open and max_empty_searches options that were added in elastic/elasticsearch#47726 and elastic/elasticsearch#47922 respectively.

…#48372) This change augments the SIEM jobs and datafeeds that were added in #47848 with the allow_lazy_open and max_empty_searches options that were added in elastic/elasticsearch#47726 and elastic/elasticsearch#47922 respectively.

droberts195 added >enhancement WIP v8.0.0 v7.5.0 labels Oct 8, 2019

droberts195 added 2 commits October 9, 2019 10:16

Add daily auditing for unassigned jobs

4042c3b

Merge branch 'master' into allow_lazy_job_assignment

f52f2ae

droberts195 added :ml Machine learning and removed WIP labels Oct 9, 2019

droberts195 added 3 commits October 9, 2019 11:47

Add a unit test

adb111a

Changing analytics assignment audit message to be more consistent wit…

a09db36

…h others And updating tests to match

droberts195 marked this pull request as ready for review October 9, 2019 15:24

dimitris-athanasiou self-requested a review October 9, 2019 15:46

dimitris-athanasiou reviewed Oct 10, 2019

View reviewed changes

droberts195 added 6 commits October 10, 2019 11:26

Merge branch 'master' into allow_lazy_job_assignment

62e05ec

Some review comments

55ce1a9

Add a STARTING start for DF analytics

a4893fa

Trying to add an integration test for lazy assignment showed how unfriendly it is to report starting but unassigned analytics tasks as STOPPED. We will cause enormous customer confusion without a STARTING state.

Add an integration test for lazy assignment of df analytics

0dc2406

Fix docs

0e0b563

Merge branch 'master' into allow_lazy_job_assignment

8e43701

walterra mentioned this pull request Oct 10, 2019

[ML] Analytics: Add starting state elastic/kibana#47858

Closed

2 tasks

dimitris-athanasiou reviewed Oct 13, 2019

View reviewed changes

Add comments

c996f5e

dimitris-athanasiou approved these changes Oct 14, 2019

View reviewed changes

Merge branch 'master' into allow_lazy_job_assignment

22b00df

droberts195 merged commit fd83c18 into elastic:master Oct 14, 2019

droberts195 deleted the allow_lazy_job_assignment branch October 14, 2019 11:13

droberts195 mentioned this pull request Oct 14, 2019

[ML][7.x] Add lazy assignment job config option #47993

Merged

droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Oct 14, 2019

[ML] Adjust BWC version following backport

09ff0af

The BWC version for await_lazy_open/await_lazy_start should be 7.5.0 now elastic#47726 is backported.

droberts195 mentioned this pull request Oct 14, 2019

[ML] Adjust BWC version following backport #47994

Merged

droberts195 added a commit that referenced this pull request Oct 14, 2019

[ML] Adjust BWC version following backport (#47994)

1c9e0bb

The BWC version for await_lazy_open/await_lazy_start should be 7.5.0 now #47726 is backported.

droberts195 mentioned this pull request Oct 15, 2019

[ML] Add allow_lazy_open and max_empty_searches to SIEM jobs elastic/kibana#48238

Merged

Mpdreamz mentioned this pull request Nov 19, 2019

[meta] 7.5 release elastic/elasticsearch-net#4232

Closed

24 tasks

codebrain mentioned this pull request Dec 18, 2019

Add allow_lazy_open to machine learning jobs. elastic/elasticsearch-net#4277

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add lazy assignment job config option #47726

[ML] Add lazy assignment job config option #47726

droberts195 commented Oct 8, 2019 •

edited

Loading

elasticmachine commented Oct 9, 2019

droberts195 commented Oct 9, 2019 •

edited

Loading

dimitris-athanasiou left a comment

dimitris-athanasiou Oct 10, 2019

droberts195 Oct 10, 2019

dimitris-athanasiou Oct 10, 2019

dimitris-athanasiou Oct 10, 2019

droberts195 Oct 10, 2019

dimitris-athanasiou Oct 10, 2019

dimitris-athanasiou Oct 13, 2019

droberts195 Oct 13, 2019

dimitris-athanasiou Oct 14, 2019

dimitris-athanasiou left a comment

[ML] Add lazy assignment job config option #47726

[ML] Add lazy assignment job config option #47726

Conversation

droberts195 commented Oct 8, 2019 • edited Loading

elasticmachine commented Oct 9, 2019

droberts195 commented Oct 9, 2019 • edited Loading

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

droberts195 commented Oct 8, 2019 •

edited

Loading

droberts195 commented Oct 9, 2019 •

edited

Loading