[ML] Should flush endpoint also wait for in-progress forecasts to complete? #32094

droberts195 · 2018-07-16T14:12:31Z

#31973 highlights a problem with polling indices being the only mechanism to check whether a forecast has completed: the primary and replica can have different documents searchable, so you can get inconsistent results from two consecutive searches of the same forecast results.

We get around this for non-forecast results by recommending people flush a job before requesting results if they require perfect consistency. However, flushing a job does not currently also flush forecasts.

If you close a job then the request does not return until all results are committed to the results index, including forecast results. So there is a discrepancy here between close and flush. Prior to the introduction of forecasts, flush provided the same consistency guarantee as close, but now it doesn't for the case of forecast results.

To restore this parity between the consistency guarantees of flush and close I propose that a request to flush a job should not return until all forecasts are complete as well as other processing. (If there are some situations where we want to flush everything except forecasts then we could have a ?forecasts=false option on the flush endpoint, but I think the default should be that flush and close cause the same documents to be searchable in indices.)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-07-16T14:12:34Z

Pinging @elastic/ml-core

hendrikmuhs · 2018-07-16T18:35:21Z

This sounds like a good idea to me. The implementation of this requires changes on the cpp side as well. We also would need to think about the jobs in the forecast queue.

Looking at #31973 there is an alternative i like to mention: The forecasting request stats document is written last after all results have been written, but only from the engine side. On the persistence side however the near real time behavior as well as replication can not guarantee this, that's why we run into consistency problems. The alternative to the triggered flush/refresh is doing an automated refresh after receiving the stats document with status==finished. The downside of this: it always forces a refresh, so it is not on_demand and it does not solve the mentioned flush inconsistency.

droberts195 · 2018-07-17T07:57:38Z

doing an automated refresh after receiving the stats document with status==finished

I don't think this will create any visibility guarantees for searches from different processes. The other process (UI, integration test harness or client application) could still search at a time when the document is visible in the primary but not the replica. The thread writing results will wait until the refresh is completed, but that doesn't make the sequence of events required to refresh the index atomic across a distributed cluster. The flush endpoint solves this by syncing with the thread that's writing results, and not returning until the thread writing results is happy that everything is in sync.

hendrikmuhs · 2018-07-17T08:32:54Z

Client applications watch for the forecast request document having status set to finished, if we ensure that the refresh happens before we write this last update I do not see a reason why this should not work.

after receiving the stats document with status==finished

With that I mean the internal communication part between the engine and the result processor, not the external client/es part.

droberts195 · 2018-07-17T08:39:14Z

if we ensure that the refresh happens before we write this last update

Yes, true, that would solve it for the case of a client reacting to seeing status==finished by reading the other forecast results.

It would still be possible for two consecutive gets of the forecast stats document to return strange results, like the first get saying the forecast was finished and the second saying it was still running. But it's probably only tests that care about this. A real client will not get the forecast stats document twice in quick succession.

davidkyle · 2018-07-20T15:50:03Z

#31973 showed that the forecast results can be inconsistent on replica nodes. Is the issue that the UI is polling for a forecast stats document with status==finished before querying results and could be susceptible to the same bug? If so we should address that with the suggested refresh.

I can see the benefit of flush request ensuring any running forecasts have finished and that the results are visible. Forecasting sends a flush command immediately after the forecast request in that case flushing should not block on the forecast finishing (forecasts=false).

droberts195 added >enhancement :ml Machine learning team-discuss labels Jul 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Should flush endpoint also wait for in-progress forecasts to complete? #32094

[ML] Should flush endpoint also wait for in-progress forecasts to complete? #32094

droberts195 commented Jul 16, 2018

elasticmachine commented Jul 16, 2018

hendrikmuhs commented Jul 16, 2018 •

edited

Loading

droberts195 commented Jul 17, 2018

hendrikmuhs commented Jul 17, 2018

droberts195 commented Jul 17, 2018

davidkyle commented Jul 20, 2018

[ML] Should flush endpoint also wait for in-progress forecasts to complete? #32094

[ML] Should flush endpoint also wait for in-progress forecasts to complete? #32094

Comments

droberts195 commented Jul 16, 2018

elasticmachine commented Jul 16, 2018

hendrikmuhs commented Jul 16, 2018 • edited Loading

droberts195 commented Jul 17, 2018

hendrikmuhs commented Jul 17, 2018

droberts195 commented Jul 17, 2018

davidkyle commented Jul 20, 2018

hendrikmuhs commented Jul 16, 2018 •

edited

Loading