Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Should flush endpoint also wait for in-progress forecasts to complete? #32094

Open
droberts195 opened this issue Jul 16, 2018 · 6 comments
Labels

Comments

@droberts195
Copy link
Contributor

#31973 highlights a problem with polling indices being the only mechanism to check whether a forecast has completed: the primary and replica can have different documents searchable, so you can get inconsistent results from two consecutive searches of the same forecast results.

We get around this for non-forecast results by recommending people flush a job before requesting results if they require perfect consistency. However, flushing a job does not currently also flush forecasts.

If you close a job then the request does not return until all results are committed to the results index, including forecast results. So there is a discrepancy here between close and flush. Prior to the introduction of forecasts, flush provided the same consistency guarantee as close, but now it doesn't for the case of forecast results.

To restore this parity between the consistency guarantees of flush and close I propose that a request to flush a job should not return until all forecasts are complete as well as other processing. (If there are some situations where we want to flush everything except forecasts then we could have a ?forecasts=false option on the flush endpoint, but I think the default should be that flush and close cause the same documents to be searchable in indices.)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@hendrikmuhs
Copy link

hendrikmuhs commented Jul 16, 2018

This sounds like a good idea to me. The implementation of this requires changes on the cpp side as well. We also would need to think about the jobs in the forecast queue.

Looking at #31973 there is an alternative i like to mention: The forecasting request stats document is written last after all results have been written, but only from the engine side. On the persistence side however the near real time behavior as well as replication can not guarantee this, that's why we run into consistency problems. The alternative to the triggered flush/refresh is doing an automated refresh after receiving the stats document with status==finished. The downside of this: it always forces a refresh, so it is not on_demand and it does not solve the mentioned flush inconsistency.

@droberts195
Copy link
Contributor Author

doing an automated refresh after receiving the stats document with status==finished

I don't think this will create any visibility guarantees for searches from different processes. The other process (UI, integration test harness or client application) could still search at a time when the document is visible in the primary but not the replica. The thread writing results will wait until the refresh is completed, but that doesn't make the sequence of events required to refresh the index atomic across a distributed cluster. The flush endpoint solves this by syncing with the thread that's writing results, and not returning until the thread writing results is happy that everything is in sync.

@hendrikmuhs
Copy link

Client applications watch for the forecast request document having status set to finished, if we ensure that the refresh happens before we write this last update I do not see a reason why this should not work.

after receiving the stats document with status==finished

With that I mean the internal communication part between the engine and the result processor, not the external client/es part.

@droberts195
Copy link
Contributor Author

if we ensure that the refresh happens before we write this last update

Yes, true, that would solve it for the case of a client reacting to seeing status==finished by reading the other forecast results.

It would still be possible for two consecutive gets of the forecast stats document to return strange results, like the first get saying the forecast was finished and the second saying it was still running. But it's probably only tests that care about this. A real client will not get the forecast stats document twice in quick succession.

@davidkyle
Copy link
Member

#31973 showed that the forecast results can be inconsistent on replica nodes. Is the issue that the UI is polling for a forecast stats document with status==finished before querying results and could be susceptible to the same bug? If so we should address that with the suggested refresh.

I can see the benefit of flush request ensuring any running forecasts have finished and that the results are visible. Forecasting sends a flush command immediately after the forecast request in that case flushing should not block on the forecast finishing (forecasts=false).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants