update by query and max_docs and conficts=proceed #63671

nik9000 · 2020-10-14T13:30:01Z

elastic/kibana#80371 seems to be having a problem with update_by_query and max_docs. I'm not 100% sure exactly what is going on, but they expect the combination of max_docs=10 and conflicts=proceed to continue running update by query until it manages to update 10 documents. This seems pretty reasonable and it seems like something the code is trying to do. I might have written that code, but I honestly don't remember at this point. Any way, it looks to me like we at least have an issue where we'll only attempt the first max_docs of each bulk response. Which would be fine without conflicts=proceed, but with it we probably should check if we should move on to the next few docs in the bulk response.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-10-14T13:30:04Z

Pinging @elastic/es-distributed (:Distributed/Reindex)

henningandersen · 2020-10-21T10:48:53Z

I agree, there looks to be an issue when we truncate the hits due to max_docs with the semantics you describe.

I am a bit torn on what the right semantics should be though. I think the choice is between:

max_docs relate primarily to the search/source and thus we should stop after having received that many hits, regardless of version conflicts.
max_docs relate primarily to the actual updates/real work and thus we should stop only after having done that many updates.

My intuition was towards the 1st option before reading your description here. I am curious if you have more input to this?

I believe the kibana issue should resolve itself, since I assume they retry and run that update periodically and the version conflicts should then no longer occur since the tasks that conflicted on the previous run would no longer appear in the search result on a second run?

nik9000 · 2020-10-22T14:36:10Z

I think the second choice is what we meant to do when implementing it. I think I made a mistake when I implemented it and sort of bumbled into doing the first thing instead.

nik9000 · 2020-10-22T14:39:17Z

I believe the kibana issue should resolve itself, since I assume they retry and run that update periodically and the version conflicts should then no longer occur since the tasks that conflicted on the previous run would no longer appear in the search result on a second run?

I think it will indeed resolve itself without us. They have some other issue around sorting by score that mean they are a bit stuck, but i think that is not related to this.

gmmorris · 2021-01-26T15:03:44Z

Thanks for looking into this y'all!

I believe the kibana issue should resolve itself, since I assume they retry and run that update periodically and the version conflicts should then no longer occur since the tasks that conflicted on the previous run would no longer appear in the search result on a second run?

It's an interesting issue, as it has different effects in different situations.
For the most part, this does resolve itself, but where this becomes problematic for us is when we fall behind on the work needed.
We use updateByQuery as part of our Task Management mechanism, which all Kibana Alerts rely on.
Our goal is for this mechanism to scale horizontally, but we're currently hitting a glass ceiling when we have too many Kibana instances running in parallel, and their updateByQuery calls result in a growing number of version_conflicts.

Long term we're hoping to solve this by adding coordination between Kibana nodes, so that they can split the work in a more reliable and sustainable fashion. Sadly though, we're still quite far from that (as Kibana instances aren't aware of each other atm), and have to rely on ES to coordinate.
When version_conflict count against the updated we end up with wasted cycles where multiple Kibana just skip an entire task claiming cycle as they experience consistent 100% version_conflict and 0% updated. We're addressing this by reshuffling Kibana instances so that they try to poll without clashing, but it's more like a clumsy Hokey Pokey than a well coordinated ballet ;)

max_docs relate primarily to the actual updates/real work and thus we should stop only after having done that many updates.

This would definitely be our preferred approach, but we only represent one POV, and I can totally see the 1st also being valid.
I'd hate to add even more to the Api, but perhaps this can be configurable?

bmcconaghy · 2021-01-26T15:08:26Z

I think it would violate the principle of least astonishment if there are cases where a given update by query call could have updated max_docs documents but does not because of conflicts. So if this can happen as the code currently stands, I consider that to be a bug.

…laiming process (#89415) This is a first step in attempting to address the over zealous shifting we've identified in TM. It [turns out](elastic/elasticsearch#63671) `version_conflicts` don't always count against `max_docs`, so in this PR we correct the `version_conflicts` returned by updateByQuery in TaskManager to only count the conflicts that _may_ have counted against `max_docs`. This correction isn't necessarily accurate, but it will ensure we don't shift if we are in fact managing to claim tasks.

…laiming process (elastic#89415) This is a first step in attempting to address the over zealous shifting we've identified in TM. It [turns out](elastic/elasticsearch#63671) `version_conflicts` don't always count against `max_docs`, so in this PR we correct the `version_conflicts` returned by updateByQuery in TaskManager to only count the conflicts that _may_ have counted against `max_docs`. This correction isn't necessarily accurate, but it will ensure we don't shift if we are in fact managing to claim tasks.

…laiming process (#89415) (#89540) This is a first step in attempting to address the over zealous shifting we've identified in TM. It [turns out](elastic/elasticsearch#63671) `version_conflicts` don't always count against `max_docs`, so in this PR we correct the `version_conflicts` returned by updateByQuery in TaskManager to only count the conflicts that _may_ have counted against `max_docs`. This correction isn't necessarily accurate, but it will ensure we don't shift if we are in fact managing to claim tasks.

gmmorris · 2021-02-23T10:45:30Z

Hey team,
I just wanted to share some context about the impact of this bug - hopefully this can help you evaluate the value of addressing this issue in the future. :)

We have now merged a mechanism into Task Manager which essentially tries to shift the time at which a Kibana node polls for tasks if it's experiencing a high rate of version_conflicts.
We found this reduced the impact of this bug as the Task Manager instances end up spacing themselves out so that they clash less.

There are two things to keep in mind about this solution:

It doesn't actually address the problem, as we still end up with wasted cycles due to conflicts until instances realign. This repeat every time we shift, which can happen multiple times when there are many instances.
We now have a glass ceiling beyond which we can't scale horizontally, as this coordination has its limits. In effect this means that features like Alerting, Reporting etc. can't currently scale beyond a certain point (though, we have been able to run 64 Kibana instances in parallel, so it isn't that bad ;) ).

Longer term we hope to introduce some smarter coordination (such as long term ownership of tasks reducing conflicts, or even a leader node that coordinates the work), but these are still far off.

Addressing this bug should reduce the conflicts we experience, enabling us to push that glass ceiling higher.... which is why addressing this would be very valuable to us.

I hope this helps :)

fcofdez · 2021-03-29T08:54:22Z

I've been taking a look into this and I have a couple of tests that reproduce the issue.
As @henningandersen pointed out, one of the problems is that we're only taking into account the first max_docs documents for each scroll response, for example with scroll_size=1000 and max_docs=10 we leave out 990 docs, meaning that a big portion of the matching documents aren't taken into account.
A solution for this problem might be to keep the matching documents around, but this might be problematic if those documents are big, maybe we can just keep around the doc ids and fetch those?

A second scenario that leads to this problem is when all the matching documents are updated concurrently, this is trickier to solve, we could retry the search request, but AFAIK we don't have any guarantee that the process would make progress in scenarios with high contention.

I'm not sure if we want to tackle the second scenario, wdyt @henningandersen?

henningandersen · 2021-03-30T13:00:13Z

@fcofdez about the first scenario, I believe we already fetch the docs from source - keeping them around for a while does not seem problematic (in that case they should lower the scroll size).

About the second scenario, I think we should not repeat the search. We should do one scroll search only and if we get to the end we are done. The "guarantee" we give is that we try to process docs that were present when we receive the operation. But any docs that appear concurrently with the update by query are not guaranteed to be included. We can therefore safely return if we get to the end. The client will not know the difference between whether extra docs appeared before or after ES sent back the response anyway.

Repeating the search could lead to seeing the same docs multiple times and if the update script is not idempotent (like a counter or add to an array), that could accumulate infinitely.

…equests In update by query requests where max_docs < size and conflicts=proceed we weren't using the remaining documents from the scroll response in cases where there were conflicts and in the first bulk request the successful updates < max_docs. This commit address that problem and use the remaining documents from the scroll response instead of requesting a new page. Closes elastic#63671

…equests (#71430) In update by query requests where max_docs < size and conflicts=proceed we weren't using the remaining documents from the scroll response in cases where there were conflicts and in the first bulk request the successful updates < max_docs. This commit address that problem and use the remaining documents from the scroll response instead of requesting a new page. Closes #63671

…equests In update by query requests where max_docs < size and conflicts=proceed we weren't using the remaining documents from the scroll response in cases where there were conflicts and in the first bulk request the successful updates < max_docs. This commit address that problem and use the remaining documents from the scroll response instead of requesting a new page. Closes elastic#63671 Backport of elastic#71430

…equests (#71931) In update by query requests where max_docs < size and conflicts=proceed we weren't using the remaining documents from the scroll response in cases where there were conflicts and in the first bulk request the successful updates < max_docs. This commit address that problem and use the remaining documents from the scroll response instead of requesting a new page. Closes #63671 Backport of #71430

nik9000 added >bug :Distributed/Reindex Issues relating to reindex that are not caused by issues further down needs:triage Requires assignment of a team area label labels Oct 14, 2020

elasticmachine added the Team:Distributed Meta label for distributed team label Oct 14, 2020

gwbrown removed the needs:triage Requires assignment of a team area label label Oct 22, 2020

gmmorris mentioned this issue Jan 27, 2021

[Task Manager] ignore version conflicts that exceed max_docs in the claiming process elastic/kibana#89415

Merged

4 tasks

gmmorris mentioned this issue Feb 23, 2021

Is the alerting framework the solution for ETL? elastic/kibana#92197

Closed

henningandersen assigned fcofdez Mar 4, 2021

henningandersen mentioned this issue Mar 9, 2021

Reindex and friends: don't scroll if limit is smaller than scroll #54270

Closed

fcofdez mentioned this issue Apr 8, 2021

Use the remaining scroll response documents on update by query bulk requests #71430

Merged

fcofdez closed this as completed in #71430 Apr 20, 2021

fcofdez mentioned this issue Apr 20, 2021

[7.x] Use the remaining scroll response documents on update by query bulk requests #71931

Merged

gmmorris mentioned this issue May 4, 2021

[Alerting + Task Manager] Benchmarking 7.14 elastic/kibana#95194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update by query and max_docs and conficts=proceed #63671

update by query and max_docs and conficts=proceed #63671

nik9000 commented Oct 14, 2020

elasticmachine commented Oct 14, 2020

henningandersen commented Oct 21, 2020

nik9000 commented Oct 22, 2020

nik9000 commented Oct 22, 2020

gmmorris commented Jan 26, 2021

bmcconaghy commented Jan 26, 2021

gmmorris commented Feb 23, 2021

fcofdez commented Mar 29, 2021

henningandersen commented Mar 30, 2021

update by query and max_docs and conficts=proceed #63671

update by query and max_docs and conficts=proceed #63671

Comments

nik9000 commented Oct 14, 2020

elasticmachine commented Oct 14, 2020

henningandersen commented Oct 21, 2020

nik9000 commented Oct 22, 2020

nik9000 commented Oct 22, 2020

gmmorris commented Jan 26, 2021

bmcconaghy commented Jan 26, 2021

gmmorris commented Feb 23, 2021

fcofdez commented Mar 29, 2021

henningandersen commented Mar 30, 2021