[Saved Objects Migrations] The default `migrations.batchSize` might be too large in some scenarios #145753

afharo · 2022-11-18T16:20:30Z

Some Saved Objects might be very large. An example is canvas-workpad. It may have embedded images in it and a large definition on its own.

This means that, in some edge cases, the migration might hit errors like:

The content length (583660234) is bigger than the maximum allowed

When processing the default migrations.batchSize of 1000 documents per batch.

This content length limit is imposed by buffer.constants.MAX_STRING_LENGTH.

We could either improve the error message to let users know that reducing migrations.batchSize might fix this for them or, even better, automatically reduce/estimate it ourselves, and reduce the batchSize accordingly.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-11-18T16:20:32Z

Pinging @elastic/kibana-core (Team:Core)

TinaHeiligers · 2022-11-18T19:36:37Z

Related: #92933

I knew we had a discussion about dynamically adapting migrations.batchSize at some point

TinaHeiligers · 2022-11-18T19:43:15Z

The related issue led me to a meta issue for improving SO migrations performance: #104083
There's already an item in that list for:

Dynamically adjust batch size to prevent exceeding ES HTTP payload size https://github.com/elastic/kibana/issues/107641

Which adds the default value but doesn't seem to address the dynamic behavior we need.

rudolf · 2022-11-18T22:16:14Z

Thanks for finding those references @TinaHeiligers I think you might be remembering #109540

We are dynamically adjusting the write batch size, but we assume it's always safe to read 1000 docs at a time. In this case parsing the response from ES failed so the only resolution would be to ask for less results from ES.

gsoldevila · 2023-06-12T09:19:03Z

Dynamic size when reading

This kind of error shouldn't happen anymore with #157494, as we adjust the read batch size dynamically, both for:

The reindex to temp loop
The outdated document search loop

So I think we are covered for the read part.

Dynamic size when writing

Now, regarding the write batch size, the whole set documents read and transformed in the steps above are is broken down in different batches, where each batch will have at most maxBatchSizeBytes bytes. These batches are then stored in ES through bulk update. So for these writes, our upper bound is defined by global size (Mb) rather than document count.

So this covers the write part for the 2 loops described above (I have to check the error message for that case).

Fixed size when picking up updated mappings

However, there is a third place where we bulk update documents: The pickupUpdatedMappings.
This is not a read + transform + update loop, but rather an asynchronous updateByQuery operation, identified by a taskId that we poll regularly to check for completion.

When calling updateByQuery, we specify a scroll_size (hardcoded to 1000 up until recently, when we updated it to use the migrations.batchSize). This scroll_size tells the ES how many SO documents to update "per batch".

Thus, if a customer has large SO that must be "picked up" (e.g. after doing a ~~compatible~~ reindex upgrade), that might cause problems on ES side. In particular, this can trigger OOM circuit breaker exceptions, which used to be considered critical and take Kibana down, causing the migration to fail. Fortunately, these are also handled now, and the operation is retried.

Pitfalls

The current pickupUpdatedMappings has some weak points:

The scroll_size is fixed by configuration (i.e. not dynamic), and if there are large SOs (and perhaps, multiple Kibana instances), this can really stress ES, and ultimately cause an actual OOM.
Circuit breaker exceptions are handled very inefficiently: When retrying the operation, the whole set of documents is re-processed, losing current progress.

Possible improvements

A possible improvement would be to dynamically adjust scroll_size (just as we do for batchSize), so that when we hit a 429 circuit breaker, we try again with a smaller scroll_size, to put less stress on ES.
It would be nice if we were able to flag updated documents somehow, so that successive retries only have to handle the remaining ones.

rudolf · 2023-10-31T14:25:28Z

Closed by #157494

afharo added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects Feature:Migrations labels Nov 18, 2022

rudolf closed this as completed Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Saved Objects Migrations] The default `migrations.batchSize` might be too large in some scenarios #145753

[Saved Objects Migrations] The default `migrations.batchSize` might be too large in some scenarios #145753

afharo commented Nov 18, 2022

elasticmachine commented Nov 18, 2022

TinaHeiligers commented Nov 18, 2022 •

edited

Loading

TinaHeiligers commented Nov 18, 2022

rudolf commented Nov 18, 2022

gsoldevila commented Jun 12, 2023 •

edited

Loading

rudolf commented Oct 31, 2023

[Saved Objects Migrations] The default migrations.batchSize might be too large in some scenarios #145753

[Saved Objects Migrations] The default migrations.batchSize might be too large in some scenarios #145753

Comments

afharo commented Nov 18, 2022

elasticmachine commented Nov 18, 2022

TinaHeiligers commented Nov 18, 2022 • edited Loading

TinaHeiligers commented Nov 18, 2022

rudolf commented Nov 18, 2022

gsoldevila commented Jun 12, 2023 • edited Loading

Dynamic size when reading

Dynamic size when writing

Fixed size when picking up updated mappings

Pitfalls

Possible improvements

rudolf commented Oct 31, 2023

[Saved Objects Migrations] The default `migrations.batchSize` might be too large in some scenarios #145753

[Saved Objects Migrations] The default `migrations.batchSize` might be too large in some scenarios #145753

TinaHeiligers commented Nov 18, 2022 •

edited

Loading

gsoldevila commented Jun 12, 2023 •

edited

Loading