Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Saved Objects Migrations] The default migrations.batchSize might be too large in some scenarios #145753

Closed
afharo opened this issue Nov 18, 2022 · 6 comments
Labels
Feature:Migrations Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@afharo
Copy link
Member

afharo commented Nov 18, 2022

Some Saved Objects might be very large. An example is canvas-workpad. It may have embedded images in it and a large definition on its own.

This means that, in some edge cases, the migration might hit errors like:

The content length (583660234) is bigger than the maximum allowed

When processing the default migrations.batchSize of 1000 documents per batch.

This content length limit is imposed by buffer.constants.MAX_STRING_LENGTH.

We could either improve the error message to let users know that reducing migrations.batchSize might fix this for them or, even better, automatically reduce/estimate it ourselves, and reduce the batchSize accordingly.

@afharo afharo added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects Feature:Migrations labels Nov 18, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Nov 18, 2022

Related: #92933

I knew we had a discussion about dynamically adapting migrations.batchSize at some point

@TinaHeiligers
Copy link
Contributor

The related issue led me to a meta issue for improving SO migrations performance: #104083
There's already an item in that list for:

Dynamically adjust batch size to prevent exceeding ES HTTP payload size https://github.com/elastic/kibana/issues/107641

Which adds the default value but doesn't seem to address the dynamic behavior we need.

@rudolf
Copy link
Contributor

rudolf commented Nov 18, 2022

Thanks for finding those references @TinaHeiligers I think you might be remembering #109540

We are dynamically adjusting the write batch size, but we assume it's always safe to read 1000 docs at a time. In this case parsing the response from ES failed so the only resolution would be to ask for less results from ES.

@gsoldevila
Copy link
Contributor

gsoldevila commented Jun 12, 2023

Dynamic size when reading

This kind of error shouldn't happen anymore with #157494, as we adjust the read batch size dynamically, both for:

  • The reindex to temp loop
  • The outdated document search loop

So I think we are covered for the read part.

Dynamic size when writing

Now, regarding the write batch size, the whole set documents read and transformed in the steps above are is broken down in different batches, where each batch will have at most maxBatchSizeBytes bytes. These batches are then stored in ES through bulk update. So for these writes, our upper bound is defined by global size (Mb) rather than document count.

So this covers the write part for the 2 loops described above (I have to check the error message for that case).

Fixed size when picking up updated mappings

However, there is a third place where we bulk update documents: The pickupUpdatedMappings.
This is not a read + transform + update loop, but rather an asynchronous updateByQuery operation, identified by a taskId that we poll regularly to check for completion.

When calling updateByQuery, we specify a scroll_size (hardcoded to 1000 up until recently, when we updated it to use the migrations.batchSize). This scroll_size tells the ES how many SO documents to update "per batch".

Thus, if a customer has large SO that must be "picked up" (e.g. after doing a compatible reindex upgrade), that might cause problems on ES side. In particular, this can trigger OOM circuit breaker exceptions, which used to be considered critical and take Kibana down, causing the migration to fail. Fortunately, these are also handled now, and the operation is retried.

Pitfalls

The current pickupUpdatedMappings has some weak points:

  • The scroll_size is fixed by configuration (i.e. not dynamic), and if there are large SOs (and perhaps, multiple Kibana instances), this can really stress ES, and ultimately cause an actual OOM.
  • Circuit breaker exceptions are handled very inefficiently: When retrying the operation, the whole set of documents is re-processed, losing current progress.

Possible improvements

  • A possible improvement would be to dynamically adjust scroll_size (just as we do for batchSize), so that when we hit a 429 circuit breaker, we try again with a smaller scroll_size, to put less stress on ES.
  • It would be nice if we were able to flag updated documents somehow, so that successive retries only have to handle the remaining ones.

@rudolf
Copy link
Contributor

rudolf commented Oct 31, 2023

Closed by #157494

@rudolf rudolf closed this as completed Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Migrations Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

5 participants