Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Sessions Stabilization Stage I #134983

Merged
merged 41 commits into from
Oct 5, 2022
Merged

Conversation

Dosant
Copy link
Contributor

@Dosant Dosant commented Jun 23, 2022

Summary

Close #139349

This pr changes search service/search session infrastructure to improve performance, stability, and resiliency by ensuring that search sessions don’t add additional load on a cluster when the feature is not used

Details on motivation and implementation details in the RFC

Comparison of old vs. new implementation. More details in the RFC

  Old implementation New implementation
Number of task manager tasks 3 0
Initial search keep_alive 1 week 1 minute, extended while polling
Search cancelation Task manager task Automatically deleted due to short keep_alive. In some case expedited faster from browser
Search-session saved object creation For every search search even if user doesn’t save it After user saves a session
Search-session status calculation Task manager task with 10 second interval. Status is calculated and saved in saved object Status is calculated on-the-fly when user retrieves search-sessions objects
Search-session clean-up Sessions that are not saved by user cleaned-up by task manager None. Because we create saved object only for sessions that saved by user
Saving session after search requests complete 5 minutes limit because task manager might have already deleted searches and session object 5 minutes artificial client-side limit to prevent keeping alive searches from "forgotten" tabs. The client continues polling on non-frequent intervals to keep the search alive while the user stays on the screen

Advantages of the new implementation:

  • No task manager usage
    • Less Kibana server load
  • Short default keep_alive
    • Less risk to end up with unintentional long-running search
  • Saved objects created only when user saves a session
    • Less cluster load. Less storage needs.
  • No need for “touched” property is search-session saved object
    • No more object updates on every search poll.
  • On-the-fly calculated search-session status
    • No need for task manager to keep state in sync
    • Less synchronization delays / potential bugs

Performace benchmark
The trend will be more visible in nightly performance tests after we merge, but I run kibana-load-test dashboard scenario a couple of times and I believe the difference is more then just a margin of error:

  Total OK KO % KO Cnt/s Min 50th pct 75th pct 95th pct 99th pct Max Mean Std Dev
Main 52088 52088 0 0% 108.066 9 1775 2384 6201 8895 11273 1968 1738
Branch 56899 56899 0 0% 123.693 9 1192 1972 4089 7057 8886 1414 1337

In this scenario, the branch performs better because there are no search-session monitoring tasks and no redundant search-session saved object creations and updates.

Minor user-facing changes:

  • Management screen:
  • Sessions list auto-refresh is disabled by default because the list is more expensive to fetch as we calculate session status on the fly
  • For the same reason, we fetch only the last 100 sessions of the current user
  • Saving the session is disabled when we continue the session from lens (or another editor). This should be revised with Separate architecture for client side cache sharing #121543
  • Errored session will be in "expired" status if current time > then expired time. We do this to shortcut search status checks if the session is expired.

Nice to have follows-ups (Out of scope):

  • ES should add completion_time to search status so we can add back completion time when restoring a session Add completion_time time field to async_search get or status response elasticsearch#88640
  • IsRestore / IsStored / isStore search options are confusing. Make them clearer consider splitting what is session related and what is search related.
  • Can simplify updateOrCreate logic server-side because now we create and update separately. This will improve performance and number of update retries. TODO create an improvement issue
  • New searches added to a session when restoring (other bucket) would use default expiration, so they wouldn't match the session's expiration.
Progress tracking

Testing Scenarios

Some advanced scenarios I've went through and to give you an idea what to look for:

  • Extend from mgmt works and extends sessions and searches
  • Deleting a saved session also deletes/cancels related async searches
  • When restoring a session, searches aren't extended further
  • When restoring a session and new searches are added - they are added to a session and warning is shown
  • Searches start with 1 minutes keep_alive, but after the session is saved are extended to 1 week (by default)
  • Everything works in a different space
  • Everything works when user doesn't have access to save a session
  • Searches work when the session feature is disabled (data.search.session.enabled:false). The keep_alive is 1 minute, and no client-side polling to extend complete searches
  • Migrations: old non-persisted sessions are dropped; persisted are migrated

Testing tips

Get search sessions in dev tools
GET .kibana*/_search
{
  "query": {
    "match": {
      "type": "search-session"
    }
  }
}
Check async search status

Look at idMapping object in search session attributes. Values would have id which is async_search_id from elasticsearch.

Use status API to check search's expiration_time and completion status:

GET /_async_search/status/FmRldE8zREVEUzA2ZVpUeGs2ejJFUFEaMkZ5QTVrSTZSaVN3WlNFVmtlWHJsdzoxMDc=

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Worth testing in multiple spaces
Search session migrations Medium High Old, not persisted sessions should be dropped. New should be migrated

For maintainers

@Dosant Dosant changed the title Search Sessions Stabilization Stage 1 – Initial PR (#132823) [WIP] Search Sessions Stabilization Stage I Jun 23, 2022
@Dosant
Copy link
Contributor Author

Dosant commented Jun 23, 2022

@elasticmachine merge upstream

@Dosant
Copy link
Contributor Author

Dosant commented Jun 24, 2022

@elasticmachine merge upstream

@Dosant
Copy link
Contributor Author

Dosant commented Jun 27, 2022

@elasticmachine merge upstream

@Dosant
Copy link
Contributor Author

Dosant commented Jul 6, 2022

@elasticmachine merge upstream

@Dosant
Copy link
Contributor Author

Dosant commented Jul 11, 2022

@elasticmachine merge upstream

@Dosant
Copy link
Contributor Author

Dosant commented Jul 19, 2022

@elasticmachine merge upstream

@Dosant
Copy link
Contributor Author

Dosant commented Jul 25, 2022

@elasticmachine merge upstream

@Dosant
Copy link
Contributor Author

Dosant commented Jul 26, 2022

@elasticmachine merge upstream

@Dosant
Copy link
Contributor Author

Dosant commented Jul 27, 2022

@elasticmachine merge upstream

@Dosant Dosant requested review from a team as code owners September 22, 2022 13:32
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-services (Team:AppServicesSv)

@Dosant
Copy link
Contributor Author

Dosant commented Sep 26, 2022

@elasticmachine merge upstream

@Dosant Dosant added the release_note:skip Skip the PR/issue when compiling release notes label Sep 26, 2022
Copy link
Contributor

@watson watson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kibana Platform Security changes LGTM 👍

@Dosant
Copy link
Contributor Author

Dosant commented Sep 27, 2022

@elasticmachine merge upstream

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response Ops changes LGTM. Saw the tasks get marked as "unrecognized".

Copy link
Contributor

@ThomThomson ThomThomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional test changes LGTM! Code review only.

@Dosant
Copy link
Contributor Author

Dosant commented Sep 28, 2022

@elasticmachine merge upstream

@flash1293
Copy link
Contributor

I checked the timelion integration and just making sure I'm getting this correctly - if the session is saved, then a new request is kicked off so the timelion server can store it, but as long as the user stays on the dashboard, it's still "waiting" for the original request which isn't cancelled, right? If that's the case, then LGTM

Copy link
Member

@ppisljar ppisljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Dosant
Copy link
Contributor Author

Dosant commented Oct 5, 2022

@elasticmachine merge upstream

@Dosant
Copy link
Contributor Author

Dosant commented Oct 5, 2022

I checked the timelion integration and just making sure I'm getting this correctly - if the session is saved, then a new request is kicked off so the timelion server can store it, but as long as the user stays on the dashboard, it's still "waiting" for the original request which isn't cancelled, right?

Right. This is expected.
I think this is OK because of a smaller bug surface as the implementation is simpler. Also potentially users may get the on-screen results earlier (without waiting for the second request to finish)

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
data 521 520 -1

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
data 2509 2513 +4

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
data 52.7KB 52.4KB -291.0B
inspector 15.9KB 16.0KB +55.0B
visTypeTimelion 109.1KB 109.2KB +92.0B
total -144.0B

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
data 23 24 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
data 433.2KB 437.0KB +3.9KB
visTypeTimeseries 19.4KB 19.5KB +97.0B
total +4.0KB

Saved Objects .kibana field count

Every field in each saved object type adds overhead to Elasticsearch. Kibana needs to keep the total field count below Elasticsearch's default limit of 1000 fields. Only specify field mappings for the fields you wish to search on or query. See https://www.elastic.co/guide/en/kibana/master/saved-objects-service.html#_mappings

id before after diff
search-session 15 12 -3
Unknown metric groups

API count

id before after diff
data 3213 3221 +8

ESLint disabled line counts

id before after diff
data 46 52 +6

Total ESLint disabled count

id before after diff
data 48 54 +6

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@Dosant Dosant merged commit 5f3d439 into main Oct 5, 2022
@Dosant Dosant deleted the search-sessions-stabilization-stage-1 branch October 5, 2022 09:52
@kibanamachine kibanamachine added v8.6.0 backport:skip This commit does not require backporting labels Oct 5, 2022
WafaaNasr pushed a commit to WafaaNasr/kibana that referenced this pull request Oct 11, 2022
Changes search service/search session infrastructure to improve performance, stability, and resiliency by ensuring that search sessions don’t add additional load on a cluster when the feature is not used
WafaaNasr pushed a commit to WafaaNasr/kibana that referenced this pull request Oct 14, 2022
Changes search service/search session infrastructure to improve performance, stability, and resiliency by ensuring that search sessions don’t add additional load on a cluster when the feature is not used
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Search Sessions Feature:Search Querying infrastructure in Kibana release_note:skip Skip the PR/issue when compiling release notes v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Search/Search Sessions] Stabilization Stage I