Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: destroy on main thread #14954

Merged
merged 43 commits into from
Feb 23, 2021
Merged

Conversation

lambdai
Copy link
Contributor

@lambdai lambdai commented Feb 5, 2021

Commit Message:
Always destroy cluster info object on the master thread.
Fixed SDS churn and a rare crash case.
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
Fixes:
#13209
istio/istio#30199
istio/istio#28315
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Contributor Author

lambdai commented Feb 5, 2021

ClusterInfo is not really destroyed because mainthread dispatcher stops dispatching posted tasks. SSLContextManager and symbol tables are not happy: these two managers expect all resources destroyed.

Before this commit the main thread dispatcher will drain the post queue but though, but the exit flag is fragile.

Any advice on how to gracefully shut down dispatcher? @antoniovicente

@mattklein123 mattklein123 self-assigned this Feb 5, 2021
@lambdai
Copy link
Contributor Author

lambdai commented Feb 5, 2021

The tests are failing because symbol tables is not empty when the IsolatedStoreImpl is destroyed.

I am investigating. But it's also good to hear if this is a big issue.

Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterInfo is not really destroyed because mainthread dispatcher stops dispatching posted tasks. SSLContextManager and symbol tables are not happy: these two managers expect all resources destroyed.

Before this commit the main thread dispatcher will drain the post queue but though, but the exit flag is fragile.

Any advice on how to gracefully shut down dispatcher? @antoniovicente

Clean shutdown of dispatchers is a fairly involved topic. The envoy dispatcher class hasn't really tried to shoot for clean shutdown. There are some things that could be done to move in that direction like:

  • introducing a bool tryPost(cb) that rejects additional callbacks after shutdown.
  • executing scheduled callbacks as part of the shutdown process
  • force close downstream connections
  • not sure what's relevant to timers

source/common/upstream/upstream_impl.cc Outdated Show resolved Hide resolved
source/common/upstream/upstream_impl.cc Show resolved Hide resolved
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@antoniovicente antoniovicente self-assigned this Feb 6, 2021
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Contributor Author

lambdai commented Feb 8, 2021

Is tsan error a red herring?
subprocess.CalledProcessError: Command '['git', 'describe', '--all']' returned non-zero exit status 128.

https://dev.azure.com/cncf/envoy/_build/results?buildId=66038&view=logs&j=d1f76054-8f79-554b-6f4a-11d6a63b8e00&t=266e17e3-d213-54b5-deef-0dcee01da137&l=25588

mattklein123
mattklein123 previously approved these changes Feb 23, 2021
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lambdai
Copy link
Contributor Author

lambdai commented Feb 23, 2021

fixing format

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@mattklein123 mattklein123 changed the title cluster: destroy on master thread cluster: destroy on main thread Feb 23, 2021
@mattklein123 mattklein123 merged commit 114d5ae into envoyproxy:main Feb 23, 2021
@lambdai
Copy link
Contributor Author

lambdai commented Feb 23, 2021

Thank you!

@lambdai lambdai added the backport/review Request to backport to stable releases label Feb 23, 2021
@lambdai
Copy link
Contributor Author

lambdai commented Feb 23, 2021

@cpakulski The buggy behavior caused lots of churns in istio. See the PR description. Can we backport it?

@cpakulski
Copy link
Contributor

@lambdai Sure. To which releases?

@lambdai
Copy link
Contributor Author

lambdai commented Feb 23, 2021

@lambdai Sure. To which releases?

Thanks! The newly added SdsCdsIntegrationTest doesn't use any recent features. I am afraid all the stable releases are impacted.

@cpakulski
Copy link
Contributor

OK - will port to all 4 latest releases.

@lambdai
Copy link
Contributor Author

lambdai commented Feb 23, 2021

Awesome. Thanks!

@Shikugawa Shikugawa added backport/approved Approved backports to stable releases and removed backport/review Request to backport to stable releases labels Feb 24, 2021
cpakulski pushed a commit to cpakulski/envoy that referenced this pull request Feb 25, 2021
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
cpakulski added a commit that referenced this pull request Mar 3, 2021
* Dispatcher: keeps a stack of tracked objects. (#14573)

Dispatcher will now keep a stack of tracked objects; on crash it'll "unwind" and have those objects dump their state. Moreover, it'll invoke fatal actions with the tracked objects. This allows us to dump more information during crash.

See related PR: #14509

Will follow up with another PR dumping information at the codec/parser level.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* cluster: destroy on main thread (#14954)

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* Updated release notes.

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

Co-authored-by: Kevin Baichoo <kbaichoo@google.com>
Co-authored-by: Yuchen Dai <silentdai@gmail.com>
@ewoksly
Copy link

ewoksly commented Oct 5, 2023

OK - will port to all 4 latest releases.

@cpakulski could you please confirm starting with version of envoy is this fixed?

@cpakulski
Copy link
Contributor

I do not remember the starting version. It was long long time ago. Please check the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/approved Approved backports to stable releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants