Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix envoy crash during early shutdown #4937

Merged
merged 9 commits into from
Nov 2, 2018

Conversation

eziskind
Copy link
Contributor

@eziskind eziskind commented Nov 1, 2018

Description: This fixes a crash during early shutdown that occurs in the InstanceImpl destructor when it frees the ListenerManagerImpl object. PR #4244 tried to fix this but we're still seeing this periodically. In the crash scenario, the RdsRouteConfigSubscription destructor calls the initialize_callback_ function which ends up calling the InitManager initialize lambda created in the RunHelper constructor. That lambda has a call to workers_start_cb which is protected by a check on the RunHelpers shutdown_ field. However at the point this lambda is executed the RunHelper object has already been destructed (in InstanceImpl::run) so the shutdown_ field is dangling and can contain arbitrary content. In this case we ended up falling through and calling the workers_start_cb function which called InstanceImpl::startWorkers which caused the crash because InstanceImpl::listener_manager_ was null. This fix avoids the crash by making the lambda passed to the InitManager::initialize method not bind to the RunHelper object since it may not exist at the time the lambda is executed. Instead keep the shutdown state in the InstanceImpl object itself.
Risk Level: low
Testing: unit testing

Signed-off-by: Elisha Ziskind <eziskind@google.com>
Signed-off-by: Elisha Ziskind <eziskind@google.com>
/**
* @return whether the shutdown method has been called.
*/
virtual bool isShutdown() PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if the "don't call virtual functions in destructor" (e.g. https://www.artima.com/cppsource/nevercall.html) rule applies here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I think the way I'm using it here is safe though - here is the relevant bit "Once a derived class destructor has run, the object's derived class data members assume undefined values, so C++ treats them as if they no longer exist. Upon entry to the base class destructor, the object becomes a base class object".

In this case, we're calling the virtual method (isShutdown) from the derived class destructor (InstanceImpl) not that of the base class (Instance) so the object should still be treated as an InstanceImpl and call the correct method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch Actually clang-analyzer-optin.cplusplus.VirtualCall check will warn about this (though, not an error now so it will pass) in clang-tidy CI. It doesn't seems the case?

In this case, we're calling the virtual method (isShutdown) from the derived class destructor (InstanceImpl) not that of the base class (Instance) so the object should still be treated as an InstanceImpl and call the correct method.

@eziskind if you make the virtual method called by the derived class destructor final, then clang-analyzer won't complain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@htuch htuch self-assigned this Nov 1, 2018
Signed-off-by: Elisha Ziskind <eziskind@google.com>
Signed-off-by: Elisha Ziskind <eziskind@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Thanks for fixing!

source/server/server.h Outdated Show resolved Hide resolved
Signed-off-by: Elisha Ziskind <eziskind@google.com>
mattklein123
mattklein123 previously approved these changes Nov 1, 2018
Signed-off-by: Elisha Ziskind <eziskind@google.com>
htuch
htuch previously approved these changes Nov 1, 2018
mattklein123
mattklein123 previously approved these changes Nov 1, 2018
@eziskind eziskind dismissed stale reviews from mattklein123 and htuch via 741c77b November 2, 2018 00:37
Signed-off-by: Elisha Ziskind <eziskind@google.com>
Signed-off-by: Elisha Ziskind <eziskind@google.com>
Signed-off-by: Elisha Ziskind <eziskind@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@htuch htuch merged commit ac2af97 into envoyproxy:master Nov 2, 2018
@eziskind eziskind deleted the shutdowncrash branch November 2, 2018 15:00
htuch added a commit to htuch/envoy that referenced this pull request Feb 21, 2019
Another heap-user-after-free, this time we were missing a fix that had been applied to server.h but
not config_validation/server.h (envoyproxy#4940). While working on this, attempted to make fully consistent and as
simple/clear as possible the constraints on member ordering.

This PR is in the tradition (!) of envoyproxy#5847, envoyproxy#4940, envoyproxy#4937. I think long-term we might want to think of
more dynamic and explicit declaration of ordering constraints, it's evidently pretty fragile. Also,
the lack of single-source-of-truth and duplication across prod and config server code bites again.

Fixes oss-fuzz issue https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13228.

Risk level: Low
Testing: Corpus entry added.

Signed-off-by: Harvey Tuch <htuch@google.com>
htuch added a commit that referenced this pull request Feb 21, 2019
Another heap-user-after-free, this time we were missing a fix that had been applied to server.h but
not config_validation/server.h (#4940). While working on this, attempted to make fully consistent and as
simple/clear as possible the constraints on member ordering.

This PR is in the tradition (!) of #5847, #4940, #4937. I think long-term we might want to think of
more dynamic and explicit declaration of ordering constraints, it's evidently pretty fragile. Also,
the lack of single-source-of-truth and duplication across prod and config server code bites again.

Fixes oss-fuzz issue https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13228.

Risk level: Low
Testing: Corpus entry added.

Signed-off-by: Harvey Tuch <htuch@google.com>
fredlas pushed a commit to fredlas/envoy that referenced this pull request Mar 5, 2019
…#6023)

Another heap-user-after-free, this time we were missing a fix that had been applied to server.h but
not config_validation/server.h (envoyproxy#4940). While working on this, attempted to make fully consistent and as
simple/clear as possible the constraints on member ordering.

This PR is in the tradition (!) of envoyproxy#5847, envoyproxy#4940, envoyproxy#4937. I think long-term we might want to think of
more dynamic and explicit declaration of ordering constraints, it's evidently pretty fragile. Also,
the lack of single-source-of-truth and duplication across prod and config server code bites again.

Fixes oss-fuzz issue https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13228.

Risk level: Low
Testing: Corpus entry added.

Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants