Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #9784: fix concurrency issue with a shared_ptr #9828

Closed
wants to merge 2 commits into from
Closed

fixes #9784: fix concurrency issue with a shared_ptr #9828

wants to merge 2 commits into from

Conversation

dmitri-d
Copy link
Contributor

fixes #9784

@jmarantz: could you give this a try when you get a chance?

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d dmitri-d changed the title fix concurrency issue with a shared_ptr fixes #9784: fix concurrency issue with a shared_ptr Jan 24, 2020
@dmitri-d
Copy link
Contributor Author

Hmm, test failures look weird -- malloc failures in clang and ld processes.

lambdai
lambdai previously approved these changes Jan 25, 2020
Copy link
Contributor

@lambdai lambdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the original code has any problem. It could be the false positive of TSAN
If this PR doesn't work, we can disable this test in TSAN

@dmitri-d
Copy link
Contributor Author

dmitri-d commented Jan 25, 2020

@lambdai: Thanks for taking a look! I was moving the original shared_ptr all the way to weak_ptr allocated in a different thread. This could be the issue behind the failure reported in #9784.

@jmarantz
Copy link
Contributor

jmarantz commented Jan 25, 2020

I patched in the change and it does not seem to help. I am reproing with:

bazel test --config=clang-tsan --define=tcmalloc=disabled --config=libc++ \
   //test/integration:vhds_integration_test --runs_per_test=100

WIthout this change it failed 98/100 times. With this change, it failed 97/100 times. I suspect that when you are not running concurrently on a single machine it works pretty often (passed for me 3x in a row, taking ~18 seconds each), but something about the timing in a multi-run case causes problems.

Why do we think this might be a tsan false positive? I've been using tsan quite a while and can't think of a case where it warned about something that was not a real problem.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Contributor Author

dmitri-d commented Jan 27, 2020

I'm copying shared_ptr explicitly here and then use that to create a weak_ptr, also explicitly. I do not see any concurrency issues when building under gcc, @jmarantz could you give this another try?

@yanavlasov yanavlasov self-assigned this Jan 27, 2020
@jmarantz
Copy link
Contributor

You are not able to repro with the instructions I gave? Or are you not using Linux/clang?

@dmitri-d
Copy link
Contributor Author

@jmarantz: I'm on linux/gcc and not seeing the failures you described.

@jmarantz
Copy link
Contributor

Dmitri, to clarify, when you say "I am not seeing the failures you describe", do you mean you ran the exact command I gave to repro and it passed 100/100 times?

Are did you have to change the command to run a command that works in gcc to test it?

is it possible for you to install clang on your system to test the exact repro I gave?

@stale
Copy link

stale bot commented Feb 7, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Feb 7, 2020
@dmitri-d
Copy link
Contributor Author

dmitri-d commented Feb 11, 2020

@jmarantz, apologies, just saw your comment.

do you mean you ran the exact command I gave to repro and it passed 100/100 times?

Yes, the test suite passes 100/100 when built/executed under gcc using the exact command you gave.

is it possible for you to install clang on your system to test the exact repro I gave?

Not at the moment, dealing with various Maistra release issues, which is built under gcc. Changing the build system would throw a wrench into that.

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Feb 11, 2020
@jmarantz
Copy link
Contributor

Can you run this under clang though? I don't think gcc could process --config=clang-tsan and in fact I'm surprised that worked at all.

Also can you merge master?

@jmarantz
Copy link
Contributor

/wait

@stale
Copy link

stale bot commented Feb 21, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Feb 21, 2020
@stale
Copy link

stale bot commented Feb 28, 2020

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Feb 28, 2020
@jmarantz jmarantz reopened this Feb 28, 2020
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Feb 28, 2020
@jmarantz jmarantz added the no stalebot Disables stalebot from closing an issue label Feb 28, 2020
@jmarantz
Copy link
Contributor

Hi @dmitri-d -- are you going to be able to get the chance to pick this up and repro with clang?
Per above, I don't think it makes sense to try to repro with gcc as it's under clang's tsan checks that we notice the failure.

@dmitri-d
Copy link
Contributor Author

dmitri-d commented Feb 28, 2020

Hey @jmarantz, nods, I will try to reproduce this with clang, I might get a chance next week. Was pretty busy with ServiceMesh release...

@dmitri-d
Copy link
Contributor Author

dmitri-d commented Mar 3, 2020

Hmm. I'm not sure what's going on here. A race is reported here: https://github.com/envoyproxy/envoy/blob/master/source/common/router/rds_impl.cc#L303 when current_cb weak_ptr goes out of scope and its destructor is called:

 Write of size 8 at 0x7b140002c510 by thread T7:
    #0 free /local/mnt/workspace/bcain_0721/llvm/utils/release/final/llvm.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:706:3 (vhds_integration_test+0x21e9578)
    #1 std::__1::_DeallocateCaller::__do_call(void*) /home/wb/opt/clang+llvm/bin/../include/c++/v1/new:334:12 (vhds_integration_test+0x22a1398)
    #2 std::__1::_DeallocateCaller::__do_deallocate_handle_size(void*, unsigned long) /home/wb/opt/clang+llvm/bin/../include/c++/v1/new:292:12 (vhds_integration_test+0x22a1364)
    #3 std::__1::_DeallocateCaller::__do_deallocate_handle_size_align(void*, unsigned long, unsigned long) /home/wb/opt/clang+llvm/bin/../include/c++/v1/new:262:12 (vhds_integration_test+0x22a1304)
    #4 std::__1::__libcpp_deallocate(void*, unsigned long, unsigned long) /home/wb/opt/clang+llvm/bin/../include/c++/v1/new:340:3 (vhds_integration_test+0x22a1298)
    #5 std::__1::allocator<std::__1::__shared_ptr_emplace<std::__1::function<void (bool)>, std::__1::allocator<std::__1::function<void (bool)> > > >::deallocate(std::__1::__shared_ptr_emplace<std::__1::function<void (bool)>, std::__1::allocator<std::__1::function<void (bool)> > >*, unsigned long) /home/wb/opt/clang+llvm/bin/../include/c++/v1/memory:1816:10 (vhds_integration_test+0x31717a1)
    #6 std::__1::__shared_ptr_emplace<std::__1::function<void (bool)>, std::__1::allocator<std::__1::function<void (bool)> > >::__on_zero_shared_weak() /home/wb/opt/clang+llvm/bin/../include/c++/v1/memory:3593:9 (vhds_integration_test+0x3171095)
    #7 std::__1::weak_ptr<std::__1::function<void (bool)> >::~weak_ptr() /home/wb/opt/clang+llvm/bin/../include/c++/v1/memory:5037:19 (vhds_integration_test+0x32d9e37)
    #8 Envoy::Router::RdsRouteConfigProviderImpl::onConfigUpdate() /proc/self/cwd/source/common/router/rds_impl.cc:303:5 (vhds_integration_test+0x32bf78a)

The previous read from another thread was here:

  #0 std::__1::__shared_count::__release_shared() /home/wb/opt/clang+llvm/bin/../include/c++/v1/memory:3415:9 (vhds_integration_test+0x22aea7f)
    #1 std::__1::__shared_weak_count::__release_shared() /home/wb/opt/clang+llvm/bin/../include/c++/v1/memory:3457:27 (vhds_integration_test+0x22aea02)
    #2 std::__1::shared_ptr<std::__1::function<void (bool)> >::~shared_ptr() /home/wb/opt/clang+llvm/bin/../include/c++/v1/memory:4393:19 (vhds_integration_test+0x314c877)
    #3 Envoy::Extensions::HttpFilters::OnDemand::OnDemandRouteUpdate::~OnDemandRouteUpdate() /proc/self/cwd/bazel-out/k8-dbg/bin/source/extensions/filters/http/on_demand/_virtual_includes/on_demand_update_lib/extensions/filters/http/on_demand/on_demand_update.h:10:7 (vhds_integration_test+0x314c7f5)

when the OnDemandRouteUpdate filter's destructor that holds the original callback shared_ptr was called. I checked that I don't pass references to shared_ptr/weak_ptr from one thread to another anywhere in the code, only copies. With my understanding of shared_ptr/weak_ptr concurrency, what I'm doing is correct and shouldn't cause any issues. Perhaps someone else can take a look? @lambdai would you have time?

@dmitri-d
Copy link
Contributor Author

ping

@mattklein123 mattklein123 removed the no stalebot Disables stalebot from closing an issue label Apr 22, 2020
@stale
Copy link

stale bot commented Apr 29, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 29, 2020
@stale
Copy link

stale bot commented May 7, 2020

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this May 7, 2020
@dmitri-d dmitri-d deleted the fix-9784 branch May 20, 2020 19:45
@dmitri-d dmitri-d restored the fix-9784 branch May 20, 2020 19:45
@alyssawilk alyssawilk reopened this May 27, 2020
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label May 27, 2020
@alyssawilk alyssawilk closed this May 27, 2020
@dmitri-d dmitri-d deleted the fix-9784 branch July 7, 2020 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vhds: tsan failure in test/integration/vhds_integration_test
6 participants