std: Avoid locks during TLS destruction on Windows #41512

alexcrichton · 2017-04-24T18:50:03Z

Gecko recently had a bug reported 1 with a deadlock in the Rust TLS
implementation for Windows. TLS destructors are implemented in a sort of ad-hoc
fashion on Windows as it doesn't natively support destructors for TLS keys. To
work around this the runtime manages a list of TLS destructors and registers a
hook to get run whenever a thread exits. When a thread exits it takes a look at
the list and runs all destructors.

Unfortunately it turns out that there's a lock which is held when our "at thread
exit" callback is run. The callback then attempts to acquire a lock protecting
the list of TLS destructors. Elsewhere in the codebase while we hold a lock over
the TLS destructors we try to acquire the same lock held first before our
special callback is run. And as a result, deadlock!

This commit sidesteps the issue with a few small refactorings:

Removed support for destroying a TLS key on Windows. We don't actually ever
exercise this as a public-facing API, and it's only used during lazy_init
during racy situations. To handle that we just synchronize lazy_init
globally on Windows so we never have to call destroy.
With no need to support removal the global synchronized Vec was tranformed
to a lock-free linked list. With the removal of locks this means that
iteration no long requires a lock and as such we won't run into the deadlock
problem mentioned above.

Note that it's still a general problem that you have to be extra super careful
in TLS destructors. For example no code which runs a TLS destructor on Windows
can call back into the Windows API to do a dynamic library lookup. Unfortunately
I don't know of a great way around that, but this at least fixes the immediate
problem that Gecko was seeing which is that with "well behaved" destructors the
system would still deadlock!

rust-highfive · 2017-04-24T18:50:13Z

r? @BurntSushi

(rust_highfive has picked a reviewer for you, use r? to override)

alexcrichton · 2017-04-24T18:53:26Z

cc @brson

alexcrichton · 2017-04-24T18:53:30Z

cc @vadimcn

mzji · 2017-04-24T19:37:25Z

@retep998 Is there a common solution for this problem on Windows?

vadimcn · 2017-04-25T19:36:50Z

Well, the standard Windows solution is "Just don't call any Windows APIs (with a few exceptions) while the loader lock is taken". Which means that if one would like to run arbitrary code on library load/unload, they must provide some kind of Initialize/Shutdown calls in said library.

vadimcn · 2017-04-25T19:49:14Z

In this case, though, this isn't about running destructors on shutdown, but rather an unfortunate interaction of Widnows XP fallback code in the implementation of Mutex, which may invoke GetProcAddress if this is the first usage of Mutex in the process. So another way to fix this particular problem would be to use a spin-lock in TLS code.

alexcrichton · 2017-04-26T04:15:16Z

@vadimcn note that this PR fixes the TLS issue by just removing all locks. I tweaked the interface a bit so the locks weren't needed, and the public facing API hasn't changed at all.

My main question would be that this is a pretty unfortunate drawback that affects all TLS destructors in Rust, so standard destructors that, say, grab a lock or otherwise go into the Windows runtime may be affected? That seems... bad :(

vadimcn · 2017-04-26T07:30:41Z

Yeah, I don't know of a 100% safe way of running arbitrary code on DLL unload ☹️ .

alexcrichton · 2017-05-04T18:32:20Z

cc @brson (you may be able to review as well)

BurntSushi · 2017-05-04T20:31:50Z

src/libstd/sys/windows/thread_local.rs

-                c::TlsSetValue(key, ptr::null_mut());
-                dtor(ptr as *mut _);
+                c::TlsSetValue((*cur).key, ptr::null_mut());
+                ((*cur).dtor)(ptr as *mut _);


What prevents this section of code from racing? That is, can more than one thread be here trying to run the dtor of the same key? (This is likely for my own edification, since I can see the old code did this as well. One of my assumptions is that run_dtors is running every time a thread exits, right?)

Correct yeah this can run concurrently, but the dtor is just a raw function pointer so it's up to the dtor to have internal synchronization if neessary.

The list of global dtors is only prepended to so taking any location and iterating over it should always be valid in terms of avoiding use-after-free.

BurntSushi · 2017-05-04T20:35:31Z

This LGTM and the approach seems sensible. I had one question about the code and one more:

Note that it's still a general problem that you have to be extra super careful
in TLS destructors. For example no code which runs a TLS destructor on Windows
can call back into the Windows API to do a dynamic library lookup.

Could you unpack this a bit? In particular, I don't know enough to understand how the second sentence follows from the first.

alexcrichton · 2017-05-04T21:06:48Z

Sure yeah, so apparently what's happening is that when a TLS destructor is run the Windows runtime holds at least one lock, the dynamic library lookup lock. I'm not sure if other locks are held. This means that if a TLS dtor attempts to grab lock A, whereas elsewhere in the code you grab lock A and then do something that grabs the dynamic library lookup lock you'll deadlock (as seen here).

I'm basically just not sure there's a way to run TLS dtors on Windows in a situation where "almost arbitrary code is allowed" in the same sense that arbitrary code is allowed in the main function.

BurntSushi · 2017-05-04T21:33:48Z

Seems sensible, thanks for explaining! @bors r+

bors · 2017-05-04T21:33:49Z

📌 Commit fffe254 has been approved by BurntSushi

bors · 2017-05-04T22:43:05Z

⌛ Testing commit fffe254 with merge 91ddbd3...

bors · 2017-05-04T22:57:44Z

💔 Test failed - status-appveyor

Mark-Simulacrum · 2017-05-04T23:22:47Z

Tidy failed? This is odd, since I don't see any error message...

I'm going to retry since it feels spurious but not opening an issue yet since, well, I'm not sure:

@bors retry

command did not execute successfully: "C:\\projects\\rust\\build\\x86_64-pc-windows-msvc\\stage0-tools\\x86_64-pc-windows-msvc\\release\\tidy.exe" "C:\\projects\\rust\\src" "--no-vendor"
expected success, got: exit code: 3221225477

…ck, r=BurntSushi std: Avoid locks during TLS destruction on Windows Gecko recently had a bug reported [1] with a deadlock in the Rust TLS implementation for Windows. TLS destructors are implemented in a sort of ad-hoc fashion on Windows as it doesn't natively support destructors for TLS keys. To work around this the runtime manages a list of TLS destructors and registers a hook to get run whenever a thread exits. When a thread exits it takes a look at the list and runs all destructors. Unfortunately it turns out that there's a lock which is held when our "at thread exit" callback is run. The callback then attempts to acquire a lock protecting the list of TLS destructors. Elsewhere in the codebase while we hold a lock over the TLS destructors we try to acquire the same lock held first before our special callback is run. And as a result, deadlock! This commit sidesteps the issue with a few small refactorings: * Removed support for destroying a TLS key on Windows. We don't actually ever exercise this as a public-facing API, and it's only used during `lazy_init` during racy situations. To handle that we just synchronize `lazy_init` globally on Windows so we never have to call `destroy`. * With no need to support removal the global synchronized `Vec` was tranformed to a lock-free linked list. With the removal of locks this means that iteration no long requires a lock and as such we won't run into the deadlock problem mentioned above. Note that it's still a general problem that you have to be extra super careful in TLS destructors. For example no code which runs a TLS destructor on Windows can call back into the Windows API to do a dynamic library lookup. Unfortunately I don't know of a great way around that, but this at least fixes the immediate problem that Gecko was seeing which is that with "well behaved" destructors the system would still deadlock! [1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1358151

bors · 2017-05-05T00:48:15Z

⌛ Testing commit fffe254 with merge 090ec35...

bors · 2017-05-05T01:00:12Z

💔 Test failed - status-appveyor

Mark-Simulacrum · 2017-05-05T01:13:26Z

Hmm, no, not a spurious failure, most likely. Maybe it's something to do with thread local storage on Windows? No idea if tidy uses thread locals, though.

alexcrichton · 2017-05-05T03:49:03Z

Yeah definitely looks like a segfault related to this, I'll take a look

alexcrichton · 2017-05-05T04:14:50Z

@bors: r=BurntSushi

bors · 2017-05-05T04:14:50Z

📌 Commit bdf6194 has been approved by BurntSushi

bors · 2017-05-05T11:05:15Z

⌛ Testing commit bdf6194 with merge 51175c9...

bors · 2017-05-05T12:50:35Z

💔 Test failed - status-appveyor

frewsxcv · 2017-05-05T12:57:15Z

Failure seems legit:

fatal runtime error: can't destroy tls keys on windows

Gecko recently had a bug reported [1] with a deadlock in the Rust TLS implementation for Windows. TLS destructors are implemented in a sort of ad-hoc fashion on Windows as it doesn't natively support destructors for TLS keys. To work around this the runtime manages a list of TLS destructors and registers a hook to get run whenever a thread exits. When a thread exits it takes a look at the list and runs all destructors. Unfortunately it turns out that there's a lock which is held when our "at thread exit" callback is run. The callback then attempts to acquire a lock protecting the list of TLS destructors. Elsewhere in the codebase while we hold a lock over the TLS destructors we try to acquire the same lock held first before our special callback is run. And as a result, deadlock! This commit sidesteps the issue with a few small refactorings: * Removed support for destroying a TLS key on Windows. We don't actually ever exercise this as a public-facing API, and it's only used during `lazy_init` during racy situations. To handle that we just synchronize `lazy_init` globally on Windows so we never have to call `destroy`. * With no need to support removal the global synchronized `Vec` was tranformed to a lock-free linked list. With the removal of locks this means that iteration no long requires a lock and as such we won't run into the deadlock problem mentioned above. Note that it's still a general problem that you have to be extra super careful in TLS destructors. For example no code which runs a TLS destructor on Windows can call back into the Windows API to do a dynamic library lookup. Unfortunately I don't know of a great way around that, but this at least fixes the immediate problem that Gecko was seeing which is that with "well behaved" destructors the system would still deadlock! [1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1358151

alexcrichton · 2017-05-05T14:00:03Z

@bors: r=BurntSushi

bors · 2017-05-05T14:00:04Z

📌 Commit 495c998 has been approved by BurntSushi

…ck, r=BurntSushi std: Avoid locks during TLS destruction on Windows Gecko recently had a bug reported [1] with a deadlock in the Rust TLS implementation for Windows. TLS destructors are implemented in a sort of ad-hoc fashion on Windows as it doesn't natively support destructors for TLS keys. To work around this the runtime manages a list of TLS destructors and registers a hook to get run whenever a thread exits. When a thread exits it takes a look at the list and runs all destructors. Unfortunately it turns out that there's a lock which is held when our "at thread exit" callback is run. The callback then attempts to acquire a lock protecting the list of TLS destructors. Elsewhere in the codebase while we hold a lock over the TLS destructors we try to acquire the same lock held first before our special callback is run. And as a result, deadlock! This commit sidesteps the issue with a few small refactorings: * Removed support for destroying a TLS key on Windows. We don't actually ever exercise this as a public-facing API, and it's only used during `lazy_init` during racy situations. To handle that we just synchronize `lazy_init` globally on Windows so we never have to call `destroy`. * With no need to support removal the global synchronized `Vec` was tranformed to a lock-free linked list. With the removal of locks this means that iteration no long requires a lock and as such we won't run into the deadlock problem mentioned above. Note that it's still a general problem that you have to be extra super careful in TLS destructors. For example no code which runs a TLS destructor on Windows can call back into the Windows API to do a dynamic library lookup. Unfortunately I don't know of a great way around that, but this at least fixes the immediate problem that Gecko was seeing which is that with "well behaved" destructors the system would still deadlock! [1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1358151

Rollup of 9 pull requests - Successful merges: #41064, #41307, #41512, #41582, #41678, #41722, #41734, #41761, #41763 - Failed merges:

rust-highfive assigned BurntSushi Apr 24, 2017

alexcrichton force-pushed the fix-windows-tls-deadlock branch from c5537b7 to fffe254 Compare April 25, 2017 17:13

aidanhs added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 26, 2017

BurntSushi reviewed May 4, 2017

View reviewed changes

frewsxcv mentioned this pull request May 4, 2017

Rollup of 8 pull requests #41759

Closed

alexcrichton force-pushed the fix-windows-tls-deadlock branch from fffe254 to bdf6194 Compare May 5, 2017 04:13

alexcrichton force-pushed the fix-windows-tls-deadlock branch from bdf6194 to 495c998 Compare May 5, 2017 13:59

frewsxcv mentioned this pull request May 5, 2017

Rollup of 9 pull requests #41773

Merged

bors added a commit that referenced this pull request May 5, 2017

Auto merge of #41773 - frewsxcv:rollup, r=frewsxcv

42a4f37

Rollup of 9 pull requests - Successful merges: #41064, #41307, #41512, #41582, #41678, #41722, #41734, #41761, #41763 - Failed merges:

bors merged commit 495c998 into rust-lang:master May 6, 2017

alexcrichton deleted the fix-windows-tls-deadlock branch June 14, 2017 18:40

std: Avoid locks during TLS destruction on Windows #41512

std: Avoid locks during TLS destruction on Windows #41512

Uh oh!

Conversation

alexcrichton commented Apr 24, 2017

Uh oh!

rust-highfive commented Apr 24, 2017

Uh oh!

alexcrichton commented Apr 24, 2017

Uh oh!

alexcrichton commented Apr 24, 2017

Uh oh!

mzji commented Apr 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadimcn commented Apr 25, 2017

Uh oh!

vadimcn commented Apr 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Apr 26, 2017

Uh oh!

vadimcn commented Apr 26, 2017

Uh oh!

alexcrichton commented May 4, 2017

Uh oh!

BurntSushi May 4, 2017

Choose a reason for hiding this comment

Uh oh!

alexcrichton May 4, 2017

Choose a reason for hiding this comment

Uh oh!

BurntSushi commented May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented May 4, 2017

Uh oh!

BurntSushi commented May 4, 2017

Uh oh!

bors commented May 4, 2017

Uh oh!

bors commented May 4, 2017

Uh oh!

bors commented May 4, 2017

Uh oh!

Mark-Simulacrum commented May 4, 2017

Uh oh!

bors commented May 5, 2017

Uh oh!

bors commented May 5, 2017

Uh oh!

Mark-Simulacrum commented May 5, 2017

Uh oh!

alexcrichton commented May 5, 2017

Uh oh!

alexcrichton commented May 5, 2017

Uh oh!

bors commented May 5, 2017

Uh oh!

bors commented May 5, 2017

Uh oh!

bors commented May 5, 2017

Uh oh!

frewsxcv commented May 5, 2017

Uh oh!

alexcrichton commented May 5, 2017

Uh oh!

bors commented May 5, 2017

Uh oh!

Uh oh!

mzji commented Apr 24, 2017 •

edited

Loading

vadimcn commented Apr 25, 2017 •

edited

Loading

BurntSushi commented May 4, 2017 •

edited

Loading