Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add a thread local storage module, std::tls #461

Merged
merged 3 commits into from
Nov 21, 2014
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
331 changes: 331 additions & 0 deletions text/0000-tls-overhaul.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
- Start Date: (fill me in with today's date, YYYY-MM-DD)
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Summary

Introduce a new thread local storage module to the standard library, `std::tls`,
providing:

* Scoped TLS, a non-owning variant of TLS for any value.
* Owning TLS, an owning, dynamically initialized, dynamically destructed
variant, similar to `std::local_data` today.

# Motivation

In the past, the standard library's answer to thread local storage was the
`std::local_data` module. This module was designed based on the Rust task model
where a task could be either a 1:1 or M:N task. This design constraint has
[since been lifted][runtime-rfc], allowing for easier solutions to some of the
current drawbacks of the module. While redesigning `std::local_data`, it can
also be scrutinized to see how it holds up to modern-day Rust style, guidelines,
and conventions.

[runtime-rfc]: https://github.com/rust-lang/rfcs/blob/master/text/0230-remove-runtime.md

In general the amount of work being scheduled for 1.0 is being trimmed down as
much as possible, especially new work in the standard library that isn't focused
on cutting back what we're shipping. Thread local storage, however, is such a
critical part of many applications and opens many doors to interesting sets of
functionality that this RFC sees fit to try and wedge it into the schedule. The
current `std::local_data` module simply doesn't meet the requirements of what
one may expect out of a TLS implementation for a language like Rust.

## Current Drawbacks

Today's implementation of thread local storage, `std::local_data`, suffers from
a few drawbacks:

* The implementation is not super speedy, and it is unclear how to enhance the
existing implementation to be on par with OS-based TLS or `#[thread_local]`
support. As an example, today a lookup takes `O(log N)` time where N is the
number of set TLS keys for a task.

This drawback is also not to be taken lightly. TLS is a fundamental building
block for rich applications and libraries, and an inefficient implementation
will only deter usage of an otherwise quite useful construct.

* The types which can be stored into TLS are not maximally flexible. Currently
only types which ascribe to `'static` can be stored into TLS. It's often the
case that a type with references needs to be placed into TLS for a short
period of time, however.

* The interactions between TLS destructors and TLS itself is not currently very
well specified, and it can easily lead to difficult-to-debug runtime panics or
undocumented leaks.

* The implementation currently assumes a local `Task` is available. Once the
runtime removal is complete, this will no longer be a valid assumption.

## Current Strengths

There are, however, a few pros to the usage of the module today which should be
required for any replacement:

* All platforms are supported.
* `std::local_data` allows consuming ownership of data, allowing it to live past
the current stack frame.

## Building blocks available

There are currently two primary building blocks available to Rust when building
a thread local storage abstraction, `#[thread_local]` and OS-based TLS. Neither
of these are currently used for `std::local_data`, but are generally seen as
"adequately efficient" implementations of TLS. For example, an TLS access of a
`#[thread_local]` global is simply a pointer offset, which when compared to a
`O(log N)` lookup is quite speedy!

With these available, this RFC is motivated in redesigning TLS to make use of
these primitives.

# Detailed design

Three new modules will be added to the standard library:

* The `std::sys::tls` module provides platform-agnostic bindings the OS-based
TLS support. This support is intended to only be used in otherwise unsafe code
as it supports getting and setting a `*mut u8` parameter only.

* The `std::tls` module provides a dynamically initialized and dynamically
destructed variant of TLS. This is very similar to the current
`std::local_data` module, except that the implicit `Option<T>` is not
mandated as an initialization expression is required.

* The `std::tls::scoped` module provides a flavor of TLS which can store a
reference to any type `T` for a scoped set of time. This is a variant of TLS
not provided today. The backing idea is that if a reference only lives in TLS
for a fixed set of time then there's no need for TLS to consume ownership of
the value itself.

This pattern of TLS is quite common throughout the compiler's own usage of
`std::local_data` and often more expressive as no dances are required to move
a value into and out of TLS.

The design described below can be found as an existing cargo package:
https://github.com/alexcrichton/tls-rs.

## The OS layer

While LLVM has support for `#[thread_local]` statics, this feature is not
supported on all platforms that LLVM can target. Almost all platforms, however,
provide some form of OS-based TLS. For example Unix normally comes with
`pthread_key_create` while Windows comes with `TlsAlloc`.

This RFC proposes introducing a `std::sys::tls` module which contains bindings
to the OS-based TLS mechanism. This corresponds to the `os` module in the
example implementation. While not currently public, the contents of `sys` are
slated to become public over time, and the API of the `std::sys::tls` module
will go under API stabilization at that time.

This module will support "statically allocated" keys as well as dynamically
allocated keys. A statically allocated key will actually allocate a key on
first use.

### Destructor support

The major difference between Unix and Windows TLS support is that Unix supports

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also possible to use only #[thread_local] with no secondary pthread_key_create / synchronization by using C++11 destructor support. On Linux, glibc defines a fn __cxa_thread_atexit_impl(dtor: unsafe extern "C" fn(ptr: *mut c_void), ptr: *mut c_void, dso_symbol: *mut i8) where dso_symbol can just be retrived by defining static mut __dso_handle: i8. On OS X, there's a fn _tlv_atexit(dtor: unsafe extern "C" fn(ptr: *mut c_void), ptr: *mut c_void); function. Rust could use the weak symbol trick to call these when available and fall back to a crappier implementation on top of dynamic TLS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! If you take a look at the sample implementation you'll see that it does precisely that!

a destructor function for each TLS slot while Windows does not. When each Unix
TLS key is created, an optional destructor is specified. If any key has a
non-NULL value when a thread exits, the destructor is then run on that value.

One possibility for this `std::sys::tls` module would be to not provide
destructor support at all (least common denominator), but this RFC proposes
implementing destructor support for Windows to ensure that functionality is not
lost when writing Unix-only code.

Destructor support for Windows will be provided through a custom implementation
of tracking known destructors for TLS keys.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows, you can set a callback for when a thread exits, and you can then iterate over all TLS keys and destroy them if they are set.

See https://github.com/ChromiumWebApps/chromium/blob/master/base/threading/thread_local_storage_win.cc#L42 for how to do it without using DllMain.

Iterating over all possible TLS keys is quadratic instead of linear if they are sparsely used, but it should be possible to write code that directly accesses the Windows TEB to avoid this if desired.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I ran across that same trick when whipping up the sample implementation. I definitely agree that the implementation can be improved as well!


## Scoped TLS

As discussed before, one of the motivations for this RFC is to provide a method
of inserting any value into TLS, not just those that ascribe to `'static`. This
provides maximal flexibility in storing values into TLS to ensure any "thread
local" pattern can be encompassed.

Values which do not adhere to `'static` contain references with a constrained
lifetime, and can therefore not be moved into TLS. They can, however, be
*borrowed* by TLS. This scoped TLS api provides the ability to insert a
reference for a particular period of time, and then a non-escaping reference can
be extracted at any time later on.

In order to implement this form of TLS, a new module, `std::tls::scoped`, will
be added. It will be coupled with a `scoped_tls!` macro in the prelude. The API
looks like:

```rust
/// Declares a new scoped TLS key. The keyword `static` is required in front to
/// emphasize that a `static` item is being created. There is no initializer
/// expression because this key initially contains no value.
///
/// A `pub` variant is also provided to generate a public `static` item.
macro_rules! scoped_tls(
(static $name:ident: $t:ty) => (/* ... */);
(pub static $name:ident: $t:ty) => (/* ... */);
)

/// A structure representing a scoped TLS key.
///
/// This structure cannot be created dynamically, and it is accessed via its
/// methods.
pub struct Key<T> { /* ... */ }

impl<T> Key<T> {
/// Insert a value into this scoped TLS slot for a duration of a closure.
///
/// While `cb` is running, the value `t` will be returned by `get` unless
/// this function is called recursively inside of cb.
///
/// Upon return, this function will restore the previous TLS value, if any
/// was available.
pub fn set<R>(&'static self, t: &T, cb: || -> R) -> R { /* ... */ }

/// Get a value out of this scoped TLS variable.
///
/// This function takes a closure which receives the value of this TLS
/// variable, if any is available. If this variable has not yet been set,
/// then None is yielded.
pub fn with<R>(&'static self, cb: |Option<&T>| -> R) -> R { /* ... */ }
}
```

The purpose of this module is to enable the ability to insert a value into TLS
for a scoped period of time. While able to cover many TLS patterns, this flavor
of TLS is not comprehensive, motivating the owning variant of TLS.

### Variations

Specifically the `with` API can be somewhat unwieldy to use. The `with` function
takes a closure to run, yielding a value to the closure. It is believed that
this is required for the implementation to be sound, but it also goes against
the "use RAII everywhere" principle found elsewhere in the stdlib.

Additionally, the `with` function is more commonly called `get` for accessing a
contained value in the stdlib. The name `with` is recommended because it may be
possible in the future to express a `get` function returning a reference with a
lifetime bound to the stack frame of the caller, but it is not currently
possible to do so.

The `with` functions yields an `Option<&T>` instead of `&T`. This is to cover
the use case where the key has not been `set` before it used via `with`. This is
somewhat unergonomic, however, as it will almost always be followed by
`unwrap()`. An alternative design would be to provide a `is_set` function and
have `with` `panic!` instead.

## Owning TLS

Although scoped TLS can store any value, it is also limited in the fact that it
cannot own a value. This means that TLS values cannot escape the stack from from
which they originated from. This is itself another common usage pattern of TLS,
and to solve this problem the `std::tls` module will provided support for
placing owned values into TLS.

These values must not contain references as that could trigger a use-after-free,
but otherwise there are no restrictions on placing statics into owned TLS. The
module will support dynamic initialization (run on first use of the variable) as
well as dynamic destruction (implementors of `Drop`).

The interface provided will be similar to what `std::local_data` provides today,
except that the `replace` function has no analog (it would be written with a
`RefCell<Option<T>>`).

```rust
/// Similar to the `scoped_tls!` macro, except allows for an initializer
/// expression as well.
macro_rules! tls(
(static $name:ident: $t:ty = $init:expr) => (/* ... */)
(pub static $name:ident: $t:ty = $init:expr) => (/* ... */)
)

pub struct Key<T: 'static> { /* ... */ }

impl<T: 'static> Key<T> {
/// Access this TLS variable, lazily initializing it if necessary.
///
/// The first time this function is called on each thread the TLS key will
/// be initialized by having the specified init expression evaluated on the
/// current thread.
///
/// This function can return `None` for the same reasons of static TLS
/// returning `None` (destructors are running or may have run).
pub fn with<R>(&'static self, f: |Option<&T>| -> R) -> R { /* ... */ }
}
```

### Destructors

One of the major points about this implementation is that it allows for values
with destructors, meaning that destructors must be run when a thread exits. This
is similar to placing a value with a destructor into `std::local_data`. This RFC
attempts to refine the story around destructors:

* A TLS key cannot be accessed while its destructor is running. This is
currently manifested with the `Option` return value.
* A TLS key *may* not be accessible after its destructor has run.
* Re-initializing TLS keys during destruction may cause memory leaks (e.g.
setting the key FOO during the destructor of BAR, and initializing BAR in the
destructor of FOO). An implementation will strive to destruct initialized
keys whenever possible, but it may also result in a memory leak.
* A `panic!` in a TLS destructor will result in a process abort. This is similar
to a double-failure.

These semantics are still a little unclear, and the final behavior may still
need some more hammering out. The sample implementation suffers from a few extra
drawbacks, but it is believed that some more implementation work can overcome
some of the minor downsides.

### Variations

Like the scoped TLS variation, this key has a `with` function instead of the
normally expected `get` function (returning a reference). One possible
alternative would be to yield `&T` instead of `Option<&T>` and `panic!` if the
variable has been destroyed. Another possible alternative is to have a `get`
function returning a `Ref<T>`. Currently this is unsafe, however, as there is no
way to ensure that `Ref<T>` does not satisfy `'static`. If the returned
reference satisfies `'static`, then it's possible for TLS values to reference
each other after one has been destroyed, causing a use-after-free.

# Drawbacks

* There is no variant of TLS for statically initialized data. Currently the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A branch + pointer offset is significantly worse than just a pointer offset, even if it's marked as likely to succeed for LLVM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree that there's a performance loss, but after some benchmarking, "significantly worse" may be overstating it a bit, I found the hit to be ~10%:

test dynamic      ... bench:      1725 ns/iter (+/- 71)
test local_data   ... bench:      6739 ns/iter (+/- 33)
test os           ... bench:      9787 ns/iter (+/- 215)
test scoped       ... bench:      1529 ns/iter (+/- 10)
test statik       ... bench:      1544 ns/iter (+/- 16)
test thread_local ... bench:      1545 ns/iter (+/- 20)

benchmarks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's far more than 10%. You're not really measuring anything by doing naive micro-benchmarks of branches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some representative examples I could measure? I'd love to get a handle on what sort of impact this has.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to black_box is far more expensive than a TLS access, and turning off optimizations won't yield useful numbers without a lot of effort put into crafting a real benchmark. The cost of accessing TLS in the executable is comparable to the cost of accessing a global variable or data on the stack.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static TLS access is just a single offset instruction with a constant offset so IMO the only way you're going to get a sane benchmark is generating the assembly for a fetch, increment and set of an integer in TLS (inside a library pub fn where it can't optimize out) and then copy-pasting it 100000 times. Using a loop will be measuring the cost of looping at least as much as the cost of the pointer offset vs. pointer offset + branch inside the loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I'm sorry, I should have clarified. The benchmarks were run with cargo bench --features thread-local which enabled the #[thread_local] usage for the various macros, as well as building everything with optimizations. Upon removing the calls to black_box the statik and thread_local benchmarks are completely optimized away (verified by inspecting the disassembly):

running 6 tests
test dynamic      ... bench:       529 ns/iter (+/- 251)
test local_data   ... bench:      5957 ns/iter (+/- 4)
test os           ... bench:      9789 ns/iter (+/- 12)
test scoped       ... bench:      1528 ns/iter (+/- 6)
test statik       ... bench:         1 ns/iter (+/- 0)
test thread_local ... bench:         1 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

I didn't find this very useful, so I added black_box to prevent this from happening. I didn't intend for it to hinder optimizations. Do you know of a way that I could measure without hindering optimizations?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could measure the time taken by a no-op loop calling black_box on a local variable, and subtract that from the time taken by the various benchmarks. AFAICT, with optimizations enabled LLVM is just going to optimize out the repeated TLS offsets to a single reused offset (not that pointer arithmetic is expensive) so it's just going to be measuring the cost of incrementing memory at that location. I wouldn't expect it to be more expensive than a global variable even without the TLS optimizations though - both are dynamic offsets in position independent code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few more benchmarks, and these are the results:

test dynamic         ... bench:      1573 ns/iter (+/- 64)
test global_variable ... bench:      1797 ns/iter (+/- 16)
test local_data      ... bench:      5246 ns/iter (+/- 14)
test local_variable  ... bench:      1531 ns/iter (+/- 2)
test noop            ... bench:       271 ns/iter (+/- 3)
test os              ... bench:      9787 ns/iter (+/- 55)
test scoped          ... bench:      1529 ns/iter (+/- 7)
test statik          ... bench:      1543 ns/iter (+/- 5)
test thread_local    ... bench:      1543 ns/iter (+/- 1)

While I don't dispute that a dynamically initialized variable has more instructions on the fast path than a statically initialized one, it seems that the impact is quite minor. These numbers make it look like a global variable is a tad bit slower! If the cost of measuring the benchmarking loop is significant in terms of measurements, then I would expect the conclusion to be that the unit being benchmarked is quite fast.

I'd also like to reiterate that I would like to support statically initialized TLS in terms of an API, but the ergonomics of doing so make it infeasible today in my personal opinion. Do note that it is entirely implemented in the sample implementation. API-wise, however providing two variants (dynamic/static) also seems somewhat overkill versus providing only one to worry about. I suspect with an extension to the macro syntax in the future (and ergonomic static initialization), we could tweak the macro to something like: tls!(static FOO: Cell<uint> := Cell::new(3)) where here the := means "statically initialized" and Cell::new(3) is evaluated at compile time.

I would also expect the number of candidates for a statically initialized TLS variable to be fairly small today. It's pretty rare to work with a data structure that can be statically initialized, so in practice if we provided 2 possibilities I would expect the dynamic variant's usage to far outweigh the static variant's usage. If, however, we see usage going in the other direction, we could certainly tweak the semantics!

`std::tls` module requires dynamic initialization, which means a slight
penalty is paid on each access (a check to see if it's already initialized).
* The specification of destructors on owned TLS values is still somewhat shaky

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the C++11 destructor support would be much more robust. It doesn't have weird limitations like PTHREAD_DESTRUCTOR_ITERATIONS, and a fallback implementation when the platform doesn't properly support C++11 (nearly all do) just needs to be safe rather than perfect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not super familiar with the semantics of C++11 destructors with respect to thread_local, do you know if there's some documentation that you can point me at? The sample implementation actually does this already where it favors the oddly-named destructor registration functions over an OS-based implementation, but the OS-based implementation is provided as a fallback.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are 100 TLS variables and each one has a destructor accessing the next for the first time, I think it will lead to leaks with the old POSIX TLS because it only cycles N times (4 on Linux IIRC). AFAIK, that problem was solved for C++11 TLS by just giving it guaranteed sensible semantics (run until completion).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know of documentation for the C++11 destructor semantics in thread_local? I'll reiterate that we do use the destructor registration functions that it uses when available, I'd like to copy the semantics to the fallback implementation (OS TLS) as much as possible, however.

at best. It's possible to leak resources in unsafe code, and it's also
possible to have different behavior across platforms.
* Due to the usage of macros for initialization, all fields of `Key` in all
scenarios must be public. Note that `os` is excepted because its initializers
are a `const`.
* This implementation, while declared safe, is not safe for systems that do any
form of multiplexing of many threads onto one thread (aka green tasks or
greenlets). This RFC considers it the multiplexing systems' responsibility to
maintain native TLS if necessary, or otherwise strongly recommend not using
native TLS.

# Alternatives

Alternatives on the API can be found in the "Variations" sections above.

Some other alternatives might include:

* A 0-cost abstraction over `#[thread_local]` and OS-based TLS which does not

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the point about _tlv_atexit, etc. above. It's also only truly zero cost if there's the ability to define variables without the branch for dynamic initialization / destruction which is why the Cell version existed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the sample implementation does indeed use the destructor registration functions, and it does actually have a statically initialized variant, I just felt that it was confusing to expose so many variants of TLS. I'll benchmark this though so we can get a good handle on how big of a hit this is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah one other thing which led me to start out with dynamic-only is the general lack of statically initialized values in the standard library. As you've found out, you can't statically initialize a Cell or a RefCell for example, and leaking an UnsafeCell as an api is pretty unfortunate. This means today for statically initialized TLS with internal mutability that we've got one of two options:

  1. Provided Cell/RefCell variants baked in (reimplementing Cell/RefCell) as you did with macros.
  2. Force usage of UnsafeCell which allows for static initialization.

I'm not a super fan of either of these options, and would prefer to hold out for something like const fn or generic constants so we have a better story for statically initialized data. In I found the usage around something statically initialized to be pretty uncomfortable unless we provided a reimplementation of Cell/RefCell (which I'd rather not do for composability reasons), so I opted to have only-dynamic for now.

I do think we may be able to add a static variant later without much pain, but I'd just prefer to see some more well-supported statically initialized values before that time.

have support for destructors but requires static initialization. Note that
this variant still needs destructor support *somehow* because OS-based TLS
values must be pointer-sized, implying that the rust value must itself be
boxed (whereas `#[thread_local]` can support any type of any size).

* A variant of the `tls!` macro could be used where dynamic initialization is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's right here.

opted out of because it is not necessary for a particular use case.

* A [previous PR][prev-pr] from @thestinger leveraged macros more heavily than
this RFC and provided statically constructible Cell and RefCell equivalents
via the usage of `transmute`. The implementation provided did not, however,
include the scoped form of this RFC.

[prev-pr]: https://github.com/rust-lang/rust/pull/17583

# Unresolved questions

* Are the questions around destructors vague enough to warrant the `get` method

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the pure #[thread_local] / C++11 destructor support case, it just needs to have a thread-local boolean tracking the initialization state. It can set it back to uninitialized before calling the destructor, and then further accesses will just reinitialize it. I think it's perfectly sane for it to infinite loop if that's the programmer has essentially asked it to do that by having dependency cycles between the destructors.

On platforms still lacking this support, it's still possible to make it memory safe but it does mean that accessing uninitialized TLS in destructors should be considered as a bug even if it's intended / would be sane with C++11 semantics because it may trigger memory leaks (PTHREAD_DESTRUCTOR_ITERATIONS, etc.).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the sample implementation does have a boolean for this and it doesn't ever flip it back to false after it's been set. This means that once deinitialized the get function will always return None, preventing cycles. The problem arises for OS-based TLS because this was difficult to do for pthread-TLS, for example, leading to some of the more uncomfortable questions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's wrong for the get function to return Option. This isn't a runtime error for the program to handle, it's a bug. The only sane thing to do with it is to unwrap it because a correct program will not have these cycles.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of branches in the fast path isn't unimportant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be in favor of get calling panic! instead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and with a single integer / enum tracking the state so there can be one branch for the fast path. Special handling of destruction would essentially be free in that case because it would only be an extra cost for initialization (which by definition only happens once).

being `unsafe` on owning TLS?
* Should the APIs favor `panic!`-ing internally, or exposing an `Option`?