-
-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fill_via_chunks: use safe code via chunks_exact_mut on BE #1180
Conversation
This (specifically, using chunks_exact_mut) actually improves performance substantially.
rand_core/src/impls.rs
Outdated
unsafe { | ||
core::ptr::copy_nonoverlapping( | ||
$src.as_ptr() as *const u8, | ||
$dst.as_mut_ptr(), | ||
chunk_size_u8); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also try the following safe approach. It adds some up-front bounds checks (which the compiler might be able to optimize out), but I would guess that it won't be a noticeable performance hit:
let src_u8_slice = zerocopy::AsBytes::as_bytes($src);
(&mut $dst[..chunk_size_u8]).copy_from_slice(&src_u8_slice[..chunk_size_u8]);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just trading one unsafe
for another? We try to minimise rand
's dependency count (people have complained multiple times in the past). Another point of compromise :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, yes, although zerocopy's entire reason for existing is to encapsulate these unsafe
s so that other crates don't have to think about it. The more crates that use zerocopy instead of using unsafe
, the fewer instances of unsafe
the ecosystem has to get right. Plus, when the Lang Item for Transmutability is implemented and stabilized, AsBytes
will be modified to just be a thin wrapper around those utilities, and will no longer require any unsafe
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not keen on so many extra dependencies just for this one use of unsafe
:
rand_core v0.6.3 (/home/dhardy/projects/rand/rand/rand_core)
├── getrandom v0.2.3
│ ├── cfg-if v1.0.0
│ └── libc v0.2.101
├── serde v1.0.130
│ └── serde_derive v1.0.130 (proc-macro)
│ ├── proc-macro2 v1.0.29
│ │ └── unicode-xid v0.2.2
│ ├── quote v1.0.9
│ │ └── proc-macro2 v1.0.29 (*)
│ └── syn v1.0.76
│ ├── proc-macro2 v1.0.29 (*)
│ ├── quote v1.0.9 (*)
│ └── unicode-xid v0.2.2
└── zerocopy v0.6.0
├── byteorder v1.4.3
└── zerocopy-derive v0.3.0 (proc-macro)
├── proc-macro2 v1.0.29 (*)
├── syn v1.0.76 (*)
└── synstructure v0.12.5
├── proc-macro2 v1.0.29 (*)
├── quote v1.0.9 (*)
├── syn v1.0.76 (*)
└── unicode-xid v0.2.2
That's four extra deps when using serde
(optional), nine when not, in what is essentially just an API crate. I know there's a valid argument that we shouldn't care, but (1) we've had complaints: #713, #850 and (2) if any of those happen to want to use rand
(and I note that zerocopy
does as a dev-dependency), then you have circular dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zerocopy-derive
looks like it would increase compile time significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for this one use of
unsafe
I've identified a few other places that could use zerocopy:
- https://github.com/rust-random/rand/blob/master/src/rng.rs#L352-L357
- https://github.com/rust-random/rand/blob/master/src/rng.rs#L370-L375 (would require zerocopy to impl
FromBytes
andAsBytes
forcore::num::Wrapping
) - https://github.com/rust-random/rand/blob/master/src/distributions/integer.rs#L131-L135 (would require
packed_simd::Simd: FromBytes + AsBytes
, which isn't currently the case)
That's four extra deps when using
serde
(optional)
Which four? I count one (synstructure v0.12.5
) that is used transitively by zerocopy-derive
and not transitively by serde-derive
.
I mention these only to provide support for using zerocopy (and most importantly for getting rid of unsafe
- if another crate like bytemuck
is a better tradeoff, you should go with that instead), not to suggest that the tradeoff is definitely worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytemuck
looks like a better option at the moment, because the derive support is optional, so we could disable it and avoid the dependency / compile-time overhead. I think depending on bytemuck
might be acceptable to reduce the unsafe code in Rand.
Those are some awesome perf numbers! cc @Ralith |
One approach I've seen before for endianness handling is to just do the copy and then go back and do a second pass afterwards to fix the endianness in-place. This can be more cache friendly on large inputs since, during the endianness swap pass, you don't need both the source and destination memory ranges in cache. |
Nice, that worked out better than I expected! It's interesting to see what LLVM is and isn't capable of optimizing out here. |
Doing this is problematic since we know nothing about the alignment of the output byte-array, and may have a partial word at the end. But doing the reverse works fine, and there doesn't appear to be a reason we can't mutate the source (though it is a breaking change). These are the results I get (byte-flipping to BE on my machine, since I don't have a BE test machine):
|
Regarding Ben Sanders comment:
My take is that (1) the macro is private to the module and (2) that the only way to make it a generic function would be to define our own trait (or use |
I believe you're guaranteed not to have a partial word at the end because the source's length is a multiple of the chunk size. As for the alignment, I suppose I could see that being a performance issue, but it's probably worth benchmarking so you don't have to break the API. You obviously have a better sense than I do of how big of a deal that kind of breaking change is, though. |
Ben here. I think (c) is a large improvement because it makes it clear at the macro callsite that statically unchecked soundness requirements are in play. The current form makes it easy to miss that for a reader who's not already familiar with the code. Nothing wrong with unsafe private helpers, but it's a lot clearer when that unsafety isn't hidden. I'm a fan of replacing macros with traits whenever possible since macros are a much bigger hammer, but I understand if you'd prefer to avoid the boilerplate. |
No it's not. It often will be in practice but the destination is just whatever byte-slice was passed to I think the API breakage is fine, though obviously it delays releasing this fix. We could split this PR and do that part later, I guess (it's only a perf boost). @vks @newpavlov? Thanks for the arguments Ben. Traits and generics are also a pretty big hammer, but still it doesn't have a noticeable compile-time impact. Your argument:
.. still partly applies with a generic function: now we know |
Shouldn't |
But is there anything Anyway, I think I should move the breaking parts to another PR so that we can do another patch release. |
I thought the problem was that for some types |
It would also be good to check whether the safe implementation is still too slow: let mut iter_src = $src.iter();
let mut chunks = $dst.chunks_exact_mut(SIZE);
for (chunk, n) in (&mut chunks).zip(&mut iter_src) {
chunk.copy_from_slice(&n.to_le_bytes());
}
let rem = chunks.into_remainder();
if let Some(n) = iter_src.next() {
rem.copy_from_slice(&n.to_le_bytes()[..rem.len()]);
} |
I removed the breaking changes. The results are actually a little better than before, though I don't understand why:
|
What's an invalid byte pattern to Your code is marginally slower than mine. |
Ack. |
Fair enough, I can only think of the undefined behavior being introduced by an unsound implementation of
Thanks for checking! How much slower? Is it fast enough so we can use the safe code? |
Looks like value stability on big endian platforms was broken:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to restore value stability across endinaness.
In the current form, I believe either |
Sorry, I meant it's a couple of percent slower than my "safe" code (which performs as below on LE):
Good point @Ralith. |
rand_core/src/impls.rs
Outdated
@@ -52,17 +52,19 @@ pub fn fill_bytes_via_next<R: RngCore + ?Sized>(rng: &mut R, dest: &mut [u8]) { | |||
} | |||
} | |||
|
|||
trait ToLe: Copy { | |||
/// Contract: implementing type must be memory-safe to observe as a byte array | |||
/// (implies no uninitialised padding). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this means the trait has to be unsafe. How would a safe implementation violating this contract look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An implementation of ToLe
for any type that could have padding (e.g. any repr(Rust)
struct), which may be uninitialized, would be unsound because fill_via_chunks
is safe and could expose the uninitialized data on a little-endian target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ralith an unsound implementation of ToLe
cannot be implemented without unsafe
, can it? A trait is unsafe
to indicate that due to the way the trait is used, soundness requires that the impl maintain certain invariants (on top of the standard invariants of safe Rust), which I don't see to be the case here. If usage of ToLe
is sound as long as the implementation is sound, it's a regular trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unsound impl is trivial:
struct ProbablyPadded(u64, u8);
impl ToLe for ProbablyPadded {
type Bytes = [u8; 0];
fn to_le_bytes(self) -> Self::Bytes { [] }
}
The key is that fill_via_chunks
doesn't use to_le_bytes
on little-endian targets, it just relies on the unsafe guarantee made by the trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ralith
Code dependent on to_le_bytes
does not use any unsafe
, so it does not make any assumptions about ToLe
which may cause UB. Your impl will simply panic on chunk.copy_from_slice(src[i].to_le_bytes().as_ref());
, since chunk size will not be equal to the src
array. You simply can not expose the padding bytes without using unsafe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reiterate, on little-endian targets, fill_via_chunks
does not invoke to_le_bytes
ever, instead using unsafe
to access the ToLe
implementer's representation directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ralith. There's a subtlety here in that the trait is serving dual roles as a (unsafe) marker trait, and a regular trait. @dhardy maybe we could use a comment that specifically mentions that memcpy is sometimes used in lieu of the given implementation, as that does open a can of worms. Worms that are certainly worth that much performance gain, but worms nonetheless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't put quite so much effort into this trait since it is supposed to change again in the next breaking release, but sure. Perhaps it should be renamed Observable
with extended documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new method to the trait, moving the usage of unsafe
into the trait. Benchmarks are unaffected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhardy Thanks, the new trait encapsulates the unsafe code in a more understandable way.
Address #1170 (@joshlf):
fill_via_chunks
: remove/replace first use ofunsafe
fill_via_chunks
: replace second use ofunsafe
fill_via_chunks
or use a generic function?remove(chacha: safer outputting #1181)unsafe
inBlockRngCore::generate for ChaChaXCore
This removes some
unsafe
code and is actually a substantial performance boost (if the LE specific path is commented out). Specifically, usingchunks_exact_mut
is a massive boost to thegen_bytes_chacha*
benches (the only ones affected).We could remove the LE-specific branch, but as mentioned it is substantially faster than the chunking code (much more than the 8% previously mentioned). I don't think there's a better
safe
alternative than chunking.Some quick benchmarks: