Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web optimizations #559

Merged
merged 12 commits into from
Dec 9, 2024
Merged

Web optimizations #559

merged 12 commits into from
Dec 9, 2024

Conversation

daxpedda
Copy link
Contributor

@daxpedda daxpedda commented Dec 5, 2024

As discussed in #541. This PR adds the following improvements:

  • Caching of the global Crypto object.
  • Detecting if our Wasm memory is based on a SharedArrayBuffer. If not, we can copy bytes directly into our memory instead of having to go through JS. This saves allocating the buffer in JS and copying the bytes into Wasm memory. This is also the most common path. SharedArrayBuffer requires target_feature = "atomics", which is unstable and requires Rust nightly. See Web optimizations #559 (comment) for full context.
  • The atomic path only creates a sub-array when necessary, potentially saving another FFI call.
  • The atomic path will now allocate an Uint8Array with the minimum amount of bytes necessary instead of a fixed size.
  • The maximum chunk size for the non-atomic path and the maximum Uint8Array size for the atomic paths have been increased to 65536 bytes: the maximum allowed buffer size for Crypto.getRandomValues().

All in all this should give a performance improvement of ~5% to ~500% depending on the amount of requested bytes and which path is taken. See #559 (comment) for some benchmark results.

This spawned a bunch of improvements and fixes in wasm-bindgen that are being used here:

@daxpedda daxpedda force-pushed the web-opt branch 4 times, most recently from 1853f3b to d5f59dd Compare December 5, 2024 01:16
@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 5, 2024

There is a strange bug in Firefox where using instanceof SharedArrayBuffer doesn't seem to work correctly in shared workers. This is probably related to this bug, but I will open a dedicated bug report tomorrow.

In the meantime I implemented a workaround where I check the name of the constructor instead.

@daxpedda daxpedda force-pushed the web-opt branch 3 times, most recently from b711996 to 0e3e163 Compare December 5, 2024 07:46
@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 5, 2024

Alright, I just remembered that I encountered this issue before. So the problem is that SharedArrayBuffer is not available without COOP & COEP. But it is still possible to create a shared WebAssembly.Memory backed by a SharedArrayBuffer without COOP & COEP. So the way to check for that is still via the constructor name, which is still in place.

We can detect COOP & COEP via crossOriginIsolated, in which case we can use the simpler instanceof SharedArrayBuffer check.

See w3c/ServiceWorker#1545 and Bugzilla #1 and #2.

@daxpedda daxpedda force-pushed the web-opt branch 2 times, most recently from 5aab925 to 04a539e Compare December 5, 2024 08:01
@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 5, 2024

Alright, after playing around with the implementation and accounting for various scenarios, I decided its probably best if we just always check the constructor name for simplicities sake. Let me know if you think otherwise.

@newpavlov
Copy link
Member

newpavlov commented Dec 5, 2024

This is a fair amount of additional complexity. It would be nice to provide performance measurements to demonstrate that this optimization is worth the trouble. It also may be worth to split this PR into two parts: Crypto caching and Uint8Array removal.

And wouldn't the unconditional dependency on std break the wasm32v1-none support?

@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 5, 2024

This is a fair amount of additional complexity. It would be nice to provide performance measurements to demonstrate that this optimization is worth the trouble.

The first change, caching the Crypto global object, is a fairly minimal change. All it does is reduce the amount of time getting the object from 35-40ns to 0. While I don't think this is a big improvement, the change is basically two lines of code and the code is actually shorter then before. (apart from the workaround fixed by rustwasm/wasm-bindgen#4315)

The second change is not using an Uint8Array unless necessary. It is only necessary when the underlying Wasm memory is a SharedArrayBuffer. In any other case, the most common case, the random bytes can be copied right into the Wasm memory. This not only reduces the amount of FFI calls from 4 to 1, but it also avoids allocating the necessary memory, which we don't cache anymore, and the byte copying into the Wasm memory.

I did a quick local benchmark:

  • Firefox: 1174ns to 695ns, ~41% improvement
  • Chrome: 1326ns to 781ns, ~41% improvement
  • Node.js v23: 2362ns to 752ns, ~68% improvement

Here is the code (no black_box() shenanigans needed because its all FFI calls):

const RUNS: u32 = 10_000_000;
const CRYPTO_BUFFER_SIZE: u32 = 256;

fn main() {
    let global: Global = js_sys::global().unchecked_into();
    let crypto = global.crypto();

    let mut buffer = [0; CRYPTO_BUFFER_SIZE as usize];
    let start = Instant::now();

    for _ in 0..RUNS {
        crypto.get_random_values_ref(&mut buffer).unwrap();
    }

    console::log_1(&format!("{:?}", start.elapsed() / RUNS).into());

    let mut buffer = [0; CRYPTO_BUFFER_SIZE as usize];
    let start = Instant::now();

    for _ in 0..RUNS {
        let js_buffer = Uint8Array::new_with_length(CRYPTO_BUFFER_SIZE);
        let js_buffer = js_buffer.subarray(0, CRYPTO_BUFFER_SIZE);
        crypto.get_random_values(&js_buffer).unwrap();
        unsafe { js_buffer.raw_copy_to_ptr(buffer.as_mut_ptr()) };
    }

    console::log_1(&format!("{:?}", start.elapsed() / RUNS).into());
}

I will let you decide on how representative this is for actual real world usage.

I assume the most offensive part of the code additions is the SharedArrayBuffer detection, which basically manually implements a LazyLock because we don't have access to one in no_std. The MSRV is not necessary a problem because we could use once_cell, which is a dependency of wasm-bindgen anyway.

I briefly tested just not caching it and doing the detection on every single call. It would still be an improvement, but we loose around 10% of the performance gain.

Let me know what you think.

And wouldn't the unconditional dependency on std break the wasm32v1-none support?

This is a workaround for a macro-hygiene bug in wasm-bindgen fixed in rustwasm/wasm-bindgen#4315, we can simply gate it behind a cfg(feature = "std"). Which I will do now in the rebase.

@daxpedda daxpedda force-pushed the web-opt branch 4 times, most recently from aa62dd1 to 2f9ec56 Compare December 6, 2024 08:33
@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 6, 2024

I removed the caching of the SharedArrayBuffer detection, we can discuss adding that in a follow-up PR.
Also added another optimization: not creating a sub-array unless necessary, saving another FFI call.

Let me know if you want me to split out the SharedArrayBuffer detection entirely.

@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 6, 2024

Alright, I made rustwasm/wasm-bindgen#4325 to solve this issue. Its a dance I've done a couple of times already and its pretty bad. The Wasm module already contains all that information so we will just let wasm-bindgen extract it for us.

So this will wait for a new release of wasm-bindgen.
Let me know what you think of the changes now.

@daxpedda daxpedda force-pushed the web-opt branch 2 times, most recently from ae2616d to 15a1de5 Compare December 6, 2024 16:06
@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 6, 2024

I decided to remove the SharedArrayBuffer detection entirely. It was discussed in the past to support something like this in wasm-bindgen, but it was too niche. If you need SharedArrayBuffer you have to compile with target_feature = "atomics". This would also be according to spec.

@daxpedda daxpedda force-pushed the web-opt branch 2 times, most recently from 30fb78f to 20f7a91 Compare December 6, 2024 16:15
@daxpedda daxpedda marked this pull request as ready for review December 7, 2024 00:11
@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 7, 2024

This PR is ready again. I updated the OP so all the relevant information can be found there.

Copy link
Member

@josephlr josephlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @daxpedda for trying to help us clean this code up. I also really like the idea of using target_feature = "atomics" to check for if we need to care about shared memory or not.

I have some suggestions below on how to furthur simplify.

// Crypto.getRandomValues()
#[cfg(not(target_feature = "atomics"))]
#[wasm_bindgen(method, js_name = getRandomValues, catch)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using methods, statics, and thread_local_v2, can we use js_namespace like:

// Web Crypto API bindings (https://www.w3.org/TR/WebCryptoAPI/)
#[wasm_bindgen(js_namespace = crypto)]
extern "C" {
    #[cfg(not(target_feature = "atomics"))]
    #[wasm_bindgen(js_name = getRandomValues, catch)]
    fn get_random_values(buf: &mut [MaybeUninit<u8>]) -> Result<(), JsValue>;
    
    #[cfg(target_feature = "atomics")]
    #[wasm_bindgen(js_name = getRandomValues, catch)]
    fn get_random_values(buf: &Uint8Array) -> Result<(), JsValue>;
}

(I'm not sure if js_namespace should be crypto or ["globalThis", "crypto"])).

Then the code can just be:

// If atomics are not enabled, we assume that Rust's memory is not backed by a
// SharedArrayBuffer, so we can directly call getRandomValues() on the buffer.
#[cfg(not(target_feature = "atomics"))]
pub fn fill_inner(dest: &mut [MaybeUninit<u8>]) -> Result<(), Error> {
    // The byteLength of the input must not exceed 65536 bytes, see:
    // https://developer.mozilla.org/en-US/docs/Web/API/Crypto/getRandomValues
    const MAX_LENGTH: usize = 65536;
    for chunk in dest.chunks_mut(MAX_LENGTH) {
        if crypto.get_random_values(chunk).is_err() {
            return Err(Error::WEB_GET_RANDOM_VALUES);
        }
    }
    Ok(())
}

// If atomics are enabled, Rust's memory can be backed by a SharedArrayBuffer,
// so we use a temporary Uint8Array buffer when calling getRandomValues().
#[cfg(target_feature = "atomics")]
pub fn fill_inner(dest: &mut [MaybeUninit<u8>]) -> Result<(), Error> {
    // We are rarely called with long slices, so we allocate a small buffer.
    const BUF_LEN: u16 = 256;  // Less than MAX_LENGTH
    let tmp = js_sys::Uint8Array::new_with_length(BUF_LEN.into());
    for chunk in dest.chunks_mut(BUF_LEN.into()) {
        // The chunk can be smaller than tmp's length, so we call to
        // JS to create a smaller view of buf without allocation.
        let buf = if chunk.len() == BUF_LEN.into() {
            tmp
        } else {
            let len: u32 = chunk.len().try_into().expect("chunk.len() <= BUF_LEN <= u32::MAX");
            tmp.subarray(0, len)
        }
        if crypto.get_random_values(&buf).is_err() {
            return Err(Error::WEB_GET_RANDOM_VALUES);
        }
        // SAFETY: buf's length is equal the chunk's length.
        unsafe { buf.raw_copy_to_ptr(chunk.as_mut_ptr().cast::<u8>()) };
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are forfeiting the check if Crypto is available though.
But if your intention was just to get rid of the nesting, how about prefacing the function like this:

if CRYPTO.with(Option::is_none) {
    return Err(Error::WEB_CRYPTO)
}

// use `get_random_values()` without nesting:
get_random_values(&buf);

Let me know how you want to proceed.

My recommendation is to keep the check in place, there's all sort of strange JS environments out there, getrandom is very foundational and should support these imo. And CRYPTO is cached after all so there is no performance overhead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if the global crypto object exists doesn't really change what platforms/environments we support. It just changes what error message we get when the call to crypto.getRandomValues fails. Before, we were checking for existence of the object as part of our platform detection (and to cache the value in RngSource). Now that we don't have two different implementations for Web vs Node, we don't need to do any of that.

So if I'm understanding things correctly, having the check for the existence of the object just lets us return Error::WEB_CRYPTO to the user, informing them that "Web Crypto API is unavailable" instead of my implementation above which would make us return Error::WEB_GET_RANDOM_VALUES in all places.

I think returning good error messages is valuable, so I think keeping some sort of check is OK, provided that we aren't making an additional FFI call every time (which I don't think we do because the result of is_none is cached).

I personally find the un-nested version easier to read, so we might try a version that avoids too much nesting (potentially by using a helper function).

Copy link
Contributor Author

@daxpedda daxpedda Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if the global crypto object exists doesn't really change what platforms/environments we support. It just changes what error message we get when the call to crypto.getRandomValues fails.

Ah, you are relying on the fact that WBG just catches any exception when tagging it with catch. While this is currently the case, its not intended to be used like this. Ideally, this would only catch exceptions thrown by the function call. However, I guess this behavior has to stay in place because otherwise it would be a breaking change.

I personally find the un-nested version easier to read, so we might try a version that avoids too much nesting (potentially by using a helper function).

Sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let me know if you like this better!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see!

From my understanding the issue was getting rid of the nesting, but that has been achieved, right? Is there some other motivation to get rid of thread_local_v2?

On that note: your proposal will store the check globally. But different threads may have different outcomes. This isn't hypothetical either, some worklets don't have access to Crypto. So this needs to be in a thread-local to work correctly.

Copy link
Member

@newpavlov newpavlov Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the wasm-bindgen inner workings, but wouldn't js_namespace-based code result in a lighter-weight WASM file? Plus, the resulting source code will be slightly smaller, assuming that we do not care about the WEB_CRYPTO error. If we want to keep it, then I think the current code is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the wasm-bindgen inner workings, but wouldn't js_namespace-based code result in a lighter-weight WASM file?

Yes, we have one FFI function less present in the Wasm file. To put it into perspective: the function that actually ends up in the Wasm file after post-processing is done is just a single FFI signature, one byte cast and an if condition. Not counting the storing and loading ofc.

Plus, the resulting source code will be slightly smaller (assuming that we do not care about the WEB_CRYPTO error).

If you don't care about the WEB_CRYPTO error yes, then we can simply go with the original proposal by @josephlr. I will do some digging into the specification and actually find out if Crypto.getRandomValues() can realistically fail or not actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that fast: according to the specification, Crypto.getRandomValues() can only fail if we don't supply the right buffer type or we exceed the maximum byte length of 65536. So instead of getting rid of WEB_CRYPTO I will go ahead and get rid of WEB_GET_RANDOM_VALUES.

Let me know what you think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8cf6832.

.github/workflows/tests.yml Outdated Show resolved Hide resolved
.github/workflows/workspace.yml Show resolved Hide resolved
@josephlr josephlr mentioned this pull request Dec 7, 2024
@daxpedda daxpedda force-pushed the web-opt branch 2 times, most recently from 0095636 to 04f830f Compare December 7, 2024 07:55
Co-Authored-By: Joe Richey <dev@richey.us>
src/backends/wasm_js.rs Outdated Show resolved Hide resolved
Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gating on target_feature = "atomics" looks a bit surprising to me, but I am not familiar with the wasm-bindgen and the wider WASM ecosystem, so I can not make an informed decision here.

src/backends/wasm_js.rs Outdated Show resolved Hide resolved
src/backends/wasm_js.rs Outdated Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 7, 2024

Gating on target_feature = "atomics" looks a bit surprising to me, but I am not familiar with the wasm-bindgen and the wider WASM ecosystem, so I can not make an informed decision here.

I'm happy to answer any questions, but here is some context:

The base problem is that Crypto.getRandomValues() doesn't support accepting a SharedArrayBuffer. You can see this by looking at the Crypto.getRandomValues() interface specification and not finding a [AllowShared] extended attribute on the array parameter.

The underlying Wasm memory in browsers is represented by a WebAssembly.Memory and its underlying buffer is a ArrayBuffer. To use a SharedArrayBuffer instead, the shared property can be used.

This is connected to the Wasm threads proposal (toolchains use the target feature atomics) because it introduces the shared memory type. Now, quoting the threads proposal (this will eventually land in a separate W3C JS specification, but the proposal isn't there yet):

It is a validation error to attempt to import shared linear memory if the module's memory import doesn't specify that it allows shared memory. Similarly, it is a validation error to import non-shared memory if the module's memory import specifies it as being shared.

This is to say:

  • Without marking the Wasm memory with shared, which requires the atomics target feature, the underlying memory MUST be an ArrayBuffer.
  • Marking the Wasm memory with shared, which requires the atomics target feature, the underlying memory MUST be a SharedArrayBuffer.

Lastly, Rust by default marks the memory with shared when the atomics target feature is active. In conclusion:

  • Without enabling the atomics target feature, the memory can't be marked with shared and therefor MUST be backed by an ArrayBuffer in JS.
  • When the atomics target feature is enabled, the memory is marked with shared and therefor MUST be backed by a SharedArrayBuffer in JS.

Within the spec, there is one thing we might want to consider: it is allowed to enable the atomics target feature without using shared memory and therefor using a regular ArrayBuffer. To account for that, we might want to detect that and run the optimized code path. However, this use case is not supported by Rust and users would have to fiddle around with LLVM args, define their own target or do some post-processing to pull off something like that. If Rust ever gains support or it becomes more common, wasm-bindgen will offer a detection mechanism: see rustwasm/wasm-bindgen#4325.

Outside the spec, the other case was using a SharedArrayBuffer without the atomics target feature. This isn't possible because browsers require the memory to be marked with shared. So this requires post-processing and is not supported by wasm-bindgen and I don't see a need for getrandom to support something hacky like that either.

@daxpedda
Copy link
Contributor Author

daxpedda commented Dec 8, 2024

CI failure is unrelated.

src/backends/wasm_js.rs Outdated Show resolved Hide resolved
#[wasm_bindgen(method, getter)]
fn crypto(this: &Global) -> Crypto;
// Holds the global `Crypto` object.
#[wasm_bindgen(thread_local_v2, js_name = crypto)]
Copy link
Contributor Author

@daxpedda daxpedda Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially avoided using globalThis here because I have no idea how it works exactly on Node.js. In the meantime I have looked up the documentation and asked around: it should be perfectly fine to use globalThis in Node.js.

For context: the current setup just accesses the object directly from the global scope. According to the spec this is perfectly fine. However, this is usually discouraged because global variables can simply be overwritten. While I'm not sure if there is more history behind this, fact is that read-only properties like crypto can't be overwritten. So both approaches, accessing Crypto through the global crypto variable or globalThis.crypto, will work the same.

But to stick with "how it should be done" I will add globalThis now.

EDIT: Support should also not be a concern after #554, but you can make your own evaluations via "Can I Use".

@newpavlov
Copy link
Member

Thank you!

@newpavlov newpavlov merged commit 30a7f1f into rust-random:master Dec 9, 2024
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants