Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add copy-on-write based instance reuse mechanism #3691

Closed
wants to merge 10 commits into from

Conversation

koute
Copy link
Contributor

@koute koute commented Jan 14, 2022

This PR adds a new copy-on-write based instance reuse mechanism on Linux.

Usage

The general idea is - you instantiate your instance once, and then you can reset its state back to how it was when it was initially instantiated.

    // Initialize a new instance. This is done only once.
    let mut store = Store::new(&engine, ());
    let instance_pre = linker.instantiate_pre(&mut store, &module)?;
    let instance = instance_pre.instantiate_reusable(&mut store)?;

    loop {
        // Call methods, modify memory, modify the table, etc.
        use_instance(instance, &mut store)?;
    
        // Reset the instance to its initial state.
        instance.reset(&mut store)?;
    }

How does it work?

After the instance is instantiated a snapshot of its state is taken. Tables and globals are simply cloned into a spare Vec, while the memory is copied into an memfd and remapped inplace in a copy-on-write fashion. Then when reset is called the tables and the globals are restored by simply copying them back over, and the memory is reset using madvise(MADV_DONTNEED) which either clears the memory (for those pages which are not mapped to the memfd) or restores the original contents (for those pages which are mapped to the memfd).

Benchmarks

In our benchmarks this is currently the fastest way to call into a WASM module assuming you rarely need to instantiate it from scratch.

call_empty_function

dirty_1mb_of_memory

Legend:

  • instance_pooling_with_uffd: create a fresh instance with InstanceAllocationStrategy::Pooling strategy with uffd turned on
  • instance_pooling_without_uffd: create a fresh instance with InstanceAllocationStrategy::Pooling strategy without uffd turned on
  • recreate_instance: create a fresh instance with InstanceAllocationStrategy::OnDemand strategy
  • native_instance_reuse: this PR
  • interpreted: just for our own reference; an instance created through the wasmi crate
  • legacy_instance_reuse: just for our own reference; this is what we're currently using - it is an instance spawned with InstanceAllocationStrategy::OnDemand strategy and then reused after manually clearing its memory and restoring its data segments and globals

The two of the benchmarks shown here are:

  • call_empty_function: an empty function is called in a loop, resetting (or recreating) the instance after each call
  • dirty_1mb_of_memory: a function which dirties 1MB of memory and then returns is called in a loop, resetting (or recreating) the instance after each call

The measurements are only for the main thread; thread count on the bottom signifies how many other threads were running in the background doing exactly the same thing as the main thread, e.g. for 4 threads there was 1 thread (the main thread) being benchmarked while other 3 threads were running in the background.

For your reference the benchmarks used to generate these graphs can be found here:

https://github.com/koute/substrate/tree/master_wasmtime_benchmarks

Which can be run like this after cloning the repository:

$ cd substrate/client/executor/benches

# `instance_pooling_with_uffd`
$ FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/uffd --bench bench -- with_recreate_instance

# `instance_pooling_without_uffd`
$ FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime --bench bench -- with_recreate_instance

# `recreate_instance`
FORCE_WASMTIME_INSTANCE_POOLING=0 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime --bench bench -- with_recreate_instance

# `native_instance_reuse`
rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime --bench bench -- native_instance_reuse

(The rustc version is just for reference as to what I used. Also please forgive the hacky way the benchmarks have to be launched for instance pooling; we don't intend to keep this codepath, so I quckly hacked it in only for the benchmarks.)

The benchmarks were run on the following machine:

AMD Threadripper 3970x (32-core)
64GB of RAM
Linux 5.14.16

Possible future work (missing features)

  • Add support for other OSes (this is Linux-only for now)
  • Add support for imported memories (this only supports modules which do not import memory)
  • Add support for taking snapshots at an arbitrary points in time (it always takes a snapshot right after instantiation)
  • Add the ability to customize what is restored on reset (it always resets everything)
  • Add address-space pooling to make the initial initialization faster (for cases where new modules are frequently instantiated from scratch)

Comment on lines +900 to +901
let vmmemory = memory.vmmemory();
instance.set_memory(index, vmmemory);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is necessary for correctness, but when I comment it out nothing fails. Any idea how I could write a test which would verify that this is here?

let mut page_first_index = None;
unsafe {
let mut fp = std::fs::File::open("/proc/self/pagemap")
.context("failed to open /proc/self/pagemap")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work if /proc isn't mounted. In addition it doesn't check /proc is actually a mounted procfs. Rustix has code to check if the /proc is sane. It doesn't seem like it is exported though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I've made a PR to rustix here to export it: bytecodealliance/rustix#174

@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Jan 14, 2022
@github-actions
Copy link

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@koute
Copy link
Contributor Author

koute commented Jan 18, 2022

It looks like qemu's user mode emulator is not emulating madvise properly and that's why the tests are failing on aarch64; I've just checked and on a bare metal aarch64 system they all pass.

I guess ideally we should disable them when running under qemu-user?

@cfallin
Copy link
Member

cfallin commented Jan 18, 2022

Hi @koute -- thanks so much for this PR and for bringing up the ideas behind it (in particular, the memfd mechanism)!

Guilty admission on my part: after you mentioned memfd recently on Zulip, and madvise to reset a private mapping (throw away a CoW overlay), I threw together my own implementation as well and did a lot of internal experimentation. (I've been hacking in the pooling allocator and on performance-related things recently as well and your idea was a huge epiphany for me.) I need to clean it up a bit still but will put it up soon (with due credit to you for memfd/madvise/CoW ideas!). Perhaps we can get the best ideas out of both of these PRs :-)

One additional realization I had was that, for performance, we don't want to do any mprotect() at all on heap growth (it holds the whole-process mmap lock in write mode, so can quickly become a bottleneck). But we can play tricks with ftruncate and a second anonymous memfd, mapped above the initial image, so that in a steady state with instance-slot reuse on reinstantiation, our only syscalls are madvise and ftruncate. (Key bits: mmap can map beyond the actual size of a file; size of a file can be changed after mapping is made, and this doesn't touch any whole-process locks; accesses beyond end of file cause SIGBUS.)

Anyway, a few thoughts on this PR:

  • The first-class notion of snapshotting, saving globals, etc., is interesting, and I can see how it could be useful in certain scenarios. However I think it actually mixes a few ideas which we probably want to separate and design independently: the notion of instance state, snapshotting, rewinding, etc., and the implementation-level mechanism of CoW-mapping a heap image.

    In particular, at least in scenarios I'm more familiar with, we do snapshotting via a separate tool that writes out a Wasm module post-snapshot, Wizer. IMHO, including a runtime snapshot-and-reuse-of-state-in-memory mechanism muddies the waters a bit -- which should be used when? -- and whether or not Wasmtime should have such an API deserves further discussion, I think.

  • Given that, I think something that we could more easily land would be a completely API-transparent extension of the pooling allocator. Basically a new mode that (i) creates a backing image for a module's memories when it's first seen, (ii) CoW-maps these images into memory slots, with the ftruncate-to-extend trick above, and (iii) tries to reuse slots with existing mappings of the desired images when possible. This is basically what I've put together, so I'll try to get it up ASAP.

  • I'm a bit concerned with the need to read /proc/self/pagemap to determine whether pages are present. This has some interesting permissions implications (see Documentation/vm/pagemap.txt in the Linux kernel) -- some bits need root or a process capability to see, and it seems that under some kernel versions the whole file is inaccessible except by root. In any case, this creates a deep coupling with a Linux-specific interface for snapshotting; ideally, such API functionality should be designed, if it exists, so it is not forever limited to be a Linux-specific feature. (In other words: performance-only features can be OS-specific, because that's the nature of optimizing for a platform; but functionality should not be OS-specific.)

    I think though that my concern here is basically subsumed by the higher-level question of where snapshotting should occur: if it is the domain of a separate tool, then that separate tool can do special things that may not be as efficient (e.g., diff a whole memory, or mmap inaccessible and mprotect() in pages one at a time), since snapshotting is probably rare compared to instantiation.

I think these are some things we should talk through after I've put up my PR and we can do comparisons. I'm really grateful for the ton of effort you put into this and look forward to comparing the approaches in more detail!

@fitzgen
Copy link
Member

fitzgen commented Jan 18, 2022

Agreed with @cfallin that we should disentangle snapshots and instantiation here, and focus on a relatively transparent extension of the pooling instance allocator for now.

That said, a snapshotting feature could be very useful for doing neat things like rr-style record and replay debugging. But this is a pretty big design space, and I'd want to hash out our motivation/use cases, technical architecture, and API design with an RFC before we dive head first into implementation.

@alexcrichton
Copy link
Member

@koute would using the pooling instance allocator work for your embedding's use case? @cfallin's work right now I believe is entirely focused on that which means that by-default Wasmtime wouldn't have copy-on-write re-instantiation because Wasmtime by default (as you've seen and modified here) uses the on-demand instance allocator. If your embedding doesn't work well with the pooling instance allocator then I think we'll need to brainstorm a solution which "merges" your work here with @cfallin's on the pooling allocator, taking into account the feedback around snapshots (which I personally agree is best to separate and ideally make the copy-on-write business a simple config option of "go faster")

@koute
Copy link
Contributor Author

koute commented Jan 19, 2022

I'm a bit concerned with the need to read /proc/self/pagemap to determine whether pages are present.

This it technically optional and done entirely for performance, so it could be made allowed to fail. Even without using /proc/self/pagemap my instance reuse mechanism is faster than anything that's currently in wasmtime (in our benchmarks), however it's not always faster than what we're currently using without the /proc/self/pagemap part. (And ideally we don't want to switch to anything that's slower; we want to improve performance.)

This has some interesting permissions implications (see Documentation/vm/pagemap.txt in the Linux kernel) -- some bits need root or a process capability to see, and it seems that under some kernel versions the whole file is inaccessible except by root.

That is, AFAIK, only applicable to the lower bits, which we don't need in this case. The higher bits (which we need) should be always readable when reading our own process' pagemap.

The first-class notion of snapshotting, saving globals, etc., is interesting, and I can see how it could be useful in certain scenarios. However I think it actually mixes a few ideas which we probably want to separate and design independently: the notion of instance state, snapshotting, rewinding, etc., and the implementation-level mechanism of CoW-mapping a heap image.

If your embedding doesn't work well with the pooling instance allocator then I think we'll need to brainstorm a solution which "merges" your work here with @cfallin's on the pooling allocator, taking into account the feedback around snapshots (which I personally agree is best to separate and ideally make the copy-on-write business a simple config option of "go faster")

In my initial prototype implementation I actually tried to do this in the same vein as the current pooling allocator, but in the end decided to go with the current approach. Let me explain.

Basically we have three main requirements:

  1. Be as fast as possible. (At least as fast as what we currently use, which is the legacy_instance_reuse you see on the graphs.)
  2. Be robust. (The instantiation can not fail.)
  3. Be simple to use and maintainable. (Although we can live with this not being the case if absolutely necessary.)

One of the problems with the current pooling allocator (ignoring how it performs) is that it fails at (2) and somewhat at (3), and isn't a simple "go faster" option that you can just blindly toggle. You have to manually specify module and instance limits (and if the WASM blob changes too significantly you need to modify them), and you need to maintain a separate codepath (basically keep another separate Engines and the same Module twice) for the case when instantiation fails.

So personally I think it just makes more sense (especially for something so fundamental and low level as wasmtime) to just let the user explicitly do the pooling themselves, which is very easy to do with the approach in this PR - just preinstantiate as many instances as you want into a Vec, pop one on use, push them back when you're done, and if you run out of pooled instances you can just easily instantiate one from scratch with a single extra if.

I also considered integrating this into InstancePre directly (or into a separate type) - that is, there would be no reset, and the instances would automatically reset themselves when dropped and returned into a pool inside of the InstancePre and then automatically and transparently reused when InstancePre::instantiate would be called (so that would actually be a transparent "go faster" option), but decided that it'd be better to just let the user control the pooling themselves since it's so simple to do anyway. (And also the way how everything is stored inside of the Store and not the Instance would make that somewhat awkward, since you want to reuse both the Instance and the Store.)

Basically, in our usecase we don't really care about snapshotting at all (it's just an implementation detail to make things go fast), and all we just need is to be able to instantiate clean instances as fast as possible.

Would this be more acceptable to you if API-wise we'd make it less like it's snapshotting and more like an explicit way to pool instances?

I think these are some things we should talk through after I've put up my PR and we can do comparisons. I'm really grateful for the ton of effort you put into this and look forward to comparing the approaches in more detail!

Sounds good to me! I'll hook up your PR into our benchmarks so that we can compare the performance.

@alexcrichton
Copy link
Member

Ok @koute so to follow up on comments and work from before, #3733 is the final major optimization for instantiation that we know of to implement. There's probably some very minor wins still remaining, but that's the lion's share of improvements that we're going to get into wasmtime (that plus memfd).

Could you try re-running your benchmark numbers with the old strategy y'all are currently using, your proposal in this PR, and then #3733 as a PR? Note that when using #3733 using the on-demand allocator, while maybe a little bit slower than the pooling allocator, should still be fine. The intention with #3733 is that it's fast enough that you won't need to maintain a pool of instances, and each "reuse" of an instance can perform the full re-instantiation process. Note that for re-instantiation it's recommended to start from an InstancePre<T> which is the fastest way today to instantiate something.

My prediction is that the time-to-instantiate #3733 is likely quite close to the strategy outlined in this PR. It will probably look a little different one way or another, but that's what I'm curious to see if #3733 works for your use case in terms of robustness and performance.

If you're up for it then it might be interesting to test the pooling allocator as well. I realize that the pooling allocator as-is isn't a great fit for your use case due to it being too constrained, but as a one-off measurement of numbers it might help give an idea of the performance tradeoff between the pooling allocator and the on-demand allocator. Also unless you're specifically interested in concurrent instantiation performance it's fine to only get single-threaded instantiation numbers. It's expected that #3733 does not scale well with cores (like this PR) due to the IPIs necessary at the kernel level with all the calls to madvise. In that sense the more interesting comparison is probably single-threaded numbers (again unless you're specifically interested in the multi-threaded numbers as well)

@alexcrichton
Copy link
Member

Oh and for now memfd is disabled-by-default, so with #3733 you'll need to execute config.memfd(true) as part of Wasmtime's configuration to be sure that it's enabled. If you're seeing instantiation take more than double-digit microseconds then Wasmtime may not be configured correctly and I can help dig in. It's expected, though, that instantiation is probably in the single-digit microseconds.

@koute
Copy link
Contributor Author

koute commented Feb 11, 2022

@alexcrichton Got it! I'll rerun all of the benchmarks next week and I'll get back to you.

@koute
Copy link
Contributor Author

koute commented Feb 18, 2022

@alexcrichton Sorry for the delay! I updated to the newest main of wasmtime, ported over this PR and reran the benchmarks. Here are the results:

call_empty_function
dirty_1mb_of_memory

In table form:

call_empty_function
| threads | legacy_instance_reuse | native_instance_reuse | recreate_instance_memfd_only | recreate_instance_pooling_memfd | recreate_instance_pooling_only | recreate_instance_pooling_uffd | recreate_instance_vanilla |
|---------|-----------------------|-----------------------|------------------------------|---------------------------------|--------------------------------|--------------------------------|---------------------------|
| 1       | 48                    | 4                     | 17                           | 8                               | 58                             | 25                             | 65                        |
| 2       | 67                    | 11                    | 43                           | 21                              | 88                             | 36                             | 109                       |
| 4       | 91                    | 17                    | 98                           | 31                              | 156                            | 60                             | 195                       |
| 8       | 160                   | 25                    | 305                          | 45                              | 335                            | 135                            | 434                       |
| 16      | 228                   | 35                    | 634                          | 64                              | 847                            | 271                            | 1140                      |

dirty_1mb_of_memory
| threads | legacy_instance_reuse | native_instance_reuse | recreate_instance_memfd_only | recreate_instance_pooling_memfd | recreate_instance_pooling_only | recreate_instance_pooling_uffd | recreate_instance_vanilla |
|---------|-----------------------|-----------------------|------------------------------|---------------------------------|--------------------------------|--------------------------------|---------------------------|
| 1       | 185                   | 145                   | 159                          | 150                             | 196                            | 293                            | 204                       |
| 2       | 273                   | 212                   | 274                          | 219                             | 289                            | 835                            | 333                       |
| 4       | 374                   | 310                   | 494                          | 311                             | 499                            | 1305                           | 553                       |
| 8       | 738                   | 520                   | 967                          | 531                             | 919                            | 1913                           | 1074                      |
| 16      | 945                   | 734                   | 2233                         | 767                             | 1808                           | 3122                           | 1985                      |

When a lot of memory is dirtied it is indeed competitive now, and when not a lot of memory is touched it also performs quite well now! Of course this is assuming both memfd and pooling is enabled.

So I'd like to ask here - are there any plans to make the pooling less painful to use? Something like this would be ideal:

config.allocation_strategy(InstanceAllocationStrategy::Pooling {
    strategy: Default::default(),
    instance_limit: 8
});

Basically completely get rid of the module_limits and have it work with modules of any size, and change the instance_limit to be a soft cap instead of a hard cap, that is - instead of returning an error when the maximum concurrent instance limit is reached it would just transparently create a new OnDemand instance. That would make the pooling strategy truly a drop-in replacement without having to hack around it.

@alexcrichton
Copy link
Member

Hm so actually the "only memfd" line, which I'm assuming is using the default on-demand allocator, is performing much worse than expected. The cost of the on-demand allocator is an extra mmap-or-two and it's not quite as efficient on reuse (a few mmap calls instead of one madvise). In that sense I wouldn't expect it to be an almost order of magnitude slower in the measurements, instead what I've been seeing locally is that it's off by some constant factor ish from the pooling allocator. Do you have some code I could read to double-check the on-demand allocator was configured correctly?

Otherwise though I definitely think we can improve the story with the usability of the pooling allocator. The reason that the module limits and such exist are so that we can create appropriate space for the internal *mut VMContext allocation for each instance which is sized to each instance. I think, though, instead of specifically configuring each field and its limits it would be easier to say "each instance gets N bytes of memory" and then during instantiation if that's exceeded instantiation fails with a descriptive error message. That would mean that a generous amount, such as a megabyte or two, could be allocated for instances and you wouldn't have to worry about tweaking specific limits.

Again though I'm surprised that the on-demand allocator is performing as bad as it did in your measurements, as my suspicion was that it would be sufficient for your use case. I think configuring the pooling allocator by allocation size is probably good to do no matter what, though.

@alexcrichton
Copy link
Member

Actually thinking about this some more, my measurements and impression about the relative cost of these is primarily in the single-threaded case, I haven't looked too too closely at the multithreaded bits. Does your use case stress the multi-threaded aspect heavily? If so we can try to dig in some more, but I'm not sure if you're measuring the multi-threaded performance at the behest of an old request of ours or your own project's motivations as well.

As a point of comparison for a 16-threaded scenario I get:

on-demand pooling
instantiate 126us 660us
empty 327us 690us
dirty 3025us 1691us

(hence some of my surprise, but again I haven't dug into these numbers much myself, especially the discrepancy between pooling/on-demand and how the speedup changes depending on the allocation strategy)

@koute
Copy link
Contributor Author

koute commented Feb 23, 2022

Do you have some code I could read to double-check the on-demand allocator was configured correctly?

Sure.

This is the crate where we encapsulated our use of wasmtime: (I'm linking to a fork which I used for benchmarking since it has some extra changes which haven't yet landed in production; please excuse the somewhat messy code in places.)

https://github.com/koute/substrate/tree/master_wasmtime_benchmarks_2/client/executor/wasmtime

Let me give you a quick step-by-step walkthrough of the code:

  1. The engine's configured.
  2. The pooling's configured. (optional)
  3. The Engine is created.
  4. Any potential memory imports are converted into memory exports. (Since we don't really need to support imported memories, and last time I tried there was big performance difference where the pooling allocator got a lot slower when using imported memories; maybe that's why you're seeing the difference here with on-demand vs pooling?)
  5. The Module is created.
  6. The Linker is created.
  7. Any function imports are registered within the Linker.
  8. We create an InstancePre. From now we'll exclusively use this InstancePre to instantiate new instances.
  9. Then on each instantiation:
    a. We create a new Store. (Since Stores can't be reused.)
    b. We instantiate a new instance through InstancePre.
    c. We call a single function inside of the WASM blob.
    d. Both the Instance and the Store are then dropped.

And my benchmarks basically just loop (9) over and over again.

Does your use case stress the multi-threaded aspect heavily?

In certain cases we do call into WASM from multiple threads at the same time, so we do care about the multi-threaded aspect, but an instantiated instance never leaves the thread on which the instantiation happened. (Basically the whole number 9 is always executed in one go without the concrete instance being sent to any other thread nor being called twice. [Unless our legacy instance reuse mechanism is used, but we want to get rid of that.])

@koute
Copy link
Contributor Author

koute commented Feb 23, 2022

Okay, so since the current memfd + pooling allocation strategy is fast enough (our primary goal was to remove our legacy instantiation scheme without compromising on performance, and ideally to also get a speedup if possible) I'm going to close this PR now.

As I've said previously I'm not exactly thrilled by the API to be able to use the pooling allocator without introducing arbitrary hard limits, but ultimately that's not a dealbreaker and we can hack around it. (:

(I'm of course still happy to answer questions and help out if necessary, so please feel free to ping me if needed.)

@koute koute closed this Feb 23, 2022
@alexcrichton
Copy link
Member

Ok cool thanks for the links, it looks like nothing is amiss there. I also forget that the machine I'm working on is an 80-core arm64 machine where IPIs are likely more expensive than smaller-core-count machines, so that would likely help explain the discrepancy.

If it helps I put up #3837 which removes ModuleLimits and folds a few things into the InstanceLimits structure. You'll probably want tables and memories set to 1 (since you're only supporting MVP wasm anyway), memory_pages set to maximum 65536 (as it's just maximally allowed memory, not committed memory). You'll need to manually configure size and table_elements though for your desired maximums. Increasing them will increase the committed memory of the pooling allocator (well, mmapped-as-zero I guess and it's not actually committed until it's accessed).

That hopefully makes things a bit more usable!

Oh also for imported memories, it's true that right now for copy-on-write-based initialization we do not apply the optimization to imported memories, only to locally defined memories. All of the "interesting" modules we've been optimizing and care about the runtime for all define their own memory and export it, but it's not impossible to support imported memories and if you've got a use case we can look to implement support for imported memories.

@koute
Copy link
Contributor Author

koute commented Feb 24, 2022

Ok cool thanks for the links, it looks like nothing is amiss there. I also forget that the machine I'm working on is an 80-core arm64 machine where IPIs are likely more expensive than smaller-core-count machines, so that would likely help explain the discrepancy.

That could potentially affect things, yes; I was testing this on a mere 32-core machine after all. (:

If it helps I put up #3837 which removes ModuleLimits and folds a few things into the InstanceLimits structure. You'll probably want tables and memories set to 1 (since you're only supporting MVP wasm anyway), memory_pages set to maximum 65536 (as it's just maximally allowed memory, not committed memory). You'll need to manually configure size and table_elements though for your desired maximums. Increasing them will increase the committed memory of the pooling allocator (well, mmapped-as-zero I guess and it's not actually committed until it's accessed).

That hopefully makes things a bit more usable!

It is indeed a significant improvement! Thanks!

Oh also for imported memories, it's true that right now for copy-on-write-based initialization we do not apply the optimization to imported memories, only to locally defined memories. All of the "interesting" modules we've been optimizing and care about the runtime for all define their own memory and export it, but it's not impossible to support imported memories and if you've got a use case we can look to implement support for imported memories.

In our use case at this point we're fine with the way it is currently. We need to support WASM blobs of either type, but we can just easily patch one into the other if necessary so that the memory's always defined internally and exported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants