Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIP: introducing the Filecoin Virtual Machine. #288

Merged
merged 9 commits into from
Feb 17, 2022

Conversation

raulk
Copy link
Member

@raulk raulk commented Feb 2, 2022

The first of several FIP submissions concerning the introduction of the Filecoin Virtual Machine as the new execution layer of the Filecoin blockchain.

This is an early draft that needs some refinement.

See discussion at #287.


Thanks to @Stebalien and @anorth for early reviews.

@raulk raulk force-pushed the raulk/introducing-fvm branch 3 times, most recently from f46de6c to 0897e69 Compare February 2, 2022 21:49
@raulk raulk force-pushed the raulk/introducing-fvm branch from 0897e69 to b27ab8c Compare February 2, 2022 21:51
Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPLD model raises some interesting questions. My impression is that it's landed somewhere in an awkward middle ground, where the encoding is done in actors, but the VM must be able to interpret that encoded data and fundamentally models reachability.

I think either extreme of the VM being oblivious, or the VM actually implementing the IPLD model could be better. Oblivious means actors APIs and state are just bytes, the VM is simpler, but actors would have to do some work to ensure they don't store unreachable blocks, etc. The VM actually implementing IPLD is very opinionated (which... I think we want?) but efficient and a great programming model for actors, at the cost of more complex VM. The middle ground looks like it's going to need most of the complexity anyway, but with added complexity/cost of needing to validate encoding done in the actors.

Wasm with 32-bit memories supports 64-bit integer types and arithmetic, it just
can't address memory beyond 4GiB. This size is deemed sufficient for blockchain
state transition use cases, but may be limiting for future and more generalised
applications of the FVM.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discussion: this might limit possible universality of FVM as an execution environment outside the blockchain state transition function. For example, an FVM rollup could have far greater demands on hardware, including memory, to enable much more complex and resource-intensive programs to be written, while retaining Filecoin's blockchain security. Similarly, for compute-over-data projects there's a potential world in which something very like the FVM is the target for code to run inside compute providers, and support very large memory execution environments. It could be very powerful to have seamless, binary interop between rollups and off-chain compute. Taken to the extreme, it might be that we develop a zk-FVM which could do both.

I get the current requirements are rightly focussed on the blockchain state use case, but is there a big gain there from taking the 32-bit route? Is there a possible path to 64-bit in the future (like a compiled contract indicating the wordsize it's built for)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64-bit memory addressing is a proposed extension to the WASM spec; it doesn't seem to be final yet. This is the latest I could find on a quick search: https://twitter.com/wvo/status/1362458368107941888.

Copy link
Member Author

@raulk raulk Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that other use cases that need virtualized runtimes would benefit from 64-bit addressing. However, I'm more skeptical on whether the FVM is the exact virtualized runtime they'd adopt. The FVM is built for state tree manipulation within a blockchain, uses blockchain messages as units of execution, and exposes blockchain-related operations through syscalls.

However, I do feel that the FVM is an instantiation of a "thin waist" interplanetary VM substrate that is WASM-based and IPLD-native. In the future, I would be agreeable to formally definining such VM substrate (by extracting the relevant parts from this spec), and rebasing the FVM spec on top of it. Would probably get more groups involved.

Specialized VMs for compute-over-data (off-chain), Lurk, etc. would also be instantiations of that VM, albeit with support for different WASM features (memory64, SIMD, interface types, threads, etc.) and different syscalls available to hosted code.

The signature of an actor entrypoint is:

```rust
pub fn invoke(params_block_id: u32) -> u32 { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To implement internal dispatch in an extensible way, compatible with standard APIs etc, I guess that actors will all need to define something like:

struct Params {
  MethodName string
  Params []byte // CBOR-ended, to be decoded by receiver
}
  • An SDK can wrap this, but for interoperability we should probably define a standard for how actors should interpret parameters.
  • This would be incompatible with the current built-in actors' parameter shapes.

This is good in a sense of giving actors full control over their API (they don't even have to use IPLD/CBOR), but a bit awkward. For ease of use and better interop between the built-in and new actors, perhaps we should consider the VM taking a slightly more active role in dispatch? E.g. the VM could interpret that params struct above, along with the method number, and then invoke the target actor with something like invoke(method_name, method_params)? Perhaps actors could declare which calling convention they use, leaving room for others (including the "raw" method currently specced)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need sane defaults, but I'd really like actors to be able to specify their own ABI and API via some form of schema. For example:

type Bar struct {
...
}

type MyActor interface {
    func Transfer(foo int, ...) Bar (id:1)
} representation "inline"

The interface would likely compile to some mapping between methods and message "templates". E.g.:

  • Transfer(foo) -> {"method": 1, "foo": $foo}
  • Transfer(foo) -> (1, $foo)`

This gives us some flexibility as actors can use method numbers, names, hashed names, polymorphic whatever, etc.


That being said... as long as we have method numbers, there's no reason not to pass them in invoke. This is more of a "if we find a nice way to get rid of them".


E.g. the VM could interpret that params struct above, along with the method number, and then invoke the target actor with something like invoke(method_name, method_params)?

Note: passing data into actors via invoke is non-trivial because the actor can't access the host's memory. The current approach lets the actor "read" the data from the system, giving it a chance to allocate the necessary space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encapsulating all call data into a single object enables call options like proxying/delegation, pattern matching, etc. Furthermore, having to choose how to identify methods in an actor whose code can be updated (assuming we do commit to implement code updates) makes the user think upfront about how they'd identify/deal with each invocable behaviour across updates.

Long term, the idea is to have an IPLD-based IDL that actors can advertise as metadata to enable discoverability by other developers (e.g. for composabilitiy) and by tooling (e.g. message inspectors). The IDL will express which behaviours an actor offers, its interface (inputs, outputs, errors), and how those behaviours can be called (call patterns). Because the method number is so ingrained in the protocol, it's likely that it will remain there forever, and it'll be a glorified call pattern available for users.


/// Creates a new block, returning the block's ID. The block's children must be in the reachable
/// set. The new block isn't added to the reachable set until the CID is computed.
pub fn create(codec: u64, data: *const u8, len: u32) -> Result<u32>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you confirm my inference here that encoding of the block is happening in contract space, prior to this call? Thus the codec parameter is expected to match the data (how is it checked?), but isn't instructing the VM how to do the encoding internally. So any (?) codec, including future ones is allowed? This will make tracking the links difficult.

It appears that we must specify a supported list of codecs that the VM has code to read. And if we're doing that, it's a small step to also do the encoding in the VM, rather than in WASM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPLD blocks are encoded actor-side and checked VM-side. For now, CBOR will be the only expected codec but we expect to add support for other codecs later.

This will make tracking the links difficult.

At a minimum, the VM will need a single function (either native or some WASM module) per-codec to extract links from blocks.

And if we're doing that, it's a small step to also do the encoding in the VM, rather than in WASM.

Unfortunately not. Encoding VM-side is non-trivial as we need to get the data across the FFI boundary somehow. We'd need to specify something like serde's builder pattern (and will lose a fair amount of performance by repeatedly crossing the FFI boundary and not being able to inline anything).

I responded more fully below in #288 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: the codec passed here is later used when calculating the CID for the block.

Copy link
Member Author

@raulk raulk Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A visitor pattern over FFI sounds slow. The VM doing the encoding doesn't seem plausible at all, unless it has knowledge of the memory layout and data types of the programming language the actor was developed with, which is undesirable. Alternative we could marshal the data across the boundary in some canonical form.... but wait, isn't that stepping on the toes of IPLD? 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A canonical form doesn't sound that bad to me. E.g. something like the Python struct module.

I don't think IPLD is trying to be an efficient in-memory format. It's a storage/distribution format.

pub fn receiver() -> Result<u64>;

/// Returns the method number from the message.
pub fn method_number() -> Result<u64>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about passing this as parameter to invoke instead? Do we want to be opinionated about its use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't do that because we'd like to find a way for this to go away (using only the parameters). But if that's too difficult (from a network upgrade standpoint), I agree that passing this to invoke makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I meant passing it to invoke and removing this method, thus making it harder for anyone to misuse and setting a path for it to go away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I meant passing it to invoke and removing this method, thus making it harder for anyone to misuse and setting a path for it to go away.

Well, we'd like to make the method number itself go away, not just the syscall.

Copy link
Member Author

@raulk raulk Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anorth That would further bless the "method number" call pattern into the FVM, something we're trying to avoid entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board with trying to get rid of it entirely. I don't see how having a syscall that's accessible from any actor code all the time is less of a blessing than passing it only to a single entry point method and forcing the callee to do more complex programming if they want to propagate it. But ok, this is your design call.

FIPS/fip-introducing_fvm.md Outdated Show resolved Hide resolved
IPLD blocks are dynamically-sized blobs of data. When loading a block by CID,
the size is unknown, so the actor cannot allocate memory upfront. Moreover,
host-side memory is inaccessible to Wasm code, so it's not possible for the
Kernel to return a pointer to host-side memory for Wasm code to read.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the VM decodes, the actor can allocate the structure to decode into up-front.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without a lot of back-and-forth. I.e.:

  1. Actor: Load CID1.
  2. VM: Found a list of length 5.
  3. Actor: Ok.
  4. VM: The first item is a number.
  5. Actor: Write the number to *some_ptr.
  6. VM: The second item is a string of length 10.
  7. Actor: write it to *some_other_ptr.
  8. etc, etc.

Copy link
Member

@anorth anorth Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see why that API would be troublesome, but can't we do better? How about:

  1. Actor: Load CID1, expecting schema A
  2. VM: Ok, here's a pointer to newly allocated memory, which you can cast

Obviously I glossed over details, but would it be unreasonable for the VM to, say, allocate memory dynamically to decode into? Maybe with a gas limit for the call?

Copy link
Member

@Stebalien Stebalien Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without making a lot of assumptions about the actor's programming language (or defining some form of flat-buffer abstraction).

  1. Currently, the VM never allocates memory because the VM knows nothing about the actor's memory layout (i.e., the actor "owns" its entire memory). To allocate, it would have to call recursively call back into the actor (which I'd like to avoid if possible).
  2. Different programming languages have different concepts of strings, vectors, data layout, etc. We can pass trivial structs (structs of ints, arrays of ints, etc.) back and forth, but nothing more complex without encoding through something like CBOR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ack that it's not a simple problem to solve a language-independent in-memory form for this. Fine to leave it as something to maybe investigate later.

## Backwards Compatibility

Existing built-in actor and system functionality will not be affected by the
introduction of the Filecoin Virtual Machine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's more to say here. Perhaps spell out that this FIP actually doesn't represent a protocol change, only a suggested re-architecture. Point to the future FIPs that will break compatibility.

## Test Cases

FVM implementations can subject themselves to the FVM-specific test vectors
hosted at: https://github.com/filecoin-project/fvm-test-vectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link doesn't resolve.

Because this proposal doesn't actually change the protocol, should we also point to (and execute) all the existing test vectors? Also all the existing actor unit and integration tests?

Worth noting here too that the roll-out plan is to demonstrate compatibility with the existing chain, resyncing since the last actors version. Actually, I just realised this FIP doesn't address the intention to trigger a network upgrade that switches the canonical spec of the actors to the Rust/WASM version? Am I right to infer that belongs here (because the next FIP will be the gas model change)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link doesn't resolve.

Try again? It does for me.

Because this proposal doesn't actually change the protocol, should we also point to (and execute) all the existing test vectors? Also all the existing actor unit and integration tests?

  1. Most of the existing test vectors tested prior network versions. The FVM only actors v6 (and soon, v7).
  2. Some of the test vectors were extracted from specs-actors.
  3. In terms of integration tests, lotus integration is underway but the tests won't pass until we have v7 support.

Actually, I just realised this FIP doesn't address the intention to trigger a network upgrade that switches the canonical spec of the actors to the Rust/WASM version? Am I right to infer that belongs here (because the next FIP will be the gas model change)?

Raul is working on a separate FIP proposing a network upgrade. This FIP is setting design direction: switching Filecoin over to a WASM/IPLD VM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link works now.

Does the Rust actors codebase faithfully replicate all the unit tests from the Go one? Or did it rely on the comprehensiveness of the Go code for logical correctness checking, and focus on conformance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rust actor codebase has very few unit tests. Pretty much all of our test coverage currently comes from extracted test vectors (both from the specs-actors tests and from the network).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem we should solve on the road to M1. Given that Rust actors are likely becoming the canonical actors, they should be as well-tested as Go actors, if not better.

@Stebalien
Copy link
Member

Stebalien commented Feb 7, 2022

@anorth

The IPLD model raises some interesting questions. My impression is that it's landed somewhere in an awkward middle ground, where the encoding is done in actors, but the VM must be able to interpret that encoded data and fundamentally models reachability.

Specifically, it must be able to:

  1. Validate data (so that FVM implementations deterministically accept/reject objects).
  2. Extract links.

I think either extreme of the VM being oblivious, or the VM actually implementing the IPLD model could be better. Oblivious means actors APIs and state are just bytes, the VM is simpler, but actors would have to do some work to ensure they don't store unreachable blocks, etc. The VM actually implementing IPLD is very opinionated (which... I think we want?) but efficient and a great programming model for actors, at the cost of more complex VM. The middle ground looks like it's going to need most of the complexity anyway, but with added complexity/cost of needing to validate encoding done in the actors.

If we keep everything inside actors, managing IPLD data becomes tricky:

  • Garbage collection (all block management, really) now needs to ask the actor what blocks it thinks should be reachable. Worse, garbage collection would likely need to be eager to ensure that data is deterministically (un)reachable.
  • If IPLD data is passed from actor A to actor B and these actors have different "opinions" on reachability, we'll run into some really strange issues.

Doing all IPLD operations outside is more workable, but will likely lead to worse performance:

  • We'll lose lots of optimization/inlining opportunities and we'll have to repeatedly cross the invocation container boundary. Internally, CBOR decoding can be optimized by the compiler because it knows the datastructure it's decoding into. On the other hand, a "good enough" decoding API (likely modeled after serde) might be efficient.
  • Gas accounting becomes tricky as we're doing more work VM-side.
  • The API surface gets pretty large (at least if we want any amount of performance).

The third option is to implement the codecs as separate WASM modules. That way:

  1. Actors can use whatever codec implementation they want internally (for performance). Or "import" the system's module for smaller code.
  2. The VM no longer needs to understand specific codecs.
  3. The single "system" codec determines reachability so there are no funny surprises.
  4. We can eventually look into supporting custom user codecs.
  5. We can share these codec implementations with IPFS, etc.


- Explicit messages are sent by secp256k1 or BLS account actors, appear on-chain
when executed, and carry a signature to prove their authenticity.
- Implicit messages are generated by the system, do not appear on chain, and
Copy link
Member

@frrist frrist Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Projects such as Lily (and consumers of its data) must be able to observe implicit and internal messages (e.g. sends from multisig actors that do not appear on chain). Currently this is handled via the ExecMonitor interface. Can we ensure the switch to the FVM doesn't break this observability requirement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated this, and concluded that the FVM doesn't need to know about the ExecMonitor callback. This is entirely handled at the node level:

However, the FVM does need to generate execution traces, but I'd be inclined to say that this is an optional feature which Lotus supports. I'm not sure we need to specify it in the spec, but I'll give it some extra thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking here now: filecoin-project/ref-fvm#318.

@raulk raulk force-pushed the raulk/introducing-fvm branch from a6e8820 to 5d7d0cb Compare February 11, 2022 14:37
Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my FIP editor hat, approving this as a well-formed and -motivated proposal.

As a note to myself and other collaborators, almost all the comments made here would have been better made in the associated discussion thread #287. That way, the discussion would remain discoverable and in-flight while the FIP progresses. We should attempt to limit comments on FIP documents to those about the appropriateness, form and completeness of a proposal, rather than its substantive content.

@jennijuju jennijuju merged commit 27920d2 into master Feb 17, 2022
@jennijuju jennijuju deleted the raulk/introducing-fvm branch February 17, 2022 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants