Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal For An Actor System Based On Mojo #1445

Closed
wants to merge 3 commits into from

Conversation

reid-spencer
Copy link

@reid-spencer reid-spencer commented Dec 8, 2023

This is currently a work in progress. There are no code changes, just a proposal written in the proposals section. This was pre-approved by Chris Lattner in a conversation in June 2023.

I will keep working on this as I have time, but it is far enough along that I'm looking for feedback and assistance from interested parties.

I will take it out of draft mode when its a little further along.

Signed-off-by: Reid Spencer reid@ossuminc.com

@reid-spencer reid-spencer marked this pull request as draft December 8, 2023 22:11
reidspencer and others added 2 commits December 8, 2023 20:55
Signed-off-by: Reid Spencer <reid-spencer@users.noreply.github.com>
@lattner
Copy link
Collaborator

lattner commented Dec 10, 2023

This is cool reid, thank you for putting this together. We're quite a bit too early to invest in this area IMO (we need to get traits much further along and complete lifetimes) but I think this is a very likely long term direction. If you're interested, Actors got built into swift with a more complex model than was in the manifesto, you can read about it here, or on several swift-evolution proposals:
https://docs.swift.org/swift-book/documentation/the-swift-programming-language/concurrency#Actors

I do hope we can eschew the complexity, a lot of it is due to legacy interop with apple frameworks. OTOH we may need such things to work with legacy python and other libs though.

@reid-spencer
Copy link
Author

@lattner - I understand the language earliness, but I think there's value in starting the Actor's project early. So, I have started already: https://github.com/ossuminc/moxy (extremely nascent). In the proposal, I've tried to minimize the requirements on Mojo. The recent introduction of Traits allowed me to get started. All else that is needed is default implementations in the traits. Later on, when Mojo has matured, it would be interesting to integrate an ASIC or GPU to help with extremely fast message dispatch. All in good time. I'm happy to start this work without the involvement of Modular's time/resources; at least for now.

Thanks for the reference to the Actor implementation in Swift. I am examining several actor systems to try and glean the winning strategies from their patterns. Akka is my strongest entry knowledge, but I'm open to merging the best ideas from other ecosystems.

I plan to leave interoperability to the end of actor development and not sacrifice simplicity or performance. In other words, interoperability will have its own complexity and costs, as an add-on.

@Brian-M-J
Copy link
Contributor

Brian-M-J commented Jan 1, 2024

Senders and Receivers

I have a suggestion for the mojo-features-needed.md concurrency section: senders and receivers, a.k.a. std::execution (read p2300 for the full technical description)1.

Basically, it's a universal abstraction to express all concurrency and parallelism without the need for locks. One of p2300's authors, Eric Niebler, calls them "lazy futures"2. For a more technical explanation of the work that it builds on (i.e. delimited continuations3 and monads), you can read p2300 itself and/or read this article. This presentation is also helpful.

Of particular note are the theoretical results in the paper p2504 - Computations as a global solution to concurrency (The paper refers to senders as 'computations'). The findings are summarized in this article:

Lucian Radu Teodorescu summarizing p2504

The senders/receivers model allows us to describe any computation (i.e., any concurrent chunk of work) as one sender. A proof for this statement would be too lengthy to show here, and the reader can find it in [P2504R0]. This paper also shows the following:

  • all programs can be described in terms of senders, without the need of synchronisation primitives
  • any part of a program, that has one entry point and one exit point, can be described as a sender
  • the entire program can be described as one sender
  • any sufficiently large concurrent chunks of work can be decomposed into smaller chunks of work, which can be described with senders
  • programs can be implemented using senders using maximum efficiency (under certain assumptions)

In summary, all concurrent single-entry single-exit chunks of works, i.e., computations, can be modelled with senders.

It is important to note that computations fully encapsulate concurrency concerns. Computations are to concurrent programming what functions are to Structured Programming. Computations are the concurrent version of functions.

This article states many advantages of senders and receivers:

Lucian Radu Teodorescu on the advantages of senders and receivers
  • ability to represent all computations (from simple ones to complex ones)
  • ability to represent computations that have inner parts that can be executed on different execution contexts (e.g., different threads, different computation units, different machines)
  • can use computations to solve all concurrency problems (without using synchronisation primitives in the user code)
  • composability
  • proper error handling and cancellation support
  • no memory allocations needed to compose basic computations
  • no blocking waits needed to implement most of the algorithms
  • allows flexibility in specializing algorithms, thus allowing implementors to create highly efficient implementations
  • ability to interoperate with coroutines

Senders and receivers enable structured concurrency. Using them, you can make higher level abstractions for lockless concurrency and parallelism:

It’s also worth mentioning that the term computation applies to multiple paradigms. It can be easily used to describe imperative work, it can be well assimilated by functional programmers, it can apply to reactive programming and to all stream-based paradigms; although we haven’t talked about it, we can think of computations also in the context of the actor model.

So we can use senders and receivers to implement the actor system, while allowing users to build any concurrency abstraction they need without the need for locks. I think this aligns very well with Mojo's goals of providing sensible defaults and high level abstractions while exposing as much as possible as libraries so users can easily build their own abstractions.

Lucian Radu Teodorescu praising S&R

Computations are to concurrency what functions are to imperative programming.

This may be a bold statement, but after these results, we might dare to say that computations solve concurrency.

Source

Niall Douglas praising S&R

Sender-Receiver is genius in my opinion. It’s so damn simple people can’t see just how game changing it is: it makes possible fully deterministic, ultra high performance, extensible, composable, asynchronous standard i/o. That’s huge. No other contemporary systems programming language would have that: not Rust, not Go, not even Erlang.

Source

Eric Niebler defending S&R

No appetite for the ability to launch whole graphs of async operations with zero allocations guaranteed? No appetite for an async model that obviates the need for reference counting? No appetite for async code where scopes and lifetimes nest, making RAII work as expected and reasoning about code possible? How about an async model that facilitates generic async algorithms that don't have to guess about whether some value of some type represents an error or not, or that gives chains of work a chance to clean up on cancellation? Or an async model that doesn't require writing custom allocators to get good performance? Then I'm sure they are also not interested in an async model built on a solid theoretical foundation stretching back to the 1950's.

Source

It's not all sunshine and rainbows

As explained in this video, senders and receivers have a lot of advantages, but the C++ implementation has some flaws too:

Safety

Performance

  • There is a little dynamic memory allocation needed in some cases, and the time spent in scheduling needs to be accounted for and minimized.
  • Some more concerns from here:
Niall Douglas on the disadvantages of universal generic async APIs

[...] at some things [the executor API] is very efficient, but that universal async composition comes at a price: for some combinations of async there is a lot of internal thunking, which has a performance hit. And some compositions have lousy performance e.g. single byte async socket i/o is going to suck, but single byte keyboard i/o is highly desirable. But in the end, that's the price you pay for the universal composition without regard to implementation, and what you do get which is positive is portability and genericity.

There is a desire that async should be universally composible and generic, then you can define generic async algorithms which universally apply to all possible kinds of async. Some of the opposition to Executors e.g. P2024 Bloomberg Analysis of Unified Executors stems from the concern that performance will suck for various mixes of incompatible stuff. This is why I proposed P2052 as an "escape hatch" for those people who:

  1. Want fully deterministic or minimum possible latency async i/o. Like, tens of nanoseconds @ 99.99% worst case.
  2. Want to schedule i/o against a known, platform-specific, i/o multiplexer i.e. they have tuned their implementation specifically around Linux io_uring, and they want their i/o code to only ever work with Linux io_uring and not say Linux epoll().
  3. Want concrete types with minimum possible build time impact, because all the templates unavoidable in Executors cost build time.
  4. To solve the chicken-and-egg problem inherent when implementing high level abstractions i.e. to be foundations for high level constructs.

In any case, the hope is that we can expose a good balance between what the platform supports e.g. mixing file and socket i/o, with genericity and portability.

...and more recently here:

Niall Douglas on how p2300 senders and receivers aren't good enough for extremely low latency applications

I am much less convinced that P2300 will be much use where you want every last iota of performance.
[...] This custom S&R implementation I have here at work implements from page 8 of [p2586]:

  1. Async cleanup.4
  2. Hard guarantees of never possible dynamic memory allocation during stacking and composure of async layers.
  3. Guaranteed bounded time and complexity i/o multiplexer.
  4. No locking nor synchronisation outside of io_uring.

We also have registered i/o buffers from P2052, so even page faults are eliminated. Basically the entire implementation is wait free, unless you explicitly ask to sleep the thread until something happens.
I will say my implementation is a touch annoying to write code with, but it forces to you to write really fast i/o code and that has been borne out by the performance we're seeing.
[...] Anyway one day my custom S&R will become open source, and if anybody wants a deliberately incompatible to P2300 S&R implementation which can peg io_uring to its maximum performance, it can be borrowed/cloned/reimplemented.

Tsung-Wei Huang's opinion of senders and receivers

IMHO, the current executor/sender/receiver design/proposal in the C++ committee is still primitive. When things go heterogeneous, the most important thing that affects (1) programming efficiency and (2) performance is "control flow". I am sure you agree that control-flow decisions frequently happen at the boundary between CPU and GPU computing. If you look at the current C++ executor design/proposal, it is very static and does not support control flow. For example, you can always submit CPU/GPU tasks to an executor or launch them asynchronously. However, when you reach control flow, you must synchronize them. And, what if you have multiple control flow blocks that may run in parallel - :) ? The parallelism you describe may not be end-to-end. Taskflow solved these problems using different tasking models to describe a workload in a parallel computation graph.

This is referring to features like conditional tasking.

After a bit of digging, this may not be true:

Eric Niebler on how to conditionally change the execution pipeline at runtime

stock 2300 is highly performant but produces expression templates so you couldn’t conditionally include pipeline stages at runtime. (Right?) Type erasure would (at a cost of course) enable that, right?

Type-erasure is one way to skin the cat, and it's certainly on my todo list.
Another way is, if you know statically the type of the pipeline you might want to conditionally include, and where, you could write a "conditional" sender that routes control flow through one sender or another depending on some runtime condition. Then you can use the runtime condition to "turn on" or "turn off" parts of the expression template.

  • The cost of type erasure affects the performance of senders and receivers:
Discussions on how type erasure affects performance of senders and receivers

with Senders, as long as your pipeline depends on some external input, e.g. HTTP1/HTTP2, optional TLS, optional compression, metrics, auth, other middleware, caches, user-provided plug-ins, etc, you would end up with type-erased senders which use extra allocations and indirections.
If you look at a large C++ application, such as Chromium, or a purely network IO like Envoy proxy, you would find that its architecture consists of tens of layers implemented as classes with virtual functions, in separate translation units. This means a lot of type-erasure, and probably lots of overhead.

...P2300 cannot be optimal for portable i/o. You need stuff like the ability to early initiate the receiver with an arbitrary unknown period until state can get cleaned up after the OS is done.

What would be superb is if the language supported much better. The last time type erasure was improved was in C++ 11, and only a bit of tinkering around edges at that. Nothing since.

Source

Ease of Use

  • It feels like a whole new programming paradigm, so there can be some work done to make it more beginner friendly.
  • It requires function colouring, which spreads throughout the program.
  • The rules that programmers need to keep in their heads to avoid edge cases and weird bugs will also need to be taught.
  • Without building high-level abstractions like algorithms on top of them, the API is too low level to be broadly useful.

If Mojo can implement senders and receivers in a better way by avoiding these pitfalls, I think it will have the best concurrency model out of all programming languages.

Additional Reading

The 'Resources' section of Nvidia's reference implementation.

Futures

Also for the futures and promises section, you can look at the STLab concurrency library for inspiration:

STLab futures description

The future implementation [...] provides continuations and joins [...]. But more important this futures propagate values through the graph and not futures. This allows an easy way of creating splits. That means a single future can have multiple continuations into different directions.
An other important difference is that the futures support cancellation. So if one is not anymore interested in the result of a future, then one can destroy the future without the need to wait until the future is fulfilled [...]. An already started future will run until its end, but will not trigger any continuation. So in all these cases, all chained continuations will never be triggered. Additionally the future interface is designed in a way, that one can use built in or custom built executors.
[...]
Since one can create with futures only graphs for single use, this library provides as well channels. With these channels one can build graphs, that can be used for multiple invocations.

Channels are this library's equivalent of the Communicating Sequential Processes (CSP) model.

P.S.: It seems the STLab concurrency library may get actor and sender/receiver5 implementations of its own.

P.P.S.: Sean Parent on sender/receiver:

Sean Parent on senders/receivers

The standard proposals lack support for a system executor, and I'm convinced the cancellation model in sender/receivers is wrong (or at the very least I don't see how to make it work. What sender/receivers get correct is a lightweight composition model by deferring parallelism. stlab::future continuations were not intended to be lightweight, but I frequently see them used as if they were. What I currently have on my list for a future revision is:

Provide a deferred sender/receiver model for futures supporting cancelation on destruction.
a. As part of this, I'll likely make future destruction blocking (same as senders) to better support structured concurrency without function coloring.

And here from 10:40 to about 14:30 -
TLDL: It should adopt the stlab cancellation model

Reactors

I recently learned of an alternative to actors that gets rid of non-determinism. They're called reactors. More papers and stuff here.

Footnotes

  1. Hylo (formerly Val) also has S&R (mentioned in this talk) and the docs for it are here.

  2. Source is here

  3. The reader can look at the WebAssembly typed continuations proposal for more info on delimited continuations. There are also many YouTube videos on the topic to look up.

  4. I assume this refers to something like async RAII. [1] [2] [3] [4]

  5. More info available here from 20:20 onwards and here

@ematejska ematejska added the mojo-repo Tag all issues with this label label May 6, 2024
@JoeLoser
Copy link
Collaborator

Going to close this PR for now since there hasn't been much activity on this in several months. Feel free to reopen when we're ready to take on this kind of work.

@JoeLoser JoeLoser closed this May 15, 2024
@nmsmith
Copy link
Contributor

nmsmith commented Jun 14, 2024

@Brian-M-J I've looked into the senders and receivers proposal. I see people promising big things ("a global solution to concurrency"), but notably, they don't seem to be able to back up their claims with evidence. I would expect to see an example of a non-trivial concurrent program that is dramatically easier to write with senders as opposed to being written with async/await and tasks etc. All I see are toy examples.

I am really skeptical that something big has been discovered. If it had, more people would have noticed by now. Senders and receivers have been around for 4+ years.

@Brian-M-J
Copy link
Contributor

Brian-M-J commented Jun 14, 2024

I would expect to see an example of a non-trivial concurrent program that is dramatically easier to write with senders as opposed to being written with async/await and tasks etc.

I'd be able to show it to you if it was open source 🙂. In terms of applications, I'd say the most prolific users of S&R are at Meta. In terms of libraries, there's plenty of open source stuff out there (see users of HPX).

I guess talks like this would be good demonstrations at least.

I guess another thing to note is that S&R is meant for high performance, so the C++ implementation might not be the most beautiful library. Starting from a blank slate (Hylo, Mojo (somewhat)) means that some syntactical niceties can be added to make it easier to use.

Edit: I just found this Reddit post that may be of help to future readers: Any real life examples of P2300 senders/receivers?

I am really skeptical that something big has been discovered. If it had, more people would have noticed by now.

As you noted, S&R is relatively new, so it isn't widely used yet. Though I wouldn't say that only a few people have noticed it. Clearly Nvidia has noticed, because they host the reference implementation. Meta has noticed it, because they've been using it in production for years. Bloomberg has noticed, because a few of their employees are authors of the proposals. Adobe has noticed, because there are plans to remodel the stlab library in terms of S&R, and Hylo is supported by the company. The users of libraries like HPX, folly etc. have noticed. The C++ committee has noticed, because they've voted favorably on it multiple times.

@nmsmith
Copy link
Contributor

nmsmith commented Jun 14, 2024

I guess talks like this would be good demonstrations at least.

That's another toy example. All he's done is create an event loop that spawns asynchronous tasks one-at-a-time. This is trivial to do in any language with async/await and a Task type.

I would love if S&R has solved some major problems with modelling concurrent systems, but I don't see it.

@nmsmith
Copy link
Contributor

nmsmith commented Jun 14, 2024

At the risk of stating the obvious, there are a lot of projects out there developed by well-meaning, passionate people, who promise that they have created something important. But most of the time, that doesn't turn out to be the case. I've been burned a lot in the past by believing that a project is as important as the contributors say it is, and then I've begun to experiment with it, only to eventually discover that I've wasted my time.

@Brian-M-J
Copy link
Contributor

I would love if S&R has solved some major problems with modelling concurrent systems, but I don't see it.

Well, the only thing I can show you when I don't have access to the code is this: Hylo has gotten rid of function colouring (see here).

BTW when they say "no actor abstractions" it just means that actors would be a library thing. You can build an actor system using S&R.

I myself am working on a Mojo implementation of S&R based on this talk in a private repo. That way when Mojo's coroutines drop with some benchmarks, I'll have something concrete to compare to. The biggest problem is that I don't know what I'm doing 😅. I'm not a low level / library expert or anything.

@nmsmith
Copy link
Contributor

nmsmith commented Jun 14, 2024

Getting rid of function coloring is a worthy goal, no doubt about that. But that is orthogonal to S&R. The reason most PLs have colored functions is because the thread-based concurrency model that they had already implemented prior to implementing async/await is incompatible with implicit suspension and implicit migration of tasks between threads. In the case of Python, the main reason you can't implicitly switch tasks is because a Python program is a big soup of shared mutable state with no synchronization/critical sections, so two tasks can easily race each other.

The solution to this is to come up with a concurrency model that ensures tasks can't race each other. You can get most of the way there with a Rust-like borrowing system. On top of that, you'd want a way to perform transactions on shared state. This is a big design space worth exploring. S&R doesn't really have a solution here. (I'd like to see an example of multiple tasks concurrently printing to stdout using S&R.)

@Brian-M-J
Copy link
Contributor

On top of that, you'd want a way to perform transactions on shared state. This is a big design space worth exploring. S&R doesn't really have a solution here.

One of the authors of p2300 wrote an article on how to use tasks to replace locks. The same ideas are repeated in p2504.

@nmsmith
Copy link
Contributor

nmsmith commented Jun 14, 2024

I've seen that. The idea is that if you can statically identify all of the places in your codebase where a variable is being accessed by multiple tasks—and if at least one of those tasks mutates the variable—then you (or your compiler) can conceivably restructure the program (cut it up into subtasks) such that any time a task needs to access the shared variable, you defer the access to a scheduler/executor. (And the executor contains the synchronization primitives required to avoid data races.)

This is a good observation, and I strongly agree that a task-based concurrency model should aim to do this. (Concurrency without explicit locking would be amazing!) But this is—again—completely orthogonal to S&R. S&R doesn't give me a simple way to write that restructured program, from what I can tell.

@nmsmith
Copy link
Contributor

nmsmith commented Jun 14, 2024

Actually, I'd only read the second article you linked, but the first article is more interesting IMO, because it actually discusses the "program restructuring" problem I'm referring to:

Let’s assume that one has an application that can be broken down in tasks relatively easily. But, at some point, deep down in the execution of a task one would need a lock. Ideally one would break the task in 3 parts: everything that goes before the lock, the protected zone, and everything that goes after. Still, easier said than done; this can be hard if one is 20 layers of function-calls – it’s not easy to break all these 20 layers of functions into 3 tasks and keep the data dependencies correct. If breaking the task into multiple parts is not easily doable, then one can also use the fork-join model to easily get out of the mess.

This proposed solution—forking a new task to mutate the shared variable—makes a lot of sense. It's worth exploring further.

@npuichigo
Copy link

@nmsmith maybe take a look at how NVIDIA leverages P2300 with CUDA to do async computation on GPUs? https://www.youtube.com/watch?v=nwrgLH5yAlM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mojo-repo Tag all issues with this label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants