LLVM Discussion Forums

[RFC] Async/Await dialect targeting LLVM coroutines

It’d be good to see the comparison, but I’m not sure I would start by trying to abstract all such scheduling models together (at least in one step): increasing levels of unification on that axis tend to bake in more and more assumptions that are unlikely to hold universally.

I don’t know what to call it, but having a library of high level cpu async abstractions seems like a good thing generally, and the topic is pretty well studied. The fact that a default lowering to existing intrinsics exists is a nice indication with respect to this proposal.

It is common on such setups eventually to want to loan the main thread/not force a thread switch. Does your design enforce this constraint by construction?

No, async.call is free to execute "task" in the caller thread, also async.await in theory could do work stealing. This is completely “async runtime” implementation defined. I checked that it is safe to to call @llvm.coro.resume before the corresponding @llvm.coro.suspend call, so it is safe to “inline” async calls, the following coroutine suspension point becomes no-op. This is required for async.parallel_for implementation, so it would not be forced to launch async tasks for small sizes.

This looks pretty interesting Eugene! Some random questions for you:

  1. It seems like you’re proposing a specific set of runtime entrypoints. Is that intended for testing, or is that the expected use-case? I would imagine that real languages targeting this would want to control a lot of the details here, which is why the LLVM lowering logic provides a lot of flexibility to the language using the compiler support.

  2. Where in the MLIR tree does the AsyncRuntimeAPI live? Do we have precedent for how runtime components are handled?

  3. Do you have an intended use-case for this? Is this intended to support something real (that we can get experience with) or is it mostly to explore the ideas?


I’ve uploaded a proof-of-concept to https://reviews.llvm.org/D82704. It implements a simple Async->LLVM conversion, and has a few end-to-end tests on top of mlir-cpu-runner.

The end goal is to codegen async/non-blocking/parallel/concurrent host-side code, starting form a hight level dialect, lowering to LLVM dialect, and jitting at runtime (non-blocking TFRT kernel, if to be more specific).

Disclaimer: This is very TFRT-codegen-centric goal :slight_smile: other people have different opinions on what this dialect should be about. Given this very narrow goal maybe it makes sense to keep it outside of MLIR tree.

The goal is to make @kernel non blocking, and “classic” parallel for with a barrier doesn’t work here.

func @kernel(%in : memref<x?f32>) {
%done = async.parallel_for { ....}
async.await %done : !async.handle // <--- this can't block
... some other operations ...

Proposed async dialect is intentionally much more constrained than a generic coroutine. It is only about concurrency, while coroutines can be also generators, and support different “types” of concurrency: lazy vs eager.

The expected use case is that runtime using this dialect should pass its own implementation of runtime API to the codegen (implemendation defined async handle, async call, async await, …).

I’ve put a “runtime reference implementation” under mlir-cpu-runner.

Cool, I’m a fan of TFRT :-), my primary concern was to make sure that it was grounded on something real, so if you have a use-case for it, then building it in the main MLIR repo seems reasonable to me. Out of curiosity, what do you plan to have generate this dialect? Will there be something end-to-end that is wired up here, including a “standard library” of I/O and other blocking APIs that can be called into? Or do you plan to write (e.g.) BEF kernels by hand using this?

Got it, makes sense. It would probably make sense to eventually parameterize the lowering code, so other clients could change the entry points being used. However, that can be done the day there is a second client :slight_smile:, I don’t see any reason for premature generalization here.

This is really cool Eugene!


Fwiw, I’ll want some representation like this when I get npcomp further along: I could see some user level API/intrinsic that compiles async functions and would generate these forms directly. I hadn’t thought through the details yet. Some such things use API calls that trigger special compilation flows, various annotations, or special syntax (python “async” functions in the limit).

I hadn’t thought yet about whether I would want something more constrained than coroutines as a starting point. Also, there are some transformation interplays between things like scf.parallel that it might be good to be explicit about (and get @herhut’s opinion). It seems like, if targeting a CPU, a reasonable lowering exists from scf.parallel to async.call/etc? For what it represents, I believe scf.parallel is the higher level form. From this perspective, this async dialect would exist at the same level as the gpu dialect and be in play for parts of the program targeting cpu (or could be generated directly by higher levels that know they are targeting a CPU).

One design nit: the region based scf ops are a nicer representation compared to the function based way you have this. In practice, when analyzing/transforming these things, one needs to work across the caller/callee boundary, and that is cleaner without needing to do interprocedural analysis. I would rename async.call to async.invoke and make it take a region instead of a callee.

One way to get a feel for the ergonomics and layering would be to write a lowering from scf.parallel to what you have (in the same vein of how SCFToStandard generates sequential code). This gets to @joker-eph’s question about whether this is a leaf or a part of the transformation dialects.

I currently think of Linalg -> SCF (scf.parallel) -> SCF+Async (async.parallel/async.await) -> LLVM lowering path. However @joker-eph and @herhut have concerns about introducing concurrency that late in the pipeline, and maybe concurrency/asynchronicity should be explicit much earlier.

One of the goals is to be able to add a BEF kernel that will jit compile async region:

%arg0 = ... !t.tensor
%arg1 = ... !t.tensor
%done = hex.jit_execute(%arg0, %arg1) -> !hex.chain {
   // MLIR in LingAlg/SCF dialects.
   scf.parallel { ... }
   scf.parallel { ... }

Attached region must be executed without ever blocking the caller thread, with presence of parallel loops it must be converted to async state machine first.

Yes, that’s the plan, it was just much easier to start with async.call first for a proof of concept :slight_smile: And this will be anyway useful as an intermediate lowering before going to LLVM.

Lowering scf.parallel with reductions will be a bit tricky, but in general it’s rather straghtforward: scf.parallel -> async.parallel followed by async.await. This is the first transformation I have in mind for this dialect.

I generally agree. It is better to not lose such semantics if they exist at a higher level. That does not need to be mutually exclusive with having defined default lowerings that recover some parallelism for simple/explicit low level cases though. How useful that turns out to be would depend on how applicable the higher level forms are to all problems.

“Analysis-first” dialect from @herhut has explicit async handle/token arguments to async regions:

%0 = async.region { ... } -> !async.token
%1 = async.region [%0] { ... } -> !async.token

I think we can easily have both: async.await and "explicit tokens".

For hight level analysis -> TFRT program compilation we can use token args, because it’s close to what TFRT already expects, for host program codegen we can convert token args to async.await:

%0 = async.region { ... } -> !async.token
async.await %0
%1 = async.region { ... } -> !async.token

I wanted to send the RFC for my variant but have not found the time to fully write it out. I will try to get to it asap. But the above is the gist of it.

I am not sure I follow this example. Are you proposing to use aync.await as an alternative lowering instead of TFRT?

I see your proposal generally at the level of the GPU dialect wrt. the stage of lowering. For the gpu dialect, the operations themselves are also asynchronous on top of the surrounding context that is lowered to TFRT (for example). We might get the same in the CPU context. Assume the example

%0, %t = async.region {
  %1 = scf.parallel_for ...
  async.yield %1

Here, we would have an async region that itself contains some parallelism in the form of an scf.parallel_for. The outer async.region could turn into a BEF level async kernel. Can your async dialect be used to compile the body? If so, what would that look like?

Yes. Proposed lowering is for the “TFRT kernel layer” that is below TFRT/BEF runtime.

Yes, that’s pretty much the goal. “Transformation-first” async dialect used to split program into the TFRT / BEF kernels, “codegen-first” dialect used to inside JIT-compile kernel “body”.

Your example would ultimately be lowered to bef-compatible mlir like this:

// "Outer" mlir translated into BEF file. 
%value, %chain = hex.run {
   // "Inner" region compiled into x86 binary at runtime.
   %1 , %2= async.parallel_for {... } : memref<?>, !async.handle
   async.await %2
   kernel.return %1
} : !t.tensor, !hex.chain

Before splitting program into TFRT/BEF kernels attached regions must be in Linalg/SCF dialects for simplicity, after extracting kernel level parallelism, attached regions lowered to a codegen-first async dialect.

Nice, this fits very well with my RFC (which I finally posted).

The semantic of this “async_compute” isn’t totally clear to me: I don’t quite get why we create a handle “out-of-thin” air and return it? What does it carry and how will it be “ready”?

Seems like you edited some of the SSA values, but not all, can you edit the post and fix the typos? (I can’t be sure what you’re returning either)

I looked at the revision actually, it is more clear there.

Simplifying to the minimum it is:

// ===--------------------------------------------------------------------=== //
// Convert `async_function` to LLVM coroutine and call it asynchronously.
// ===--------------------------------------------------------------------=== //

func @async_function(%arg0: !async.runtime) -> !async.handle {
  call @print_i32(%c789): (i32) -> ()
  async.emplace_handle %ret_handle : !async.handle
  return %ret_handle : !async.handle

func @main2() {
  %runtime = async.default_runtime : !async.runtime
  %handle = call @async_function(%runtime): (!async.runtime) -> !async.handle
  async.sync_await %handle : !async.handle

There is something a bit strange here in that the IR is in a strange state before transformation. It seems to me that this is embedding some “breadcrumbs” that are expected to be recovered by a future transformation (the call does not make sense without the transformation for example).

What about expressing this in the call itself instead of this fake/empty handle? Something like:

// ===--------------------------------------------------------------------=== //
// Convert `async_function` to LLVM coroutine and call it asynchronously.
// ===--------------------------------------------------------------------=== //

func @async_function(%arg0: !async.runtime) {
  call @print_i32(%c789): (i32) -> ()

func @main2() {
  %runtime = async.default_runtime : !async.runtime
  %handle = async.call @async_function(%runtime): (!async.runtime) -> !async.handle
  async.sync_await %handle : !async.handle

Handles created by the runtime, the semantics is that it will become “ready” when the async computation is completed. Details are implementation defined, runtime can launch computation on a separate thread, or can run it immediately.

My though was that “async function” is a “function that returns async handle” and that fake handle was a hacky way to signal that intent, but now I’d go for explicit AsyncFuncOp to make it clear that it is not a regular function, and must be called with async call (similar to python coroutines).

Returning “something” that “owns” a coroutine handle is a C++ way of writing coroutines, e.g. cppcoro::task is just a thin wrapper around void* coro_handle:

Example from: https://en.cppreference.com/w/cpp/language/coroutines

task<> tcp_echo_server() {
  for (;;) {
    size_t n = co_await socket.async_read_some(buffer(data));
    co_await async_write(socket, buffer(data, n));

Because in MLIR functions must have explicit terminator, async function has to “create a handle “out-of-thin” air”.

I think something like this is easier to work with and better aligned with RFC from Stephan:

func @main() {
  %handle = async.region %runtime {
     call @print_hello_world(): () -> ()

Not sure if the runtime arg should be explicit with async regions, or introduced later when we are doing host-side codegen.

Async regions converted to functions (captured values packed into struct?)

async.func @async_fn(%arg0: !async.runtime) {
  call @print_hello_world(): () -> ()

func @main {
   %handle = async.call @async_fn(%runtime): (!async.runtime) -> !async.handle

Right, the rest of the handling around coroutine is fairly clear to me: I was mostly trying to define away this “fake handle” from the IR.

Creating a separate op may be a good way to get started, otherwise we could have maybe try to shove the logic in the terminator itself without a different FuncOp:

func @async_func() -> !async.handle {
  async.return // no need to create a fake handle

(not even sure if we need to have a return value for the func right now)

In Zig (and I believe Rust had the same experimence), the LLVM implementation of co-routines was tried and we learned it was broken because it does not seperate the allocation and deallocation of memory (the co-routine frame) from the execution of said co-routine.

Do you have any pointers to discussion? Don’t quite get what does it mean in practice.

  • Because Zig uses LLVM’s coroutine support, it has to follow the paradigm they set, which is that the coroutine allocates its own memory and destroys itself. In Zig this means that calling a coroutine can fail, even though that doesn’t have to be possible. It also causes issues such as #1194.
  • It also means that every coroutine has to be generic and accept a hidden allocator parameter. This makes it difficult to take advantage of coroutines to solve safe recursion (#1006).
  • One of the slowest things LLVM does in debug builds, discovered with -ftime-report , is the coroutine splitting pass. Putting the pass in zig frontend code will speed it up. I examined the implementation of LLVM’s coroutine support, and it does not appear to provide any advantages over doing the splitting in the frontend.
  • Optimizations are completely disabled due to LLVM bugs with coroutines (#802)

That last bug is particularly nasty https://bugs.llvm.org/show_bug.cgi?id=36578