LLVM Discussion Forums

[RFC] Add critical section operation

I’d like to add a critical section operation, the intended use case is for synchronizing access to shared mutable state when async dialect is lowered to LLVM coroutines + concurrent execution.

Examle: compute “something” in parallel, and aggregate results into shared memref


%0 = alloc(): memref<f32>
%r = "async.create_resource"(): () -> !async.resource

%token = async.execute {
  %1 = "do_some_heavy_compute_1"(): () -> f32
  async.critical_section [%r] {
    %2 = load %0[] : memref<f32>
    %3 = addf %1, %2 : f32
    store %3, %0[]: memref<f32>
  }
}

%token_0 = async.execute {
  %4 = "do_some_heavy_compute_2"(): () -> f32

  async.critical_section [%r] {
    %5 = load %0[] : memref<f32>
    %6 = addf %1, %2 : f32
    store %6, %0[]: memref<f32>
  }
}

async.await %token, %token_0

Should it be resource + critical_section or something else? I like resource/critical section because it allows to model things like synchronized access to “device”.

For Async->LLVM lowering this can be also lowered to coroutines, so there will be no “blocking” per se (no holding a mutex lock for guarding a critical section), but a “wait list” of suspended coroutines (critical sections) that will be processed sequentially in some random order at runtime.

The standard dialect has the generic_atomic_rmw operation (see https://mlir.llvm.org/docs/Dialects/Standard/#stdgeneric_atomic_rmw-genericatomicrmwop). Admittedly this is lower level.

Supporting critical sections is not specifically tied to the async dialect. Except that you plan to lower it to a shared runtime, the two concepts are unrelated. You could even model this as a library call (assuming you had async. function calls).

%token_0 = async.execute {
  %4 = "do_some_heavy_compute_2"(): () -> f32

  async.call @aquire_lock(%r)
  %5 = load %0[] : memref<f32>
  %6 = addf %1, %2 : f32
  store %6, %0[]: memref<f32>
  async.call @release_lock(%r)
}

but that would give you nested asynchronicity. Maybe something like

%token_0 = async.execute {
  %4 = "do_some_heavy_compute_2"(): () -> f32
  async.yield %4 : f32
}

%token_1 = async.call @lock_acquire[%token_0](%r)

%value = async.execute [%token_1] {
  %4 = "do_some_heavy_compute_2"(): () -> f32
  async.yield %4 : f32
}

%token_2 = async_execute [%value] {
    ^bb0(%1: ...):
    %5 = load %0[] : memref<f32>
    %6 = addf %1, %5 : f32
    store %6, %0[]: memref<f32>
  }

%token_r = async.call @release_lock[%token_2](%r)

could also work but then you would need support for a value (the %r) that changes its readiness state back and forth, so that only one waiting co-routine (or async dependency) fires at a time.

For some reason I thought that atomic operations in std are just counterparts of c++ fetch_add, fetch_sub, etc…, looks like generic_atomic_rmw is exactly what I need.

Wouldn’t you need std.call here?
(and I don’t understand the second example at all, maybe it is also supposed to be a synchronous std.call?)

Yes, you are right, the first example only works with std.call as we do not model nested asynchronicity. I wanted to encode the fact that the lock operations do not block the thread but instead deschedule until the lock token is available. My idea was to have aquire_lock return a token that becomes ready when the lock was acquired. That way, computations that depend on the lock would not be scheduled until then. Likewise with the release_lock.

In the second example, I have made this explicit by actually returning these tokens without using nesting. I just noticed I missed some dependencies though.

It makes sense now with the added dependencies!