Parallel recursion using async dialect

I have the following simple recursive loop. basically it iterated over a single dimensional memref and increment each element by 1 recursively. arg0 is the input array, arg1 is current index and arg2 is the number of elements in the array.

func @foo(%arg0: memref<10xi32>, %arg1: i32, %arg2: i32) {
    %true = constant true
    %c1_i32 = constant 1 : i32
    %0 = scf.if %true -> (i32) {
      %6 = addi %arg1, %c1_i32 : i32
      scf.yield %6 : i32
    } else {
      %6 = subi %arg1, %c1_i32 : i32
      scf.yield %6 : i32
    %1 = cmpi sge, %arg1, %arg2 : i32
    cond_br %1, ^bb1, ^bb2
  ^bb1:  // pred: ^bb0
  ^bb2:  // pred: ^bb0
    %token = async.execute {
      call @foo(%arg0, %0, %arg2) : (memref<10xi32>, i32, i32) -> ()
    %c1_i32_0 = constant 1 : i32
    %2 = index_cast %arg1 : i32 to index
    %3 = memref.load %arg0[%2] : memref<10xi32>
    %4 = addi %3, %c1_i32_0 : i32 %4, %arg0[%2] : memref<10xi32>
    %5 = memref.cast %arg0 : memref<10xi32> to memref<*xi32>
    call @print_memref_i32(%5) : (memref<*xi32>) -> ()

however I can not run this mlir-cpu-runnner. It crashes. It runs ok if I add a async.await after the asynchronous recursive call. but I want all recursive calls to be parallel (and independent). Is this achievable using async dialect, if so how?

Here is the stacktrace of the crash

#0 0x0000000000fc0e23 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (../../../../build_polyrec/bin/mlir-cpu-runner+0xfc0e23)
 #1 0x0000000000fbec6e llvm::sys::RunSignalHandlers() (../../../../build_polyrec/bin/mlir-cpu-runner+0xfbec6e)
 #2 0x0000000000fc140f SignalHandler(int) Signals.cpp:0:0
 #3 0x00007f6fc7967980 __restore_rt (/lib/x86_64-linux-gnu/
 #4 0x00007f6fc7c7f617 mlirAsyncRuntimeCreateToken AsyncRuntime.cpp:0:0
 #5 0x00007f6fc7d4a228 
 #6 0x00007f6fc7d4a61f 
 #7 0x00007f6fc7d4a309 
 #8 0x00007f6fc7cdc65f std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_state<std::function<void ()>, std::allocator<int>, void ()>::_M_run()::'lambda'(), void> >::_M_invoke(std::_Any_data const&) ThreadPool.cpp:0:0
 #9 0x00007f6fc7cdc5b7 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ThreadPool.cpp:0:0
#10 0x00007f6fc7964907 __pthread_once_slow /build/glibc-S9d2JN/glibc-2.27/nptl/pthread_once.c:118:0
#11 0x00007f6fc7cdc371 std::__future_base::_Task_state<std::function<void ()>, std::allocator<int>, void ()>::_M_run() ThreadPool.cpp:0:0
#12 0x00007f6fc7cdbcb3 void* llvm::thread::ThreadProxy<std::tuple<llvm::ThreadPool::ThreadPool(llvm::ThreadPoolStrategy)::$_0> >(void*) ThreadPool.cpp:0:0
#13 0x00007f6fc795c6db start_thread /build/glibc-S9d2JN/glibc-2.27/nptl/pthread_create.c:463:0
#14 0x00007f6fc64f371f __clone /build/glibc-S9d2JN/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:97:0
Segmentation fault (core dumped)

Right now the problem is that when you exit your first @foo call program terminates, but you might still have some async work going on, you’d need to await on “something” so that the top level function waits for the completion of all async tasks.

It can be ! for example.

func @foo(%group: !, ...) {
  %token = ...
  async.add_to_group %group, %token

func @bar() {
  %group = async.create_group
  call @foo(%group)
  async.await_all %group

To create group you’d need to know upfront the number of tokens you’ll add to it.

1 Like

Another option is to structure it like this:

func @foo(...) -> !async.token {
  %token = async.execute {
    %rec_token = call @foo()
    async.await %rec_token
  return %token

func @bar() {
  %token = call @foo()
  async.await %token

This will launch a “tree of async tasks” and they all will run concurrently, and when the last one will be completed program will finish.

async.await inside the async.execute conceptually automaticallu converted to a state machine + a callback, so the “true blocking” will only happen inside the @bar

1 Like

I implemented the first solution and it works correctly now. Thanks!

Also I don’t know the size of the array/list beforehand. Is there a way to create a dynamic sized async group?
If not second solution can fix that problem anyway. I will try that one out as well.

The problem with dynamically-sized groups is that adding tokens to the group can race with await operation, and await can’t figure out is it ok to continue if the number of pending tokens dropped to zero.

I think that second solution should be the same in terms of achieved concurrent execution.