[RFC] OpenACC dialect

We would like to propose an OpenACC[1] dialect to be added to MLIR in a simlar
way than the OpenMP dialect[2] has been added.
The overall goal is to have a dialect that is front-end agnostic and that
represent the capabilities of the OpenACC 3.0 standard and follow with potential
new versions.

OpenACC is very important for scientific applications running on big supercomputers.
About 25% of applications running on the Summit supercomputer (#1 in top500) are using
OpenACC.

The dialect would ultimatly support variety of underlying dialects and
loops representation such as loop.for, loop.parallel, affine.for in
the dialect region.

The operations of the dialect would represents the different constructs found in
the standard. A parallel construct would be represented by an acc.parallel
operation and a loop construct would be a acc.loop operation. Some
construct could be direct call to runtime library (data movement, allocation …)

The acc.parallel operation imply that its region must be offloaded to an
accelerator. The mapping of the region to the accelerator follows the standard.
One worker per gang execute the region. The number of gang and workers
can be specified or determined be the lowering process. The num_gangs(X)
and num_workers(X) can be added to the operation to specify those numbers.
This will for example drive the number of workgroup/worker if the operation
is lowered to the GPU dialect.

The acc.loop operation specifies the mapping of the loop(s) within its region to
the available workers. Several mapping can be specified to the operation.
Common mapping can be: gang, gang vector, vector … Nested loops can be
collapse with a collapse(X) added to the operation.

Unlike the OpenMP dialect, we are not targeting the OpenMP builder. We are
targeting a lowering to the GPU dialect as an initial step.
So a simple lowering from OpenACC to GPU dialects could be done as shown below.

func @compute(%x: memref<1024xf32>, %y: memref<1024xf32>,
  %n: index, %a: f32) -> memref<1024xf32> {
  %c0 = constant 0 : index
  %c1 = constant 1 : index

  %c0 = constant 0 : index
  %c1 = constant 1 : index

  acc.parallel num_gangs(8) num_workers(128) {
    acc.loop gang vector {
      loop.for %arg0 = %c0 to %n step %c1 {
        %xi = load %x[%arg0] : memref<1024xf32>
        %yi = load %y[%arg0] : memref<1024xf32>
        %ax = mulf %a, %xi : f32
        %yy = addf %ax, %yi : f32
        store %yy, %y[%arg0] : memref<1024xf32>
      }
    }
  }
 return %y : memref<1024xf32>
} 

Once lowered it could look like this.

func @compute(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>, %arg2: index, %arg3: f32) -> memref<1024xf32> {
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %c1_0 = constant 1 : index
  %c8 = constant 8 : index
  %c128 = constant 128 : index
  gpu.launch blocks(%arg4, %arg5, %arg6) in (%arg10 = %c8, %arg11 = %c1_0, %arg12 = %c1_0) threads(%arg7, %arg8, %arg9) in (%arg13 = %c128, %arg14 = %c1_0, %arg15 = %c1_0) {
    %0 = muli %arg4, %arg13 : index
    %1 = addi %0, %arg7 : index
    %2 = muli %arg10, %arg13 : index
    loop.for %arg16 = %1 to %arg2 step %2 {
      %3 = load %arg0[%arg16] : memref<1024xf32>
      %4 = load %arg1[%arg16] : memref<1024xf32>
      %5 = mulf %arg3, %3 : f32
      %6 = addf %5, %4 : f32
      store %6, %arg1[%arg16] : memref<1024xf32>
    }
    gpu.terminator
  }
  return %arg1 : memref<1024xf32>
}

The gang and vector information defined how the loop inside the region is
mapped to the workgroups/workers available.

For example, if the loop operation is only mapped with gang acc.loop gang {,
it will distribute the iterations of the associated loops among the workgroups only.

The first intent is to use this dialect from f18/flang. It might as well be
used by any frontend targeting MLIR. We have started work on the parsing/sema
and are looking into the lowering to MLIR when f18 will include the first bit of fir/mlir
in the master.

Obviously, the dialect is meant to represent the full capabilities of the
OpenACC standard and more operations will come as we designed them.
acc.parallel and acc.loop are good starting point since they are
representing the most used construct in OpenACC.

As we go, there will probably be some overlap between an OpenACC and the offload
part of the OpenMP dialect. If it make sense, there can share common lowering.

References

[1] https://www.openacc.org

[2] RFC: OpenMP dialect in MLIR - #9 by kiranchandramohan

Just bringing this topic back since we now have OpenACC 3.0 parsing and semantic checking landed in Flang. The idea of the OpenACC dialect remains the same.
It can live in the Flang directory but would work only with the FIR dialect.
If it is part of the core MLIR then we plan to make it work with the core dialects as well. This will also allow some interaction/sharing between the OpenMP and OpenACC dialects. Any comments on this?

1 Like

To add some context, we just received some funding from the Exascale Computing Project (ECP) to do this development.

I just posted a first potential patch that show 3 main operations (parallel, loop, data) that can be find here:
https://reviews.llvm.org/D84268

Flang is already able to parse and do semantic checking for OpenACC 3.0.

The overall goal looks good to me. I don’t have much comments here otherwise, keeping it independent of the frontend and reusable seems like a valuable goal and makes it a good fit here.

1 Like

Thanks @clementval for posting this and the patches to Flang and MLIR for supporting OpenACC.

It would be great if OpenACC is part of MLIR and hopefully we will be able to share code and ideas with the OpenMP dialect.

The only question we had is whether there is a runtime for OpenACC and how it affects lowering to other dialects?

We can help with reviews tomorrow or early next week.

@kiranchandramohan Thanks for you feedback and question. Some comments inlined.

It would be great if OpenACC is part of MLIR and hopefully we will be able to share code and ideas with the OpenMP dialect.

Yeah that would be the goal. I’m seeing various palces where OpenMP and OpenACC can share some common infrastructure and code. Even maybe a translation from one dialect to the other and vice-versa.

The only question we had is whether there is a runtime for OpenACC and how it affects lowering to other dialects?

There are various ways to tackle this. As you know, there is no OpenACC runtime in LLVM. So we have several ideas:

  • One idea could be to make the offloading part of the OpenMP runtime more generic and be usable by OpenACC as well and other accelerator targeted languages. This is more or less what was done in GCC. Of course this implies to propose an RFC to the OpenMP runtime community and being accepted. The CLACC project is doing the translation from OpenACC to OpenMP in clang and has an extensive view of what should be done to support missing function in the OpenMP runtime.
  • A second idea would be to use an external open-source OpenACC runtime (maybe the one from OpenARC since it is from ORNL and we have people with knowledge on it). If it proves to be stable and efficient, there could be some work done to make it LLVM-friendly and maybe propose it to the community. We could also make the lowering targeting GNU Offloading and Multi Processing Runtime Library. Choice might be driven by what is more stable and more suitable to be plugged-in.
  • Last choice would be to target a generic offloading runtime like PHIRE. For sure there will be some research work done in this direction by people from national labs but this is not meant to be upstream. At least not in a first phase.

The choice will be mostly driven by what is possible to do and how the community accept change in existing code or addition of new runtime. But as TFRT runtime can also be external until it mature and reach a wider community.

Hopefully this answer some of your questions. Let me know if you want more details on some other aspects.

1 Like