[RFC] Privatisation in OpenMP dialect

OpenMP constructs can define a variable to be a copy of another variable so that each thread executing the construct gets a separate copy and there are no dependencies. This is called privatisation in OpenMP.

We can handle privatisation in two different ways,

  1. Privatisation clauses can be dissolved to allocas/allocs by the frontend preceding MLIR.
  • Advantages:
    • Simpler engineering
    • Other layers need not be aware of privatisation.
  • Disadvantages:
    • All frontends would need to handle privatisation clauses themselves
    • Some clauses (allocate) specify that the allocation in privatisation has to be performed by the openmp runtime, this would mean insertion of a runtime call by the frontend.
  1. Privatisation clauses can be represented in the OpenMP MLIR dialect and handled appropriately.
  • Advantages:
    • Representation in MLIR would mean frontends can leave it to MLIR to handle privatisation
    • MLIR dialect will be more representative of OpenMP and can do some checks.
  • Disadvantages:
    • Additional engineering effort to make another layer aware of privatisation
    • Have to handle copying, constructor, destructor etc, differences in C++/Fortran privatisation

Handling in the frontend is fairly obvious, hence will discuss privatisation in the OpenMP dialect. One way to represent a private clause is by having an operand which is of the same type as the original variable. This operand will be an argument to the entry block. Then a transformation pass can perform the privatisation transformation. This is probably straightforward.

Things become a bit more involved when there are constructors, destructors or if the variable is marked as allocatable (in which case it might have to allocated during runtime). This would probably require the following,

  1. The constructor, destructor functions.
  2. The operation (call op) to call the constructor/destructor.
  3. The operation to perform the allocation, deallocation.

I guess (1) can be stored in the type. The pass which performs the transformation can be aware of (2) and (3) but this would restrict the transformation to be performed at that particular dialect (e.g FIR for Fortran and not in LLVM dialect or during translation.

For discussion:

  • Should MLIR have a representation for private clauses?
  • Is it OK to represent a private clause as an operand which is a basic block argument of the entry block?
  • Should it be a transformation pass which performs the privatisation?
  • Can this transformation pass be generic or should it sit with the source dialects?
  • Is storing constructor, destructor information in the type the right approach?

A simple Fortran example with FIR and OpenMP dialects is given below. We have a fortran program with an OpenMP Parallel loop with the loop index marked as private. Initially the representation will have a private operand. After transformation the operand will be replaced with an alloca.

Fortran Source MLIR (FIR + OpenMP) MLIR (After privatisation)
program pvt
    integer :: i
    integer :: arr(10)
    !$OMP PARALLEL PRIVATE(i)
    do i=1, 10
      arr(i) = i
    end do
    !$OMP END PARALLEL
end program
func @_QQmain() {
  %0 = fir.address_of(@_QEarr) : !fir.ref<!fir.array<10xi32>>
  %1 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QEi"}
  omp.parallel private(%pvt_i : !fir.ref<i32>) {
    %c1_i32 = constant 1 : i32
    %2 = fir.convert %c1_i32 : (i32) -> index
    %c10_i32 = constant 10 : i32
    %3 = fir.convert %c10_i32 : (i32) -> index
    %c1 = constant 1 : index
    %4 = fir.do_loop %arg0 = %2 to %3 step %c1 -> index {
      %6 = fir.convert %arg0 : (index) -> i32
      fir.store %6 to %pvt_i : !fir.ref<i32>
      %7 = fir.load %pvt_i : !fir.ref<i32>
      %8 = fir.convert %7 : (i32) -> i64
      %c1_i64 = constant 1 : i64
      %9 = subi %8, %c1_i64 : i64
      %10 = fir.coordinate_of %0, %9 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
      %11 = fir.load %pvt_i : !fir.ref<i32>
      fir.store %11 to %10 : !fir.ref<i32>
      %12 = addi %arg0, %c1 : index
      fir.result %12 : index
    }
    %5 = fir.convert %4 : (index) -> i32
    fir.store %5 to %1 : !fir.ref<i32>
    omp.terminator
  }
  return
}
func @_QQmain() {
  %0 = fir.address_of(@_QEarr) : !fir.ref<!fir.array<10xi32>>
  %1 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QEi"}
  omp.parallel {
    %pvt_i = fir.alloca i32 {uniq_name = "i"}
    %c1_i32 = constant 1 : i32
    %2 = fir.convert %c1_i32 : (i32) -> index
    %c10_i32 = constant 10 : i32
    %3 = fir.convert %c10_i32 : (i32) -> index
    %c1 = constant 1 : index
    %4 = fir.do_loop %arg0 = %2 to %3 step %c1 -> index {
      %6 = fir.convert %arg0 : (index) -> i32
      fir.store %6 to %pvt_i : !fir.ref<i32>
      %7 = fir.load %pvt_i : !fir.ref<i32>
      %8 = fir.convert %7 : (i32) -> i64
      %c1_i64 = constant 1 : i64
      %9 = subi %8, %c1_i64 : i64
      %10 = fir.coordinate_of %0, %9 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
      %11 = fir.load %pvt_i : !fir.ref<i32>
      fir.store %11 to %10 : !fir.ref<i32>
      %12 = addi %arg0, %c1 : index
      fir.result %12 : index
    }
    %5 = fir.convert %4 : (index) -> i32
    fir.store %5 to %1 : !fir.ref<i32>
    omp.terminator
  }
  return
}

MLIR : @ftynse @schweitz @jeanPerier
OpenMP : @jdoerfert @Meinersbur
Team : @clementval @abidmalikwaterloo @kirankumartp @SouraVX

2 Likes

My overall take on this is that having a first-class representation in MLIR for a concept is necessary if we want MLIR to reason about this concept (i.e., perform analyses or transformations). The converse is not necessarily true: a representation can still make sense from a layering or simplicity perspective even if it there are no transformation that require it. In the former case, the design of the representation should be driven by the analysis needs; in the latter case, by simplicity.

I also caution against involving the notion of a frontend in the design. We are not in the situation of a classical “frontend - ‘mlir IR’ - backend” compiler in general, the stack is deeper and more heterogeneous than that. As a concrete example, Polygeist has a C++ frontend that produces a mix of Affine, SCF, ex-Standard and LLVM dialect ops. Affine gets parallelized within MLIR and lowered to SCF, which may or may not be converted to the OpenMP dialect. If we decide that “the frontend” is expected to produce some form of the IR, we need to define what “the frontend” is (it appears that Polygeist would have to do some things twice: first time on the input C++ with potential pragmas, and second time when introducing new OpenMP constructs) and make sure that this form persists across all layers of the representation stack. A more extreme example would be a TF → HLO → MHLO → Linalg → Affine → SCF → OpenMP pipeline. That being said, individual pipelines need not necessarily reimplement all of the privatization every time, the dialect can provide utility functions for them to use.

I’m inclined to say yes, but can accept it either way.

This is one possibility.

Another one is to have an operation omp.mlir.privatize that can appear in the region of an operation to which the clause is attached. The benefit of having such an operation is the ease of lowering it differently based on almost arbitrary criteria (operand types, attributes, etc.) using patterns. The drawback is a weird semantics should it appear under control flow, i.e. outside of the entry block of the region.

I’m generally in favor of having small passes that are easy to test and replace. However, having such a pass means being able to represent the IR before and after it, which means we would ultimately implement both the first-class support for privatization clauses and the dissolved-to-allocas form.

Having it with source dialects may not compose. Imagine having several dialects that each perform privatization differently for their types (simple example, a mix of FIR and memref dialects). The representation with a dedicated operation I mentioned above is a workaround. Another scalable alternative is a generic pass that operates on type interfaces with each supported type implementing the interface.

I’m not certain I understand what is intended here. Store function pointers to a constructor and a destructor in the runtime representation of a type? This would make it impossible to privatize any built-in types because they are not going to be modified just to support this.

Type interfaces sound like a solution here, again. The type that are willing to be privatizable by OpenMP can implement an interface that takes an OpBuilder and some extra parameters and emits the IR for construction, destruction, copying, etc. in a dialect-specific way.

1 Like

Thanks @ftynse for the reply.

I am not aware of any transformations/analysis that will immediately benefit from the privatisation representation (other than privatisation itself).
Another factor that we are considering for the representation in the dialect is whether the information in the representation is needed for creating the OpenMP runtime calls. AFAIU, the privatisation information is not needed for this purpose.
Not having a representation for privatisation will mean that we cannot perform some semantic checks that are specified in the OpenMP standard in the dialect.

I felt that it is unlikely that a lowering from another dialect will use privatisation. This can happen only if there is an analysis pass which determines that privatising an SSA value (corresponding to some variable) will help in parallelisation. And unless that dialect is specifically created for lowering to OpenMP it will have other mechanisms to do the privatisation like transformation
I believe the likely use case is when frontends lower to the MLIR representation. Will Polygeist use the privatisation representation in the OpenMP dialect if it is there?

Yes, the semantics will be weird and the conversion will have to hoist the allocas into the entry block of the region. Finding the entry block when multiple dialects are present is also going to be a problem. In OpenMP there are operations which are going to the outlined (like parallel, task) and these will have entry-blocks. An interface is being created for these Ops as per an earlier suggestion of yours. What about other dialects?

I was referring to high-level language types (like classes in C++) which have custom constructors and destructors. The high-level dialect will hopefully store information about the constructor, destructor etc somewhere (I was assuming it has to be in the type). If it does not then it will not be possible to insert calls to the constructor and destructor when the privatised copy is created. For builtin types are there constructors/destructors? Anyway this information about constructors/destructors is optional.

OK

The lastprivate clause is implemented with a runtime call in OpenMP worksharing loop. The code generated below uses the omp.is_last variable (initially used in __kmpc_for_static_init_4) to check whether it is the last iteration and then stores the value of the lastprivate copy to the original variable in .omp.lastprivate.then: basic block.

call void @__kmpc_for_static_init_4(%struct.ident_t* nonnull @1, i32 %4, i32 34, i32* nonnull
%.omp.is_last, i32* nonnull %.omp.lb, i32* nonnull %.omp.ub, i32* nonnull %.omp.stride, i32 1, i32 1) #3
...
...
omp.loop.exit:                                    ; preds = %omp.inner.for.body, %entry
  %x1.0.lcssa = phi i32 [ undef, %entry ], [ %call, %omp.inner.for.body ]
  call void @__kmpc_for_static_fini(%struct.ident_t* nonnull @1, i32 %4)
  %11 = load i32, i32* %.omp.is_last, align 4, !tbaa !6
  %.not = icmp eq i32 %11, 0
  br i1 %.not, label %.omp.lastprivate.done, label %.omp.lastprivate.then
.omp.lastprivate.then:                            ; preds = %omp.loop.exit
  store i32 %x1.0.lcssa, i32* %x, align 4, !tbaa !6
  br label %.omp.lastprivate.done
.omp.lastprivate.done:                            ; preds = %.omp.lastprivate.then, %omp.loop.exit
  call void @__kmpc_barrier(%struct.ident_t* nonnull @2, i32 %4)
  ret void

Also, if code has to be inserted in the header (firstprivate) or footer (lastprivate) rather than in the loop body then the privatisation transformation has to be performed while the CFG is generated for the loop. CFG is currently generated in translation.
Alternative is to insert conditional loads/stores in the body of the loop for firstprivate and lastprivate.

We are leaning towards implementing privatisation outside the OpenMP dialect. This is because of the hope that it will lead to a simpler implementation, don’t have to deal with constructors, finalizers/destructors in the dialect. I was performing an audit of privatisation to check whether there are places where the privatisation information is needed for making runtime calls. If some information is needed for creating runtime calls then that information should be represented in the OpenMP dialect. I came across the lastprivate clause in worksharing loop (and a few others) where the update to the original variable happens based on the is_last variable which is set by the runtime in the Clang generated code. The code generated as shown below uses the omp.is_last variable (initially used in __kmpc_for_static_init_4) to check whether it is the last iteration and then stores the value of the lastprivate copy to the original variable in .omp.lastprivate.then: basic block.

call void @__kmpc_for_static_init_4(%struct.ident_t* nonnull @1, i32 %4, i32 34, i32* nonnull
%.omp.is_last, i32* nonnull %.omp.lb, i32* nonnull %.omp.ub, i32* nonnull %.omp.stride, i32 1, i32 1) #3
...
...
omp.loop.exit:                                    ; preds = %omp.inner.for.body, %entry
  %x1.0.lcssa = phi i32 [ undef, %entry ], [ %call, %omp.inner.for.body ]
  call void @__kmpc_for_static_fini(%struct.ident_t* nonnull @1, i32 %4)
  %11 = load i32, i32* %.omp.is_last, align 4, !tbaa !6
  %.not = icmp eq i32 %11, 0
  br i1 %.not, label %.omp.lastprivate.done, label %.omp.lastprivate.then

.omp.lastprivate.then:                            ; preds = %omp.loop.exit
  store i32 %x1.0.lcssa, i32* %x, align 4, !tbaa !6
  br label %.omp.lastprivate.done

.omp.lastprivate.done:                            ; preds = %.omp.lastprivate.then, %omp.loop.exit
  call void @__kmpc_barrier(%struct.ident_t* nonnull @2, i32 %4)
  ret void

I was not sure whether the is_last variable set by the runtime call is required and was considering the following lowering for privatisation. This will lead to an additional comparison for the last iteration and also will introduce an assignment (or a call to an assignment function and the destructor/finalizer) in the body of the loop. Was thinking whether this additional comparison and assignment might interfere with the optimisations of the loop.

@Meinersbur was suggesting to have a variable is_last and then representing that variable as an argument of the entry block (or operand) of the OpenMP wsloop operation and using that variable in the runtime calls and also to use it as the predicate for the last private update.

Another possibility is to introduce an additional region for constructs which have lastprivate or which need finalization. This region can have is_last as the argument of the entry block. This additional region will contain the lastprivate update, finalization/destructor of lastprivate variables. While lowering during translation, this region can be fitted into the exit block of the worksharing loop. This would avoid introducing the lastprivate update into the body of the loop.

I guess we could also model the terminator op to have a region and then include the lastprivate update in that region.

@Meinersbur also raised the point whether the private variables could be values and this would be good for optimisations. This was also something that @ftynse raised in the reduction RFC.

@ftynse Are any of these approaches OK. Did I miss something simpler? Or should we go on to have a representation for private clauses in the OpenMP dialect.

FYI @jdoerfert .
CC @clementval , @kirankumartp , @SouraVX , @abidmalikwaterloo.

Source Source (After privatisation)
integer :: x
!$omp.parallel
!$omp do lastprivate(x)
do i=1,N
...
end do
!$omp end do
integer :: x
!$omp.parallel
integer :: x_priv !Not real code
!$omp do
do i=1,N
…
if (i .eq. N) then
  x = x_priv
end if
end do
!$omp end do