LLVM Discussion Forums

Array copy elimination optimization in affine dialect

Hello!
I am currently trying to experiment with the affine dialect of mlir for the flang project. I am working with a small example to get array copy elimination working.

!arr3d_ref = type memref<?x?x?xi8>
// c = ((a + b) * c) + (b / 2)
func @arr_compute(%a : !arr3d_ref, %b : !arr3d_ref, %c : !arr3d_ref) -> !arr3d_ref {
  %i = dim %a, 0 : !arr3d_ref
  %j = dim %a, 1 : !arr3d_ref
  %k = dim %a, 2 : !arr3d_ref
  %t1 = alloc(%i,%j,%k) : !arr3d_ref
  %t2 = alloc(%i,%j,%k) : !arr3d_ref
  // t1 = a + b
  affine.for %t = 0 to %k {
    affine.for %s = 0 to %j {
      affine.for %r = 0 to %i {
        %a_v = affine.load %a[%r, %s, %t] : !arr3d_ref
        %b_v = affine.load %b[%r, %s, %t] : !arr3d_ref
        %t1_v = addi %a_v, %b_v : i8
        affine.store %t1_v, %t1[%r, %s, %t] : !arr3d_ref
      }
    }
  }
  // t2 = t1 + c
  affine.for %t = 0 to %k {
    affine.for %s = 0 to %j {
      affine.for %r = 0 to %i {
        %c_v = affine.load %c[%r, %s, %t] : !arr3d_ref
        %t1_v = affine.load %t1[%r, %s, %t] : !arr3d_ref
        %t2_v = muli %c_v, %t1_v : i8
        affine.store %t2_v, %t2[%r, %s, %t] : !arr3d_ref
      }
    }
  }
  // c = t2 + b/2
  affine.for %t = 0 to %k {
    affine.for %s = 0 to %j {
      affine.for %r = 0 to %i {
        %t2_v = affine.load %t2[%r, %s, %t] : !arr3d_ref
        %b_v = affine.load %b[%r, %s, %t] : !arr3d_ref
        %v_2 = constant 2 : i8
        %b_vv = divi_signed %b_v, %v_2 : i8
        %c_v = addi %t2_v, %b_vv : i8
        affine.store %c_v, %c[%r, %s, %t] : !arr3d_ref
      }
    }
  }
  return %c : !arr3d_ref
}

I was wondering if there is a way to get this program to optimize t1 and t2 arrays out. I tried the --affine-loop-fusion but that doesn’t seem to do anything.
Is there another pass which performs these kind of optimizations or is there anything I am missing here?
Thank you,
Rajan

-affine-loop-fusion is meant to be able to perform such fusion here, but it currently doesn’t support symbolic bounds (there is a TODO in the check). Had %k, %j, %i been constants, it would have worked. Extending the pass to handle dynamic trip counts / memrefs would be interesting and conceptually straightforward. For your example, setting %i, %j, %k to the result of a constant op on the top yields this:

-canonicalize -affine-loop-fusion

%0 = alloc() : memref<1x1x1xi8>
    %1 = alloc() : memref<1x1x1xi8>
    %c2_i8 = constant 2 : i8
    affine.for %arg3 = 0 to 50 {
      affine.for %arg4 = 0 to 50 {
        affine.for %arg5 = 0 to 50 {
          %2 = affine.load %arg0[%arg5, %arg4, %arg3] : memref<?x?x?xi8>
          %3 = affine.load %arg1[%arg5, %arg4, %arg3] : memref<?x?x?xi8>
          %4 = addi %2, %3 : i8
          affine.store %4, %0[0, 0, 0] : memref<1x1x1xi8>
          %5 = affine.load %arg2[%arg5, %arg4, %arg3] : memref<?x?x?xi8>
          %6 = affine.load %0[0, 0, 0] : memref<1x1x1xi8>
          %7 = muli %5, %6 : i8
          affine.store %7, %1[0, 0, 0] : memref<1x1x1xi8>
          %8 = affine.load %1[0, 0, 0] : memref<1x1x1xi8>
          %9 = affine.load %arg1[%arg5, %arg4, %arg3] : memref<?x?x?xi8>
          %10 = divi_signed %9, %c2_i8 : i8
          %11 = addi %8, %10 : i8
          affine.store %11, %arg2[%arg5, %arg4, %arg3] : memref<?x?x?xi8>
        }
      }
    }

Is anyone actively working on relaxing loop fusion for the symbolic bounds case?