Linalg tiling error

Hi all,
After the help I received in this thread , I was able to build a quite performing GEMM for the architecture I am using. Now I am trying to automate the process of selecting parameters like tile size, unrolling, etc…

However, when I try to tile with tile sizes of 9 9 9 (this would be the L1/L2/L3 tiling) and then register tiling with tile sizes of 8 8 8, I get this error:

mlir/lib/Analysis/SliceAnalysis.cpp:57: void getForwardSliceImpl(mlir::Operation*, llvm::SetVector<mlir::Operation*>*, mlir::TransitiveFilter): Assertion `op->getNumRegions() == 0 && "unexpected generic op with regions"' failed.

This only happens if I couple the L1/L2/L3 tiling with the register tiling. If I remove register tiling, then it works.

While I understand that the situation is quite edge and probably not extremely useful, I would still like to understand what is going on, so I can have a better understanding of the compiler interiors. This is the C++ snippet I am using to compile linalg.matmul:

    {
        mlir::linalg::CodegenStrategy strategy;

        auto tilingOptions = mlir::linalg::LinalgTilingOptions()
                             .setTileSizes(tileSizes)
                             .setInterchange({0, 2, 1});

        auto promotionOptions = mlir::linalg::LinalgPromotionOptions()
                                .setOperandsToPromote({0, 1})
                                .setAlignment(getpagesize());

        strategy
                .tileIf<mlir::linalg::MatmulOp>(!tileSizes.empty(), tilingOptions)
                .promoteIf<mlir::linalg::MatmulOp>(params.promote, promotionOptions);

        strategy.transform(getFunction());
    }
    {
     mlir::linalg::CodegenStrategy strategyRegisters;

     auto tilingOptions = mlir::linalg::LinalgTilingOptions()
                           .setTileSizes(registerTileSizes);

     auto promotionOptions = mlir::linalg::LinalgPromotionOptions()
                               .setUseFullTileBuffersByDefault(params.promoteFullTile)
                               .setAlignment(128);
     strategyRegisters
       .tileIf<mlir::linalg::MatmulOp>(!registerTileSizes.empty(), tilingOptions)
       .promoteIf<mlir::linalg::MatmulOp>(params.promote, promotionOptions)
       .vectorizeIf<mlir::linalg::MatmulOp>(params.vectorize)
       .setVectorTransformsOptions(vectorTransformsOptions)
       .setVectorTransferToSCFOptions(mlir::VectorTransferToSCFOptions().setUnroll(params.unrollVectorTransfers));

     strategyRegisters.transform(getFunction());
    }

In the snippet, tileSizes is a vector [9,9,9] and registerTileSizes is a vector [8, 8, 8].

My main questions are:

  1. Why is it giving the error in the first place?
  2. How can I debug it? Should I print some intermediate IR so that I can see what is going on?

Thanks for any hint,
Giuseppe

Just to add some other info, the instruction that makes the compilation fail is the setUnroll:

.setVectorTransferToSCFOptions(mlir::VectorTransferToSCFOptions().setUnroll(params.unrollVectorTransfers));

If I remove this, it works.

Disabling vectorization altogether and sharing the IR just before vectorization would be a good place to start so we can reproduce what the vectorization + lowering to SCF and LLVM does without having to replicate your full setup.

Alternatively, if you can share a phabricator diff at head and repro instructions we can patch and investigate to reduce the problem to something standalone.

Hi @nicolasvasilache ,
Sure, this is the MLIR before any vectorization (the GEMM M,K,N are all set to 1024):

#map0 = affine_map<(d0) -> (9, -d0 + 1024)>
#map1 = affine_map<(d0, d1)[s0] -> (d0 * 1024 + s0 + d1)>
#map2 = affine_map<(d0, d1) -> (d0 * 9 + d1)>
#map3 = affine_map<(d0, d1) -> (8, d0 - d1)>
#map4 = affine_map<(d0, d1)[s0] -> (d0 * 9 + s0 + d1)>
#map5 = affine_map<(d0, d1) -> (d0 * 8 + d1)>
module  {
  func @gemm(%arg0: memref<1024x1024xf32>, %arg1: memref<1024x1024xf32>, %arg2: memref<1024x1024xf32>) {
    %c1024 = constant 1024 : index
    %c9 = constant 9 : index
    %c324 = constant 324 : index
    %c8 = constant 8 : index
    %c0 = constant 0 : index
    %c256 = constant 256 : index
    %cst = constant 0.000000e+00 : f32
    %0 = alloc(%c324) {alignment = 4096 : i64} : memref<?xi8>
    %1 = alloc(%c324) {alignment = 4096 : i64} : memref<?xi8>
    %2 = std.view %0[%c0][] : memref<?xi8> to memref<9x9xf32>
    %3 = std.view %1[%c0][] : memref<?xi8> to memref<9x9xf32>
    %4 = alloc(%c256) {alignment = 128 : i64} : memref<?xi8>
    %5 = alloc(%c256) {alignment = 128 : i64} : memref<?xi8>
    %6 = alloc(%c256) {alignment = 128 : i64} : memref<?xi8>
    scf.for %arg3 = %c0 to %c1024 step %c9 {
      %7 = affine.min #map0(%arg3)
      %8 = affine.min #map0(%arg3)
      scf.for %arg4 = %c0 to %c1024 step %c9 {
        %9 = affine.min #map0(%arg4)
        %10 = subview %arg0[%arg3, %arg4] [%7, %9] [1, 1] : memref<1024x1024xf32> to memref<?x?xf32, #map1>
        %11 = affine.min #map0(%arg4)
        %12 = subview %2[0, 0] [%7, %9] [1, 1] : memref<9x9xf32> to memref<?x?xf32, #map2>
        scf.for %arg5 = %c0 to %c1024 step %c9 {
          %13 = affine.min #map0(%arg5)
          %14 = subview %arg1[%arg4, %arg5] [%11, %13] [1, 1] : memref<1024x1024xf32> to memref<?x?xf32, #map1>
          %15 = affine.min #map0(%arg5)
          %16 = subview %arg2[%arg3, %arg5] [%8, %15] [1, 1] : memref<1024x1024xf32> to memref<?x?xf32, #map1>
          %17 = subview %3[0, 0] [%11, %13] [1, 1] : memref<9x9xf32> to memref<?x?xf32, #map2>
          linalg.copy(%10, %12) : memref<?x?xf32, #map1>, memref<?x?xf32, #map2>
          linalg.copy(%14, %17) : memref<?x?xf32, #map1>, memref<?x?xf32, #map2>
          scf.for %arg6 = %c0 to %7 step %c8 {
            %18 = affine.min #map3(%7, %arg6)
            %19 = affine.min #map3(%8, %arg6)
            scf.for %arg7 = %c0 to %13 step %c8 {
              %20 = affine.min #map3(%13, %arg7)
              %21 = affine.min #map3(%15, %arg7)
              %22 = subview %16[%arg6, %arg7] [%19, %21] [1, 1] : memref<?x?xf32, #map1> to memref<?x?xf32, #map1>
              scf.for %arg8 = %c0 to %9 step %c8 {
                %23 = affine.min #map3(%9, %arg8)
                %24 = subview %12[%arg6, %arg8] [%18, %23] [1, 1] : memref<?x?xf32, #map2> to memref<?x?xf32, #map4>
                %25 = affine.min #map3(%11, %arg8)
                %26 = subview %17[%arg8, %arg7] [%25, %20] [1, 1] : memref<?x?xf32, #map2> to memref<?x?xf32, #map4>
                %27 = std.view %4[%c0][] : memref<?xi8> to memref<8x8xf32>
                %28 = subview %27[0, 0] [%18, %23] [1, 1] : memref<8x8xf32> to memref<?x?xf32, #map5>
                linalg.fill(%27, %cst) : memref<8x8xf32>, f32
                %29 = std.view %5[%c0][] : memref<?xi8> to memref<8x8xf32>
                %30 = subview %29[0, 0] [%25, %20] [1, 1] : memref<8x8xf32> to memref<?x?xf32, #map5>
                linalg.fill(%29, %cst) : memref<8x8xf32>, f32
                %31 = std.view %6[%c0][] : memref<?xi8> to memref<8x8xf32>
                %32 = subview %31[0, 0] [%19, %21] [1, 1] : memref<8x8xf32> to memref<?x?xf32, #map5>
                linalg.fill(%31, %cst) : memref<8x8xf32>, f32
                linalg.copy(%24, %28) : memref<?x?xf32, #map4>, memref<?x?xf32, #map5>
                linalg.copy(%26, %30) : memref<?x?xf32, #map4>, memref<?x?xf32, #map5>
                linalg.copy(%22, %32) : memref<?x?xf32, #map1>, memref<?x?xf32, #map5>
                linalg.matmul ins(%27, %29 : memref<8x8xf32>, memref<8x8xf32>) outs(%31 : memref<8x8xf32>)
                linalg.copy(%32, %22) : memref<?x?xf32, #map5>, memref<?x?xf32, #map1>
              }
            }
          }
        }
      }
    }
    dealloc %4 : memref<?xi8>
    dealloc %5 : memref<?xi8>
    dealloc %6 : memref<?xi8>
    dealloc %1 : memref<?xi8>
    dealloc %0 : memref<?xi8>
    return
  }
}

Thank you once more for your help,
Giuseppe

Hi @nicolasvasilache ,
I think I solved the issue. I tried to use mlir-opt on the intermediate IR I had and it was working fine: that hinted me I was doing something wrong in my pass pipeline.

Basically, this is what I was doing:

    funcPM.addPass(mlir::createCanonicalizerPass());
    funcPM.addPass(createLinalgCodegenPass(options));
    // Compile module and print the intermediate IR
    funcPM.addPass(mlir::createConvertVectorToSCFPass());
    funcPM.addPass(mlir::createLowerAffinePass());
    funcPM.addPass(mlir::createConvertLinalgToLoopsPass());
    funcPM.addPass(mlir::createLowerToCFGPass());
    pm.addPass(mlir::createConvertVectorToLLVMPass());
    pm.addPass(mlir::createLowerToLLVMPass());
    // Compile module and translate to LLVMIR

And apparently I can not run the passes multiple times on the same module. Could you confirm that?

Thanks,
Giuseppe

Hmm … I can’t, I don’t see what would prevent you from running these 2 pass pipelines, one on func, the other on module. Unless some weird intermediate state is created, I don’t see a problem with your pass pipelines offhand.