I’ve been trying to follow this paper to learn how to use MLIR to generate a BLIS-like strategy for matrix multiplication. However, I’m specifically trying to figure out how to do this with the linalg codegen strategy and I’m having a really hard time reproducing something that would be equivalent to some of the intermediate passes being done.

At the moment, I’m trying to reproduce the IR in section 2.6. I want to start out with the following MLIR file with a simple call to linalg.matmul

```
func @matmul(%arg0: memref<2088x2048xf64>,
%arg1: memref<2048x2048xf64>,
%arg2: memref<2088x2048xf64>) {
linalg.matmul
ins(%arg0, %arg1: memref<2088x2048xf64>,
memref<2048x2048xf64>)
outs(%arg2: memref<2088x2048xf64>)
return
}
```

and end up with something equivalent to the following

```
#map4 = affine_map<(d0) -> (256 * d0)>
#map5 = affine_map<(d0) -> (256 * d0 + 256)>
#map6 = affine_map<(d0) -> (64 * d0)>
#map7 = affine_map<(d0) -> (2088, 64 * d0 + 64)>
#map9 = affine_map<(d0) -> (8 * d0)>
#map10 = affine_map<(d0) -> (8 * d0 + 8)>
#map16 = affine_map<(d0) -> (16 * d0)>
#map17 = affine_map<(d0) -> (522, 16 * d0 + 16)>
func @matmul(%A: memref<2088x2048xf64>, %B: memref<2048x2048xf64>, %C: memref<2088x2048xf64>) {
affine.for %arg3 = 0 to 8 {
affine.for %arg4 = 0 to 33 {
%0 = memref.alloc() : memref<64x256xf64>
// Packing %A into a 64x256 buffer.
affine.for %arg5 = #map6(%arg4) to min #map7(%arg4) {
affine.for %arg6 = #map4(%arg3) to #map5(%arg3) {
%1 = affine.load %A[%arg5, %arg6] : memref<2088x2048xf64>
affine.store %1, %0[%arg4 * -64 + %arg5, %arg3 * -256 + %arg6] : memref<64x256xf64>
}
}
affine.for %arg5 = 0 to 256 {
%1 = memref.alloc() : memref<256x8xf64>
// Packing %B into a 256x8 buffer.
affine.for %arg6 = #map4(%arg3) to #map5(%arg3) {
affine.for %arg7 = #map9(%arg5) to #map10(%arg5) {
%2 = affine.load %B[%arg6, %arg7] : memref<2048x2048xf64>
affine.store %2, %1[%arg3 * -256 + %arg6, %arg5 * -8 + %arg7] : memref<256x8xf64>
}
}
affine.for %arg6 = #map16(%arg4) to min #map17(%arg4) {
// This is multiplying a packed 64x256 LHS panel with a packed 256x8 RHS panel.
affine.for %arg7 = 0 to 256 {
affine.for %arg8 = 0 to 8 {
affine.for %arg9 = 0 to 4 {
%2 = affine.load %0[%arg4 * -64 + %arg6 * 4 + %arg9, %arg7] : memref<64x256xf64>
%3 = affine.load %1[%arg7, %arg8] : memref<256x8xf64>
%4 = affine.load %C[%arg6 * 4 + %arg9, %arg5 * 8 + %arg8] : memref<2088x2048xf64>
%5 = mulf %2, %3 : f64
%6 = addf %4, %5 : f64
affine.store %6, %C[%arg6 * 4 + %arg9, %arg5 * 8 + %arg8] : memref<2088x2048xf64>
}
}
}
}
memref.dealloc %1 : memref<256x8xf64>
}
memref.dealloc %0 : memref<64x256xf64>
}
}
return
}
```

I don’t care about vectorization or unrolling yet. Nor do I care about the specific dialect that the resulting IR uses. I would just like something that yields same or comparable performance via passes using linalg codegen strategy if possible.

I’ve tried many different things in attempt to reproduce this. And I get something that close-ish by using tiling, interchanges, and promotions similar to slide 133 here. However, the resulting performance is about 1/3 of what the above strategy yields.

Sorry if this is too broad a question. But if anyone has any pointers, that would be greatly appreciated. I can also post what I have tried. Although I’d like to wait and see if someone has a simple suggestion first and then we can go from there. I imagine there is someone out there that already knows how exactly to do this.

Thanks!