Hi folks, i’m just getting started with MLIR and trying to learn what is possible. I hope this is the right place to ask usability questions (if not please pardon my ignorance).

I’m looking some simple examples and trying to see what the MLIR framework has working for loop fusion.

I wrote up a simple loop with affine.for and affine.load/affine.store and i was able to get --affine-loop-fusion to fuse my loops (hooray!)

I then tried something a bit different, and maybe something not intended to be supported. I tried mixing linalg operations (slice/matmul) inside my affine.for loop and no longer see loop fusion happening. Stepping through the code, i see that the (affine loop) fusing code looks only for affine.load/affine.store. This kind of makes sense as i suppose those are the only memory operations it fully understands.

So my question for folks here is: what are some suggestions on on to accomplish i’m trying to do:

- i want be able to track the dependencies between the loops
- i want to enable loop fusion.
- i do
*not*want to lower my matrix multiplication into linear operations

My example code MLIR text is below

```
func @test(%kernel1 : memref<256x128xf32>, %kernel2 : memref<256x256xf32>, %P0 : memref<128x1xf32>, %out : memref<256x1xf32>) {
%c0 = constant 0 : index
%c1 = constant 1 : index
%c2 = constant 2 : index
%c128 = constant 128 : index
%c256 = constant 256 : index
%r0to128 = linalg.range %c0:%c128:%c1 : !linalg.range
%r0to256 = linalg.range %c0:%c256:%c1 : !linalg.range
%r0to1 = linalg.range %c0:%c1:%c1 : !linalg.range
// do a CrossProduct (256x128, 128x1) -> 256x1
// in two iteration, doing 128 elements at a time
affine.for %0 = 0 to 2 {
%min = muli %c128, %0 : index
%max = addi %min, %c128 : index
%r0 = linalg.range %min:%max:%c0 : !linalg.range
%sub_kernel1 = linalg.slice %kernel1[%r0, %r0to128] : memref<256x128xf32>, !linalg.range, !linalg.range, memref<128x128xf32>
%sub_out = linalg.slice %out[%r0, %r0to1] : memref<256x1xf32>, !linalg.range, !linalg.range, memref<128x1xf32>
linalg.matmul(%sub_kernel1, %P0, %sub_out) : memref<128x128xf32>, memref<128x1xf32>, memref<128x1xf32>
}
// do a CrossProduct (256x256, 256x1) -> 256x1
// in two iteration, doing 128 elements at a time
affine.for %0 = 0 to 2 {
%min = muli %c128, %0 : index
%max = addi %min, %c128 : index
%r0 = linalg.range %min:%max:%c0 : !linalg.range
%sub_kernel2 = linalg.slice %kernel2[%r0, %r0to256] : memref<256x256xf32>, !linalg.range, !linalg.range, memref<128x256xf32>
%sub_out = linalg.slice %out[%r0, %r0to1] : memref<256x1xf32>, !linalg.range, !linalg.range, memref<128x1xf32>
linalg.matmul(%sub_kernel2, %out, %sub_out) : memref<128x256xf32>, memref<256x1xf32>, memref<128x1xf32> }
return
}
```

thanks,

ian Bearman

Principal Software Engineering Manager

Microsoft Visual C++ Team: Optimization & Code Generation

`/* Making your code faster, smaller, smarter! */`