Error in lowering vectorized code

Hello,

I try to write a matrix multiplication of size 256x256x256 based on the
little tutorial of the tutorial provided by Aart Bik and Nicolas Vasilache.
Based on their examples, I wrote the following function, meant to iterate the kernel they provide on the whole matrices.

func @matmul(%C:memref<256x256xf32>,
             %A:memref<256x256xf32>,
             %B:memref<256x256xf32>) {
  affine.for %i = 0 to 256 step 4 {
    affine.for %j = 0 to 256 step 4 {
      affine.for %k = 0 to 256 step 4 {
        %av= vector.load %A[%i,%k] : memref<256x256xf32>, vector<4x4xf32>
        %bv= vector.load %B[%k,%j] : memref<256x256xf32>, vector<4x4xf32>
        %cv= vector.load %C[%i,%j] : memref<256x256xf32>, vector<4x4xf32>

        %3 = vector.transpose %av,[1,0]: vector<4x4xf32> to vector<4x4xf32>
        %4 = vector.extract %3[0]: vector<4x4xf32>
        %5 = vector.extract %bv[0]: vector<4x4xf32>
        %6= vector.outerproduct %4,%5,%cv: vector<4xf32>, vector<4xf32>
        %7= vector.extract %3[1]: vector<4x4xf32>
        %8= vector.extract %bv[1]: vector<4x4xf32>
        %9= vector.outerproduct %7,%8,%6: vector<4xf32>, vector<4xf32>
        %10= vector.extract %3[2]: vector<4x4xf32>
        %11= vector.extract %bv[2]: vector<4x4xf32>
        %12= vector.outerproduct %10,%11,%9: vector<4xf32>, vector<4xf32>
        %13= vector.extract %3[3]: vector<4x4xf32>
        %14= vector.extract %bv[3]: vector<4x4xf32>
        %15= vector.outerproduct %13,%14,%12: vector<4xf32>, vector<4xf32>

        vector.store %15, %C[%i,%j] : memref<256x256xf32>, vector<4x4xf32>
      }
    }
  }
  return 
}

I try to compile this code with:

mlir-opt --lower-affine --convert-scf-to-std --convert-vector-to-llvm --convert-std-to-llvm  vector.mlir

The result is an error:

vector.mlir:12:14: error: failed to legalize operation 'llvm.mlir.cast' that was explicitly marked illegal
        %3 = vector.transpose %av,[1,0]: vector<4x4xf32> to vector<4x4xf32>

The last lowering phase somehow cannot deal with the llvm.mlir.cast operation synthesized above.

Is this normal?
Dumitru

ADDENDUM1: The following code also breaks, with a different error message:

func @test(%C:memref<256x256xf32>,
             %A:memref<256x256xf32>,
             %B:memref<256x256xf32>) {
  %0 = constant 0.0 : f32
  %1 = vector.broadcast %0 : f32 to vector<4xf32>
  %2 = vector.broadcast %1 : vector<4xf32> to vector<4x4xf32>
  affine.for %i = 0 to 256 step 4 {
    affine.for %j = 0 to 256 step 4 {
      vector.store %2, %C[%i,%j] : memref<256x256xf32>, vector<4x4xf32>
    }
  }
  return 
}

ADDENDUM2: vector.transfer_writedoes not work, either, as it seems to accept only 1D vectors (for vectors with more dimensions, the conversion to LLVM breaks. In other words, it would seem that the only way to load/store vectors from larger matrices is to do it scalar value by scalar value. And that even doing this is difficult, because it’s not obvious how to extract a scalar value from a vector.

Using vector.transfer operations would requires a convert-vector-to-scf before lower-affine (see e.g. mlir/test/Conversion/VectorToSCF/vector-to-loops.mlir).

vector.load / vector.store op lowering was introduced here: ⚙ D96185 [mlir][Vector] Introduce 'vector.load' and 'vector.store' ops. ConvertVectorToLLVM.cpp shows the lowering bails in the >1-D case. So I guess you’re back to a solution with vector.transfer for now.

Note that this is what convert-vector-to-scf does (there are variations depending on whether we go through memory/use indirect addressing or unroll/use vector.insert/extract). The current implementation is not progressive enough and @matthias-springer is in the process of revisiting it now that he extended vector.transfer to also work with an explicit mask operand.

1 Like

Thanks! We managed to make it work. I suggest adding calling code to the manual you wrote (and which is very useful, BTW). We can contribute such examples, if you prefer spending your time doing better things - for us it’s natural to write it, as it is part of learning how to use the tooling.

please do, contributions most welcome, the more the merrier :slight_smile: !

Here it is:

func @matmul_vectorized(%C:memref<256x256xf32>,
                        %A:memref<256x256xf32>,
                        %B:memref<256x256xf32>) {
  %cf0 = constant 0.0 : f32
  affine.for %i = 0 to 256 step 4 {
    affine.for %j = 0 to 256 step 4 {
      affine.for %k = 0 to 256 step 4 {
        %av= vector.transfer_read %A[%i,%k],%cf0 : memref<256x256xf32>, vector<4x4xf32>
        %bv= vector.transfer_read %B[%k,%j],%cf0 : memref<256x256xf32>, vector<4x4xf32>
        %cv= vector.transfer_read %C[%i,%j],%cf0 : memref<256x256xf32>, vector<4x4xf32>

        %3 = vector.transpose %av,[1,0]: vector<4x4xf32> to vector<4x4xf32>
        %4 = vector.extract %3[0]: vector<4x4xf32>
        %5 = vector.extract %bv[0]: vector<4x4xf32>
        %6 = vector.outerproduct %4,%5,%cv: vector<4xf32>, vector<4xf32>
        %7 = vector.extract %3[1]: vector<4x4xf32>
        %8 = vector.extract %bv[1]: vector<4x4xf32>
        %9 = vector.outerproduct %7,%8,%6: vector<4xf32>, vector<4xf32>
        %10= vector.extract %3[2]: vector<4x4xf32>
        %11= vector.extract %bv[2]: vector<4x4xf32>
        %12= vector.outerproduct %10,%11,%9: vector<4xf32>, vector<4xf32>
        %13= vector.extract %3[3]: vector<4x4xf32>
        %14= vector.extract %bv[3]: vector<4x4xf32>
        %15= vector.outerproduct %13,%14,%12: vector<4xf32>, vector<4xf32>

        vector.transfer_write %15, %C[%i,%j] :  vector<4x4xf32>, memref<256x256xf32>
      }
    }
  }
  return
}

It’s the simplest one I could build (but clearly not the most efficient).
D.