Error in lowering vectorized code

dpotop · April 12, 2021, 1:01pm

Hello,

I try to write a matrix multiplication of size 256x256x256 based on the
little tutorial of the tutorial provided by Aart Bik and Nicolas Vasilache.
Based on their examples, I wrote the following function, meant to iterate the kernel they provide on the whole matrices.

func @matmul(%C:memref<256x256xf32>,
             %A:memref<256x256xf32>,
             %B:memref<256x256xf32>) {
  affine.for %i = 0 to 256 step 4 {
    affine.for %j = 0 to 256 step 4 {
      affine.for %k = 0 to 256 step 4 {
        %av= vector.load %A[%i,%k] : memref<256x256xf32>, vector<4x4xf32>
        %bv= vector.load %B[%k,%j] : memref<256x256xf32>, vector<4x4xf32>
        %cv= vector.load %C[%i,%j] : memref<256x256xf32>, vector<4x4xf32>

        %3 = vector.transpose %av,[1,0]: vector<4x4xf32> to vector<4x4xf32>
        %4 = vector.extract %3[0]: vector<4x4xf32>
        %5 = vector.extract %bv[0]: vector<4x4xf32>
        %6= vector.outerproduct %4,%5,%cv: vector<4xf32>, vector<4xf32>
        %7= vector.extract %3[1]: vector<4x4xf32>
        %8= vector.extract %bv[1]: vector<4x4xf32>
        %9= vector.outerproduct %7,%8,%6: vector<4xf32>, vector<4xf32>
        %10= vector.extract %3[2]: vector<4x4xf32>
        %11= vector.extract %bv[2]: vector<4x4xf32>
        %12= vector.outerproduct %10,%11,%9: vector<4xf32>, vector<4xf32>
        %13= vector.extract %3[3]: vector<4x4xf32>
        %14= vector.extract %bv[3]: vector<4x4xf32>
        %15= vector.outerproduct %13,%14,%12: vector<4xf32>, vector<4xf32>

        vector.store %15, %C[%i,%j] : memref<256x256xf32>, vector<4x4xf32>
      }
    }
  }
  return 
}

I try to compile this code with:

mlir-opt --lower-affine --convert-scf-to-std --convert-vector-to-llvm --convert-std-to-llvm  vector.mlir

The result is an error:

vector.mlir:12:14: error: failed to legalize operation 'llvm.mlir.cast' that was explicitly marked illegal
        %3 = vector.transpose %av,[1,0]: vector<4x4xf32> to vector<4x4xf32>

The last lowering phase somehow cannot deal with the llvm.mlir.cast operation synthesized above.

Is this normal?
Dumitru

ADDENDUM1: The following code also breaks, with a different error message:

func @test(%C:memref<256x256xf32>,
             %A:memref<256x256xf32>,
             %B:memref<256x256xf32>) {
  %0 = constant 0.0 : f32
  %1 = vector.broadcast %0 : f32 to vector<4xf32>
  %2 = vector.broadcast %1 : vector<4xf32> to vector<4x4xf32>
  affine.for %i = 0 to 256 step 4 {
    affine.for %j = 0 to 256 step 4 {
      vector.store %2, %C[%i,%j] : memref<256x256xf32>, vector<4x4xf32>
    }
  }
  return 
}

ADDENDUM2: vector.transfer_writedoes not work, either, as it seems to accept only 1D vectors (for vectors with more dimensions, the conversion to LLVM breaks. In other words, it would seem that the only way to load/store vectors from larger matrices is to do it scalar value by scalar value. And that even doing this is difficult, because it’s not obvious how to extract a scalar value from a vector.

nicolasvasilache · April 12, 2021, 2:43pm

Using vector.transfer operations would requires a convert-vector-to-scf before lower-affine (see e.g. mlir/test/Conversion/VectorToSCF/vector-to-loops.mlir).

vector.load / vector.store op lowering was introduced here: ⚙ D96185 [mlir][Vector] Introduce 'vector.load' and 'vector.store' ops. ConvertVectorToLLVM.cpp shows the lowering bails in the >1-D case. So I guess you’re back to a solution with vector.transfer for now.

nicolasvasilache · April 12, 2021, 2:45pm

Note that this is what convert-vector-to-scf does (there are variations depending on whether we go through memory/use indirect addressing or unroll/use vector.insert/extract). The current implementation is not progressive enough and @matthias-springer is in the process of revisiting it now that he extended vector.transfer to also work with an explicit mask operand.

dpotop · April 12, 2021, 4:03pm

Thanks! We managed to make it work. I suggest adding calling code to the manual you wrote (and which is very useful, BTW). We can contribute such examples, if you prefer spending your time doing better things - for us it’s natural to write it, as it is part of learning how to use the tooling.

nicolasvasilache · April 12, 2021, 4:05pm

please do, contributions most welcome, the more the merrier !

dpotop · April 12, 2021, 4:39pm

Here it is:

func @matmul_vectorized(%C:memref<256x256xf32>,
                        %A:memref<256x256xf32>,
                        %B:memref<256x256xf32>) {
  %cf0 = constant 0.0 : f32
  affine.for %i = 0 to 256 step 4 {
    affine.for %j = 0 to 256 step 4 {
      affine.for %k = 0 to 256 step 4 {
        %av= vector.transfer_read %A[%i,%k],%cf0 : memref<256x256xf32>, vector<4x4xf32>
        %bv= vector.transfer_read %B[%k,%j],%cf0 : memref<256x256xf32>, vector<4x4xf32>
        %cv= vector.transfer_read %C[%i,%j],%cf0 : memref<256x256xf32>, vector<4x4xf32>

        %3 = vector.transpose %av,[1,0]: vector<4x4xf32> to vector<4x4xf32>
        %4 = vector.extract %3[0]: vector<4x4xf32>
        %5 = vector.extract %bv[0]: vector<4x4xf32>
        %6 = vector.outerproduct %4,%5,%cv: vector<4xf32>, vector<4xf32>
        %7 = vector.extract %3[1]: vector<4x4xf32>
        %8 = vector.extract %bv[1]: vector<4x4xf32>
        %9 = vector.outerproduct %7,%8,%6: vector<4xf32>, vector<4xf32>
        %10= vector.extract %3[2]: vector<4x4xf32>
        %11= vector.extract %bv[2]: vector<4x4xf32>
        %12= vector.outerproduct %10,%11,%9: vector<4xf32>, vector<4xf32>
        %13= vector.extract %3[3]: vector<4x4xf32>
        %14= vector.extract %bv[3]: vector<4x4xf32>
        %15= vector.outerproduct %13,%14,%12: vector<4xf32>, vector<4xf32>

        vector.transfer_write %15, %C[%i,%j] :  vector<4x4xf32>, memref<256x256xf32>
      }
    }
  }
  return
}

It’s the simplest one I could build (but clearly not the most efficient).
D.

Topic		Replies	Views
Vector Matrix Multiply Error MLIR	2	915	September 8, 2020
Lowering matrix multiplication with tiling failed due to the illegal bufferization op MLIR	3	230	July 8, 2023
Vectorization failure MLIR	0	151	August 2, 2023
Failed to lower GPU dialect using gpu-lower-to-nvvm pipeline MLIR	5	143	February 29, 2024
Making linalg.matmul to GPU runnable code MLIR	6	1228	April 19, 2022

Error in lowering vectorized code

Related Topics