I try to follow the very nice tutorial on Vector Dialect AArch64 Codegen ForMatrix-Matrix Multiplication. However, I’m not generating code for AArch64, but for my x86_64 PC (under MacOS).
When compiling the first example (in page 2) I noticed that the generated code contains no fused multiply-add instruction. I thought x86_64 has such instructions for vectorized code.
The compilation command I’m using is:
mlir-opt --convert-vector-to-llvm='reassociate-fp-reductions=1' \ --convert-std-to-llvm sample.mlir | \ mlir-translate --mlir-to-llvmir | \ llc -o sample.s
Is this normal?
PS: would it be possible to also provide the code that calls function
@vector_outerproduct_matmul_2d_4x4x4xf32_kernel to perform a 32x32x32 or 512x512x32 matrix multiplication? I’d like to see what the correct way of loading the vector registers is.
PS2: I noticed (in the last table) the performance loss when moving the kernel size from 64 to 512. Does this mean that the outer loops are not well optimized?