No fused multiply-add on x86_64 for vectorized code?


I try to follow the very nice tutorial on Vector Dialect AArch64 Codegen ForMatrix-Matrix Multiplication. However, I’m not generating code for AArch64, but for my x86_64 PC (under MacOS).

When compiling the first example (in page 2) I noticed that the generated code contains no fused multiply-add instruction. I thought x86_64 has such instructions for vectorized code.

The compilation command I’m using is:

   mlir-opt --convert-vector-to-llvm='reassociate-fp-reductions=1' \
		--convert-std-to-llvm sample.mlir | \
	mlir-translate --mlir-to-llvmir | \
	llc -o sample.s

Is this normal?


PS: would it be possible to also provide the code that calls function ​@vector_outerproduct_matmul_2d_4x4x4xf32_kernel​ to perform a 32x32x32 or 512x512x32 matrix multiplication? I’d like to see what the correct way of loading the vector registers is.
PS2: I noticed (in the last table) the performance loss when moving the kernel size from 64 to 512. Does this mean that the outer loops are not well optimized?

You’ll need --fp-contract=on with llc - FMA instructions won’t be used by default. (You’ll find more information at LLVM Language Reference Manual — LLVM 12 documentation or llc --help.)

I added --fp-contract=on to the llc call, still no change. The beginning of the generated code looks like this:

	movq	16(%rsp), %rax
	movaps	(%rsi), %xmm4
	movaps	16(%rsi), %xmm5
	movaps	32(%rsi), %xmm6
	movaps	48(%rsi), %xmm11
	movaps	(%r8), %xmm7
	movaps	16(%r8), %xmm10
	movaps	32(%r8), %xmm9
	movaps	48(%r8), %xmm8

	movaps	%xmm4, %xmm0
	shufps	$0, %xmm4, %xmm0                ## xmm0 = xmm0[0,0],xmm4[0,0]
	mulps	%xmm7, %xmm0
	addps	(%rax), %xmm0

	movaps	%xmm5, %xmm1
	shufps	$0, %xmm5, %xmm1                ## xmm1 = xmm1[0,0],xmm5[0,0]
	mulps	%xmm7, %xmm1
	addps	16(%rax), %xmm1

There are 16 mulps/addps pairs in the whole code, I didn’t copy all of them here.

I’m on the 71b823dd68f67d9594d83f8b33c46f7a60d1b305 commit of llvm-project, from March 22.

PS: Is it correct to assume that the code generator uses SSE instead of AVX because the input vector size is 4x4? If the input vectors were 8xf32, would/should it have used the ymm registers?

It’s probably because of the target. (You are using llc with the default options). Please be aware of the arch and cpu it’s generating instructions for.

1 Like

Thanks! With --march=x86-64 -mcpu=core-avx2 (even without --fp-contract=on) it issues fmadd instructions. However, it still remains on XMM registers (SSE). What can I do to move to YMM (AVX)?

I think you should really post this on the LLVM forum - this already had really nothing to do with MLIR but with LLVM code generation! :slight_smile: