Solved: How to make LLC use AVX2 and FMA?

Hello,

Starting from an MLIR program I try to output AVX2/FMA vectorial code
working on the 256-bit registers %ymm.

For a reason I don’t understand I can only get assembly code that uses
AVX instructions with no FMA on 128-bit %xmm registers.

The things seem to go astray during the call to llc. Its input contains
the following call to fmuladd:

%44 = call <8 x float> @llvm.fmuladd.v8f32(<8 x float> %37, <8 x float> %40, <8 x float> %43), !dbg !35

However, the output contains instead the following code working on 128-bit registers:

movaps (%rsi,%rdx), %xmm0
movaps 16(%rsi,%rdx), %xmm1
mulps 16(%rax,%rdx), %xmm1
addps 16(%rcx,%rdx), %xmm1
mulps (%rax,%rdx), %xmm0
addps (%rcx,%rdx), %xmm0
movaps %xmm0, (%rcx,%rdx)
movaps %xmm1, 16(%rcx,%rdx)

The command line I use is:

~/llvm/bin/mlir-opt --lower-affine --convert-scf-to-std --convert-std-to-llvm --convert-vector-to-llvm try.mlir | ~/llvm/bin/mlir-translate --mlir-to-llvmir | ~/llvm/bin/llc -O1 --fp-contract=fast

Should you want to replicate the behavior, here is the MLIR source code:

func @test(%a:memref<128xvector<8xf32>>,
           %b:memref<128xvector<8xf32>>,
           %c:memref<128xvector<8xf32>>) {
  affine.for %idx = 0 to 128 {
    %av = memref.load %a[%idx] : memref<128xvector<8xf32>>
    %bv = memref.load %b[%idx] : memref<128xvector<8xf32>>
    %cv = memref.load %c[%idx] : memref<128xvector<8xf32>>
    %xv  = vector.fma %av, %bv, %cv : vector<8xf32>
    memref.store %xv, %c[%idx] : memref<128xvector<8xf32>>
  }
  return
}

Best,
Dumitru

PS: found the solution: add -mattr=avx2 -mattr=fma to llc.

This can be a problem with the target / sub-target options in the LLVM IR generated: the backend does not know that AVX2 is legal here.
You should be able to override this with llc by adding: -mattr=+avx2

1 Like

To see how clang passes this information to LLVM, try to see the difference between:
echo "int foo() { return 0;}" | clang -x c++ - -o - -S -emit-llvm -O2
and:
echo "int foo() { return 0;}" | clang -x c++ - -o - -S -emit-llvm -O2 -march=native

On my machine the function attribute goes from:

"target-cpu"="x86-64"
"target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87"

to:

"target-cpu"="skylake-avx512"
"target-features"="+64bit,+adx,+aes,+avx,+avx2,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl,+bmi,+bmi2,+clflushopt,+clwb,+cmov,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+popcnt,+prfchw,+rdrnd,+rdseed,+rtm,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsavec,+xsaveopt,+xsaves,-amx-bf16,-amx-int8,-amx-tile,-avx512bf16,-avx512bitalg,-avx512er,-avx512ifma,-avx512pf,-avx512vbmi,-avx512vbmi2,-avx512vnni,-avx512vp2intersect,-avx512vpopcntdq,-cldemote,-clzero,-enqcmd,-fma4,-gfni,-lwp,-movdir64b,-movdiri,-mwaitx,-pconfig,-pku,-prefetchwt1,-ptwrite,-rdpid,-serialize,-sgx,-sha,-shstk,-sse4a,-tbm,-tsxldtrk,-vaes,-vpclmulqdq,-waitpkg,-wbnoinvd,-xop"