I just followed this tutoria llvm-project/HighPerfCodeGen.md at hop · bondhugula/llvm-project · GitHub by @bondhugula
and use mlir-cpu-runner to run on llvm dialect on my avx2 machine, it works fine and I got about 50 FLOPS.
However, when I try to run the exact same llvm dialect(with same optimizations) on llvm version: 13.0.0git(which was from command llvm-config --version), and I dump the result with llvm-objdump, I found that the instructions was not automatically using avx2’s fma instructions(VFMADD213, VFMADD231) as in the tutorial and got very poor performance(almost 10 times worse). I guess it’s not only the vectorization problem.
I’m wondering if it’s because the version of mlir or just I forgot to set some flags?