Mlir-cpu-runner auto vectorization

Hi all,

I just followed this tutoria llvm-project/HighPerfCodeGen.md at hop · bondhugula/llvm-project · GitHub by @bondhugula
and use mlir-cpu-runner to run on llvm dialect on my avx2 machine, it works fine and I got about 50 FLOPS.
However, when I try to run the exact same llvm dialect(with same optimizations) on llvm version: 13.0.0git(which was from command llvm-config --version), and I dump the result with llvm-objdump, I found that the instructions was not automatically using avx2’s fma instructions(VFMADD213, VFMADD231) as in the tutorial and got very poor performance(almost 10 times worse). I guess it’s not only the vectorization problem.

I’m wondering if it’s because the version of mlir or just I forgot to set some flags?

Thanks

It is hard to tell what’s the reason right now, can you elaborate one the step you used to “run the exact same llvm dialect(with same optimizations) on llvm”?


This is one that I wrote, I just applied tiling and vectorization,
after that I used:

  • mlir-opt -convert-vector-to-llvm -lower-affine -convert-scf-to-std -convert-std-to-llvm

to lower to llvm dialect and use mlir-cpu-runner to run this program.
And I got about 3.7GFLOPS.


This is the example of the tutorial, also with tiling and vectorization, and apply mlir-opt same as previous one, however this got 7.5GFLOPS, it’s about 2x faster.

Update:
I just used mlirx(GitHub - polymage-labs/mlirx: MLIRX is a LLVM fork with a strict superset of its MLIR component upstream. It is always meant to be in sync with upstream LLVM, and includes non-trivial extensions to MLIR meant for advanced research and experimentation. LLVM project components other than MLIR almost never deviate from upstream, and if they do, by only a commit or so.) to build llvm and run the tiled-only version using mlir-cpu-runner on the above version and mlirx version respectively, and their are both 13.0.0git(which was from command llvm-config --version).
Strange thing just happened: I got 1.77 GFLOPS on the former one and 9.2 GFLOPS on the latter one(mlirx), I just can’t figure out why.

Note: I applied mlir-opt -lower-affine -convert-scf-to-std -convert-std-to-llvm to the code below and use the same lowered llvm dialect to feed into two version of mlir-cpu-runner to get the result.

You are using different versions of MLIR on different inputs, why would you expect identical performance characteristics? Start by reducing the differences in the setup and see which changes impact performance more than others… Two random things I see are alignment (the allocation in your code has unspecified alignment, this may result in poor vectorization) and the lack of CPU spec (target instructions are produced by LLVM, not MLIR, and it may need to receive an equivalent of -mcpu=skylake or -march=native to produce the specific instruction set). You may also want to pass the enable-x86vector, or whatever it was called before, option to the vector-to-llvm lowering.

The version reported by llvm-config is for releases, thousands of commits will be covered by it.

Note: the suffix “git” means it is built from a development branch. Without the exact git hash it does not say much (they can be months apart and still show the same “13.0.0git”).

1 Like

Ok, thanks, it’s really helpful, I’ll try to reduce the differences and figure out how come it it. So is MLIR able to find the pattern for fmul and fadd that can be fused to FMA? Or this is done by llvm in practice?

Btw, after vectorization, the affine load and store no longer exists, does that mean vectorization should be perform in the stage after affine optimizations? or is that a way to transform back to affine?

I see… Thanks, Mehdi!

MLIR does not fuse fmul and fadd itself. However, it has an fma operation 'std' Dialect - MLIR that will get lowered appropriately. IIRC, some Linalg/Vector conversions produce it.

There are multiple ways of performing vectorization in MLIR. You are probably referring to the early vectorization of affine loops (aka super-vectorization). It indeed replaces the affine.load/store with vector.transfer_read/write. Affine memory operations are not sufficiently expressive for the kind of memory access that is needed after vectorization. Converting those reads/writes back to affine will likely introduce additional, potentially non-affine, loops and revert some of the vectorization decisions. You may try teaching the affine analysis about those operations though.

1 Like

Did you mean teaching affine analysis about vectorization? like writing another pass which is in charge of vectorizing the loop.

No, I mean teaching affine analysis about the meaning of vector.transfer_read/write. These operations read/write memory similarly to affine load/store. This will let you use affine analyses after vectorization if you want.

Oh ok, seems I have to add the pattern of transfer_read/write when walking through operations to make affine transformation recognize it? also the rewriting rule.

MLIRX uses a memref.vector_cast operation: instead of using vector_load/store or any vector_read/write transfer, it turns the memref into a memref of vector elt types via memref_vector_cast (with checks to ensure that that’s valid), and then the affine.load/stores just happen on vector memrefs - you get completely aligned load/stores (but aligned vs unaligned isn’t really the difference). This is really the ideal code on this part (w.r.t load/store) AFAIK.
OTOH, memref.vector_cast doesn’t exist upstream nor does the affine-vectorize pass of MLIRX. You may get nearly the same performance when using affine.vector_load/store on AVX2 but those load/stores won’t be aligned.

Hi Uday, thanks for answering, I’m just wondering that if affine dialect would use isl for analysis for some transformations in the future, or keep using heuristic?

There is some support for Presburger sets already - if you have specific transforms and analyses, it may good to see what exactly is needed.

1 Like

Ok, maybe currently I don’t actually need too complicated analysis.
Btw, I have a two dimensional array and I’d like to convert to mlir memref type, I’ve gone through several ways such as memref.globalOp, memref.allocOp, etc… However memref.globalOp is able to create a buffer and initialize it but doesn’t have alignment attribute, in constrast, memref.allocOp is able to create a buffer with specified alignment but can’t use the DenseFPElementsAttr to initialize.

Is there any way to convert the external array to the buffer in MLIR with memref type and alignment? Or I have to use memref.allocOp and fill in manually?

Thanks~