Can't get dot-product X86-CPU instruction in assembler code

Hello all!
I faced the following problem.
It is needed to sum all elements of f32 vector [vector<8xf32>] into f32 scalar value.
I used vector::ContractionOp to solve this task, i.e. I wanted to perform dot product of given vector and vector of 1-s:

auto vecMemRefEltType = VectorType::get({vectorSize}, rewriter.getF32Type());
auto one = std_constant_float(llvm::APFloat(1.0f), rewriter.getF32Type());
auto oneVec = rewriter.create<mlir::vector::BroadcastOp>(loc, vecMemRefEltType, one);

llvm::StringRef contrIter("reduction");
auto zzz = rewriter.getStrArrayAttr(ArrayRef<StringRef> {contrIter});

auto myMap = AffineMap::get(1, 0, {getAffineDimExpr(0, ctx)}, ctx);
auto myMap2 = AffineMap::get(1, 0, {}, ctx);
auto xxx = rewriter.getAffineMapArrayAttr(ArrayRef<AffineMap> {myMap, myMap, myMap2});

auto scalarRes = rewriter.create<vector::ContractionOp>(loc, opFMAO, oneVec, zero, xxx, zzz);

Before ConvertVectorToLLVM pass execution the IR was like that:

 #map0 = affine_map<(d0) -> (d0)>
 #map1 = affine_map<(d0) -> ()>
 %59 = vector.contract {indexing_maps = [#map0, #map0, #map1], iterator_types = ["reduction"], kind = #vector.kind<add>} %58, %cst_1, %cst : vector<8xf32>, vector<8xf32> into f32

When looking into result assembler code I could see the following sequence of instructions, that my vector::ReductionOp has been lowered to:

|    de2d: c5 ea 58 1d ab 29 ff ff      |vaddss|-54869(%rip), %xmm2, %xmm3  # 7e0 <strdup+0x7e0>|
|    de35: c5 fa 16 e2                  |vmovshdup|%xmm2, %xmm4|
|    de39: c5 e2 58 dc                  |vaddss|%xmm4, %xmm3, %xmm3|
|    de3d: c4 e3 79 05 e2 01            |vpermilpd|$1, %xmm2, %xmm4|
|    de43: c5 e2 58 dc                  |vaddss|%xmm4, %xmm3, %xmm3|
|    de47: c4 e3 79 04 e2 ff            |vpermilps|$255, %xmm2, %xmm4|
|    de4d: c5 e2 58 dc                  |vaddss|%xmm4, %xmm3, %xmm3|
|    de51: c4 e3 7d 19 d2 01            |vextractf128|$1, %ymm2, %xmm2|
|    de57: c5 e2 58 da                  |vaddss|%xmm2, %xmm3, %xmm3|
|    de5b: c5 fa 16 e2                  |vmovshdup|%xmm2, %xmm4|
|    de5f: c5 e2 58 dc                  |vaddss|%xmm4, %xmm3, %xmm3|
|    de63: c4 e3 79 05 e2 01            |vpermilpd|$1, %xmm2, %xmm4|
|    de69: c5 e2 58 dc                  |vaddss|%xmm4, %xmm3, %xmm3|
|    de6d: c4 e3 79 04 d2 ff            |vpermilps|$255, %xmm2, %xmm2|
|    de73: c5 e2 58 d2                  |vaddss|%xmm2, %xmm3, %xmm2            |

I.e., the task has been solved by simple permutation/summation of vector elements, while it is known that X86 instruction set contains _mm_dp_ps intinsic Intel® Intrinsics Guide and I need it to be used because it must work faster than ugly set of permutation&summation ops instead.

The CPU features string, gotten from my PC looks like

"+sse2,-tsxldtrk,+cx16,+sahf,-tbm,-avx512ifma,-sha,-gfni,-fma4,-vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,-amx-tile,-uintr,+popcnt,-widekl,+aes,-avx512bitalg,-movdiri,+xsaves,-avx512er,-avxvnni,-avx512vnni,-amx-bf16,-avx512vpopcntdq,-pconfig,-clwb,-avx512f,+xsavec,-clzero,-pku,+mmx,-lwp,-rdpid,-xop,+rdseed,-waitpkg,-kl,-movdir64b,-sse4a,-avx512bw,+clflushopt,+xsave,-avx512vbmi2,+64bit,-avx512vl,-serialize,-hreset,+invpcid,-avx512cd,+avx,-vaes,-avx512bf16,+cx8,+fma,+rtm,+bmi,-enqcmd,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,+fxsr,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,-sgx,-shstk,+cmov,-avx512vbmi,-amx-int8,+movbe,-avx512vp2intersect,+xsaveopt,-avx512dq,+adx,-avx512pf,+sse3"

so that it contains tag “+sse4.1” on and it seems I can expect the intrinsic _mm_dp_ps is possible for use in my assembler IR…
P.S. When using vector::ReductionOp

auto scalarRes = rewriter.create<vector::ReductionOp>(loc, rewriter.getF32Type(), rewriter.getStringAttr("add"), opFMAO, ValueRange{});

the same assembler code has been obtained…

Why it is not possible to get dot-product X86-CPU instruction in output assembler code?

P.S.2: When using AARCH64 as a target platform the effect was the same, i.e. the set of permutation/summation instructions has been applied instead of natural native dot-product instruction.

You don’t give the flags you use for lowering, but I suspect you don’t tell the vector dialect lowering that FP reassociation is allowed in the reductions. So for
a vector.contract and similar operations, using just a plain --convert-vector-to-llvm preserves the exact order implied in the operation:

 ... 
 vaddss  %xmm1, %xmm0, %xmm2   ; scalar
 ...

but --convert-vector-to-llvm='reassociate-fp-reductions=1" allows reassociation to get more SIMD:

  ... 
  vaddps  %xmm1, %xmm0, %xmm0   ; SIMD 
  ...

You can read more about this in the AVX512 case studies.

Thank you for the recommendation.
I implemented it right now and unfortunately, it doesn’t help…
_mm_dp_ps doesn’t appear :frowning:
For being sure I tried both cases setEnableAVX512(0) and setEnableAVX512(1).

mlir::LowerVectorToLLVMOptions ZZZ;
ZZZ.setReassociateFPReductions(1);
ZZZ.setEnableIndexOptimizations(1);
ZZZ.setEnableArmNeon(0);
ZZZ.setEnableArmSVE(0);
ZZZ.setEnableAVX512(1);
pm.addPass(mlir::createConvertVectorToLLVMPass(ZZZ));

But after that corresponding ASM code size has been reduced a little:

|    de27: c4 e3 7d 19 d3 01            |vextractf128|$1, %ymm2, %xmm3|
|    de2d: c5 e8 58 d3                  |vaddps|%xmm3, %xmm2, %xmm2|
|    de31: c4 e3 79 05 da 01            |vpermilpd|$1, %xmm2, %xmm3|
|    de37: c5 e8 58 d3                  |vaddps|%xmm3, %xmm2, %xmm2|
|    de3b: c5 fa 16 da                  |vmovshdup|%xmm2, %xmm3|
|    de3f: c5 ea 58 d3                  |vaddss|%xmm3, %xmm2, %xmm2|
|    de43: c5 ea 58 15 95 29 ff ff      |vaddss|-54891(%rip), %xmm2, %xmm2  # 7e0 <strdup+0x7e0>|

Well, “doesn’t help” is a misrepresentation, since the resulting assembly exploits more SIMD than the original code you had. Did you compare the performance of the code you want and/or expect with the code generated by LLVM? MLIR tries to express SIMD using only generic intrinsics where possible, since the backend is pretty good in selecting the right instruction sequences for the right construct. If you see a huge performance gain left on the table, please file a bug on this and we will look into it!

Also, your original code (multiplying with ones followed by a contraction) just to sum-reduce a vector seems rather elaborate. Why not just use a plain vector.reduction? In general, I think you should really rely on the compiler to do the right thing rather than trying to force codegen into a very specific direction.

Given

func @foo(%arg0: vector<8xf32>, %arg1: f32) -> (f32) {
  %0 = vector.reduction "add", %arg0, %arg1 : vector<8xf32> into f32
  return %0 : f32
}

this looks rather decent, right?

vextractf128    xmm2, ymm0, 1
vaddps  xmm0, xmm0, xmm2
vpermilpd       xmm2, xmm0, 1  
vaddps  xmm0, xmm0, xmm2
vmovshdup       xmm2, xmm0   
vaddss  xmm0, xmm0, xmm2
vaddss  xmm0, xmm1, xmm0

I’m sorry, but it doesn’t look well.
ASM code should contain a couple of _mm_dp_ps instructions (I don’t know the reason that prevented to implement dp_ps instruction that processes 8-element ymm vector at once), f32-add and no more.
Every CPU clock does matter when you develop a DSP application.
Please, take a look at MLIR documentation 'vector' Dialect - MLIR. You can find folowing example there

// Simple DOT product (K = 0).
#contraction_accesses = [
 affine_map<(i) -> (i)>,
 affine_map<(i) -> (i)>,
 affine_map<(i) -> ()>
]
#contraction_trait = {
  indexing_maps = #contraction_accesses,
  iterator_types = ["reduction"]
}
%3 = vector.contract #contraction_trait %0, %1, %2
  : vector<10xf32>, vector<10xf32> into f32

My MLIR code I showed above is the same, but where is promised dot-product?))
There is dot-product in X86 instruction set, why it can’t be used?

Yes, thank you for that. I wrote a big chunk of that :wink:

Can you please read my response carefully? Just because you expect a particular instruction doesn’t mean that instruction should really be there: either MLIR or LLVM may conclude there are better instructions (but if you can actually time the difference, show results in cycles and/or seconds, that demonstrate we miss opportunities, we indeed would like to hear about that). But you still have not answered why you would like to sum-reduce the elements of a vector through a dot product?

If I use vector.fma operation, in the end I get appropriate fma assembler instruction in ASM code, only one instruction (not couple of mpy/add), and this is very good example).
I have already told that for my task “It is needed to sum all elements of f32 vector [vector<8xf32>] into f32 scalar value”. It’s a part of the algorithm I am implementing. One way to do that fast is dot-product. I would be very thankful if you could point another cheap way to implement this.

It seems that dot-product instruction is the most suitable and natural for this case. But I don’t know all details and may not be right in that…

Ah, I think you are under the impression that one instruction always performs better than several. Well, as they say in Dutch, “meten is weten” (measuring is knowing), so I wrote a small benchmark with the following three implementations of a sum-reduce (summing 4xf32 to be in line with the exact instruction you are interested in).

The first two, foo() and bar(), are a direct reduce and a contraction:

  func @foo(%arg0: vector<4xf32>, %arg1: f32) -> (f32) {
    %0 = vector.reduction "add", %arg0, %arg1 : vector<4xf32> into f32
    return %0 : f32
  }

  func @bar(%arg0: vector<4xf32>, %arg1: f32) -> (f32) {
    %c = std.constant dense<[1.0, 1.0, 1.0, 1.0]> : vector<4xf32>
    %0 = vector.contract #contraction_trait %arg0, %c, %arg1
     : vector<4xf32>, vector<4xf32> into f32
    return %0 : f32
  }

The third one, is a hack in my sandbox, where the “aart” dialect supports the dpps instruction (don’t try this at home, it won’t work).

  func @baz(%arg0: vector<4xf32>, %arg1: f32) -> (f32) {
    %i = constant 0 : i32
    %c = std.constant dense<[1.0, 1.0, 1.0, 1.0]> : vector<4xf32>
    %0 = aart.dot %arg0, %c : vector<4xf32>, vector<4xf32>, vector<4xf32>
    %1 = vector.extractelement %0[%i : i32]: vector<4xf32>
    %2 = addf %arg1, %1 : f32
    return %2 : f32
  }

Compiling this gives the following assembly for foo():

            vpermilpd       xmm2, xmm0, 1  
            vaddps  xmm0, xmm0, xmm2
            vmovshdup       xmm2, xmm0
            vaddss  xmm0, xmm0, xmm2
            vaddss  xmm0, xmm1, xmm0

For bar(), very similar code result, smart enough to see that the 1xa=a, but not optimizing 0+a=a (this is due to a recent change I actually was not in favor of; I made a note we need to work on this).

        vpermilpd       xmm2, xmm0, 1 
        vaddps  xmm0, xmm0, xmm2
        vmovshdup       xmm2, xmm0  
        vaddss  xmm0, xmm0, xmm2
        vxorps  xmm2, xmm2, xmm2
        vaddss  xmm0, xmm0, xmm2
        vaddss  xmm0, xmm0, xmm1

Then finally, baz() gives your favorite instruction:

        vbroadcastss    xmm2, dword ptr [rip + .LCPI2_0] # xmm2 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]
        vdpps   xmm0, xmm0, xmm2, 255
        vaddss  xmm0, xmm0, xmm1

Running this many times in a loop (but in such a manner that it is not optimized away) gives the following timings in “clocks” (lower is better):

foo():  912 clocks
bar():  939 clocks
baz():  925 clocks

So, although all very close, the code initially proposed (vector.reduction) works best (and probably even better for the original 8xf32 case).

1 Like

Hello Mr Aart,
Have a nice day!
The broadcast instruction and load of constant 1-s vector for baz can be hoisted out of loop in my case, there is no need to include them to loop/function body.
And the fact that such an important instruction for DSP as dot-product is used as a hack doesn’t look splendid:))

I agree. My experiment above actually convinced me we should make this instruction more easily accessible to the vector dialect, since it performs remarkably well.

Okay, thank you. When it will be possible to get new MLIR release containing this fix? Should I report this case as a bug?

We are in the process of renaming the AVX512 dialect into X86Vector. After that, I can directly make the vdpps intrinsic more easily available. After that, we need to unify support across common platforms. Sounds good?

Thank you very much! :pray:
Yeah, it sounds good and let it be implemented good too.
I hope my ASM code will be at least ~25% faster with this fix.