Hello all!

I faced the following problem.

It is needed to sum all elements of f32 vector [vector<8xf32>] into f32 scalar value.

I used vector::ContractionOp to solve this task, i.e. I wanted to perform dot product of given vector and vector of 1-s:

```
auto vecMemRefEltType = VectorType::get({vectorSize}, rewriter.getF32Type());
auto one = std_constant_float(llvm::APFloat(1.0f), rewriter.getF32Type());
auto oneVec = rewriter.create<mlir::vector::BroadcastOp>(loc, vecMemRefEltType, one);
llvm::StringRef contrIter("reduction");
auto zzz = rewriter.getStrArrayAttr(ArrayRef<StringRef> {contrIter});
auto myMap = AffineMap::get(1, 0, {getAffineDimExpr(0, ctx)}, ctx);
auto myMap2 = AffineMap::get(1, 0, {}, ctx);
auto xxx = rewriter.getAffineMapArrayAttr(ArrayRef<AffineMap> {myMap, myMap, myMap2});
auto scalarRes = rewriter.create<vector::ContractionOp>(loc, opFMAO, oneVec, zero, xxx, zzz);
```

Before ConvertVectorToLLVM pass execution the IR was like that:

```
#map0 = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> ()>
%59 = vector.contract {indexing_maps = [#map0, #map0, #map1], iterator_types = ["reduction"], kind = #vector.kind<add>} %58, %cst_1, %cst : vector<8xf32>, vector<8xf32> into f32
```

When looking into result assembler code I could see the following sequence of instructions, that my vector::ReductionOp has been lowered to:

```
| de2d: c5 ea 58 1d ab 29 ff ff |vaddss|-54869(%rip), %xmm2, %xmm3 # 7e0 <strdup+0x7e0>|
| de35: c5 fa 16 e2 |vmovshdup|%xmm2, %xmm4|
| de39: c5 e2 58 dc |vaddss|%xmm4, %xmm3, %xmm3|
| de3d: c4 e3 79 05 e2 01 |vpermilpd|$1, %xmm2, %xmm4|
| de43: c5 e2 58 dc |vaddss|%xmm4, %xmm3, %xmm3|
| de47: c4 e3 79 04 e2 ff |vpermilps|$255, %xmm2, %xmm4|
| de4d: c5 e2 58 dc |vaddss|%xmm4, %xmm3, %xmm3|
| de51: c4 e3 7d 19 d2 01 |vextractf128|$1, %ymm2, %xmm2|
| de57: c5 e2 58 da |vaddss|%xmm2, %xmm3, %xmm3|
| de5b: c5 fa 16 e2 |vmovshdup|%xmm2, %xmm4|
| de5f: c5 e2 58 dc |vaddss|%xmm4, %xmm3, %xmm3|
| de63: c4 e3 79 05 e2 01 |vpermilpd|$1, %xmm2, %xmm4|
| de69: c5 e2 58 dc |vaddss|%xmm4, %xmm3, %xmm3|
| de6d: c4 e3 79 04 d2 ff |vpermilps|$255, %xmm2, %xmm2|
| de73: c5 e2 58 d2 |vaddss|%xmm2, %xmm3, %xmm2 |
```

I.e., the task has been solved by simple permutation/summation of vector elements, while it is known that X86 instruction set contains _mm_dp_ps intinsic IntelÂŽ Intrinsics Guide and I need it to be used because it must work faster than ugly set of permutation&summation ops instead.

The CPU features string, gotten from my PC looks like

```
"+sse2,-tsxldtrk,+cx16,+sahf,-tbm,-avx512ifma,-sha,-gfni,-fma4,-vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,-amx-tile,-uintr,+popcnt,-widekl,+aes,-avx512bitalg,-movdiri,+xsaves,-avx512er,-avxvnni,-avx512vnni,-amx-bf16,-avx512vpopcntdq,-pconfig,-clwb,-avx512f,+xsavec,-clzero,-pku,+mmx,-lwp,-rdpid,-xop,+rdseed,-waitpkg,-kl,-movdir64b,-sse4a,-avx512bw,+clflushopt,+xsave,-avx512vbmi2,+64bit,-avx512vl,-serialize,-hreset,+invpcid,-avx512cd,+avx,-vaes,-avx512bf16,+cx8,+fma,+rtm,+bmi,-enqcmd,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,+fxsr,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,-sgx,-shstk,+cmov,-avx512vbmi,-amx-int8,+movbe,-avx512vp2intersect,+xsaveopt,-avx512dq,+adx,-avx512pf,+sse3"
```

so that it contains tag â+sse4.1â on and it seems I can expect the intrinsic _mm_dp_ps is possible for use in my assembler IRâŚ

P.S. When using vector::ReductionOp

```
auto scalarRes = rewriter.create<vector::ReductionOp>(loc, rewriter.getF32Type(), rewriter.getStringAttr("add"), opFMAO, ValueRange{});
```

the same assembler code has been obtainedâŚ

Why it is not possible to get dot-product X86-CPU instruction in output assembler code?

P.S.2: When using AARCH64 as a target platform the effect was the same, i.e. the set of permutation/summation instructions has been applied instead of natural native dot-product instruction.