Vector Dialect Ops for Intel Intrinsics - Optimizing linalg.copy

maheshl · January 25, 2021, 6:42am

I am working on optimizing the performance of linalg.copy (for n-D transposition), based on the HPTT library [1]. It decomposes a multidimensional tensor into two-dimensional bxb macrotiles and wxw microtiles (where w is the vector-width and b=4w). The microtiles are computed by a microkernel that does in-register transposition with explicit vectorization. I am attempting to rewrite in MLIR (using the vector ops, after lowering from linalg to vector dialect) this microkernel specified for single-precision elements and AVX-512. This microkernel mainly involves the following intel vector intrinsics:

_mm256_loadu_ps (float const * mem_addr) [2]
_mm256_unpacklo_ps (__m256 a, __m256 b) [3]
_mm256_unpackhi_ps (__m256 a, __m256 b) [4]
_mm256_shuffle_ps (__m256 a, __m256 b, const int imm8) [5]
_mm256_permute2f128_ps (__m256 a, __m256 b, int imm8) [6]
_mm256_storeu_ps (float * mem_addr, __m256 a) [7]
(Please refer to the microkernel implementation here: [8], lines 139-225)

I have been able to use MLIR’s vector.extract_slices and vector.instert_slices() for intrinsics 1 and 6 respectively. I am wondering if the existing MLIR ops could be used to implement the other intrinsics 2-5. I am uncertain if vector.shuffle() or the other vector ops would be sufficient, would I have to write new vector dialect ops to do this?

Any suggestions/feedback would be helpful.

I am also interested in knowing if there are any similar efforts for optimizing multidimensional transposition in MLIR.

Thanks!
-Mahesh

Cc: @MaheshRavishankar

maheshl · January 25, 2021, 6:44am

Ref:
[1] Springer, Paul, Tong Su, and Paolo Bientinesi. “HPTT: A high-performance tensor transposition C++ library.” In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, pp. 56-62. 2017.

[2] Intel® Intrinsics Guide

[3] Intel® Intrinsics Guide

[4] Intel® Intrinsics Guide

[5] Intel® Intrinsics Guide

[6] Intel® Intrinsics Guide

[7] Intel® Intrinsics Guide

[8] The HPTT microkernel implementation: github.com/springer13/hptt/blob/master/src/transpose.cpp#L134

ftynse · January 25, 2021, 1:40pm

As a general strategy, I would suggest you add ops that correspond exactly to these intrinsics into the avx512 dialect (and its llvm_avx512 counterpart, although these two may get merged soon). Then we can generalize the desirable behavior of these ops to the Vector dialect if necessary. Vector dialect tends to be more high level than most machine instructions and supports, e.g., multi-dimensional vectors,.

dcaballe · January 25, 2021, 6:28pm

Hi @maheshl,

+1 to what @ftynse suggested. Are you using the LLVM x86 backend or something else? If you wanted to keep the approach a bit more portable, maybe you could check if the LLVM backend is able to generate what you want out of regular vector shuffles. I would expect that for #2, #3 and #4. Not so sure about #5. You could write a simple LLVM or MLIR test with the vector shuffles describing your permutations, compile it for AVX512 and check the generated assembly.

maheshl · January 26, 2021, 3:40am

Thanks @ftynse and @dcaballe.
I have been trying to use the MLIR vector ops so far. I shall write tests with LLVM shufflevector to see if I can generate code similar to intel intrinsics. Based on that, I can add necessary ops to the avx512/llvm_avx512 dialect.

aartbik · January 26, 2021, 4:58am

Please also have a look at some earlier docs on the vector dialect where we talk about keeping the vector dialect as the proper bridge between the “virtual vector level” and the “hardware vector level”, using progressive lowering. As stated above, we want to avoid that the vector dialect itself becomes too close to the hardware too soon.

Topic		Replies	Views
The Dialect suitable to describe the tiling operation MLIR	5	870	February 5, 2021
[MLIR] Multidimensional vector abstraction MLIR	7	863	February 1, 2020
[Abandoned][RFC] AVX512-specific Dialect for implementing and benchmarking XNNPack in MLIR MLIR	22	1654	March 4, 2020
Movement between vector type on 'vector' dialect MLIR	3	364	December 2, 2022
An update on linalg on tensors MLIR	3	1872	September 30, 2020

Vector Dialect Ops for Intel Intrinsics - Optimizing linalg.copy

Related Topics