LLVM Discussion Forums

Use MLIR/IREE for GPU CodeGen


I am working on GPU CodeGen for PowerVR platform. I started with IREE because IREE has implemented many interesting conversions for LinalgToSPIRV. I have a question with regard to IREE (5a51d2a57 Jul 18), it looks like LinalgTilingOnBuffers skips ‘gpu.func’, for example:

  gpu.module @kernels {
    gpu.func @matmul(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %ret0: memref<16x16xf32>) {
      linalg.matmul %arg0, %arg1, %ret0 :
        (memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>)

I run the command

iree-opt -split-input-file \
         -iree-codegen-linalg-tile-and-fuse \
         -iree-codegen-convert-to-gpu -canonicalize -cse \

But the output is unchanged. So I change my work flow to this: I first use iree-opt to compile the non gpu MLIR function, then use mlir-opt to lower affine IR to standard IR, for example

iree-opt -split-input-file \
         -iree-codegen-linalg-tile-and-fuse \
         -iree-codegen-convert-to-gpu -canonicalize -cse \
         test.mlir | \
   mlir-opt -lower-affine 
// test.mlir
module attributes {
  spv.target_env =
    #spv.target_env<#spv.vce<v1.3, [Shader], [SPV_KHR_storage_buffer_storage_class]>,
    {max_compute_workgroup_invocations = 128 : i32,
     max_compute_workgroup_size = dense<[128, 128, 64]> : vector<3xi32>}>} {
  func @matmul(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %ret0: memref<16x16xf32>) {
    linalg.matmul %arg0, %arg1, %ret0 :
      (memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>)

Then I Manually copy the result to a stub code, for example:

#map0 = affine_map<(d0, d1)[s0] -> (d0 * 16 + s0 + d1)>
module attributes {
  gpu.container_module, spv.target_env = #spv.target_env<#spv.vce<v1.3, [Shader],
  {max_compute_workgroup_invocations = 128 : i32,
   max_compute_workgroup_size = dense<[128, 128, 64]> : vector<3xi32>}>} {
  gpu.module @kernels {
    gpu.func @matmul(..) kernel .. { .. }

      .. (manual copy paste) ..


  func @main() {
    "gpu.launch_func"(%cst4, %cst4, %cst1, %cst1, %cst1, %cst1,
                      %arg0, %arg1, %arg2)
      { kernel = @kernels::@matmul }
        : (index, index, index, index, index, index,
           memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>) -> ()

And execute the stub code with vulkan runner

mlir-vulkan-runner \
  --shared-libs=libvulkan-runtime-wrappers.so,libmlir_runner_utils.so \
  --entry-point-result=void -print-ir-after-all \

I wonder if there is a better way to prevent this manyal copy? Or is there any option in MLIR or IREE to mark certain functions as gpu.func

Thanks :slight_smile:

If your question is specific to IREE, you like should have a look at the dedicated communication channels they have: https://github.com/google/iree#communication-channels

Hi @cycheng. As @joker-eph mentioned, this is IREE specific question better addressed on the IREE forums. Reach out to us on the IREE Discord server (https://discord.com/invite/26P4xW4) and we can walk through what you want.

Specifically here, IREE codegen doesn not work with gpu.module or gpu.func.

If you are interested in looking into how IREEs codegen would work, you can look at description( here).
This shows what the IREE codegen pipeline expects as input and all the transformations applied to generate the SPIR-V dialect code and is for the linalg.matmul example itself.

If you want to see the entire IREE pipeline, you should be able to try this

func @matmul(%arg0: tensor<16x16xf32>, %arg1 : tensor<16x16xf32>) -> tensor<16x16xf32> {
  %0 = "mhlo.dot"(%arg0, %arg1) : (tensor<16x16xf32>, tensor<16x16xf32>) -> tensor(16x16xf32)
  return %0 : tensor<16x16xf32>


iree-translate -iree-mlir-to-vm-bytecode-module -iree-hal-target-backends=vulkan-spirv -print-ir-after-all -mlir-disable-threading test.mlir

Hey @cycheng, thanks for your interest in IREE and GPU CodeGen! It’s interesting to see that you are trying to mix mlir-vulkan-runner together with IREE. One thing to point out is that IREE has its own way of handling runtime-kernel ABI, which is different from the GPU dialect and various runners in MLIR core. The passes in IREE are assuming that and that’s why they are staying in IREE’s codebase (some pattern might be upstreamable and we are generally gradually upstreaming them); so they might not be directly applicable to the contract expected by the GPU dialect. mlir-vulkan-runner follows the convention of the GPU dialect. So that’s why you need the “manual” copying to make it work. @MaheshRavishankar just landed a very nice write up on the code generation flow in IREE today that you might be interested to give a read.

If you’d like to leverage the existing functionality in IREE I’d recommend you start with IREE. At the moment it’s typical to start with HLO level when entering IREE’s flow; you can find various examples by following the Get Started pages. It’s not easy to enter IREE’s system at a random layer at the moment but we have plans to improve that and hopefully later it would also be easy to just drop in some SPIR-V code and leverage IREE’s runtime to run it and see how it performs.

OTOH, if you are interested to start with GPU level and use mlir-vulkan-runner because its simplicity, etc., you are also certainly very welcome to play with it and extend it in the way you’d like. :slight_smile:

The flow in the GPU dialect is different. We do not start out with a gpu.func at the LinAlg level. What the above code essentially would say is to run a linalg.matmul on every thread of the gpu. What you likely want is to run a linalg.matmul using all threads of the gpu.

To get closer to what you want, you would use your test input file and then transform it using mlir-opt to gpu code. I used the passes -convert-linalg-to-parallel-loops --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --gpu-kernel-outlining which yields

#map0 = affine_map<()[s0, s1, s2] -> ((s0 - s1) ceildiv s2)>
#map1 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)>

module attributes {gpu.container_module, spv.target_env = #spv.target_env<#spv.vce<v1.3, [Shader], [SPV_KHR_storage_buffer_storage_class]>, {max_compute_workgroup_invocations = 128 : i32, max_compute_workgroup_size = dense<[128, 128, 64]> : vector<3xi32>}>} {
  func @matmul(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %arg2: memref<16x16xf32>) {
    %c16 = constant 16 : index
    %c0 = constant 0 : index
    %c1 = constant 1 : index
    %c1_0 = constant 1 : index
    %0 = affine.apply #map0()[%c16, %c0, %c1]
    %1 = affine.apply #map0()[%c16, %c0, %c1]
    "gpu.launch_func"(%0, %1, %c1_0, %c1_0, %c1_0, %c1_0, %arg0, %arg1, %arg2) {kernel = @matmul_kernel::@matmul_kernel} : (index, index, index, index, index, index, memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>) -> ()
  gpu.module @matmul_kernel {
    gpu.func @matmul_kernel(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %arg2: memref<16x16xf32>) kernel {
      %0 = "gpu.block_id"() {dimension = "x"} : () -> index
      %1 = "gpu.block_id"() {dimension = "y"} : () -> index
      %2 = "gpu.block_id"() {dimension = "z"} : () -> index
      %3 = "gpu.thread_id"() {dimension = "x"} : () -> index
      %4 = "gpu.thread_id"() {dimension = "y"} : () -> index
      %5 = "gpu.thread_id"() {dimension = "z"} : () -> index
      %6 = "gpu.grid_dim"() {dimension = "x"} : () -> index
      %7 = "gpu.grid_dim"() {dimension = "y"} : () -> index
      %8 = "gpu.grid_dim"() {dimension = "z"} : () -> index
      %9 = "gpu.block_dim"() {dimension = "x"} : () -> index
      %10 = "gpu.block_dim"() {dimension = "y"} : () -> index
      %11 = "gpu.block_dim"() {dimension = "z"} : () -> index
      br ^bb1
    ^bb1:  // pred: ^bb0
      %c1 = constant 1 : index
      %c0 = constant 0 : index
      %c16 = constant 16 : index
      %12 = affine.apply #map1(%0)[%c1, %c0]
      %13 = affine.apply #map1(%1)[%c1, %c0]
      scf.for %arg3 = %c0 to %c16 step %c1 {
        %14 = load %arg0[%12, %arg3] : memref<16x16xf32>
        %15 = load %arg1[%arg3, %13] : memref<16x16xf32>
        %16 = load %arg2[%12, %13] : memref<16x16xf32>
        %17 = mulf %14, %15 : f32
        %18 = addf %16, %17 : f32
        store %18, %arg2[%12, %13] : memref<16x16xf32>

Next step would be to do the mapping to spirv. However, when I used the --convert-gpu-to-spirv pass, I got an error about a missing abi attribute. I don’t know whether we have a pass to insert these. @antiagainst do you know?

That would be the final step using the above pipeline, too.

Hey @joker-eph, @MaheshRavishankar, thanks for reminding, I will move to the IREE community for IREE specific questions!

@MaheshRavishankar, @antiagainst, thanks for pointing out the references:

  1. IREE CPU/GPU Code Generation Pipeline
    (or) https://google.github.io/iree/design-docs/codegen-passes
  2. Progressive lowering of Matmul operation from HLO to SPIR-V
  3. Improve and demonstrate hackability at different levels of IR

It’s very useful to help me understand the IREE codegen pipeline :grin:

@antiagainst, thanks for the explanation! There are two reasons leading me to use mlir-vulkan-runner:

  1. Yes! For simplicity :wink:
  2. I managed to cross compile aarch64 iree on x86 pc but still can’t make it work. (I can cross compile MLIR for aarch64 in two stages, first build x86, then build aarch64 with the help of x86 llvm/mlir tablegen),

Our device is running on aarch64 linux, so I need an aarch64 based runtime, but I also want to leverage IREE’s transformation passes : P
I will follow your advice starting from IREE, and try to figure out how to cross build for IREE in IREE community.

Many thanks for your help :slight_smile:

Oh… right, I understand, then it is not the right level to perform tiling!

The idea is to first tile the linalg operation and then map it to GPU. If you wanted to perform some tiling, the invocation would be mlir-opt -linalg-tile-to-parallel-loops="linalg-tile-sizes=0,2" -convert-linalg-to-parallel-loops --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --gpu-kernel-outlining.

A summary of the passes would be

  1. tile the linalg operation using parallel loops
  2. transform the tiled linalg operations to loops, as well
  3. distribute to GPU blocks/threads (this is done using annotations on the loops)
  4. convert the annotated loop nest to a GPU function
  5. outline a kernel

You can run it with your example step by step to see the tiling happening.

1 Like

Right now not. We had discussions in a previous thread and agreed on adding an attribute to gpu.func for workgroup size, which will address the issue. It’s just that nobody has gotten to implement it yet. :wink:

Actually supporting mobile and edge devices is one of IREE’s primary goals. So we are certainly aligned here and IREE cares about cross compilation. :slight_smile: At the moment we have cross compilation flow for Android working and you can find the steps in Get Started on Android with CMake page. If it’s general AArch64 Linux, I wouldn’t expect it to be super hard to get it working there too by selecting proper toolchains and probably tweaking some configurations. We have some extra documentation on cross compilation here. Please certainly feel free to ask on IREE’s channel and tweak things to make your case work!

1 Like

@cycheng to follow up on comment from @herhut earlier.

From here to generate SPIR-V code you need to

  1. There is one manual step where the gpu.func generated needs to have a spv.entry_point_abi attribute added to the gpu.func that needs to be added to specify the workgroup size to use in the lowering to SPIR-V. So you would have to do something like
gpu.func (...) 
attributes {spv.entry_point_abi = {local_size = dense<[32, 4, 1]> : vector<3xi32>} {

(more info about the ABI attributes here).

  1. Then you can generate the spv.module by appending -lower-affine -canonicalize -legalize-std-for-spirv -convert-gpu-to-spirv to the flags above. Though while trying this out, realized there is a bug in the -lower-affine pass which is fixed after this patch.

Now that I tried this out, I think it is worth adding a pass to remove the manual step (1) above (i.e. allowing setting that through command line). Will take a stab at doing that and will update this post.

Thanks for bringing this up. This part of the codebase hasnt been in active use so things were not integrated as well as it should be. Will update this post as I try to stream-line this.

1 Like

With this patch (and its dependent patches) the following module

$ cat test.mlir
module attributes {
  spv.target_env =
    #spv.target_env<#spv.vce<v1.3, [Shader], [SPV_KHR_storage_buffer_storage_class]>,
    {max_compute_workgroup_invocations = 128 : i32,
     max_compute_workgroup_size = dense<[128, 128, 64]> : vector<3xi32>}>} {
  func @matmul(%arg0: memref<16x16xf32>, %arg1: memref<16x16xf32>, %ret0: memref<16x16xf32>) {
    linalg.matmul %arg0, %arg1, %ret0 : (memref<16x16xf32>, memref<16x16xf32>, memref<16x16xf32>) -> ()

can be lowered to spir-v with the following command line

$ mlir-opt -linalg-tile-to-parallel-loops="linalg-tile-sizes=8,4" -convert-linalg-to-parallel-loops
    -test-gpu-greedy-parallel-loop-mapping -convert-parallel-loops-to-gpu -gpu-kernel-outlining
    -lower-affine -canonicalize -legalize-std-for-spirv
    -test-spirv-entry-point-abi="workgroup-size=4,8" -convert-gpu-to-spirv test.mlir

The additional flags are for

  • Lowering affine.apply instructions inserted by the convert-parallel-loops-to-gpu pass.
  • Canonicalization to just get around some missing lowering for std.br` instruction
  • Some ops like subview cannot be lowered directly to SPIR-V, so they are folded into their load/store uses
  • Set workgroup size for the gpu.func kernel function (here [8, 4, 1])
  • Lower gpu.func and its body to spirv dialect.
1 Like

Thanks @MaheshRavishankar for adding this. Nice to see this work end-to-end again.

In this example where the operands to the launch are static, couldn’t we derive the ABI attributes from the call, as well? Not saying you should do this, just curious.

On a different note, if you only want to experiment with kernel side code generation you can use the ModelBuilder code which is in IREE. It allows using directly IREE lowering passes and run the generated code through Vulkan runner without having to do any extra steps. The following example for instance generated code for linalg.matmul: https://github.com/google/iree/blob/main/experimental/ModelBuilder/test/TestMatMulVulkan.cpp

You can build the example into a standalone app, you just need to add -DIREE_BUILD_EXPERIMENTAL=ON to CMake command line and follow the normal IREE build process:


cmake --build build/

and run:

./build/experimental/ModelBuilder/test/test-matmul-vulkan -vulkan-wrapper=build/third_party/llvm-project/llvm/lib/libvulkan-runtime-wrappe

On a slightly different topic if you have some ideas of a good strategy for matmul code generation that would work well for PowerVR, I would be interested to discuss it. I’m currently looking at different mobile GPU architectures to try to identify the different strategies needed to get efficient for different architectures.

@herhut and @MaheshRavishankar Thanks for sharing me another approach, and come out the patches so quickly, the MLIR community is awesome!!

I will try this approach, and compare against IREE’s transformation (Sorry I am still struggling on cross build IREE for aarch64).

@antiagainst I am very glad to be in the same direction with IREE/MLIR (unofficial, not represent powervr :stuck_out_tongue:) Thanks for your info, I am still working on cross build IREE for aarch64) Could you have a look and give me some hints, many thanks :slight_smile:

@ThomasRaoux Yes, I only want to experiment linalg dialect codegen, so I assume graph compiler has generated linalg dialect for me.
The cooperative matrix patch and MatmulCodegenStrategy are quite interesting!
My plan of the following weeks:
(1) Benchmarking single linalg.matmul on powervr.
(2) Benchmarking producer/consumer linalg.matmul on powervr.
(3) Tune the performance:
(3.1) Come out the optimization strategy for powervr, try to feedback (patches or discussions) to community if my bosses agree (Sorry I have to be conservative here, but I think powervr will become more open source friendly, and I am also working on this)
Unfortunately, MLIR/IREE has done many fantastic work, so I am not sure how much we can actually contribute :wink:

Aww, thanks! Do know that more work and focus is extremely helpful! As fairly generic solutions, there is no making up for heart and expertise on specific platforms, and that is the difference between a mediocre outcome and something great for each. So thank you for your perseverance, and don’t be too intimidated by all that appears to be going on. There is a lot still undone, especially for platform specific tuning and specialization.

Yes the cooperative matrix benchmark test is probably the most interesting, it only works for GPU supporting cooperative matrix extension right now. I’m planning to add an alternative lowering for other GPUs using subgroup operations or falling back to naive code generation. I’ll hopefully get to it in the next few weeks.

Some interesting characteristics that influence the lowering so far are whether GPU has fast shared local memory, if it supports efficient subgroups shuffle instructions and if it has any dedicated hardware (tensorcore kind of thing).
If you are able to share, it would great to know if PowerVR can take advantage of any of those features.

THe patches have landed and the command line (here) should work. In due course will add a test for that pipeline. We probably can add a GPUCodegenPassPipeline somewhere for at least testing purposes.

I am not sure how we would do that. In general there is a non-trivial relationship between tile sizes and workgroup sizes. For the purposes of having some working flow, just allow setting this from command line for now.

There is definitely a lot of work to be done. @ThomasRaoux work on using cooperative matrix instructions is what we are focussing on. Maybe the same strategy (that has previously been demonstrated on CPU by @nicolasvasilache) but for powerVR would be really awesome!