Lowering GPU dialect

According to this post GPU code generation status: NVidia, OpenCL, GPU dialect is aimed to provide support for host-device ABI and launch kernels.

I’ve been playing with GPU dialect but I don’t understand how can I lower the dialect. Right now I have a pass where I add a gpu.alloc op, like this:

%memref = gpu.alloc  () : memref<200x100x600xf32>
...
gpu.dealloc  %memref : memref<200x100x600xf32>

Then I tried to lower this gpu.alloc/dealloc to MLIR LLVM dialect using:

pm.addPass(mlir::createGpuToLLVMConversionPass());

but it does nothing. It works but I get the same MLIR code. I assume that the GPU dialect cannot be converted to a lower level dialect? But it can’t be converted to LLVM either, because if I try to do so, I get:

cannot be converted to LLVM IR: missing `LLVMTranslationDialectInterface` registration for dialect for op: gpu.alloc

Then how I’m supposed to lower GPU dialect?

PS: I expected the gpu.alloc to be lowered to some llvm.call to cudaMalloc (on the device). However, because gpu dialect should work with AMD too, I don’t know where should I specify that I want to lower to NVIDIA.

See an example of this pass in action here: llvm-project/lower-alloc-to-gpu-runtime-calls.mlir at main · llvm/llvm-project · GitHub

In general looking at the test directory is a good way to get examples about how the code is used.

1 Like

GPU dialect represents both host code and device code, which should be converted differently. You need to create a (host) module that contains gpu modules. Then convert gpu modules to either NVVM or ROCDL and compile them to binary with appropriate passes. This will give you the (host) module with kernels as binary blobs that can be compiled further. In addition to @mehdi_amini’s example, here’s an end-to-end integration test llvm-project/gpu-to-cubin.mlir at main · llvm/llvm-project · GitHub.

It can be converted to a mix of LLVM and target-specific dialects (NVVM, ROCDL). It cannot be translated to LLVM IR, which is where the registration error pops up. Please avoid the confusion about conversion and translation.

It is lowered to another call that we implement in a runtime library - llvm-project/CudaRuntimeWrappers.cpp at main · llvm/llvm-project · GitHub or llvm-project/RocmRuntimeWrappers.cpp at main · llvm/llvm-project · GitHub.

1 Like

Thanks to both of you. With the example provided by @mehdi_amini I have now find where the problem was. Let me show you an example.

  1. I have created a simple output that represents how I currently create a GPU alloc/dealloc:
func @main() {
  %memref = gpu.alloc  () : memref<10xf32>
  gpu.dealloc  %memref : memref<10xf32>
  return
}

With this code, I am unable to lower it to LLVM:

[user@machine llvm]$ ./build/bin/mlir-opt --gpu-to-llvm  gpu.mlir 
module  {
  llvm.func @main() {
    %memref = gpu.alloc  () : memref<10xf32>
    gpu.dealloc  %memref : memref<10xf32>
    llvm.return
  }
}
  1. With a very similar code (based on the one posted by @mehdi_amini):
func @main() {
  %0 = gpu.wait async
  %1, %2 = gpu.alloc async [%0] () : memref<10xf32>
  %3 = gpu.dealloc async [%2] %1 : memref<10xf32>
  gpu.wait [%3]
  return
}

I am able to lower it to LLVM:

[user@machine llvm]$ ./build/bin/mlir-opt --gpu-to-llvm  gpu_async.mlir 
module  {
  llvm.func @main() {
    %0 = llvm.call @mgpuStreamCreate() : () -> !llvm.ptr<i8>
    ...
    %17 = llvm.call @mgpuMemFree(%16, %0) : (!llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.void
    %18 = llvm.call @mgpuStreamSynchronize(%0) : (!llvm.ptr<i8>) -> !llvm.void
    %19 = llvm.call @mgpuStreamDestroy(%0) : (!llvm.ptr<i8>) -> !llvm.void
    llvm.return
  }
  llvm.func @mgpuStreamCreate() -> !llvm.ptr<i8>
  llvm.func @mgpuMemAlloc(i64, !llvm.ptr<i8>) -> !llvm.ptr<i8>
  ...
}

So, according to this minimal examples, the GpuToLLVMConversionPass pass will only work if the code contains alloc/dealloc with async. Why does this happen?

This may be something that got broken when the async support was added, but I didn’t follow closely enough the transition. It does not seem expected to me that the non-async would just be silently ignored here (otherwise what’s the intent to have these!).

@csigg worked on this I think?

mgpuMemAlloc/Free() are intended to map to cuMemAllocAsync() (they don’t yet because we haven’t upgraded to CUDA 11.3) and therefore take a stream argument. The stream is converted from !gpu.async.token, which is missing from the operands in your initial code.

The gpu-async-region pass gets you from your initial code to the async variant.

It does not seem expected to me that the non-async would just be silently ignored here (otherwise what’s the intent to have these!).

I guess the gpu-to-llvm conversion target should mark the gpu dialect illegal?
Like ⚙ D104208 [mlir] Mark gpu dialect illegal in gpu-to-llvm conversion

1 Like

Thanks for the tip. I can make it work like this:

optPM.addPass(mlir::createGpuAsyncRegionPass());
pm.addPass(mlir::createGpuToLLVMConversionPass());

I would like to ask a final question. Why the GPU dialect does not have something like populateGPUToLLVMConversionPatterns like Affine, Std, etc? If I’m not wrong, the only way to lower GPU to LLVM is to add a pass like I have written below, but it is not possible to use transitive lowering with the GPU dialect. Please correct me if I’m mistaken.

I guess this behaviour is by desing, probably because it makes sense to have some dialect to output gpu code and then lower this gpu code with a pass (createGpuToLLVMConversionPass).

Some of the transformations in the GPU dialect lowering are not rewrite-pattern based but walk the IR instead. For example the passes around introducing async behavior work that way.

I think for the GPU to LLVM transformation in particular there is no good reason. If you would like those patterns to be exposed, feel free to add a populate method. Note that you also need to configure the type converter and legality.

Sure, I’ve been playing with it and I have just created the populate method. It works pretty well.

Sadly, I’m still having issues with GPU dialect. I’m able to generate pure LLVM code, but at the moment of generating the final executable file, I don’t know how to link my code with mgpu functions (e.g, mgpuStreamCreate) because right now I get the undefined reference to ... errors. @ftynse mentioned that there were two backends (NVIDIA and AMD), but I don’t even know how to choose between them.

I honestly think that some aspects of GPU dialect that we have been discussing in this post should be documented somewhere (maybe they are, I just didn’t find them)…

There is an implementation of these helper functions for CUDA in llvm-project/CudaRuntimeWrappers.cpp at 9c21ddb70ab56eb3ca5b0f99faa18bb3af17b3df · llvm/llvm-project · GitHub and you have to link your final executable against these.

You chose the runtime implementation by linking against a different library. We do not support mixed environments.

I agree. The documentation is lacking. Would you be interested in writing a mini tutorial with your learnings?