[RFC] Converting multi-threaded SPIR-V to LLVM dialect: overview

Hi all,

I have been working on SPIR-V to LLVM dialect conversion over summer, focusing on the single-threaded code. I am now interested in investigating how the multi-threaded SPIR-V can target CPUs. Some related links:

This post intends to give a high-level overview of the project ideas, and to spark the related discussion. Feel free to make suggestions/corrections/comments - any contribution is very welcome! :blush: After gathering some feedback I can start prototyping and add more details regarding the implementation side.

Goals
Ultimately, the idea is to be able to convert SPIR-V multi-threaded kernel with synchronisation constructs to LLVM IR. We want to preserve inter-thread and memory coherence in a way that running LLVM’s optimisations would not ruin GPU-specific constraints. The kernel can look like

spv.module @foo Logical GLSL450 ... {
  spv.globalVariable @id built_in("WorkgroupId")

  spv.func @bar() "None" {
    %id = spv.mlir.addressof @id : ...
    // Code #1
    spv.ControlBarrier "Workgroup", "Device", "Acquire|UniformMemory"
    // Code #2
  }
}

Mapping to CPU
The main question is how to model multi-threaded GPU execution on CPUs. For that, we need to transform GPU thread organisation hierarchy to CPUs. I think of the following mapping (inspired by Intel’s OpenCl to CPU compilation):

  • Each workgroup can be considered as a single CPU thread
  • Subgroups and invocations can be packed into a SIMD vectors, as a wavefront

This approach allows to avoid launching hundreds of threads on CPU, as well as uses SIMD to mimic SIMT concepts.

Synchronisation
One of the challenges is to model barriers in LLVM dialect. While spv.MemoryBarrier is essentially LLVM’s fence instruction, spv.ControlBarrier requires extra handling. A naive mapping can be similar to spin lock - blocking until the counter value has reached the number of threads.

Transformations
With this mapping, each multithreaded kernel would have to

  • undergo “unrolling” (workgroup >> single CPU thread/SIMD ops) transformation
  • undergo “analysis” transformation to handle convergence/synchronisation issues (like variables crossing the barriers)

This transformation can be written as a separate stand-alone SPIR-V dialect pass: e.g. -spirv-to-cpu.

Can you use the OpenMP dialect in MLIR?
Can you use the OpenMPIRBuilder in LLVM?

This is to handle a ControlBarrier across multiple workgroup right? Since you mentioned that you would map a workgroup to a single CPU thread, you can’t really spin lock there. Instead it seems to me that you need to do something more complex if the number of invocation in the workgroup don’t match the SIMD width.

Hi George, sorry about the late reply; just went back from a long leave. :slight_smile: It’s great to see that you are still interested in pushing this forward after the GSoC project!

I’d say before jumping into decisions, it would be nice to do some preparation and define steps ahead. I suspect many of the issues we are gonna handle won’t be trivial; they will involve taking inspirations from existing solutions and discussing with the community to weigh trade offs. So making sure we survey enough is important. I believe you have already been doing this (hence this thread!), just without writing all the stuff down explicitly. :slight_smile:

Specifically, I’d suggest to understand more around 1) CPU/GPU concurrency hierarchies and differences, 2) existing concurrency programming primitives/models for CPU (threads, corontines, fibers, SIMD, OpenMP, etc.), 3) CPU emulation of GPU (OpenCL, OpenGL, Vulkan, etc.) kernel execution in existing projects, 4) etc. It would be a great contribution already to write them down somewhere, as it provides a background for discussion and also to get everybody interested to be on the same page.

Then we can start biting the whole task off piece by piece. This could be driven by dismantling some concrete kernel in reality so we have a clear target. We won’t be handling everything in one-go because there are so many cases. For the cases in which we already have direct mapping we can just do it. For others we might need improvement that takes longer to land. Having dedicated threads for discussing specific issues to keep focused is typically more productive. (This isn’t saying the overview thread isn’t useful; quite the opposite—we need it too.) It would be great if you can drive to formalize such steps and plans.

Sorry if this is being a bit vague; but I guess that’s what should be for overviews. :slight_smile:

Thanks,
Lei

Sorry for late reply! I was also on a holiday :slightly_smiling_face:.

That’s right. spv.ControlBarrier has different scopes, particularly:

  1. Device: device can have multiple workgroups that should all reach the barrier to proceed. Since we can model workgroup as a single thread, this is is essentially a thread barrier (e.g. spin lock)

  2. For other scopes smaller than workgroups we pack instruction into SIMD vector. Say the workgroup size is W and SIMD width is S. I see the following options:

  • if W == S, then it is just a vector instruction.
  • if W < S, we can still use vector but with predication, turning off the non-used entries. (Vector predication is currently a work in progress under LLVM as far as I know, I will look more into this). Alternatively, we can assume that S must be a multiple of W and not consider this case.
  • if W > S, we now need to create multiple (vector) instructions and make sure that they are executed before the barrier. Particularly, if we had
vector_op1.1
vector_op1.2
// barrier was here
vector_op2.1
vector_op2.2

we need to make sure that vector_op1.* are executed before vector_op2.*. A possible solution is to combine instructions before/after the barrier into non-inlined functions. Then the compiler should not reorder produced function calls.

I agree! Thanks for the overview. So I guess the plan is the following:

  1. Outline the differences between GPUs/CPUs (SPIR-V/LLVM) to identify what has to be mapped (e.g. barriers, etc.).
  2. Gather links from existing projects (I have mostly looked at OpenCL at the moment).
  3. Decide on the kernel/cases to map (i.e. what SPIR-V/GPU dialect kernel we can consider as MVP).
  4. Decide on actual mappings based on case-by-case basis.

I have been doing some things from the list already, but I guess it is way better to summarise them all in one place before jumping to prototypes, etc. So I will start with that :blush:.

Do you assume you always know the exact values for W and S before doing the conversion? Otherwise the code has to be generated parametrically somehow.

It isn’t clear to me why we need to prevent reordering of side-effect free (outside of convergent specific operations) operations? Are there specificities associated to the GPU model here?

In SPIR-V dialect, we can use LowerABIAttributesPass to set shader interface attributes, and particularly the workgroup size. I assume the values of S should be chosen optimally based on W? I know that LLVM has a ScalableVectorType, so some inspiration can be taken from there. (I will add more on this when I will put together resources, proposal, current solutions, etc. as mentioned by @antiagainst )

Here I meant instructions with side-effects, as per SPIR-V dialect spec for spv.ControlBarrier:

All invocations of this module within Execution scope must reach this point of execution before any invocation will proceed beyond it.

If you want to convert to single-thread CPU code, why would you still need fences? Is the SSA semantics not ensuring the needed sequencing, if the operations of multiple SPV threads are correctly interleaved into the single CPU thread? I saw your answer to @joker-eph 's similar question, but it’s not clear to me why you want to preserve more than the functional semantics when converting into CPU code.

The programming model of GPUs relies in the concurrent execution of “threads” inside a workgroup, and it can be semantically not possible to just execute one at a time in sequence if there is a barrier or a convergent/uniform instruction.
Mapping this to a single thread is possible, but you need to still interleave each instruction for all the “threads” in between barriers. Some cooperative instructions need also specific handling.
I’d look into how Intel mapped OpenCL to CPU for inspiration or POCL? Here is one past presentation: https://llvm.org/devmtg/2012-04-12/Slides/Ralf_Karrenberg.pdf and a good description of the algorithm involved in POCL: https://tutcris.tut.fi/portal/files/5075042/pocl.pdf

Regarding workgroup size: workgroup size is specified in the SPIR-V module as constants. They can be normal constants or specialization constants. But specialization constant is just a mechanism for second-stage compilation in driver compilers. I’d view the SPIR-V to LLVM conversion as “driver compiler” so specialization constants will be “freezed” and constant folded first. It’s reasonable to me to assume that we know the exact values for it. I’m not entirely up to speed with scalable vectors in LLVM; it would be nice to learn more via this effort. :slight_smile:

Regarding control barriers with device execution scope, I’m not sure how useful it would be to support it. At least for Vulkan, it requires all execution scope to be workgroup or subgroup (see VUID-StandaloneSpirv-None-04636). We can have more workgroups than GPU cores: some issued to GPU cores; some still waiting. It isn’t clear to me how one can sync across all these workgroups in kernel execution. Specific vendors might have extensions to support it in a constrained way; but I don’t think there is a standard cross-vendor solution for it. Again to me we won’t be supporting all features by looking at all the details in the spec in one-step (SPIR-V is for multiple APIs anyway) so having some real cases to guide the way would be nice.