Hi all,
I have been working on SPIR-V to LLVM dialect conversion over summer, focusing on the single-threaded code. I am now interested in investigating how the multi-threaded SPIR-V can target CPUs. Some related links:
- Initial discussion on mapping GPU concepts to CPUs: GSoC proposal: SPIR-V to LLVM IR dialect conversion in MLIR
- GSoC summary
This post intends to give a high-level overview of the project ideas, and to spark the related discussion. Feel free to make suggestions/corrections/comments - any contribution is very welcome! After gathering some feedback I can start prototyping and add more details regarding the implementation side.
Goals
Ultimately, the idea is to be able to convert SPIR-V multi-threaded kernel with synchronisation constructs to LLVM IR. We want to preserve inter-thread and memory coherence in a way that running LLVM’s optimisations would not ruin GPU-specific constraints. The kernel can look like
spv.module @foo Logical GLSL450 ... {
spv.globalVariable @id built_in("WorkgroupId")
spv.func @bar() "None" {
%id = spv.mlir.addressof @id : ...
// Code #1
spv.ControlBarrier "Workgroup", "Device", "Acquire|UniformMemory"
// Code #2
}
}
Mapping to CPU
The main question is how to model multi-threaded GPU execution on CPUs. For that, we need to transform GPU thread organisation hierarchy to CPUs. I think of the following mapping (inspired by Intel’s OpenCl to CPU compilation):
- Each workgroup can be considered as a single CPU thread
- Subgroups and invocations can be packed into a SIMD vectors, as a wavefront
This approach allows to avoid launching hundreds of threads on CPU, as well as uses SIMD to mimic SIMT concepts.
Synchronisation
One of the challenges is to model barriers in LLVM dialect. While spv.MemoryBarrier
is essentially LLVM’s fence
instruction, spv.ControlBarrier
requires extra handling. A naive mapping can be similar to spin lock - blocking until the counter value has reached the number of threads.
Transformations
With this mapping, each multithreaded kernel would have to
- undergo “unrolling” (workgroup >> single CPU thread/SIMD ops) transformation
- undergo “analysis” transformation to handle convergence/synchronisation issues (like variables crossing the barriers)
This transformation can be written as a separate stand-alone SPIR-V dialect pass: e.g. -spirv-to-cpu
.