GPU Compute Basic Algorithms

Restating what was said above. This is a really timely and important contribution. Two thing I would request

  1. Would be good to consider how these things would get lowered to SPIR-V. @antiagainst, @ThomasRaoux, @hanchung and me can help with that. As mentioned by everyone, having blocks that can be composed to implement the above algorithms would be super useful.
  2. Within IREE specifically, we are looking at targeting SPIRV from Vector/Linalg dialect. For the kernel side code-generation we can go to GPU dialect and then SPIRV too (and in many ways preferable). So would be really interesting to hook up these algorithms to the IREE codegen pipeline.

Cool! We have basic prototypes on Reduction and Convolution within IREE, and as Mahesh said, we’re happy to help on these stuff. We can prototype for the rest, and also get some improvements for convolution or reduction. We recently had a simple discussion regarding the improvement of reduction in IREE. We can also start from that tree-based algorithm or see how to put these optimization together.

Great!

One thing I hope we can discuss deeply is the tradeoffs that appear in the side-effecting memory world (memref on scalar) and the value world (n-D vector).
We will need both in the longer term but note that n-D vector SSA values allow designing abstractions, that map without surprises to efficient HW primitives + unroll-and-jam.
Think of it as llvm.matrix intrinsics on steroids: it becomes quite feasible to create building blocks that operate at peak, and compose, on various HW (not just GPUs :slight_smile: ).

First of all, thanks for the interest and the great discussion (you can find the slides here). I summarized the ideas and questions which arose:

  • Parametrization & Tuning: Can attributes be used to have generic implementations which support different trait-offs? This includes things like memory- / register-pressure vs. performance. Or faster but redundant parallel execution vs. slower but more energy-efficent sequential execution. So far, this is the most uncertain point. This might even by a research topic.
  • Homogenization: Can the operations be defined in an uniform way which is agnostic to the hierarchy, maybe even across different dialects? Attributes should be able to do this, I will come back to this point in my next post.
  • Composition & Reusability: Many of the proposed operations depend / are built on one another. So can we define them in a way, which reflects this? This is to support higher level optimization instead of having low-level-optimized blackbox implementations. This can probably be done by lowering these operations progressively inside the same dialect, but without an implementation it is hard to tell.
  • Hardware Diversity: Is this limited to GPUs or can it be extended to CPUs and other types of hardware? The consensus was that it should be done in both GPU and the Vector dialect, because both seem be possible and also to avoid a design over-fitting.
  • Dialects & Libraries: How are these operations exported to the outside? They probably need to go through the entire stack as they also require memory and buffer bindings as well as optimizations such as fusion. Is is not clear yet if they would fit into the existing dialects or require their own for the higher levels. However, this might become more obvious when the lower levels (GPU and Vector) are implemented.

Please correct me if there are any errors or misunderstandings in the summary.

2 Likes

Some questions.

  1. How does MLIR plan to implement these ops? Is it like as a part of progressive lowering we “decompose” the ops like “scan” into micro-ops in descendent dialect which eventually would realize the op?

  2. How do we target libraries in MLIR? I want to utilize scan implementation from a specialized library, say. How would that work, it we intend to?

It doesn’t always have to be micro-ops in descendent dialects but indeed, progressive lowering + transformations that break up an op into smaller ops seem to be a reasonable way to get started. You could look at how vector.contract gets progressively lowered to either:

  1. vector.matrix_multiply -> llvm.matrix_multiply
  2. vector.outerproduct + vector.transpose -> insert/extract/fma
  3. directly to insert/extract/fma

From any of these levels it would make sense to go to a HW-specific dialect e.g. GPU.
However there are also implications on chains of ops with sources / sinks to memory (e.g. load/store and vector.transfer); see e.g. @ThomasRaoux’s commits to iree to see a bit of the gradient here.

There is also some WIP that adds a lowering to vector.reduce.

A lot more work is needed to get a meaningful set of useful primitives such as described by @Lichtso .

For example, Linalg supports a primitive lowering of ops to library calls for semantically named ops. This needs to be extended (e.g. like discussed here) but it shows that starting from high-level, semantically charged ops, we can mix transformations, codegen and library calls. This is also why I have been talking about ops whose semantics is captured by attributes: this allows building generic mechanisms to mix codegen + library calls.

I imagine similar mechanisms can be generalized and serve the same purpose. However, it is unclear at this point whether all/most of the ops discussed in the meeting have such attribute-based representations but the least the compiler / analysis and transformations need to know about e.g. my_special_scan_and_conv_op, the better.

Thanks @Lichtso for this nice summary of the discussion!

Thanks Nicolas. That makes sense!

As promised, here is my take on the homogenization point from before. In my next post I want to talk more about composition and reusability.

Hierarchy Levels

I am not sure how to name these exactly, but they should definitely not be called thread, vector, lane, wrap, wavefront, block, grid, etc. These terms are confusing as they can mean different things depending on the vendors, technologies and on CPUs / GPUs.

None the less, It should be possible to define an attribute enum for these 4 levels, which can be used inside the GPU and Vector dialect equally. This attribute would then be used to progressively lower the operations onto their own equivalent of the next lower level.

Single Element (Lowest Level)

  • terms (GPU): invocation (Vulkan), lane, thread
  • terms (CPU): vector element
  • Memory type: Register, temporary thus SSA-only
  • Dimensionless (always 1D)
  • Concurrency: Simultaneous / parallel and implicitly synchronized
  • Possible interactions: Bits inside the element

Single Processor

  • terms (GPU): subgroup (Vulkan), warp, wavefront
  • terms (CPU): SIMD vector, thread
  • Memory type: Register, temporary thus SSA-only
  • Dimensionless (always 1D)
  • Concurrency: Simultaneous / parallel and implicitly synchronized
  • Possible interactions: All elements in the same processor

Multi Processor

  • terms (GPU): workgroup (Vulkan), thread block
  • terms (CPU): (sliding) window
  • Memory type: Shared, temporary thus SSA-only
  • Dimensional (fixed to 3D on GPUs)
  • Concurrency: Simultaneous / parallel but requires explicit synchronization
  • Possible interactions: All elements which are currently being processed

Entire Array (Highest Level)

  • terms (GPU): global work (Vulkan), grid
  • terms (CPU): n-D vector (MLIR Vector Dialect), array
  • Memory type: Global / Buffer, persistable thus memref
  • Dimensional (fixed to 3D on GPUs)
  • Concurrency: Sequential / serial
  • Possible interactions: Only from past to future

Fusion

The temporal / sequential axis can be lowered from the array level down to the element level in some cases (like a scan followed by some element-wise arithmetic): Instead of doing one pass for the scan kernel, store and load to / from buffers and then another pass for the arithmetic kernel, everything can be fused into one kernel and one pass. This lets the values stay in registers and avoids the slow access to global memory in the middle. But this also means, that part of the temporal / sequential axis then happens inside each element.

Fixed Dimensions in GPUs

The fixed dimensions for the two upper levels of the hierarchy are somewhat problematic:

  • They are fixed to exactly 3 axes. So, a bad fit for generic n-D vectors / arrays.
  • The maximum number of elements along each axis can be different from the other two. This is especially problematic for transpose operations.
  • The y and z axis are sometimes almost unusable / ridiculously limited (on some mobile devices).

For these reasons it might be better to only use the x axis and simulate n-D instead.

The Multi Processor Level on CPUs

I could not find anything on multi processor collaboration in the Vector dialect or anything else CPU related. Is it just not planned / implemented, have I overlooked something or is it unfeasible?

Overall this looks good to me! Seems like it would be nice to have this as doc somewhere (maybe here: https://mlir.llvm.org/docs/Dialects/GPU/ ?), I remember being confused by the Vulkan terminology coming from CUDA.

The terminology, we should stay aligned with the GPU dialect. If we redefine the levels we should do it there accordingly.
Your single-processor/multi-processor levels are somehow of a misnomer to me: as a thread block is running on a single multi-processor (in CUDA terminology) as far as I understand. Also you mentioned for the multi-processor " * Possible interactions: All elements which are currently being processed" ; it isn’t clear if this refers to all elements in the entire array: it should be all elements in the block.

What about LockStep and Processor for these two levels?

I am not sure, what you meant by that.
Wouldn’t it be best to find names that match the GPU and CPU (Vector dialect) equally?

Lockstep goes in the right direction, but it is more of a phenomenon than a property. While the second lowest level is definitely always in lockstep, the level above could also happen to be in lockstep, depending on the scheduling and synchronization. And yes, you a right that the term “processor” can mean totally different things depending on the vendor and implementation. So it is probably best not to use it at all.

I wanted to make it clear, that the “block” / “work group” is not just a physical location but also has a temporal aspect. It logically traverses the array (at the highest level) but can stay at the same physical location. That is why I used the word “currently” in “All elements which are currently being processed”.

Maybe the perspective of scheduling (similar to your idea of lockstep) could be used for the naming. As far as I know, on GPUs the dynamic scheduling happens on the third (lowest) level. This would also be the same for the CPU side of things.

Trying to glob CPU/GPU together might not be the best way to implement the algorithms that the original post described. To get efficient performance, the algorithms on CPU and GPU could vary quite significantly. So at this level (close to GPU dialect) I would trade-off portability to performance. I did see that the consensus was to do it in both, and I will go with the consensus, but my overall sense from the post is that the algorithms being implemented are GPU first. Given that I agree with Mehdi that it is better to keep the terminology closer to GPU/GPU dialect. Since GPU dialect is to map to both SPIR-V and LLVM dialect (NVVM actually), OpenCL/Vulkan terminology is preferable (though there are some CUDA terminology in GPU dialect like grid_dim, and block_dim which is legacy and no one has had the time to address it).

Would be better not to re-invent new terminology. Vulkan terminology is agreed upon by all vendor that support Vulkan (which includes NVIDIA). We could go with workitem/subgroup/workgroup which map to thread/warp/block in CUDA.

+1. I remembered we used to have such tables in some internal docs but they were not migrated out together with the open sourcing. Agreed that it would be nice to use this opportunity to have such a terminology map. And @Lichtso already did it nicely! :slight_smile:

+1. We already have 4+ (AMD, NVIDIA, OpenCL, Vulkan, …) columns in the terminology map; it would be nice to avoid creating another one, at least in the GPU domain. Aiming to map to different lower-level IRs, setting down on a set of terms following a standard would be preferrable.

Ah I didn’t get this angle, I see what you mean. That said is the CPU/Vector dialect already have naming for these hierarchy?
The trade-off seems like either introducing a new terminology, or using the GPU one and doing the mental mapping when targeting CPU. I can see pros and cons for both.
I am fairly allergic to Vulkan terminology personally (invocation is a terrible thread for a SIMD lane as “invocation” is used already in many context, including “invoking a kernel” for example), but if we already use it extensively in the GPU dialect it would lower the mental load for everyone to use it everywhere consistently I think.

I have been prototyping these algorithms the last few days to get the composition right. Apart from the 4 primitives which were already discussed: element-wise compute kernel, reduction, exclusive / inclusive scan I found 2 additional ones to be very helpful: Gather and scatter.

Gather is a permutation and scatter is an inverse permutation. Both are the inverse of each other. Additionally, they can also duplicate or discard elements, not just reorder them. For the gather this means that the size of the index array can have less elements that the maximal index in it plus one, resulting in elements being discarded. Likewise, for the scatter a condition mask can be supplied to discard some elements.

They could be implemented as:

  • register moves on the lowest hierarchy level
  • shuffle on the second level
  • load from (gather) / store to (scatter) shared memory on the third level
  • load from (gather) / store to (scatter) buffer / global memory on the forth level

(I think the vector.shuffle is already a gather as I described it)

Gather and scatter together allow to define the exchange of information between elements without using memref, purely in SSA style. Is that what you meant by “the side-effecting memory world” and “the value world”?

Thanks @Lichtso for digging deeper,

At this point I would be interested in seeing some of the prorotypes you mention that you have in progress and see how things would connect, from some level (doesn’t have to be the top-level, prob. better to start with the lowest level), down to execution.

In particular, it seems in your proposal, the vector level is only at the highest level and that you have more specific entities at lower levels (new types ?). Note that MLIR does not have any register guarantee, it has SSA values of different types and memory. Any register promotion / allocation is done in LLVM and one thing we can do from an MLIR-perspective atm is to coarsen the granularity of these values (up to the point where LLVM would spill). A vector.contract that operates on 2-D vectors is an example: starting from 1x1 and coarsening yields better and better arithmetic intensity until it gets too big and starts spilling (e.g. 6x16x8xf32 on avx2 and 8x32x32 on avx512 when lowered to vector.outerproduct (i.e.vector. broadcast + vector.fma)).

I seem to understand that you are proposing ops with semantically bearing names + attributes that describe mapping to hardware as “the representation of these primitives”?

I’ll describe some of the things we have in flight to maybe help bridge the delta between what I think you propose and what we have today. As far as basic compute primitives go, I have been thinking in temrs of target-independent ways of expressing the computation and how it breaks down in smaller pieces (both in some in-memory form and in SSA-values form). Then, mapping at some level of hierarchy would be obtained by applying transformations and targeting a primitive of appropriate granularity (i.e. loops + processor ID information, memory accesses, SSA-values and special instructions). Such transformations could be driven by special mapping attributes (which atm do not exist in core but may be close to what you are proposing).

As an example, let’s take the WIP flow for linalg.matmul to SPIRV. It resembles:

  1. linalg.matmul -> tile, promote (with padding) and assign loops to blocks -> loops + blockIdx... + copies to shared memory + linalg.matmul (this is done in a few steps but I cram everything in step 1.)
  2. rewrite as vectors into vector.transfer + vector.contract
  3. unroll vector operations to some good sizes for the HW (e.g. what comes in step 4.)
  4. map vector.contract to cooperative matrix instruction (e.g. 1 warp)

As these steps occur, canonicalizations, memory->value forwarding and injection of static behavior happen (e.g. the padding to fixed size which enables vectorization even on boundary tiles).

In this scheme things are not yet connected in a generic fashion: mapping a vector.contract to threads has implications on how other dependent vector ops must be mapped (TL;DR, there seems to be a need to follow individual scalar SSA values within the vector all the way to memory). So the way this ends up materializing is that SPIRV has a special type that propagates into the load/stores and at the memory level we can bitcast and “all is fine”. So what we have right now is very quite custom lowering to get started.

Note that `vector.contract` is more general because it has paralell/reduction iterators and a permutation to map vector dimensions to iterators. More generally, `vector.contract` could/should be augmented with a region and the it becomes a "structured op on vectors", similar to linalg.generic that is a "structured op" on buffers. There are representational caveats however, related to SSA values and reductions that also pop up in the "structured ops on tensors".

However the general problem is by no means solved.
It seems that a “representation of mapping to resources in the type system” that informs transformations is something that would be needed and could match some of your proposal. Historically, using affine maps to represent such information has worked reasonably.

The flow described above is “top-down” and relies on a lower-level representation for vector.contract which carries the mapping to threads / warps (i.e. the SPIRV CooperativeMatrix). That type/op “knows about” the mapping, but does not describe it in a generic way.

If I understand the way you naturally described the primitives, it seems to take a “bottom-up” approach.This is great for building unsurprising abstractions that work and it seems like it would fit the bill for building more general representations for scan/reduce/FFT/transpose/cooperative matrix type “mapped ops”. How the “mapping atribute” is represented, how it is used to connect to actual HW entities (I would think SSA values at some point in the process (e.g. thread/processor id)) and how it connects to the upper levels is still an open problem AFAIK. Do you already know of good representations that are known to work and compose already?

In my experience iterations consisting of switching hats between “top-down” and “bottom-up” while trying to connect pieces that we understand well yields the most satisfying results by decomposing into smaller well-understood problems. Another way to view this is that top-down is transformation/decomposition-driven and bottom-up is maybe more metaprogramming/composition-driven. A lowering is something special that sits at the intersection (e.g. see how vector.contract lowers to vector.broadcast + vector.fma: it is applied as a transformation but it is really metaprogramming a contract with insert/extract/broadcast and fma).
The tradeoff seems to be that tiling, fusion, moving memory around seem more prone to generalizing in a transformation setting whereas mapping to some HW dimension, rewriting X as smaller Ys and data distribution “seem simpler” in a metaprogramming setting.

I’ll stop the digression here because it may be too disconnected from your proposal but I am looking forward to seeing the concrete examples that your lowest-level abstractions would take and how they connect to high-performance execution on HW.

Exciting sutff!

We definitely want gather and scatter however I am afraid vector.shuffle is not it.This instruction closely models LLVM and the constraint is that the pattern is fully static.

This is one of the tricky disconnects and does relate to value vs memory: values are generally indexed with static indices. The exception is (1-D)-flattened LLVM vectors which can be insertelement/extractelement one scalar at a time. This relates to the deeper corners of the deeper dive in the vector dialect.

I think the fundamental reason is that values boil down to registers for which one must know the number statically to emit assembly (e.g. %reg0, %reg1 and not reg[%reg0]) vs memory where load/store from a (sometimes aligned) pointer works.
You can see this materialize into spills in lmem in CUDA when “your implementation of blah” does not fully unroll. CUDA likes to give the impression of an array of registers per thread but will spill when register numbers can’t be determined statically.

The vast majority of gather/scatter semantics go through memory.
Something close is vector.transfer which lower to llvm.masked.load / llvm.masked.store intrinsics.
Similar instructions for gather scatter that would lower to the proper llvm.masked.* instructions would be great to have.

This is for CPU where I don’t think there is such a thing as a portable in-register gather/scatter (different ISAs may have their special intrinsics though).

For GPU I imagine if you are thinking about cross-warp + ballot at the lowest level of granularity?
Note that despite what it says, I think warp-level synchronizations is just an illusion and still goes through the memory unit (at least it was when I used warp shuffles for FFT back in 2014…).

Maybe it’s all about giving the illusion of flexible memory accesses + register performance to a user (but it seems to scale to 32 for the past ~10 years?).

The idea is to have one abstract type (n-D vector / multidimensional array) and have it lowered to different memory / value types depending on its size, the hierarchy level it occurs in and the target architecture.

E.g:

  • a level 4 (grid) reduction operates on a memref array and decomposes into
  • a dynamic set of level 3 reductions which split the original array to operate on shared memory
  • a fixed set of level 2 reductions which split the shared memory to operate on SIMD-vectors
  • depending on the target architecture: A fixed set of level 1 reductions which split the SIMD-vectors to operate on registers or just a single SIMD instruction.

All of these levels could still be talking about generic arrays, just different sizes of them to fit in their specific actual memory type. Though, that requires marking either the values, types or operations as being on a specific level in the hierarchy. That might mean that we have to use multiple or polymorphic types to represent this marking / the attributes.

Yes, the operation names describe what it does and the attributes describe how it is done.

The way I am currently prototyping is: I started with a bunch of common high level problems and implemented them with the proposed primitives to get a decomposition. This way I hope to get the interfaces of the primitives and their expected behavior right first, and only then think about their realization on the hardware.
The primitives are inspired by this paper: Single-pass Parallel Prefix Scan with Decoupled Look-back. Take a look in section 5.2 there you will also find some of the algorithms which can be build using these primitives.

Ah, good to know! Indeed, gather and scatter only make sense if they can handle dynamic indices.

Actually, this would be the second level (sub groups) of the hierarchy.

Yes, to some degree. Often the e.g. 64 lanes are actually 16 4-lane SIMD-vectors in sequence or something like that. But, ultimately it does not matter as the ISA hides that away and let’s them appear as happening in lockstep / at once.

I think what I want to find is a definition, which enables the user to describe:

  • an abstract data flow
  • in a bundled matter (as multi dimensional arrays)
  • and express lane-changes / swaps / swizzles / shuffles of these elements
  • without explicitly referring to memory (functional style).

Yup naming is the #1 hard problem in computer science. :wink: OpenCL terminology is also a viable choice here, as invocation/thread/block/etc. are all very overloaded terms. I personally feel it’s clean and consistent. It’s also better established.

On SPIR-V side we also have unlimited number of virtual registers. The real register allocation is happening in GPU driver compilers. Generally we also don’t want to put too much pressure on VGPRs; otherwise it would hurt GPU occupancy and it would limit the GPUs ability to leverage zero-cost context switch to hide memory latency.

Just wanted to say that this is not the only way to lower a tiled small matmul to SPIR-V at the subgroup level. As many things in Vulkan/SPIR-V, cooperative matrix is an extension that requires special capabilities. It’s vendor specific right now. W’d like to use it whenever possible, but for the cases where we don’t have it, we still need to leverage what we have at hand as much as possible. Subgroup operations is core to Vulkan 1.1 and it’s available on both desktop and mobile GPUs. Converting tiled small matmul into these subgroup operations can potentially give us great performance. I believe these subgroup operations are what @Lichtso is trying to use. So this is an awesome path to build out in parallel to the cooperative matrix path.

Not sure I get this; but I might miss something. :slight_smile: My impression is that at subgroup level (level #2) we are modelling subgroups so it should stay with what native subgroup size the hardware provides? (The size itself can be a parameter according to the hardware for sure. In SPIR-V we have constructs allowing querying such information and it is used to drive pattern application. We can have similar fields for subgroup size for example and that information is exposed by the hardware via Vulkan API calls.) Using a multiple of the native subgroup size seems to be a upper level’s concern. And the subgroup operations does have hardware support, for example, see AMD’s. IIRC, these are originated from the needs in graphics: it’s quite common for graphics to do sampling that need to access neighboring pixels. So being fast really matters there.

That is correct.

I recently helped implementing the VkPhysicalDeviceSubgroupProperties structure in MoltenVK. It turns out that while you can get the hardware width, there can also be varying sizes per dispatch: VK_EXT_subgroup_size_control.html.