LLVM Discussion Forums

Tile/LowerToLoops + Distribute to processors with Linalg ops

Currently Linalg tiling and lowering to loops results in generation of scf.parallel/scf.for operations. In IREE, for GPU code-generation, tiling is used as a way to map to the different GPU compute heirarchies. Specifically, the scf.parallel loops generated for inter-tile iteration can be distributed across processors at each level. (More info here). For IREE itself the explicit instantiation of the scf.parallel is unnecessary. It would be much better if the tiling and lowering to loops transformations in Linalg would also distribute the iterations across processors. This post is to highlight use cases we have in IREE, and collect thoughts on the best way to proceed with adding this feature to Linalg. Presumably this would simplify the transformations in IREE by quite a bit, and would also be useful for similar use cases (more on this later).

Use case 1 : Distributing iterations to processors in a cyclic manner (Most general case)

Starting with a scf.parallel operation of this form

scf.parallel (%arg0, %arg1) = (%lb0, %lb1) to (%ub0, %ub1) step (%s0, %s1) {

can be distributed amongst a 2D grid of processors (%nx, %ny) in a cyclic manner to get

scf.for %arg0 = (%lb0 + %pidy * %s0) to %ub0 step (%ny * %s0) {
  scf.for %arg1 = (%lb1 + %pidx * %s1) to %ub1 step (%nx * %s1) {

Such a distribution does not assume any relationship between the number of iterations and the number of processors that each parallel loop is distributed to.

Use case 2 : Distributing iterations to processors in a cyclic manner with over-subscribed number of processors

If it is known that the number of processors (%nx, %ny) is greater than or equal to the number of iterations (floordiv(%ub0 - %lb0), %s0), floordiv(%ub1 - %lb1), %s1), then you can generate IR with much lesser overhead.

%n0 = floordiv(%ub0 - %lb0), %s0)
%n1 = floordiv(%ub1 - %lb1), %s1)
%0 = cmpi "slt", %pidy, %n0
%1 = cmpi "slt", %pidx, %n1
scf if (%0 && %1) {
  %arg0 = %lb0 + %pidy * %s0
  %arg1 = %lb1 + %pidx * %s1

Use case 3 : Distributing iterations to processors in a cyclic manner with matching number of processors

If it is known that the number of iterations exactly match the number of processors, this could be further simplified to

%arg0 = %lb0 + %pidy * %s0
%arg1 = %lb1 + %pidx * %s1

Note that all of these use cases rely on some information that is not deducible from the IR.

There are a couple of existing mechanism to map loops to processors

  1. The mapLoopsToProcessorIds method here. This can be used to implement use case 1, but relies on materializing the original scf.parallel operation.
  2. The SCFToGPU lowering has mapping attribute that is used to implement the distribution. This is very tied to the use of gpu.launch mechanism. This is really nice idea IMO and the fact that it controls the number of processors as well means that the transformation is correct by construction. But in uses cases which are similar to IREE, where the lowering is used only for device side code-generation this mechanism cant be used. In such situations it is much better to distribute the loop by construction.

(If there are other approaches, please let me know)

In terms of implementation, one option I can think of is to have an enum list that represents different distributions. The LinalgTilingOptions can be updated to include this to distribute the inter-tile loops. This is slightly painful cause it wont allow for custom distributions. Maybe the use of AffineMap to specify the distributions is an option, but its not clear to me how to encode the assumptions that allow for control-flow simplification. I am wondering if there is a way to allow the client of the tiling/lowering-to-loop transformations have a custom distribution (and not use some hard-coded strategies).?

Note that simplifying the control-flow that is generated this way will make some other transformations much easier since you might not need to move instructions across region boundaries

(fyi : @nicolasvasilache, @ThomasRaoux, @antiagainst, @herhut, @hanchung)

I am not sure what distribute the loop by construction refers to, here. The approach using mapping attributes was not meant to infer these attributes. Instead, the producer of the loops should already supply the mapping, as well. In the end, one does the tiling etc. with a specific mapping in mind, so it also makes sense to supply that mapping.

We currently have no producers for the mapping attribute except for a greedy mapper that was written mostly as a test pass. So if you want to add support to LinAlg to produce the right mapping annotations from the start, that would be awesome!

How about having a couple hard-wired distribution strategies that generate loops with mapping attributes right away. For custom use cases, also allow to do tiling without mapping attributes and then users can write a pass that only produces the mapping they desire. That way they get to reuse the tiling and transformation from LinAlg to loops and the transformation from loops to SIMT.

scf.parallel document what is possible rather than behavior. E.g., it just says it could be executed in parallel, need not be executed such but could be. At least until it is mapped, then you have more explicit semantics and can’t just lower to single, plain for loop.

+1 and this should be able to allow lowering to LinAlg to handle it.

@herhut, @jpienaar Thanks for the comments. I realize now that I wasnt clear that the distribution is meant to be optional. The default will be what is currently done, i.e. explicitly instantiate the scf.parallel. This is an additional option that allows you to directly generate the distributed loops.

I am working on a patch how this looks in Linalg. WIll send it out for review not as finished product, but something to anchor the discussion and evolve.

I implemented a prototype of what I had in mind.
Here is the patch for that : https://reviews.llvm.org/D85147

It works for context of Linalg (and what use case I listed above). The change above is the first draft and open to evolving it based on inputs. The difference from the comments above though is that it distributed the loop while generating it. So you can pass the distribution as an option to Linalg Tiling as follows

            .setTileSizes({8, 8, 4})

Effectively it allows generating parallel code from Linalg using a similar specification as with sequential code, but with an extra option.
I am not saying that the explicit instantiation of scf.parallel and distribution is never needed. Whether to use the approach above, or what exists with lowering scf.parallel -> GPU depends on use case.

Thanks Mahesh for tackling mapping to SPMD loops and for making it play nicely with the patterns.

This is one of the important transformations that we have been discussing for some time but didn’t have the resources to attack yet. I am very excited to see progress on this!

This all looks great to me, my only comment would be re. “Use case 2:” and “Use case 3:”.
I think one of the longer term goals should be that we can express everything like “Use case 1:” + appropriate std.assume / std.assert / new special op carrying the relevant information + canonicalizations should kick in.

In practice, many canonicalizations are missing to get there (for instance scf.if doesn;t even know how to shed an empty else block) so it makes total sense to start with specific implementations for now.