Currently Linalg tiling and lowering to loops results in generation of scf.parallel
/scf.for
operations. In IREE, for GPU code-generation, tiling is used as a way to map to the different GPU compute heirarchies. Specifically, the scf.parallel
loops generated for inter-tile iteration can be distributed across processors at each level. (More info here). For IREE itself the explicit instantiation of the scf.parallel
is unnecessary. It would be much better if the tiling and lowering to loops transformations in Linalg would also distribute the iterations across processors. This post is to highlight use cases we have in IREE, and collect thoughts on the best way to proceed with adding this feature to Linalg. Presumably this would simplify the transformations in IREE by quite a bit, and would also be useful for similar use cases (more on this later).
Use case 1 : Distributing iterations to processors in a cyclic manner (Most general case)
Starting with a scf.parallel
operation of this form
scf.parallel (%arg0, %arg1) = (%lb0, %lb1) to (%ub0, %ub1) step (%s0, %s1) {
....
}
can be distributed amongst a 2D grid of processors (%nx, %ny)
in a cyclic manner to get
scf.for %arg0 = (%lb0 + %pidy * %s0) to %ub0 step (%ny * %s0) {
scf.for %arg1 = (%lb1 + %pidx * %s1) to %ub1 step (%nx * %s1) {
..
}
}
Such a distribution does not assume any relationship between the number of iterations and the number of processors that each parallel loop is distributed to.
Use case 2 : Distributing iterations to processors in a cyclic manner with over-subscribed number of processors
If it is known that the number of processors (%nx, %ny)
is greater than or equal to the number of iterations (floordiv(%ub0 - %lb0), %s0), floordiv(%ub1 - %lb1), %s1)
, then you can generate IR with much lesser overhead.
%n0 = floordiv(%ub0 - %lb0), %s0)
%n1 = floordiv(%ub1 - %lb1), %s1)
%0 = cmpi "slt", %pidy, %n0
%1 = cmpi "slt", %pidx, %n1
scf if (%0 && %1) {
%arg0 = %lb0 + %pidy * %s0
%arg1 = %lb1 + %pidx * %s1
...
}
Use case 3 : Distributing iterations to processors in a cyclic manner with matching number of processors
If it is known that the number of iterations exactly match the number of processors, this could be further simplified to
%arg0 = %lb0 + %pidy * %s0
%arg1 = %lb1 + %pidx * %s1
...
Note that all of these use cases rely on some information that is not deducible from the IR.
There are a couple of existing mechanism to map loops to processors
- The
mapLoopsToProcessorIds
method here. This can be used to implement use case 1, but relies on materializing the originalscf.parallel
operation. - The SCFToGPU lowering has mapping attribute that is used to implement the distribution. This is very tied to the use of
gpu.launch
mechanism. This is really nice idea IMO and the fact that it controls the number of processors as well means that the transformation is correct by construction. But in uses cases which are similar to IREE, where the lowering is used only for device side code-generation this mechanism cant be used. In such situations it is much better to distribute the loop by construction.
(If there are other approaches, please let me know)
In terms of implementation, one option I can think of is to have an enum list that represents different distributions. The LinalgTilingOptions can be updated to include this to distribute the inter-tile loops. This is slightly painful cause it wont allow for custom distributions. Maybe the use of AffineMap
to specify the distributions is an option, but its not clear to me how to encode the assumptions that allow for control-flow simplification. I am wondering if there is a way to allow the client of the tiling/lowering-to-loop transformations have a custom distribution (and not use some hard-coded strategies).?
Note that simplifying the control-flow that is generated this way will make some other transformations much easier since you might not need to move instructions across region boundaries
(fyi : @nicolasvasilache, @ThomasRaoux, @antiagainst, @herhut, @hanchung)