LLVM Discussion Forums

Deprecate use of scf.for (previously loop.for) to gpu.launch conversion

There are currently three separate passes that lower from SCF to GPU dialect

The first two convert scf.for to gpu.launch operations, while the last one is from scf.parallel to gpu.launch. The lowering from scf.for to gpu.launch predates the latter and has poorer semantics since it makes an assumption that the loops are parallel. This is something worth deprecating in favor of using the lowering from scf.parallel to gpu.launch conversion.

The intent of this post is to get feedback if anybody is using these conversion before they are removed. The current plan is to remove these in the last week of May (after the current renaming of loop dialect to scf dialect is complete and also giving more time for folks to respond).

This would also mean that the following functions would also be removed

since they implement the core functionality of the above passes.

@ftynse, @herhut, @antiagainst for visiblity.

Makes sense to me! Seems like an “assumption” that the loops are parallel should be conveyed in the IR by transforming them into scf.parallel? I suspect this was only done like this because it was before scf.parallel existed in the first place?

Exactly.

If non-scf.parallel loops need to be mapped, something has to implement parallelism discovery (e.g., affine analysis) and record its results as parallel loops.

Yes thats right. If it were possible to go back in time, it would have been better to develop and scf.parallel first and then implement lowering from that to GPU. This now retro-actively gets us there?

The plan is still to tackle this next week.

Patch D80747 removes the scf.for to gpu.launch conversion. I kept the affine.for to gpu.launch conversion in. It needs to maybe changed to an affine.parallel to gpu.launch conversion (not familiar with it to take that on, but would like someone familiar with it to tackle it)

@bondhugula for visibility/comments

Replicating my comment from the review:

the maximum code-reuse path would be affine.for -> (affine dependence analysis) -> affine.parallel -> scf.parallel -> gpu.kernel. I would go for it unless there are things that can be expressed at both affine and gpu but cannot be expressed at the scf level.

1 Like

Makes sense. A similar change to instead convert affine.parallel to gpu.launch makes sense. This isn’t any different from such a switch on the scf.for/parallel side. In fact, mlir::isLoopParallel could be readily used to turn affine.for’s into affine.parallel’s. However, one may in addition want to collapse multiple 1-d affine.parallel’s into a multi-dimensional one. What impact would that have on the conversion mechanics and don’t you also need something similar for scf?

+1 affine.parallel's can just be lowered to scf.parallel's and we could remove anything from affine to gpu.launch unless there is polyhedral information that may directly benefit scf.parallel to gpu.launch lowering.