Using attributes to specify workgroup configuration when lowering to GPU

mehdi_amini · February 11, 2020, 10:25pm

I still have a bunch of questions that I don’t feel were really answered in the thread so far?

antiagainst · February 14, 2020, 8:17pm

Beyond testing: the attribute situation isn’t great though. How do these attributes get placed on a particular IR construct? This can be fragile across the pipeline so if this becomes the pass configuration you need to insert them right before the pass. But then how do they get inserted? If there is a magic way of determining this information (we have an Oracle) then why do we need to materialize as an attribute? The Oracle can be an interface that the pass queries…

For real use cases, aren’t we already having such “oracles”? At the current moment if a pass needs some configuration to work, one needs to write that pass with a constructor requiring parameters. Still use the GPU case as an example:

github.com

llvm/llvm-project/blob/308571074c13ea2a0758aa085aa02f72150f891e/mlir/lib/Conversion/LoopsToGPU/LoopsToGPUPass.cpp#L120-L131


      
          std::unique_ptr<OpPassBase<FuncOp>>
          mlir::createSimpleLoopsToGPUPass(unsigned numBlockDims,
                                           unsigned numThreadDims) {
            return std::make_unique<ForLoopMapper>(numBlockDims, numThreadDims);
          }
          
          std::unique_ptr<OpPassBase<FuncOp>>
          mlir::createLoopToGPUPass(ArrayRef<int64_t> numWorkGroups,
                                    ArrayRef<int64_t> workGroupSize) {
            return std::make_unique<ImperfectlyNestedForLoopMapper>(numWorkGroups,
                                                                    workGroupSize);
          }

When such a pass is used, the user need to specify the configuration so the user is acting as the oracle here. The attribute approach specifies it as attributes to the IR instead. (But it has benefits as discussed in the above.)

For my use case, now I can compose a general pipeline to compile a function containing some HLO ops down to a SPIR-V module. Before invoking the pipeline, I can attach the “configuration attributes” needed to the input IR:

func @kernel(
  %arg0 : ... { spv.interface_var_abi = #spv.interface_var_abi<PushConstant, bind(0, 1)> },
  ...
) attributes {
  spv.target_env = #spv.target_env<V_1_3, [SomeExtension], [Shader, OtherCapabilities], ...>
  loop.num_workgroups = dense<...>, loop.workgroup_size = dense<...>
} {
  xla_hlo.* ...
}

And then invoke the compilation pipeline. In the compilation pipeline, patterns/passes related to loop conversion will pick up the loop.* attributes for their configuration and patterns/passes related to SPIR-V conversion will pick up the spv.* attributes. Similarly for others. I can then switch to another set of “configuration attributes” and use the same pipeline to compile towards another, say, SPIR-V target environment.

Also as Uday and others pointed out, one can also fine control each op’s configuration in this way.

mehdi_amini · February 15, 2020, 6:54am

I see this as fundamentally different: one is a global configuration while the other is at the granularity of an operation. When one set a global parameter, it isn’t an “oracle” but a heuristic: it applies identically to any piece of inputs.

I have strong concerns about this right now. I don’t believe this is a sound approach in general and this is why I wanted clarification about the plan “beyond testing a pass”. The attributes aren’t gonna be preserved across the pipeline: if an attribute is intended to be the pass configuration it has to be attached exactly before the pass runs and removed immediately after.
Even this is not robust: the pass itself would have to be careful about handling the attributes while transforming the IR to not create inconsistent state.

nicolasvasilache · February 15, 2020, 5:09pm

I’d recommend looking at how an example of a custom pass can be built as I do here. In this particular case I am interested in demonstrating pass fusion by composing local patterns. The hope is that this can start paving the road towards a world in which phase ordering is much less problematic.

The reason I mention this is because I use attributes in this case to build a sort of finite state machine that drives how patterns are applied. In such a world I think conveying information for transformations is reasonable, precisely because the attributes do not escape pass boundaries.

I definitely see mapping to virtual processor ids as an additional pattern that fits this philosophy. But I see it as a pattern that gets composed in the mix until the attributes disappear and turn into proper SSA values (e.g. %threagIdx.x and friends).

So while I am supportive of using attributes I think it is very important to have a good story about going to SSA values without escaping the scope of a particular pass.

In the same way that we don’t want quadratic or worse behavior for ops(I.e ops must almost never know about each other to transform) we have the same issue here: we must not let passes leak or we will risk having passes need to know about each other. For more discussion see the lessons from LIfT in the Linalg rationale.

Lastly if folks are interested in a glimpse of the direction this will take, I am building towards very custom fused passes with configurable knobs. We will have a bunch of passes that will match and transform specific ops, implemented as “bags of patterns + knobs“. Once we get enough of those we can throw a learning system on top and we will have this mixture of expert compilers I’ve been talking about (internally) for more than a year.

Anyway, TL;DR my position is yes to attributes, no to leaking them across pass boundaries.

ftynse · February 17, 2020, 12:36pm

I agree that we need a finer-grained control over how the pass is applied that is detachable from the heuristic. I also think it’s important to control the complexity of what gets exposed in those attributes, and what is their lifetime. Otherwise, we risk creating auxiliary data structures in attributes that, in the extreme case, become IR-building scripts that should have been implemented in the code.

What seems to be an issue here is the separation between op-scope transformations and a pass, or that between transformation “utility” functions and heuristics that control the application of those transformations. A pass would be a heuristic that controls which op-level transformations are applied. Whether it communicates to with them through attributes or through, e.g., arguments of the function that implements the transformation is a design choice. What this proposition aims at is a stronger decoupling of the heuristic from the op-level transformations, which is valuable if we want to explore multiple different heuristics. We need some general way of doing so without leading to an explosive number of similar passes and without leaking the abstraction to other passes.

mehdi_amini · February 18, 2020, 3:47am

I don’t understand why we’d ever want to use attribute in this case though: when would that be a preferred design choice?

bondhugula · February 18, 2020, 4:58am

As one example: one could print the op at an intermediate stage and take a look at all its attributes, which may specify how the op will be transformed / lowered further; similarly, a compiler developer could test the API or experiment otherwise by changing IR text (instead of using builder API). To take it to the extreme, one could encode an entire scheduling language using op attributes. It could be useful for a compiler developer’s productivity and for modularity to have this textual representation. On this topic, IMO the most important thing (I see others also making this point) is to carefully analyze what the lifetime of this op is, and all the things (transforms/analyses) that happen / that are expected to be implemented in the future during that lifetime. I don’t think there is a blanket answer on “use/don’t use” such attributes in general.

mehdi_amini · February 18, 2020, 6:04am

This could be achieved by having the test pass load the parameters from the IR and then pass them through the API to the function that perform the transformation. You don’t need to materialize them on the IR to call this function though.

Right, but since passes are modular and supposed to be composable in various order in different pipeline, that kind of exclude such analysis across passes (which I think was also @nicolasvasilache’s point above).

bondhugula · February 18, 2020, 7:47am

Having to create such test passes is exactly the thing to avoid in such cases - a useful by-product of using intermediate ops with attributes. There would be multiple lowering paths that map to these ops with attributes in different ways, and there’s typically a strong need to have a textual representation of that information/ops – otherwise, (a) you’d miss simplifications to perform on that form (for eg. attributes like maps and sets could bind to SSA value operands), and (b) the lowering would be too drastic with too much in-memory transformation. A test pass with different command line options can’t be a substitute for an intermediate op with attributes in several cases.

mehdi_amini · February 18, 2020, 8:23pm

We’re likely talking past each other, because I have no idea what you mean with your last answer right now.

antiagainst · February 18, 2020, 11:01pm

I see this as fundamentally different: one is a global configuration while the other is at the granularity of an operation. When one set a global parameter, it isn’t an “oracle” but a heuristic: it applies identically to any piece of inputs.

I’m not sure how the difference would be useful in reality here. The configurations baked in to the pass at construction time is only “global” to the input IRs that are suitable for those configurations. I would doubt that there is one set of global heuristic that works everywhere. It’s just that one looks at the current (set of) input IRs that one want to target and decides, okay, this set of heuristics is good for my case so I can bake it into the pass itself.

I have strong concerns about this right now. I don’t believe this is a sound approach in general and this is why I wanted clarification about the plan “beyond testing a pass”. The attributes aren’t gonna be preserved across the pipeline: if an attribute is intended to be the pass configuration it has to be attached exactly before the pass runs and removed immediately after.
Even this is not robust: the pass itself would have to be careful about handling the attributes while transforming the IR to not create inconsistent state.

My understanding is that patterns and passes are all reusable components that pipeline authors can pick and compose according to their needs. It’s the pipeline author’s responsibility to make sure all passes compose; sometimes it means throwing in more patterns, creating pass wrappers, etc. With the current passing-configuration-at-pass-construction scheme, it doesn’t mean the pipeline author is free of worries: it’s the pipeline author’s responsibility to make sure all “global heuristics” are properly set and consistent for the set of input IRs they care about.

So I’m not seeing why this is a great concern for using attributes but not for using pass constructor parameters. FWIW, I’m not proposing that we should have some attribute-carrying-configuration scheme that works everywhere and it just composes well with all potential pipelines without the pipeline authors’ interaction. I agree with Uday that one likely need to think about the op itself and its uses to decide and there is unlikely to have a blanket answer.

mehdi_amini · February 18, 2020, 11:14pm

This was mentioned before: isolation between passes (quote from Nicolas: “In such a world I think conveying information for transformations is reasonable, precisely because the attributes do not escape pass boundaries”). A global heuristic is something local to the pass. It also does not affect correctness in any way.
An attribute on an operation is contextual and is valid only in the context in which the attribute is added: as soon as a pass runs the context changes and you get into the problem of preserving the attributes.
There is prior art in LLVM with metadata, this is nothing new.

flaub · March 6, 2020, 9:51pm

What if we were to make a distinction between ad-hoc attributes and those that are defined on the op? I don’t know the history of issues that came from LLVM metadata, but it seems like in this case, we want to have some information that is encoded as an attribute which gets translated to a different attribute from one dialect to the next. If we had an optional attribute that is defined ‘offiically’ on the op, then it seems like it would have well defined semantics that the conversion process can use. I might be misunderstanding the issue here, if so, I apologize.

mehdi_amini · March 7, 2020, 12:09am

What if we were to make a distinction between ad-hoc attributes and those that are defined on the op?

Attributes in MLIR that are defined for the specific semantic of an op are definitely OK: this is why attributes exists in the first place.

The whole discussion here was really about using the attribute mechanism for other purpose than what you mention (a “side channel” to transport semantics information across the pipeline).

herhut · March 12, 2020, 2:07pm

antiagainst:

For constant workgroup sizes in gpu.launch it’s easy since we can just fold that into the generated SPIR-V module. For non-constant cases, we should still be able to define workgroup sizes at launch time via specialization constants. Seeing an gpu.launch with non-constant workgroup sizes, I think we should generate a SPIR-V module with specialization constants for the workgroup sizes. Then before launch, which is vkCmdDispatch in Vulkan, the Vulkan runtime should feed in the concrete values for the specialization constants for hardware driver compilers to finally compile the SPIR-V module. This cames at its costs: we need to specialize and recompile the SPIR-V module for every gpu.launch with non-constant workgroup sizes.

For IREE specifically, due to the fact that the runtime side is in IREE and the CodeGen side is in MLIR core, we need extra handshake and contracts to make the above happen. (E.g., the CodeGen need to know the SpecIds for the specialization constants to generate.)

Sorry for coming back to this quite late. I somehow stopped to get notifications

I wonder what the conclusion here was. From what I read, constants as launch bounds are sufficient for lowering to SPIR-V in the GPU dialect case and that dynamic bounds are possible but require different code generation? So, do we still need attributes?

antiagainst · March 12, 2020, 7:40pm

Thanks everyone for all the great insights thus far! Really appreciate that. My original intention is trying to see whether it makes sense to introduce some GPU dialect attributes to aid transformations. That got evolved into a general discussion on attribute as loop mapping mechanisms and attributes for transformations, and other topics. I can see the points as raised by @mehdi_amini here. Given the controversy, I’ll not push on this anymore.

@herhut: For SPIR-V workgroup size, an attribute on gpu.func is still useful. It allows gpu.func to be self-contained when used alone (without gpu.launch*) to target SPIR-V CodeGen, where we need the workgroup size in the kernel. Specialization constant does not invalidate that; it just allows some local workgroup dimensions specified differently. For example, you can just specify workgroup_size = {x = 64, y = 16} and then z must be represented a specialization constant (which requires a SpecId as a contract with the runtime). It does mean a different CodeGen path if such case. If all the workgroup dimensions are consistent then we can CodeGen something like spv.ExecutionMode @kernel "LocalSize", 64, 32, 8. With spec constant, we need to generate something like

spv.specConstant @x SpecId(0) : i32
spv.specConstant @y SpecId(1) : i32
spv.specConstant @z SpecId(2) : i32

spv.ExcutionMode @kernel "LocalSIzeId", @x, @y, @z
// or
spv.specConstantComposite @wgsize builtin(WorkgroupSize) = (@x, @y, @z) : vector<3xi32>

(LocalSIzeId and spv.specConstantComposite do not exist at the moment so it’s a mock up.)

mehdi_amini · March 12, 2020, 11:39pm

In general, having some constraints on the workgroup size as attribute on the gpu.func makes sense to me. It becomes part of the contract with a gpu.launch, but it does not requires the ability to recover this from the call-site. The kernel optimization/codegen is decoupled from this point of view.

herhut · March 13, 2020, 9:39am

antiagainst:

@herhut: For SPIR-V workgroup size, an attribute on gpu.func is still useful. It allows gpu.func to be self-contained when used alone (without gpu.launch*) to target SPIR-V CodeGen, where we need the workgroup size in the kernel. Specialization constant does not invalidate that; it just allows some local workgroup dimensions specified differently. For example, you can just specify workgroup_size = {x = 64, y = 16} and then z must be represented a specialization constant (which requires a SpecId as a contract with the runtime). It does mean a different CodeGen path if such case. If all the workgroup dimensions are consistent then we can CodeGen something like spv.ExecutionMode @kernel "LocalSize", 64, 32, 8. With spec constant, we need to generate something like
spv.specConstant @x SpecId(0) : i32
spv.specConstant @y SpecId(1) : i32
spv.specConstant @z SpecId(2) : i32

spv.ExcutionMode @kernel "LocalSIzeId", @x, @y, @z
// or
spv.specConstantComposite @wgsize builtin(WorkgroupSize) = (@x, @y, @z) : vector<3xi32>
(LocalSIzeId and spv.specConstantComposite do not exist at the moment so it’s a mock up.)

Thanks for clarifying. I think the modelling with an attribute on the gpu.func makes definitely sense in the setting where we do not have a corresponding launch. It is also useful in general and making it officially part of the gpu dialect and not a special case in SPIR-V lowering gives us the opportunity for some extra verification. So we should add it as an optional attribute on gpu.func that requires the kernel attribute and extend the verifier for gpu.launch to check that actual operands match what is annotated. We have to check the signature anyway, so this is not much extra overhead.

Topic		Replies	Views
Allocating workgroup memory for use by mlir-cuda-runner MLIR	15	942	October 20, 2020
GPU Workgroup/shared memory address space is hard coded MLIR	9	752	December 29, 2022
[MLIR][GPU][Linalg] How to find a valid lowering pipeline for workgroup memory usage via `transform.structured.promote`? MLIR linalg , gpu	2	139	April 2, 2024
How to lower the combination of async gpu ops in `gpu` Dialect MLIR gpu	6	480	August 28, 2023
How to lowering gpu.launch correctly MLIR	4	168	December 4, 2023

Using attributes to specify workgroup configuration when lowering to GPU

Related Topics