LLVM Discussion Forums

[RFC] Making the size of `index` configurable in lowering to LLVM

This is related to the lowering to LLVM, in particular on GPUs, but probably will also apply in other settings.

In the language reference, the MLIR type index was defined as an unsigned integer value with the size of the natural machine word of the target architecture. The updated Rationale now states that the bitwidth of index is undefined.

I am in favor of the latter, as it allows lowerings to target architectures to choose the most appropriate size for subscripts and size values. On some targets, even though they are 64 Bit platforms, sizes and subscripts for many workloads are confined to a 32 bit range. On some GPUs, using 64-Bit indices furthermore comes with a significant performance overhead.

To support these use cases out of the box, it would be great if one could configure the size of the index type when lowering to LLVM independently from the size of the machine word. This would simply become an additional parameter to the lowering pass.

The alternative would be to have a special pass that rewrites computations on index values to 32 bit. This cannot be done before the lowering, as many computations are not yet visible at that stage (e.g. index operations). And it is difficult after the lowering if the workload has 64 bit integers that have to remain.

I have quickly prototyped this and currently it is only a few lines of code that need to change. So the main change here is conceptual and to the MLIR design.

You’re getting into an interesting topic, but potentially a can of worm.
Lowering to LLVM in general requires to answer many ABI question (we already played a bit with it for memref lowering), which implies some notion of target. The way LLVM encodes a lot of the ABI aspects is in the DataLayout (associated with the module, usually derived from the Target Triple).
We likely need to think about a similar concept in MLIR, and it likely applies at multiple level: I don’t think the IR is entirely ABI independent at higher levels, and every level of lowering may need similar set of informations about the target.

I agree for the longer term vision. One key difference to LLVM is that MLIR is way less homogeneous with what “lowering to a target” means. So finding a single description of all aspects of lowering might be difficult. This plays a bit into the same direction as the discussion on using attributes to steer lowering across passes in the context of GPU workgroup sizes.

As a starter, I would make it an option on the lowering to LLVM dialect.

And I am aware that this opens a can of worms but I think there will be no way to avoid this for practical code generation.

+1. LLVM has no “index” type and so we need to make the decision at the point of lowering to LLVM. I don’t see a reason why it shouldn’t be configurable given the updated rationale.

+1

We have implemented a pass that converts the index computations to 32-bit after lowering to the LLVM dialect. This post-processing is a hack (it works for us since the index computations are the only 64-bit integer computations), and having an option to lower to 32-bit directly would be nice.

The optimal bit-width may also depend on the “domain”. For typical stencil applications, 32-bit indexes are sufficient, and the performance benefits are significant (at least on GPU).

FWIW, the conversion from Standard to LLVM derives the integer type to use for index from the pointer size provided by the data layout.

However, we set up the data layout rather arbitrarily in the lowering (default layout for the host target triple, which is known to be problematic).

One way forward I could suggest is to progressively add data layout-related capabilities to the conversion in a form of pattern options or conversion hooks (the two currently necessary pieces I’m aware of are the bitwidth of index and the alignment of vector loads/stores). When we have slightly more cases, e.g. conversions between other dialects than standard and LLVM, we can try and generalize this mechanism to make it extensible. So far, we only seem to need this mechanism for lowering, but I might imagine it being also used for transformations, e.g. to estimate memory footprints or ensure alignment in our use cases.

Thanks @herhut for starting the discussion on this! This is not limited to LLVM lowering; we are hitting this too in SPIR-V CodeGen. Using bitwidths other than 32 requires special hardware capability, which means special capabilities/extensions requirements in the generated the SPIR-V module. So we are converting it unconditionally to i32 and I’d assume we’ll keep doing it that way.

I think this also is not limited to index. In general one may have some bitwidth in the source dialect but not in the target dialect, so we need to do the conversion there. I’m currently prototyping using the type converter (some test examples here; still in progress so sorry if something is obviously missing), but wondering what others might think.

When you’re not dealing with types like “index” but types that have a specified bitwidth, I don’t think you can change the bitwidth freely.
If you’re extending the bitwidth of floating point you also likely need to adjust for rounding (or masking for integers). In the other direction if you want to model 16bits integers computation on an 8 bits instruction set you probably need to expand this with multiple operations.

Yes agreed. That’s the direction I’m heading for; right now I’m changing bitwidths unconditionally as a temporary step to thread target environment support across components. For the immediate use cases we have right now we see things like 64-bit integers coming down from upper layers that indeed can be 32-bit values. So this works. Next I’ll work on enabling proper extending or emulating bitwidths.

Some of what Lei is referring to in our end is coming from the XLA/HLO dialect, which we use as a source for a lot of things, being quite a bit too cavalier in using i64 for things that really should be an appropriate machine word (ie. Index). I consider these bugs in the HLO dialect and we have some heuristics to narrow the types. I think if we can solve some more of the usability challenges with index, we’ll have a better time fixing HLO and then can remove the type hacks.

In general, for post-hoc changes to floating point types, we intend to use the various quant ops (region and casts), which at least make it explicit that an additional approximation is being made and needs care. Although I’ve seen plenty of cases where people don’t have a type and just blindly swap it, hoping for the best. Not saying it’s right but it’s not uncommon to play fast and loose like that. The better we can make these tires and the safe ways to transport them, the less people will need to reach for the quick substitution.

Se should be able to start fixing the situation there once https://reviews.llvm.org/D76726 goes in (allowing tensor<index>).

+1 on improving usability of index.

Another thing to consider is if index should be mapped to different bitwidths at different points in the program.

On GPUs, i64 arithmetic is expensive as Stephan said above, so we should use i32 for indices wherever we can. However, modern GPUs can hold arrays with more than 2^32 elements so in some cases we do need i64 index arithmetic. Ideally we would use teach our code generator to lower index to i32 when safe and i64 otherwise. In some cases we might even want to multi-version the generated code on the size of the input and output arrays so that we can use i32 in the common case.

To me this all seems to indicate that index -> integer is better done as a transformation on the standard dialect (or before) and not as as part of the lowering to LLVM.

Some of the computations are not visible in the Standard dialect, yet, and some operations in standard dialect only accept index, for example the IndexOp. So doing this before lowering is hard.

Also, I am not sure that mixing two different lowering bitwidths in the same context (like a single function) makes a lot of sense. @sanjoy_das_google concerns can also be solved by lowering to different nested modules (multi-versioning) or lowering different functions with different configurations.

We already have the issue that the host and device side of GPU code might use different index sizes and we have to bridge this difference on the function boundary. So some post-processing may be required but I think that is easier at a courser grain level, like between functions or modules.

What about if the the indexes are used for different purpose? I can imagine that the affine indexing and the index used for a TensorFlow reshape attributes can be seen as fundamentally different enough that their lowering would be different, even if you mix these in the same “context”.

I assume here that the index type is used for values that are constrained in their size by the (expected) addressable memory. So all things that are indexes. Do you have an example where, in the same function, at lowering time, you need different sizes? What do the reshape attributes turn into when lowered to LLVM? I would expect them to get folded into some index computation, turn into loop bounds, etc.

Example: if you use an index to represent the dimension to transpose, the range of your index is limited by the rank of the tensors. This is use-case specific.