[RFC] Range analysis when evaluting/lowering affine maps

krzysz00 · November 11, 2021, 3:06am

When we expand an affine expression to runtime computations, especially with operations such as division or modulo, we always emit signed operations and checks for handling negative inputs. However, in many cases, we can obtain bounds on the arguments to these operations, either from things like the bounds on an affine.for or from domain-specific information the generator of the MLIR code is aware of.

I propose (and plan to implement) range analysis for the values in affine maps to enable certain optimizations, such as the use of unsigned remainder when the operands are known to be non-negative and more aggressive constant folding. This’ll simplify the generated code and thus make the job of any backend compilers easier.

One thing I’m not sure about is whether, and, if so, what, the syntax should be for explicitly affixing bounds in calls to affine.apply. An attribute might be reasonable, but we don’t have an obvious way to write down values that are either an integer, -infinity, or +infinity.

Mogball · November 15, 2021, 10:22pm

Is this related to Adding unsigned integer ceil and floor in Std Dialect? This seems useful; do you have a specific use-case you could share?

What if affine expressions where first transformed into a region with MLIR operations, and then the analysis can be performed directly on MLIR?

@ftynse Might have an opinion on this.

krzysz00 · November 15, 2021, 11:41pm

Yes, this is related to the linked post, at least in that they’re both stemming from the same observation/complaint.

I’ll try to elaborate on our usecase some, and hopefully I’ll strike a good balance between presenting a realistic example and avoiding extraneous details.

In our usage of MLIR, we’re generating convolution kernels that use an implicit matrix multiplication. I’m going to discuss where the signed operation emission comes up in the context of storing back the results of the multiplication, but the discussion should generalize to loading data as well.

Once the multiplication is performed, each GPU thread has a %v : vector<R x type> of values that correspond to some section of the result matrix (with R being a constant).

To simplify the usecase, we’ll be storing %v to %out : memref<N x H x W x C x type>, and we’ll have each thread’s %v be a contiguous slice of the C dimension (with R evenly dividing C). Because of this simplifying assumption, we can emit a store of %v[%i] to %out[%n0, %h0, %w0, %c0 + %i] (in reality, we generate coordinate-update code that converts the movement from the next value in %v to the non-negative offsets to (n0, h0, w0, c0), allowing for fully unrolled loops without full re-computation of the indices).

So, the problem becomes computing (%n0, %h0, %w0, %c0). We know that there is a fixed mapping from the a thread’s ID to coordinates (%i0, %j0) that indicate which value in the computed matrix is stored in %v[0]. However, since we support both NCHW and NHWC tensors, since we plan to support arbitrary or mixed layouts in the future, and since the matrix->tensor map can be quite complex (for example, due to needing to handle padding while storing the results of backwards convolutions, or when loading data), it’s unreasonable to similarly hard-code the map (%n0, %h0, %w0, %c0) = f(%i0, %j0).

Therefore, the map f is stored as an affine map, and we translate from matrix to tensor coordinates using expandAffineMap() to get the initial values corresponding to the starting coordinates of each thread’s data.

The simplest example of f is something like

f = (d0, d1) -> (d1 floordiv HW, (d1 mod HW) floordiv W, d1 mod W, d0)

(with all the capital letters being the relevant constants).

We know that the arguments d0 and d1 to this map are non-negative, though LLVM doesn’t appear to. Therefore, if affine.apply emits checks for handling negative inputs to floordiv or mod when lowering the computation above to the arithmetic dialect, the downstream compiler is less capable of optimizing expensive arithmetic to cheaper bitwise operations.

Looking more broadly, while our code currently contains extensive infrastructure to work around missing features in MLIR, we are hoping to be able to express our kernel generator in a style that more closely matches the intended use of MLIR and thus take advantage of the existing infrastructure instead of fighting against it. One such improvement would be ways to express to affine.apply (and therefore affine.for, affine.load, and so on) that some values are non-negative (or being able to use unsigned operations, as in the previous proposal), enabling the emission of more optimal code.

My proposal for a range analysis is intended to introduce a general optimization to the lowering of the affine dialect to arithmetic, which, on top of enabling the optimizations discussed above, would allow for observations like “Oh, this loop goes from 0 to 4 but the iteration variable is divided by 8? That’s just 0 now.”. However, there may be other solutions to our problem, which I’d be happy to discuss.

(On that note, @jungpark , since you headed the previous discussion)

(and while the above issue could be solved with a custom affine visitor that assumes everything is non-negative, that’s limited-purpose, duplicative, and leaves us with extra maintenance burden)

krzysz00 · November 15, 2021, 11:48pm

and to make the plan I have in mind more concrete, one idea I have is a affine-infer-ranges pass that annotates each use of an affine map with the pair of attributes lowerBounds and upperBounds, each a dictionary mapping dimension or symbol names to index. Present entries indicate the said lower/upper bound, while the absence of such an entry would indicate the bound is unknown. These bounds can either be user-specified, inferred (in the case of things like for loops with fixed bounds), or derived from the bounds present on the ops defining the arguments to the map.

I’m not entirely sure this is the right design for this analysis, but it’s what I was wanting comment on before I charge off and write a pile of code.

Mogball · November 16, 2021, 12:45am

I’d be curious to hear more about this, actually (in another thread, maybe), especially if many of them are similar problems to the one you have encountered here.

This sounds like a good idea, and which would be more generic than just “the value is non-negative”. I’m not familiar with the AffineMap structure itself, but perhaps this info can be added there.

Alternatively, the lower bound and upper bound attributes could be propagated to the generated ops, and then a second pass can apply them (e.g. ceildivsi where lowerBound >= 0 just becomes ceildivui). Wherever the data is stored, I think that inferring the ranges, lowering affine, and applying them should be separate (and thus composable).

Related: there has been a lot of discussion about attributes in passes (On querying an Operation's intrinsic (core) vs external/user-defined attributes for example) and I personally would like to see more passes use attributes to annotate metadata.

jpienaar · November 16, 2021, 1:12am

Have you considered using an IntegerSetAttr with symbols between this and symbols matching in order with dynamic dimensions in memref? (then you avoid the issue of -infinity and infinity too I believe)

Where all do you need this? E.g., why not have range analysis be an analysis ?

krzysz00 · November 16, 2021, 5:04pm

No, I haven’t, but they do seem like a reasonably tool for the job. Fortunately, they also don’t supporting ORing constraints, which means that I won’t have to deal with an overly precise analysis.

… I didn’t know that was a thing.

Given some of the above, it’s entirely possible that the real RFC is “add a range analysis to MLIR”, where we’d have

A RangeAnalysis, which knows about things like arithmetic operations (and probably stuff like affine.apply, depending)
A mechanism (attribute in tablegen?) that’ll allow annotating new or custom operations with range information (for example miopen.workgroup_id >= 0)
A simplify-arithmetic pass that can be tossed in anywhere in the pipeline one likes that uses the results of this range analysis to perform optimizations like constant folding and replacing signed operations with unsigned ones where their inputs are known-non-negative.

Mogball · November 16, 2021, 5:58pm

You could introduce a new type of Attribute in the form [operand_index, cmpi predicate, value] such as operand[0] ge 5.

Specifically for the arith dialect, ops where this information is useful could have an OptionalAttr<ArrayAttr<RangeInfoAttr>>. This doesn’t mean that other ops can’t also be annotated with range info.

ftynse · November 19, 2021, 3:12pm

In general, the data structure you want is something like isl_map in the isl library where you have the “affine map” between input/output dimensions but also the bounds on dimensions. IIRC, one of the initial versions of AffineMap allowed to specify bounds on LHS dimensions, but this was removed for the lack of use cases. Note that is is possible to model that with an IntegerSet, although it will look scary in the IR and likely require some custom operations: one can have a set defined as affine_set<(d0, d1)[s1](d0 - d1 == 0, d0 >= 0, s1 - d0 - 1 >= 0) to indicate that d0 is bounded as 0 <= d0 < s1. Specific ops (to be defined, maybe even in the affine dialect) can then interpret d0 as input dimension and d1 as output dimension and compute the set of values d1 may take given specific d0 or a set thereof (it can be seen as intersection + projection + point-sampling).

At this point, I would say that extending AffineMap to be more like isl_map (or rather isl_basic_map because it only models convex sets) would be rather challenging because there are many that assumes maps are defined everywhere in Z^N.

I have several concerns regarding the addition of attributes to affine.apply:

It can lead to contradictory annotations. One use can indicate that %0 >= 0 and another that the same %0 < 0. This is hard to catch by the verifier because verifiers of individual operations are not expected to know about siblings, and the parent operation is not necessarily expected to know that it must also check some attributes on affine.apply that it may or may not contain. If we want extra information about the value, this information must be provided at the value definition point, not at uses to avoid such contradictions. In theory, this information should be just part of the conceptual type of this value, but we have some liberty deciding whether to model something as an actual type or as an attribute on the producing operation (block arguments are trickier).
As proposed, it does not work in presence of canonicalizations. Affine applies will be folded into their users if those users also accept affine maps. This means these extra attributes will have to be added to all such operations, and the canonicalizations updated.
There is no guarantee that some further transformation won’t break the assumption conveyed by this attribute. For example, imagine that the non-negativity assumption is derived from the bounds of the surrounding affine loop, which gets lowered to SCF and than shifted by -1, which breaks the non-negativity assumption in a way that can no longer be understood by the affine framework. I don’t think this is solvable in full generality though: someone may know that their pass pipeline wouldn’t do that after the point where the unsigned optimization is introduced and chose to use the optimization, but it must remain opt-in given how loops are currently modeled.

Having an analysis, on the other hand, sounds fine. Since it’s MLIR and it should generalize, I don’t expect an analysis to just know about ops for which the ranges of resulting values can be inferred. Instead, we may have some IntegerResultBoundsInterface implemented by operations willing to opt into the range analysis. Its main method is similar to the folding hook: it receives a list of IntegerSet (or a custom structure) with bounds on the operands if known and produces a list of structures of the same type with bounds on its result. The analysis process proceeds all ops in def-use order. Thus each op can use whatever logic it wants to indicate what are the bounds on its results, including custom attributes that are specific to the op in question. We can have, for example, %1 = custom_dialect.assume_bound %0 { lower = 42, upper = 100 } and then use %1 instead of %0 to indicate user remarks; or propagate bounds through affine maps.

Passes may make use of this analysis as they see fit. For example, there is an affine.min simplification somewhere that relies on bounds-related knowledge inferred from surrounding loops. I wouldn’t agree that transforming signed to unsigned operation is necessarily a simplification: conceptually, having two kinds of operations is more complex than having one, practically, the performance will depend on the target architecture and compiler backend. But I have no objection in having a -make-int-arithmetic-unsigned-when-possible path with a clear name that does just that.

ftynse · November 19, 2021, 3:13pm

This sounds like reinveinting AffineExpr.

krzysz00 · November 22, 2021, 5:01pm

@ftynse Thanks for helpingfdlesh out the design of the bounds analysis - that does seem to be the right approach!

krzysz00 · April 19, 2022, 5:37pm

Range inference for arithmetic (and thus having nothing to do with affine at all, but I thought I’d bring the diff back to where all this started) is now the patch series

Topic		Replies	Views
Some MLIR warts we've found in our convolution kernel generator MLIR	11	663	November 17, 2021
Request for more review / further discussion on integer range analysis MLIR	6	333	May 9, 2022
[RFC] Arith dialect versions of affine.apply/min/max MLIR	20	1107	February 3, 2022
Understanding the vector abstraction in MLIR MLIR	24	3038	June 4, 2020
MLIR affine inverse MLIR affine	6	503	May 25, 2022

[RFC] Range analysis when evaluting/lowering affine maps

Related Topics