I would like to share one of the problems I’ve been struggling with when applying optimizations in the affine domain in MLIR. We currently have a good variety of optimizations working on the Affine dialect upstream. Many of them make use of affine analysis to address, for example, the legality aspect of the optimizations. This is also the case of the affine super vectorizer pass. However, an important difference with respect to other affine optimizations is that the super vectorizer is producing (memory) operations that are not part of the Affine dialect but the Vector dialect (vector.transfer_read & vector.transfer_write). Since affine analysis is only able to deal with affine memory ops, this basically means that currently there is no way to apply any other Affine optimization after applying vectorization (e.g., affine LICM or unrolling), which is a highly important limitation.
I’ve been gathering feedback from community members to address this problem and these are some of the solutions that were proposed:
Short term:
Convert Vector ops back to Affine after vectorization (Vector -> Affine -> Vector -> LLVM). This would allow us to reuse the Affine -> Vector -> LLVM conversion, which are already implemented. Alternatively, convert Vector -> Affine -> Std -> LLVM. This would require introducing the Affine to Std conversion for vector ops, which would be kind of redundant with the Affine->Vector conversion.
Introduce some customization in the super vectorizer to choose which output we want to produce: Affine memory ops or Vector memory ops. The implementation could range from a simple flag + if-else statements to providing a class with customizable call-backs for the generation of the memory operations.
Introduce op interfaces to vectorize memory operations that would produce the output that we want (Vector or Affine ops). Affine memory operations would implement this interface. We would use the same approach as before: a flag or a customization mechanism within the interface to choose between the two outputs.
Same as before but using a dialect interface for vectorization. This interface would be implemented by the Affine dialect and would provide all the knobs needed to make the vectorization transformation more generic. As before, we would need a flag or a customization mechanism since currently there is no way to register different implementations of the same interface to the same dialect and dynamically choose between them.
Long term:
Make the affine ecosystem more compatible with other dialects through op interfaces (or any other mechanism) that could help expose affine properties and affine constructs (affine maps) from any arbitrary operation.
My personal take:
Re #1, I think the Vector/Affine back-and-forth conversion would solve the problem quickly but might not be the cleanest approach to have upstream. I could definitely use this locally, but I would prefer to go with a cleaner solution.
Re #2 (customizable call-backs), #3 and #4, using a class with customizable call-backs seems to provide the same functionality as using interfaces. If we had a mechanism to register multiple implementations of the same interface into the same op/dialect and dynamically choose which one to use, maybe the interface implementation would be a cleaner one. However, since it’s not possible to do that right now, I think #2 could be good and simple enough for the short-term.
Re #5, that’s definitely interesting but it would require a lot of work, discussion and re-design so maybe that’s something we can work on for the long term.
I would appreciate your feedback before proceeding with further investigation!
Exploring #5 definitely seem interesting here.
I’m a bit wary of customizing the vectorizer itself: this seems extra complexity in a piece of code that shouldn’t have to handle this: conceptually the conversion vector->affine achieve the same thing but is at least isolated and independent.
I think converting back to affine is a reasonable option, which may also be helpful if somebody wants to extract affine representation from the IR that wasn’t affine in the first place.
Alternatively to customization in #2, I’d consider refactoring the dialect into a set of independent utility functions and defining two separate passes that use them. This leads to some duplication of high-level logic, but I personally found this easier to follow than recurrent if (flag) do-x else do-y or forwarding multiple callbacks.
I guess if we frame the vector->affine conversion as something more generic, it may have some value. We could even target standard memory ops in the future, as well. Do we have any utilities that could help with the analysis needed to determine if a memory op is affine? Maybe we could start with something simple: only support simple index expressions with IVs from affine loops.
Also, vector transfer read/write ops are modeling a few things in a single op: contiguous vector load/store, scalar load/store + broadcast, load/store with other permutations, padding, etc.). We may also want to canonicalize them into simpler ops (scalar load + vector broadcast, vector transfer read/write + vector shuffles, etc.) before the conversion to make it simpler. Not sure if @nicolasvasilache, @aartbik have something for this already. Would it make sense to add this canonicalization patterns to the canonicalizer or better have an independent intra-dialect “lowering” pass?
I’m a bit wary of customizing the vectorizer itself: this seems extra complexity in a piece of code that shouldn’t have to handle this:
We would only need customization in two places (basically where vector loads and vector stores are generated [1, 2]). However, I understand your concern. The value I see in this approach is that we would be exploring something closer to interfaces, which we will eventually need if we want to make the vectorizer more generic (e.g., vectorizing ops from other dialects). We can address that separately, though.
@dcaballe I think the posts above are partly missing the point. Pursuing #5 is generally a good thing and should anyway be done, but I don’t think that’s the cause of the difficulties you are facing - at least based on the information you provide in your post. Are you using vector.read/write_transfer to perform 1-d vectorization? If that’s the case, I do think those ops are the wrong abstraction to use for 1-d vectorization. Most of the approaches you mention of converting back to affine are just undoing things at the expense of unnecessary work (they may still be useful in other contexts like @ftynse mentions, just like scf.for to affine.for may be useful at some point/in some scenarios). Would you have the difficulties you mention had you used a memref_shape_cast + affine.load/store, i.e., first converting to a memref of vector element type (https://reviews.llvm.org/D85885 - btw, this is still pending for review where you are a reviewer!) and then do regular affine.load/stores on them? I’ve had no difficulty dealing with any downstream optimizations with this abstraction, but again, your use cases and context might be different and not completely covered in your post (for eg. multi-dim vectorization). If you are really doing multi-dim vectorization via tile_load/stores (which is what vector read/write transfers really are) and you want them to play well with affine analysis, #5 is the way to go, but you’d still need to introduce new ops in the affine dialect I feel (please see further below). This requires significant interesting design/implementation like others point out. There isn’t also much difference between dma_read/write with strides (affine.dma_start) and tile load/stores. Both are coarse-grained read/write ops that load/store data that’s not necessarily contiguous in memory, but still structured.
I did foresee this interoperability issue when the supervectorizer (someone should change this non-standard name - it’s just a vectorizer!) was built to generate other abstractions instead of pure affine.load/stores when possible.
@joker-eph - I don’t think customizing a vectorizer to generate the right ops by construction (if you know those are the right ones that you’d immediately want to convert to) is additional complexity. Instead, canonicalizing the tiled load/stores for the special case of 1-d tiles and other conditions when they could be converted to affine load/stores is additional complexity.
@dcaballe since the pass in question is of vectorizing affine loops, I think the even more appropriate design point for the multi-dim case is that of introducing affine.tile_load/tile_store ops, which could lower to vector_read/write transfer ops for the > 1 dim case. I think these would be the right op abstraction to target for an affine loop vectorizer for the general multi-dim case. Then, the op interfaces per #5 would play well.
On a minor note, I don’t see what the issue with the unroller is (unless you meant unroll-and-jam). Since unrolling is always valid, affine.for unrolling should work irrespective of what dialect ops you have in the mix (vector/affine/std load). It’s things like fusion, store to load forwarding, trip count calculation, scalar replacement, loop unswitching, anything depending on dependence information that you’ll run into issues with.
My point is that this is complexity that does not belong in the vectorizer as I understand it (other than artificial limitations).
I don’t quite get right now why to indicate “I want to load data here” the op used should be different based on what you want to do next, from the vectorizer point of view. I would expect the vectorizer to be able to express its intent in one way with one set of operations (and progressive lowering from there)
All of what you say above makes sense to me - but that doesn’t still fully specify the right choices here. Note that @dcaballe is I think vectorizing input with affine.for/load/stores here. vector.read/write transfer ops can use unrestricted subscripts/access functions for their operands like std.load/store. Also, it’s not that we are choosing a different load op based on what to do next (post #4 para 2) - the information on vectorization widths and dimensionality is already available. So, you know what the right ops are. If you don’t want to specialize and differentiate the 1-d contiguous case from the general case in the vectorizer, the new affine.tile_load/store ops was what I was proposing. But even here it doesn’t help with the difficulties expressed in the original post because you’d have to wait for the OpInterface support and for the other passes/utilities to use those where possible, and separately for a canonicalization pass to convert such tiled load/stores to regular load/stores whenever possible (1-d and contiguous). The latter canonicalization would almost always be run right after vectorization. The little specialization needed to 1-d contiguous load/store ops also allows a more incremental path to build/test things - note that all choices return a vector type.
Are you using vector.read/write_transfer to perform 1-d vectorization?
I’m mostly focusing on 1D vectorization right now.
Would you have the difficulties you mention had you used a memref_shape_cast + affine.load/store, i.e., first converting to a memref of vector element type (https://reviews.llvm.org/D85885 1 - btw, this is still pending for review where you are a reviewer!) and then do regular affine.load/stores on them?
That would work. We also have affine.vector_load and affine.vector_store in this regard, which is what we’ve been using so far.
If you don’t want to specialize and differentiate the 1-d contiguous case from the general case in the vectorizer
Actually, I don’t see why not! We could always generate an affine.vector_load/store for 1-D cases and lower them later to vector transfer ops. We already have this lowering implemented upstream. That would avoid the need for customization altogether and one could indirectly generate the current vectorizer’s output by also applying the Affine->Vector lowering. WDYAT about this as a short-term solution? It sounds the less convoluted one to me.
@dcaballe since the pass in question is of vectorizing affine loops, I think the even more appropriate design point for the multi-dim case is that of introducing affine.tile_load/tile_store ops, which could lower to vector_read/write transfer ops for the > 1 dim case. I think these would be the right op abstraction to target for an affine loop vectorizer for the general multi-dim case. Then, the op interfaces per #5 would play well.
We probably want to leave the multi-dim cases for later but… That could work. However, I’m worried about the op duplication/redundancy that we would have per dialect (we have some already). If we plan to move forward with #5 in the long term and since the Vector dialect seems to be the one responsible for vector ops, would it make more sense to delegate the vector ops to the Vector dialect for consistency?
In this regard and also answering @bondhugula’s comment on the vector transfer ops, my take is that they are great to model vectorization at a high-level of abstraction where you don’t care much about the low-level implementation details of those loads/stores. However, in order to improve the vectorizer, make compelling cost-based decisions, etc. we have to reason about low-level details of a vector load/store (contiguous, strided, gather/scatter, scalar + broadcast, etc.) and, therefore, we may want to be more explicit about how to model those in the IR instead of using a higher-level representation at that point. I wonder if we could have more explicit lower-level operations for vector memory ops in the Vector dialect and keep transfer reads/writes for higher levels of abstraction that could then be lowered to this set of more explicit ops (my previous comment on canonicalizing transfer reads/writes was towards this direction). This should help bring @bondhugula’s concept of simpler 1-d and multi-dim vector loads/stores to the Vector dialect and more consistently pave the way to #5. I think we already have other scenarios in the Vector dialect where we lower high-level ops into lower-level ops within the same dialect, right? (Vector contractions?)
Thanks for starting this thread Diego and sorry for the delay, I was out last week.
I think you put the finger on some key composability limitations.
The main problem I see is that vector.transfer is a high-level op (which you also mention in your last post) but for some reason, 1-D affine.vector_load/store “raises” to it on the way to LLVM. I understand this was done for expediency but it seems to me this is the root cause of your immediate problem. I would look into 2 things:
refactoring the lowering of affine to std to avoid going through vector.transfer.
have a legalization pattern that takes 1-D vector.transfer ops with agreeable (permutation map + alignment + layout) and rewrite those to affine.vector operations. You can then apply those patterns greedily. If you use a post-pass that comes immediately after vectorization, this could grow to legalize more things for affine analyses and transformations, in the future.
Absolutely, the affine dialect has some conflations and limitations, here is a small laundry list for such an ecosystem that I would welcome:
AffineExpr/AffineMap (resp. affine.apply/min/max) are not affine types (resp. operations) but really std/arith types (resp. operations) that apply to arbitrary index and behave differently for “symbols” and “dims” (for affine.apply, dims compose, symbols concatenate). What SSA values “symbols” and “dims” are allowed to bind to in a particular transformation setting is completely orthogonal. Longer term, this hints at a new arith dialect that can be expanded with polynomial, log/exp/shift (for recursive, FFT, scans etc).
most other affine dialect operations are just std operations whose operands are the result of arith.affine_apply/min/max with extra dim/symbol operands requirements related to AffineScope. Using the std form for these ops with additional arith.affine_apply/min/max analysis/canonicalizations would trigger the following:
a. shake up missing canonicalization infra (e.g. std.addi -> arith.affine_apply, nested loop bounds canonicalization, …).
b. mix affine and non-affine indexing within a given load/store like op: it should be pretty easy to extend dependence analysis to support some dependences that are partially *-distance (Allen & Kennedy notation) and relax existing violated dependence analysis checks (e.g. here).
c. longer-term, remove duplicated “affine-only” IR, whose existence is a byproduct of MLIR initially only being able to represent affine.load/store/for/if. Arguably, this would require a bit of “raising”/analysis to make arith.affine_apply/min/max more prominent, but this is simple in the overall picture.
extend and generalize transformations that have been implemented pessimistically in “affine only” form to also work more optimistically with only subsets of arith.affine_apply operands.
There should be no back-and-forth: vector.transfer is a higher-level op than affine.vector_load.
I’d say it’s even more isolated than vector->affine because of the narrow use case that @dcaballe has: in the short term, we only want to legalize 1-D vector.transfer ops.
Note that in the case of interest to Diego, legalize 1-D vector.transfer would suffice to unblock him.
But generally +1 to such a legalization pass. In the grander picture of arith.affine_apply/min/max ops described above, do you foresee significantly more than complex problems than proper canonicalization of these 3 ops + std.addi/muli/constant/... and some simple traversal to detect preconditions for transformations?
Not yet AFAIK but I don’t foresee major issues. I’d say it’s mostly missing canonicalizations into arith.affine_apply/min/max. For the short-term interop. you seek, I don’t think you’d even need those?
I have some trouble understanding what this would look like, could you maybe give some examples?
It seems to me this goes against the other parts because:
for your immediate problem you only need simple patterns for 1-D vector.transfer
if targeting better interplay between affine analysis and other dialects, vector->affine would seem to increase the footprint of “affine-only” abstractions.
vector.transfer_read/write is a generic higher-level abstraction, that works for n-D vectors and composes well with many other vector-level abstractions and progressive lowering. In the particular corner case of interest here (1-D + powers of 2 vectors + vector size that divides the problem size) they can easily lower to something that affine can understand today. The fact that affine does not understand anything else than “affine-only” is a deeper problem unrelated to future-proof vectorization abstractions.
The real problem is that affine.vector_load/storelowers tovector.transfer, which is a higher-level abstraction, on the way to LLVM. The correct path should be vector.transfer -> canonicalize 1-D affine.vector -> LLVM (probably via std).
%AV = memref_vector_cast %A : memref<?x?xf32> to memref<?x?xvector<8xf32>>
It is still unclear to me what happens when either:
the 1-D vector length is not a power of 2 (or whatever alignment the data layout requires) or,
? % 8 != 0 for the second memref<?x?xf32> dimension: parts of the data seems to just not be addressable?
? % 8 < 0 for the second memref<?x?xf32> dimension: seems like undefined behavior.
It is trivial to canonicalize from the higher-level vector.transfer to the lower-level subset that affine can deal with. Targeting a high-level abstraction + (very) progressively has been quite successful in MLIR so I am not sure why we’d want to target a lower level abstraction.
Indeed, the vector dialect is not limited by affine-only constraints and neither should the transformation. It turns out the implementation was limited by the IR we had at the time and it is probably time to refresh some of this. In any case, I’ll reiterate that legalizing vector.transfer for the corner case that affine analyses support is a trivial rewrite.
Yeah I’m not super satisfied with this name either, naming is hard. The initial feedback that I was sympathetic to, was that people were expecting vectorization to be 1-D only and that this was too confusing so I went for “supervectorizer” drawing inspiration from “superscalar”. Happy to revert back if this is now imprinted in people’s mind that MLIR vectors and vectorization are a multi-1D-vector abstraction.
In any case, let’s absolutely not use “tile”: it is one the most overloaded term in HPC compilers and hardware and algorithms… (otherwise should the transformation be called “tiling” ?)
This is a step further in the wrong direction: vector.transfer is still higher-level than affine.vector_load/store with N>1-D vectors and will continue to evolve this way to support gather/scatter, sparse, canonicalization/folding with other vector ops (unrolling, insert/extract, contract, transpose, future generic op, etc). Also, “tile”, really ?
Agreed.
Agreed, to break the back-and-forth cycle you mentioned, but the other way around vector.transfer is still a higher level abstraction than affine.vector_load/store or .
Yes and they are going to continue evolving towards higher-level abstractions (e.g. sparse).
Absolutely, there has already been quite some work on progressive “Vector -> Vector” lowering + canonicalization + folding.
In the fullness of time, we should indeed expose more of those + cost models. Mundane constraints such as timing and number of hands on deck get in the way… As a side note, in the particular case of memory + broadcast, it turns out that LLVM does a great job at converting std.load/std.store + std.insert/std.extract + the resulting lowering of (vector.contract -> vector.outerproduct -> vector.fma -> llvm.fmuladd) into “broadcast from memory” form (see the first assembly listing on slide 42 here):
It seems to me that adding ops with “affine-only” semantics to the vector dialect would only displace interoperability concerns into the vector dialect. So far I have been pretty happy with vector ops that compose and I am not looking at breaking that. If simplicity is what you are after and vector.transfer is too high-level, there is also std.load/std.store but of course this does not resolve the “affine dialect composition” problem so you’d still need a legalization pattern. However, the ? % 8 != 0 case still needs to be handled somehow.
Yes, we indeed have this behavior and continuing to add more progressive canonicalization and lowering to the vector dialect.
Now irrespective of the points above, there is still the fact that this pass has been written quite some time ago and needs to be refreshed, especially in light of linalg -> vector needs and the upcoming sparse needs. The affine parts (affine.for dependence analysis + affine.for rewrite + affine.load/store) can be decoupled from the op vectorization part. The op vectorization part can be reused in a) affine, b) Linalg and c) a potential scf.parallel_for + std.load/std.store if someone is interested in that. In fact some of this has been written out of core for Linalg and should be unified.
One progressive way of going could be to introduce a vector.generic op with structured ops semantics that has been discussed in the past (slide 48 shows vector.contract) and use it as a common ground. Would that sound useful to people?
For Diego’s immediate blocker, the 2 steps (that I’ll rewrite here) should do:
refactoring the lowering of affine to std to avoid going through vector.transfer.
have a legalization pattern that takes 1-D vector.transfer ops with agreeable (permutation map + alignment + layout) and rewrite those to affine.vector operations.
This is answered on the reviews page. Power of two isn’t an issue.
Yes, it’s not addressable through this cast’ed memref. They are still addressable through the original elemental memref. As an example, with full/partial tiles for vectorziation, the then branch/full tile would use the cast vector memref while the else version would continue to use the original elemental memref (accessing the partial tiles). This is what is intended and natural for this — please understand that this is a casting op, not a load/store/data moving op.
Yes, it’s undefined behavior, no different from similar scenarios that are bound to occur with dynamic memrefs. (For eg. you can do an alloc while passing a non-positive operand.) In the usage context, depending on which utility/pass is generating such ops, such dynamically undefined code wouldn’t get executed.
I think all three of your above questions are pretty much on the same issue. memref_vector_cast is a casting op, not a loading/storing data movement op.
refactoring the lowering of affine to std to avoid going through vector.transfer.
I agree with this! However, affine.vector -> Std is not that simple. affine.vector ops have semantics similar to vector.transfer in the sense that they have a return vector type that indicates the number of elements to be loaded/stored from a “scalar” memref:
There is nothing like that in Std, AFAIK. We would need std.vector ops similar to affine.vector ops or go with the memref vector cast op + Std scalar loads/stores. However, we introduced affine.vector ops to address the casting problems when using the scalar counterparts (see https://llvm.discourse.group/c/mlir/31). Note that my previous comments about introducing more explicit lower-level memory ops in the Vector dialects were exactly towards this direction. Would these ops make sense (either in the Std or the Vector dialect)?
The main problem I see is that vector.transfer is a high-level op (which you also mention in your last post) but for some reason, 1-D affine.vector_load/store “raises” to it on the way to LLVM.
have a legalization pattern that takes 1-D vector.transfer ops with agreeable (permutation map + alignment + layout) and rewrite those to affine.vector operations. You can then apply those patterns greedily. If you use a post-pass that comes immediately after vectorization, this could grow to legalize more things for affine analyses and transformations, in the future.
I agree with the high-level and raising points. However, if doing 1-D affine.vector -> vector.transfer ops is raising the abstraction level, isn’t affine.load/store -> vector.transfer (what the vectorizer is currently doing) kind of doing the same thing? If the input is an affine.load/store and the goal is to generate an affine.vector_load/store, going through an intermediate vector.transfer op would mean:
(Affine->Vector) Raising the abstraction level and dropping the affine information that the input IR has, and
(Vector->Affine) Having to retrieve the affine information again through analysis immediately after the vectorizer.
This sounds a bit redundant to me. Wouldn’t it make more sense to just generate the affine.vector ops directly instead of dropping the affine information and having to implement an analysis to retrieve that information again? What is the concern about generating affine.vector_load/store ops directly for 1-D cases given that the inputs are affine.load/store ops?
I think changing the lowering of affine.vector ops + generating affine.vector ops for supported 1-D affine.load/store ops in the vectorizer would be the cleaner and probably less problematic approach moving forward. Thoughts?
%AV = memref_vector_cast %A : memref<?x?xf32> to memref<?x?xvector<7xf32>>
Assuming %AV[0, 0] lives at address 0, what is the address of %AV[0, 1] ?
Oh I get that, what I am saying is this op semantic is not compatible with the more general vector.transfer-based vectorization. The supervectorization pass refuses to split full/partial tiles too early. The information is carried in the vector.transfer op. Later in the transformation pipeline, and depending on the HW, canonicalization patterns + unswitching can result in either:
dynamic vector masks (e.g. for ISAs with >1D vectors),
predication (e.g. for GPUs, the evolving LLVM predication path and hopefully in the future SVE-style)
cleanup loop with scalar code (the “partial vectorization” you mention)
As you well know, the claim is that the vector.transfer-based vectorization superseedes and encompasses vectorization with full/partial split as a corner case. The converse is not true as loops get split (esp. multiple nested loops). Note that this argument also appears in SVE/VLA-style programming.
This is a very important (re)design thread that I’d like to see reaching some sort of conclusion. Did you start working on #2, @dcaballe already?
I’m generally supportive of the #3->#4->#5 approach. There is no fundamental reason for affine to be restricted to specific forms of affine.load/store ops. Affine should know about the ops inducing side-effects, and figure out in linear time which of these induce “affine” side-effects. Currently it does not even need to check this and declares every non-affine-dialect memory access as UB, but I understand this is only a historical short-cut. State of the art in polyhedral compilation allows to mix and match affine and non-affine side-effects. The interface mechanism, and later modularization into a dialect, would make things much more composable and future-proof, and avoid duplicating tons of code.
Please instead directly respond on the reviews page for that PR - the thread above doesn’t include the response on the reviews page as context. I still don’t see how power of two vs non-power of two makes any difference for this op.
What does it mean for a specific op’s semantic to not be compatible with a different op’s semantics?! The former is a casting op and the latter is a data transfer op and encodes different abstractions! The subsequent explanation really doesn’t sound meaningful or on topic to me.