Pattern rewriter - apply only the specified patterns

(I posted this as an issue last year when MLIR was on tensorflow’s GitHub, but I’m bringing this up again with some more discussion.)

The greedy pattern rewriter that currently exists not only applies the supplied patterns but also performs folding and DCE on the entire function. applyPatternsGreedily I believe will benefit from having an extra boolean option (‘onlyPatterns’) to specify that only the supplied patterns are to be applied and no other folding/DCE is to be performed. Adding this would be trivial. The motivation is as follows, but this still doesn’t address three issues that I describe further below.

  1. In several utilities/passes (esp. at their entries and exits), one may want to perform specific simplification rewrites and just those specific op rewrites instead of whole IR DCE / folding. This is so that you could say canonicalize/simplify the input in order to make the pass more effective and in some cases valid. In several such cases, it’s not really viable to check if the input is already in a canonical/simplified form - it would require one to duplicate a lot of the logic from rewrite patterns in question and keep them updated/in sync, and in many cases, one can’t check if it’s in a simplified form without actually performing it to see if something happened (due to a combined effect).

  2. In some cases, running whole function DCE/folding could even delete the op that the utility was expected to work on! (not just due to one pattern but as a combined effect) How would we know if that happened when calling applyPatternsGreedily within that utility? OTOH, if a user is able to specify specific patterns, they would know whether the patterns could potentially erase the op, and not use those that would.

  3. Applying unrelated folding and DCE at the exit of the pass/utility will see unexpected output / simplifications that makes writing test cases difficult and updating those test cases based on completely unrelated updates to other folding patterns.

  4. Applying certain lightweight patterns on the ops the utility/pass works on leads to earlier simplification, more intuitive/readable pass output, and more readable/compact test cases.

  5. Using the current applyPatternsGreedily is also a compile-time issue dealing with all the ops in the function when you only wanted specific stuff.

  6. The canonicalization patterns are also in local namespaces and so they can’t be accessed other than through the rewriter out of the box

While adding a ‘onlyPatterns’ flag as proposed will help all of the above, it doesn’t address the following three issues:

(A) a change by @River707 a few months ago moved several standalone canonicalization patterns into the folding hook of the op - to allow more simplification. This however prevents one from applying those patterns via a pattern list - instead the folder / tryToFold has to be called (when you have onlyPatterns = true), and the pass/utility will have to deal with phase ordering if it needs a specific op’s folding and canonicalization patterns.

(B) the ‘onlyPatterns’ flag doesn’t address the issue of all ops in the entire function from getting into the worklist queue. Although only the ops related to the supplied patterns will match, you still have to create and process a much larger than necessary worklist.

To address these two, we could have an option / an alternate applyPattern method that only puts a list of supplied ops (often just one op) into the worklist and does both: folding/DCE and application of the supplied patterns. This will address both (A) and (B). But there would also have to be a way to indicate successful folding/erasure back to the client.

(C ) There would still be no way to run just one of the op’s patterns - it’s either all canonicalization patterns or none. But this is again is a smaller issue in the whole scheme.

Do these proposed changes and rationale sound fine?

I think that adding an “onlyPatterns” flag to this makes sense.

I’d rather not have a flag, it seems awkward applyPatternsGreedily(, /*onlyPatterns=*/true);. It would be much cleaner to have different entry points that both happen to use the same greedy driver internally, albeit configured differently. Realistically, the current behavior should be moved to an entry point with a closer name applyPatternsAndSimplify or something. It also isn’t clear to me If what you really want is to say mlir::simplifyOperations(ops) and have it run canonicalization patterns and folders on those operations(optionally with some additional patterns). If there is a better user API that we can provide, I would rather we do that than add flags.

I would be unsupportive if the folder isn’t run in whatever mode this turned out to be. It seems like that would incentivize using canonicalization patterns over the folder, which is the exact opposite of what we want.

If you always wanted to run the folder, @River707 how do you address issue (2) in my list? Concrete example: if a utility is given an if/else op to hoist up invariant loops and it so happens that the ‘if’ op provided could really be folded and eliminated because its bodies were empty or that the body which was always taken was empty), there would be no way of applying canonicalization patterns without losing the entire op. Shouldn’t that utility be just simply hoisting instead of eliminating it as dead or folding it? If you always want to run the folder, how would you simplify but not fold (which is what you want to do here - the simplify will drop unused operands, compose operands, etc. which is needed for the pass to be both effective and in some cases valid). Also, it shouldn’t be upon the caller to always run folding on it first before handing things to such utilities. If you force folding, a utility like hoist, loopUnroll or doXyz would always become hoistOrFold, loopUnrollOrFold, doXyzOrFold etc. and callers will have to be prepared to deal with that outcome. One could argue that that should be behavior because otherwise the utility is just doing useless work on something that shouldn’t exist: but what if the caller’s intention was to provide an empty body ‘if’ or in other ways an op that is not yet alive, call the utility to perform the action, and then populate it making it live. I would tend to think that the right behavior for such a utility is to still perform the intended action (hoisting in the example above) and not be doing DCE/folding on it. There are some important design choices to make on utilities here, which impact what support you want to provide in the pattern rewriter.

Applying patterns itself is already simplification - in fact, I was myself thinking of renaming applyPatternsGreedily to applyPatternsAndFoldGreedily - the current name is misleading when reading code at its call sites.

I am missing the part where the utility needs to apply canonicalization?
(To be clear: I am not against improving the situation with the APIs here, I think multiple entry point and possibly different drivers make sense)

Are you looking for an example where canonicalizations (ones that simplify the op without trying to fold) are necessary to make the pass effective / valid?

I don’t quite understand what you are getting at here. If I have an op that can be folded away, why would I want to apply canonicalizations to it? Trying to pick and choose the level of canonicalization you want seems weird and not really feasible, e.g. a canonicalization pattern has just as much capability(if not more) to delete the operation and replace it with something else. I don’t see how removing the ability to fold improves the situation at all.

Yes: this is the part that isn’t clear to me here. I don’t associate directly in my mind the internals of loop unrolling with canonicalization patterns (or if I want to canonicalize the unrolled body before unrolling an outer loop, then I would want all the folding as well anyway).

For say loop unrolling, the canonicalization for affine.for’s help you detect if the trip count is a multiple of the unroll factor and thus doing away with the cleanup part generation (composing affine.apply’s / dropping unused/duplicate bound operands, and simplifying the bounds maps used). The other example in the context of the hoisting of if/else that I mentioned earlier and that I thought would be clear: dropping unused operands on the affine.if (which bind to its integer set) allows you to hoist it more while keeping the utility simple. Second, composing in any affine.applyss that provide values for the if’s operands is necessary for the hoist check to be valid again keeping the check simple. Examples:

affine.for %i
  affine.for %j
     affine.if affine_set<(d0, d1) : (d0 - 1 >= 0, -d0 + 100 >= 0)(%i, %j)  {
         ...
      }

%j is an unused operand. canonicalizeSetAndOperands drops such operands. Hoist check becomes simple - you just have to see if the IV is among the set’s operands and you can hoist past %j.

affine.for %i
  %i_ = affine.apply affine_map<(d0) -> (d0 + 1)>(%i)
  affine.for %j
     affine.if affine_set<(d0) : (d0 - 1 >= 0, -d0 + 100 >= 0)(%i_)  {
    
}

If you canonicalize an affine.if op, it ensures that operands are always either IVs or top-level symbols (the %i_ gets composed in); as a result, the utility/hoist check becomes simple.

But such examples / use cases are pervasive. Any affine.for op / affine.if canonicalization makes nearly every utility simpler – hoisting, scalar replacement, unrolling, full/partial separation to name a few.

Did you see post #4? It might have crossed your reply.

I have the impression that you’re explaining “why is canonicalization helping a transformation”, but this isn’t controversial: what I haven’t seen clearly showed to me is when is it needed to have a targeted canonicalization in the middle of a transformation. The examples you’re showing here seem like they would be handled by running the canonicalized before the unrolling pass for example?

But a utility could be called at the middle of another utility/transformation. For eg., the utility to hoist 'if’s could be called at the middle of the one that does separation of full and partial tiles (the latter creates an if/else condition). A utility to hoist allocations could be called at the middle of a utility that creates those allocations (the one that introduces memref packing/copying for example). The utility to promote a single iteration loop body to its containing op could be called from almost anywhere. There isn’t a reason to think of these rewrite patterns as being needed at entry alone. Second, a canonicalization rewrite may be needed towards the end of a pass/utility to cleanup on an op or one or more collected ops. Third, even for the canonicalization that is needed at the the beginning of a transformation, you may not want the canonicalization to be performed before the pass / at the caller, but to be handled within the pass/utility at its entry. The caller isn’t supposed to know what patterns are needed there - it could well be a custom pattern outside of the getCanonicalizationPatterns, and even otherwise, shouldn’t need to add code there to canonicalize.

This stack of revisions addresses these as well as a couple of other issues: D77478, D77483, D77485, D77486, and D77487.