Speeding up canonicalize?

I haven’t done the archeology on this, but I think that I wrote this code originally, and I wasn’t worried about that. I just pounded out the simplest thing possible with the plan of coming back to it later (now apparently) when we had larger testcases that worked with MLIR :slight_smile:

Your point is a good one though. I think that it would be best to merge this into the prepass since there is a simple fixed point to trivial deadness, and it is easy to drive it through the SSA def-use chains.

The reason that top-down simplification is important for folding / pattern rewriting is that these happen in the middle of the expression tree, and doing so re-adds the operands and users back to the worklist. Folding the operands away first can cause a lot fewer iterations of the worklist.

-Chris

I added it at one point when it was enabling some simplifcations in code I was dealing with, but happy to have it in another pass.

From my perspective, this is partially an effect of trying to optimize PHIs in MLIR. In LLVM it’s easier(probably one of the few upsides of modeling a PHI as an instruction), because it’s just another instruction in your worklist. In MLIR, because we use block arguments, it’s a different IR structure that was never really integrated well with the current simplification loop.

Yeah, several actually. Aside from those that depend on it, I’ve also tried to avoid the situation where we end up prematurely splitting and then have to do this everywhere(this applies to test2 as well):

add(Canonicalizer());
add(ADCE());

(Not saying that we shouldn’t have a separate more focused ADCE pass at some point)

I 100% agree that the current loop needs to change, it is too happy to re-run the full iteration over and over again (often no reason). I can play around with changing it and see what breaks, but I would consider cases that break to be bugs honestly(i.e. I suspect something is wrong with updating the worklist).

– River

@clattner I don’t know if you missed this comment, but we found a lot of cases like this internally.
This may not matter for canonicalization but it does for other “pattern-based optimizers” (using DRR for example).
In such cases they prefer to match the largest subgraph, and traversing bottom-up was allowing this kind of behavior.

Now in practice that may just be a misused of the greedy pattern application driver, and another driver would be better suited for this.
I actually talked with River a few months ago about the need for different visitation order, and it goes beyond bottom-up/top-down but also the way you re-enqueue operand/results after applying a pattern (to drive kind of BFS/DFS traversal).

2 Likes

As a follow-up another interesting behavior change (still dealing with breakages following this change): we use to fold constant themselves once before “remembering” them. This had the effect of using the dialect “materializeConstants” hook, which could in some way “canonicalize” the constant.
After this change we don’t do that anymore, so if a constant had attributes in the first place we’ll preserve this in the pipeline while these would be folded before I believe (if the dialect had the ability to materialize constant).

Honestly, I think this canonicalizer traversal change patch was rushed in based on a single test case (or two) and without adequate experimentation for something that has more far reaching consequences on transformation pipelines than anything else – and more importantly with the quick conclusion that there was no reason a bottom up traversal could be better. Pre-order or a different approach might well be a better one eventually, but I don’t think a major change like this that has impact on nearly every MLIR user upstream or downstream should be done without adequate experimentation. I feel the right course of action now is to quickly revert this. For eg., we can’t be sure that it doesn’t slow down a large number of cases by small amounts upstream and downstream or create other issues. We need a good representative set even if small and the pass timing infra already exists to manually time and check for that small number.

Yes, but the reason the traversal order was kept unchanged for this long was the realization that post order + processing in reverse was a natural one for “consumer-driven” simplification. I recall reading and updating things in the rewrite driver multiple times but not considering changing the traversal order for that reason.

1 Like

Yeah I tend to agree with Uday here: while canonicalization “should just work” regardless of the initial order, the greedy pattern rewriter driver being the only one available, a very large number of clients have been started to use it for more than canonicalization and got unfortunately very sensitive to the order, in particular with the expectation that we’d be matching consumers before producer (which is not actually true in the absolute considering how we re-enqueue users after pattern application).

We likely should really reprioritize revisiting the need for more pattern drivers than the “simple” greedy one: we’ve been talking “forever” about having drivers that can expose cost functions, or can be driven by search or ML methods for example. But even starting with different “optimization” goal and visitation order would be a good first step.
I suspect this will require some non-trivial work though.

Hi all,

I’m sorry I haven’t had time to dig into this, feel free to revert the patch if it is dubious! I’m happy to spend more time discussing this and change it, or recommit it if we decide it is the right thing after all.

-Chris

As a middle ground, you could also keep the pre-folding and uniquing of constants but not change the traversal order.

Yes the pre-folding/uniquing of constant caused a lot of trouble to us internally, but I looked at it as “bugs” in the internal projects (that said our invariants with respect to constants in a dialect aren’t documented, and folks have been “creative” with respect to defining dialects with multiple kind of constant ops).

On the other hand, the traversal order created more subtle issues for which I didn’t find immediate solution to offer to our clients. Actually we landed a “flag” on applyPatternsAndFoldGreedily API to let clients opt in the previous visitation order: we couldn’t have upgraded our clients forward without this (now this is “dead code” upstream and I can’t say I approve this in general).
The issues ranged from correctness to performance (still got report as late as today of ML models seeing “quality issues” after this change which triggered more uses of this flag internally).

Reverting the visitation order but not the change to constant processing seems like a good middle ground for now.

(As an aside, in a “graph region” the visitation order should really be driven by use-def chains more than anything right?)

Having been around ML for a long time, I get a bit squinty eyed at “quality issues”: generally, ime, these are undiagnosed correctness issues due to a somewhat “pragmatic” interpretation of invariants.

Moving forward, I don’t have a strong opinion on a full rollback or further iteration based on the merits, so long as we are orderly about it.

I do think this flushed a lot of bugs or otherwise underconstrained behavior. Given the thrash that we’ve just gone through this week, if we’re going to rush a further action other than a full rollback while we figure out what we really want to do, I’d feel better if we got some further impact reports to determine whether a further fix forward is likely to help and not hurt. Unless if there is evidence that a fix forward will reduce some chaos, I’d vote for a pure rollback or keeping it as is while discussing the real path forward.

I’d prefer not rushing a further change unless if there are significant impacts still from multiple parties.

Given Stella’s note as well, we should just revert Chris’ change. I really don’t think this should be incrementally fixed upstream. The fact that it may cause churn downstream is orthogonal to some extent but I think there were fundamental tradeoffs missed before this was committed. (For eg. I hadn’t even had a chance to look at the responses to my comments on phab and here before this was landed.) I’ll revert it assuming there is consensus.

Update: ⚙ D99329 Revert "[Canonicalizer] Process regions top-down instead of bottom up & reuse existing constants."

Accepted because I think this is the right action per the LLVM developer process (I would prefer an additional ack to land).

Ftr, on the Google side, integrating the original version of this cost me personally a sustained 48h stretch of time, and the rollback will also tax me personally. But it is the right thing to do for upstream. Maybe also it will cause some to reevaluate the style of downstream integration that biases towards these kind of non recoverable issues (IREE for example is making the decision now as to whether to switch itself to a more pipelined, more open source friendly integrate process). All of that is (or should be) orthogonal to what to do with this patch.

Adding some information to this thread. In IREE using the new traversal did speed up compilation by the same amount that Chris found earlier (by about 33% on compiling MobileBERT). Since there is a flag to allow which traversal to pick, does the change really need to be reverted. Having the flag will allow for more testing and getting more information about the efficacy right? Would be fine to use the earlier traversal by default?

Let’s talk about this in the ODM meeting in 10 minutes.

It’s a shame it wasn’t announced before, I would have participated in the live discussion.

In my opinion, virtually all upstream contributors also work on downstream projects. Being more mindful about downstream instead of the current “scorched earth” approach where we don’t care at all would likely save time for virtually everyone.

I strongly agree, and I apologize for imprecision in my comment. In this case, I think that the malfunction was that parties should have noted early that this was a “scorched earth” commit (unintentionally), stopped the line, and called for a revert when the issues became apparent. Instead, we all (across multiple companies) just descended into the chaos and only joined on it so late that we were fairly committed to the fight to make it work. I’m jumping to conclusions, but some of us are effectively on integration freight trains that can be hard to stop once they start, and I know that biased my early evaluations of the right thing to do here. I suspect it biased several of us. We should inspect that in our processes and comms and figure out how to improve for the future… that is really what I was getting at.

That is all orthogonal to what we do now.

Would it be worthwhile having a flag that purposely randomizes the processing order to avoid downstream users from making assumptions here? It’s not really clear to me whether this is just a case where sensitive tests broke, but functionality was the same, or whether there were more serious regressions?

1 Like

You can keep the default as is with a flag but that would make the code for the other traversal mode dead and untested upstream. One might as well take it out and have proper test cases when adding it later. IMHO, speeding up a set of test cases by 33% isn’t still great justification to make a change like this.

+1 on this in this case, we need to be careful though as many (most?) changes are “breaking changes” but don’t all fit into this category.
In particular the difference between this change and any other breaking change is one of fundamentals: the motivation for this change was canonicalization, it was explained and justified this way. However the changed component is actually used for non-canonicalization use-cases pervasively (MLIR does really have an alternative driver to offer for pattern application).
Without invalidating the merit of the change in the context of canonicalization, the realization of how widespread the greedy driver is used in non-canonicalization environment seems enough to me to go back and rethink the change.

What’s interesting here to me as well is how this change didn’t hurt much upstream: it shows an area where we don’t have much “stress testing” of the framework and I tend to relate this to the absence of well established “end-to-end” flow upstream (like an “end-to-end TOSA compiler” for example).

Yes, we discussed it this morning: it would be doable for canonicalization, but see below:

Most of the problem we hit were in “non-canonicalization” use-cases. We have many callers of applyPatternsAndFoldGreedily that “bring their own patterns” and don’t even include the canonicalization one. The driver is used for “lowering” and other kind of preparations for some passes.
If applyPatternsAndFoldGreedily was only used for canonicalization, I suspect we wouldn’t have had that much problems.
(add to it the issue of invariants around constant ops I tried to describe above)

1 Like