I will send a patch to improve some low hanging things, but here are some other observations:
I’m surprised that Value::getDefiningOp is that expensive (and almost all of the calls are from tryToFold). Is there a way to improve this? This seems like a really core operation that should be fast.
Matching in m_Constant is pretty expensive because it has do do a op->hasTrait<ConstantLike>() query which digs through a bunch of stuff. I think it would be better to cache the results in a DenseSet<OperationName> which would be faster to look up. Such a change would live in OperationFolder, which I’ll prototype to see if it helps things.
propagateLiveness is really slow. I haven’t looked into this, but I suspect there is something algorithmically wrong. My testcase does have a bunch of regions (things like if statements)
OperationEquivalence::computeHash is spending a significant amount of time in hash_short(const char*, size_t). Are we hashing the operation name instead of the uniqued OperationName pointer address?
Thanks for the report Chris! A lot of those look like things that I’ve seen before. I can take a look at some of the items if you’d like(happy to look at any of it really), but I’ll need some kind of repro though (would be awesome if there was one that worked with just upstream IR). I’ve already been optimizing quite a few of these areas recently, but I generally only have TensorFlow models to use for benchmarking (and the CIRCT code seems to be hitting different parts of the codebase).
After spending a bunch of time microoptimizing the wrong thing, it looks like there is a very low hanging thing (I’ll send a patch in the morning after local testing completes): GreedyPatternRewriteDriver populates its worklist inorder, but then grinds through it using pop_back. For cases that do a lot of folding, this is super inefficient. Reversing the worklist once built provides a 33% speedup.
I’ll send in the patch in the morning, it is only a few lines of code. There is more to be done, I’d love help looking at the bulleted items above if you’re interested. To repro, build the circt repository and use this command:
It turns out that my simple patch permutes tons of testcases by inverting the order of constants at the top of a block… which happens because we are “folding” all of the existing constants and reinserting them. While correct, this is really inefficient - it means every std.constant gets fold()'d into an attribute, recreated in the entry block, and then deleted for each iteration of canonicalize.
The test case you have for benchmarking has a single block while your patch is also changing the order for IR that has large nested blocks: common for TF models with while and if operations with large bodies. It’d be good to test these on the TF models that @River707 mentions using before?
Sure, it would be great to take a look at other cases. This iteration order should be better in general though, due to the way the “revisiting” logic works in canonicalize. If you have something like:
a = constant 1
b = add a, a
c = mul b, b
A top down traversal touches each op once, folding the add, then the mul.
Visiting bottom up will hit the mul first (no change), then the add (no change, then the constant (it folds). Folding then puts all the users of the constant back on the worklist, which causes the add to get folded. This puts all the uses of b back on the worklist, which puts the mul on the worklist.
This is only 2x slower than visiting top down in the trivial case like this, but this is much worse for high fanout cases where there are lots of uses hanging off foldable ops.
I’m not aware of any reason a bottom-up traversal would be better in general.
Thanks for the repro. I can take a look at getDefiningOp and OperationEquivalence tomorrow. For propagateLiveness, can you check to see what impact 4e02eb8014c4dd8dd21071947525926bbe8046ef has? I’m not sure if that has been picked up by CIRCT yet, and it had a noticeable impact on some other benchmarks I’ve been testing.
Yeah, I can look into it as this has been a TODO for a while. I think we should be more selective about when we run specific region simplifications, as you mention. Like, eraseUnreachableBlocks should only really be run if we touched the CFG in some way(deleted terminators, added non-entry blocks, inlined regions, etc.). runRegionDCE is similar, though I suspect there are other general algorithmic cleanups that should happen there anyways.
In case it is helpful, there is an even larger example here. It also occurs to me that it would have been easier to do this work had I disabled multithreading.
Very easily could but also a function of the user written folders. E.g., we ran into an issue which took 10 min as someone wrote a bad folder, so it is useful to break that out and see which folders are taking what time vs the pass itself.
So, I took a look at test2 today and I found something interesting. There are a few parts of RegionDCE that can be improved (as expected), but the execution time of it is actually quite fast. With some good ole printf debugging, I found that we were actually spending very little time in region simplification. The reason why it creates problematic compile time is because of the fact that it clears the OpFolder (which results in the problems that you have already identified w.r.t constant behavior) when simplification is successful. I reworked the simplification logic to track operation erasure using the RewriterBase class I added a little while ago and the execution time dropped in half, with the result being pretty much the same as when you don’t run region simplification at all.
If you did bottom-up, you’d catch the “consumer-driven” simplification early - for eg. deadness due to zero use or other erasure that simplifies your producers. I think that might have been the motivation behind doing a post-order walk and extracting from the end. Eg:
func (%a : ..)
b = add a, a
c = mul b, b
y = alloc(c)
...
return x
If you did bottom-up, you eliminate everything in one pass IIUC (note the call to isOpTriviallyDead early in the iteration or via folding). With top-down, you would be 2x worse?
For the nested blocks case, both, with the previous approach and this one, parents are processed before their descendants. I think this part is less important.
I had a two thoughts last night (pretty impressive, I know! and I’m curious what you think about these high level points:
First, it isn’t obvious to me why canonicalize is doing an optimistic DCE pass in the first place. By the same logic, we should be running SCCP (optimistic CCP) as well, and many other things. These seem better split out to their own DCE or ADCE like pass. Do you know of any clients for which this is actually important or can we just drop it entirely?
Second, and a bigger deal, I think the top level loop of GreedyPatternRewriteDriver::simplify is wrong. Boiling it down, it currently looks like this:
do {
changed = doTrivialDCEFoldingAndPatternRewriting()
changed |= doSimplifyRegions()
} while (changed && not too many iterations);
The original idea of this was that CFG simplification can unlock other canonicalizations, e.g. merging two identical blocks can make a conditional branch have the same destination for both the true and false edges. However, the simplify regions simplifications are extremely unlikely to occur in practice (this is why they are pulled out of the main workloop) and the doTrivialDCEFoldingAndPatternRewriting part of the algorithm is already iterative, so it doesn’t need yet-another iteration loop around it. I think that we should change the above loop to:
do {
doTrivialDCEFoldingAndPatternRewriting()
changed = doSimplifyRegions()
} while (changed && not too many iterations);
The implication of this is that cases we only rerun doTrivialDCEFoldingAndPatternRewriting in the unlikely case where “doSimplifyRegions” does something. This will have the effect of halving the iterations in practice.
I think that both of these problems were my bright idea back in the day, before we understood how the stack would go together.
I haven’t done the archeology on this, but I think that I wrote this code originally, and I wasn’t worried about that. I just pounded out the simplest thing possible with the plan of coming back to it later (now apparently) when we had larger testcases that worked with MLIR
Your point is a good one though. I think that it would be best to merge this into the prepass since there is a simple fixed point to trivial deadness, and it is easy to drive it through the SSA def-use chains.
The reason that top-down simplification is important for folding / pattern rewriting is that these happen in the middle of the expression tree, and doing so re-adds the operands and users back to the worklist. Folding the operands away first can cause a lot fewer iterations of the worklist.