Speeding up canonicalize?

clattner · March 14, 2021, 6:29am

Over at the CIRCT project, we find that canonicalize is the major time sink for us. Here’s an example --pass-timing output (release build):

$ firtool aliaksei-test1.fir -lower-to-rtl -disable-output -pass-timing
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 2.5158 seconds

   ---User Time---   ---Wall Time---  --- Name ---
   0.0956 (  3.7%)     0.0956 (  3.8%)  CSE
   0.0029 (  0.1%)     0.0029 (  0.1%)    (A) DominanceInfo
   0.8098 ( 31.6%)     0.8098 ( 32.2%)  Canonicalizer
   0.2367 (  9.2%)     0.2367 (  9.4%)  LowerFIRRTLToRTL
   0.0844 (  3.3%)     0.0378 (  1.5%)  'rtl.module' Pipeline
   0.0844 (  3.3%)     0.0378 (  1.5%)    RTLCleanup
   0.1506 (  5.9%)     0.1506 (  6.0%)  CSE
   0.0126 (  0.5%)     0.0126 (  0.5%)    (A) DominanceInfo
   1.1852 ( 46.3%)     1.1852 ( 47.1%)  Canonicalizer
   2.5624 (100.0%)     2.5158 (100.0%)  Total

Canonicalization is good, but 77% of runtime is a bit excessive. Almost everything is in OperationFolder::tryToFold, the profile looks like this:

Weight	Self Weight	Symbol Name
222.00 ms 7.0%	222.00 ms	mlir::Value::getDefiningOp() const
187.00 ms 5.9%	187.00 ms	mlir::detail::constant_op_bindermlir::Attribute::match(mlir::Operation*)
123.00 ms 3.8%	123.00 ms	mlir::OperationFolder::tryToFold(mlir::OpBuilder&, mlir::Operation, llvm::SmallVectorImplmlir::Value&, llvm::function_ref<void (mlir::Operation)>)
117.00 ms 3.6%	117.00 ms	processValue(mlir::Value, (anonymous namespace)::LiveMap&)
111.00 ms 3.5%	111.00 ms	mlir::applyPatternsAndFoldGreedily(llvm::MutableArrayRefmlir::Region, mlir::FrozenRewritePatternList const&, unsigned int)
103.00 ms 3.2%	103.00 ms	mlir::DictionaryAttr::get(llvm::StringRef) const
100.00 ms 3.1%	100.00 ms	propagateLiveness(mlir::Region&, (anonymous namespace)::LiveMap&)
91.00 ms 2.8%	91.00 ms	mlir::Region::isIsolatedFromAbove(llvm::Optionalmlir::Location)
89.00 ms 2.8%	89.00 ms	mlir::simplifyRegions(llvm::MutableArrayRefmlir::Region)
76.00 ms 2.4%	76.00 ms	mlir::detail::walk(mlir::Operation, llvm::function_ref<void (mlir::Operation)>, mlir::WalkOrder)
73.00 ms 2.3%	73.00 ms	bool llvm::DenseMapBase<llvm::DenseMap<mlir::Value, llvm::detail::DenseSetEmpty, llvm::DenseMapInfomlir::Value, llvm::detail::DenseSetPairmlir::Value >, mlir::Value, llvm::detail::DenseSetEmpty, llvm::DenseMapInfomlir::Value, llvm::detail::DenseSetPairmlir::Value >::LookupBucketFormlir::Value(mlir::Value const&, llvm::detail::DenseSetPairmlir::Value const*&) const
71.00 ms 2.2%	71.00 ms	(anonymous namespace)::LiveMap::setProvedLive(mlir::Operation*)
65.00 ms 2.0%	65.00 ms	mlir::NamedAttrList::getDictionary(mlir::MLIRContext*) const
55.00 ms 1.7%	55.00 ms	(anonymous namespace)::GreedyPatternRewriteDriver::addToWorklist(mlir::Operation*)

I will send a patch to improve some low hanging things, but here are some other observations:

I’m surprised that Value::getDefiningOp is that expensive (and almost all of the calls are from tryToFold). Is there a way to improve this? This seems like a really core operation that should be fast.
Matching in m_Constant is pretty expensive because it has do do a op->hasTrait<ConstantLike>() query which digs through a bunch of stuff. I think it would be better to cache the results in a DenseSet<OperationName> which would be faster to look up. Such a change would live in OperationFolder, which I’ll prototype to see if it helps things.
propagateLiveness is really slow. I haven’t looked into this, but I suspect there is something algorithmically wrong. My testcase does have a bunch of regions (things like if statements)
OperationEquivalence::computeHash is spending a significant amount of time in hash_short(const char*, size_t). Are we hashing the operation name instead of the uniqued OperationName pointer address?

-Chris

River707 · March 14, 2021, 6:57am

Thanks for the report Chris! A lot of those look like things that I’ve seen before. I can take a look at some of the items if you’d like(happy to look at any of it really), but I’ll need some kind of repro though (would be awesome if there was one that worked with just upstream IR). I’ve already been optimizing quite a few of these areas recently, but I generally only have TensorFlow models to use for benchmarking (and the CIRCT code seems to be hitting different parts of the codebase).

– River

clattner · March 14, 2021, 7:39am

After spending a bunch of time microoptimizing the wrong thing, it looks like there is a very low hanging thing (I’ll send a patch in the morning after local testing completes): GreedyPatternRewriteDriver populates its worklist inorder, but then grinds through it using pop_back. For cases that do a lot of folding, this is super inefficient. Reversing the worklist once built provides a 33% speedup.

I’ll send in the patch in the morning, it is only a few lines of code. There is more to be done, I’d love help looking at the bulleted items above if you’re interested. To repro, build the circt repository and use this command:

firtool ~/Desktop/example-firrtl/aliaksei-test1.fir -lower-to-rtl -disable-output -pass-timing

The .fir testcase is available here.

I’m going to look at the ConstantLike cache next, and will include an updated profile in the morning.

-Chris

clattner · March 14, 2021, 5:27pm

It turns out that my simple patch permutes tons of testcases by inverting the order of constants at the top of a block… which happens because we are “folding” all of the existing constants and reinserting them. While correct, this is really inefficient - it means every std.constant gets fold()'d into an attribute, recreated in the entry block, and then deleted for each iteration of canonicalize.

clattner · March 14, 2021, 8:16pm

Patch is up here: https://reviews.llvm.org/D98609

This doesn’t include the caching of the ‘is constant’ check. I’d rather get this piece in before making the patch any larger.

bondhugula · March 14, 2021, 9:29pm

The test case you have for benchmarking has a single block while your patch is also changing the order for IR that has large nested blocks: common for TF models with while and if operations with large bodies. It’d be good to test these on the TF models that @River707 mentions using before?

clattner · March 15, 2021, 1:05am

Sure, it would be great to take a look at other cases. This iteration order should be better in general though, due to the way the “revisiting” logic works in canonicalize. If you have something like:

a = constant 1
b = add a, a
c = mul b, b

A top down traversal touches each op once, folding the add, then the mul.

Visiting bottom up will hit the mul first (no change), then the add (no change, then the constant (it folds). Folding then puts all the users of the constant back on the worklist, which causes the add to get folded. This puts all the uses of b back on the worklist, which puts the mul on the worklist.

This is only 2x slower than visiting top down in the trivial case like this, but this is much worse for high fanout cases where there are lots of uses hanging off foldable ops.

I’m not aware of any reason a bottom-up traversal would be better in general.

-Chris

River707 · March 15, 2021, 2:00am

Thanks for the repro. I can take a look at getDefiningOp and OperationEquivalence tomorrow. For propagateLiveness, can you check to see what impact 4e02eb8014c4dd8dd21071947525926bbe8046ef has? I’m not sure if that has been picked up by CIRCT yet, and it had a noticeable impact on some other benchmarks I’ve been testing.

– River

clattner · March 15, 2021, 3:42am

Oh good catch, circt doesn’t have that patch yet. Here’s the profile before it:

Weight	Self Weight	Symbol Name
152.00 ms 6.1%	152.00 ms	propagateLiveness(mlir::Region&, (anonymous namespace)::LiveMap&)
147.00 ms 5.9%	147.00 ms	mlir::applyPatternsAndFoldGreedily(llvm::MutableArrayRefmlir::Region, mlir::FrozenRewritePatternList const&, unsigned int)
125.00 ms 5.0%	125.00 ms	processValue(mlir::Value, (anonymous namespace)::LiveMap&)
93.00 ms 3.7%	93.00 ms	mlir::simplifyRegions(llvm::MutableArrayRefmlir::Region)
80.00 ms 3.2%	80.00 ms	mlir::detail::walk(mlir::Operation, llvm::function_ref<void (mlir::Operation)>, mlir::WalkOrder)
75.00 ms 3.0%	75.00 ms	(anonymous namespace)::LiveMap::setProvedLive(mlir::Operation*)
61.00 ms 2.4%	61.00 ms	mlir::OperationFolder::tryToFold(mlir::Operation, llvm::function_ref<void (mlir::Operation)>, llvm::function_ref<void (mlir::Operation)>, bool)
57.00 ms 2.3%	57.00 ms	(anonymous namespace)::OperationVerifier::verifyDominanceOfContainedRegions(mlir::Operation&)
55.00 ms 2.2%	55.00 ms	deleteDeadness(llvm::MutableArrayRefmlir::Region, (anonymous namespace)::LiveMap&)
53.00 ms 2.1%	53.00 ms	mlir::Region::isIsolatedFromAbove(llvm::Optionalmlir::Location)
49.00 ms 1.9%	49.00 ms	mlir::Value::getDefiningOp() const
38.00 ms 1.5%	0 s	mlir::DictionaryAttr::getWithSorted(mlir::MLIRContext*, llvm::ArrayRef<std::__1::pair<mlir::Identifier, mlir::Attribute> >)
36.00 ms 1.4%	0 s	malloc

And here’s the time passes output:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 1.7039 seconds

   ---User Time---   ---Wall Time---  --- Name ---
   0.0716 (  4.1%)     0.0716 (  4.2%)  CSE
   0.0015 (  0.1%)     0.0015 (  0.1%)    (A) DominanceInfo
   0.6702 ( 38.4%)     0.6702 ( 39.3%)  Canonicalizer
   0.2094 ( 12.0%)     0.2094 ( 12.3%)  LowerFIRRTLToRTL
   0.0749 (  4.3%)     0.0344 (  2.0%)  'rtl.module' Pipeline
   0.0749 (  4.3%)     0.0344 (  2.0%)    RTLCleanup
   0.1412 (  8.1%)     0.1412 (  8.3%)  CSE
   0.0091 (  0.5%)     0.0091 (  0.5%)    (A) DominanceInfo
   0.5771 ( 33.1%)     0.5771 ( 33.9%)  Canonicalizer
   1.7444 (100.0%)     1.7039 (100.0%)  Total

I’ll take a look with the patch, as the liveness stuff seems to be a big deal.

-Chris

clattner · March 15, 2021, 3:56am

That patch helped a lot! This is the after effect:

Weight	Self Weight	Symbol Name
138.00 ms 6.0%	138.00 ms	propagateLiveness(mlir::Region&, (anonymous namespace)::LiveMap&)
117.00 ms 5.1%	117.00 ms	mlir::applyPatternsAndFoldGreedily(llvm::MutableArrayRefmlir::Region, mlir::FrozenRewritePatternList const&, unsigned int)
90.00 ms 3.9%	90.00 ms	mlir::simplifyRegions(llvm::MutableArrayRefmlir::Region)
90.00 ms 3.9%	90.00 ms	mlir::Region::isIsolatedFromAbove(llvm::Optionalmlir::Location)
77.00 ms 3.3%	77.00 ms	mlir::detail::walk(mlir::Operation, llvm::function_ref<void (mlir::Operation)>, mlir::WalkOrder)
57.00 ms 2.4%	57.00 ms	deleteDeadness(llvm::MutableArrayRefmlir::Region, (anonymous namespace)::LiveMap&)
55.00 ms 2.3%	55.00 ms	mlir::OperationFolder::tryToFold(mlir::Operation, llvm::function_ref<void (mlir::Operation)>, llvm::function_ref<void (mlir::Operation)>, bool)
51.00 ms 2.2%	51.00 ms	mlir::Value::getDefiningOp() const
49.00 ms 2.1%	49.00 ms	mlir::DictionaryAttr::get(llvm::StringRef) const
45.00 ms 1.9%	45.00 ms	(anonymous namespace)::LiveMap::setProvedLive(mlir::Operation*)
43.00 ms 1.8%	43.00 ms	mlir::wouldOpBeTriviallyDead(mlir::Operation*)
36.00 ms 1.5%	36.00 ms	mlir::detail::constant_op_bindermlir::Attribute::match(mlir::Operation*)
36.00 ms 1.5%	36.00 ms	(anonymous namespace)::OperationVerifier::verifyDominanceOfContainedRegions(mlir::Operation&)
27.00 ms 1.1%	27.00 ms	malloc
27.00 ms 1.1%	27.00 ms	free
26.00 ms 1.1%	26.00 ms	mlir::OperationFolder::tryToFold(mlir::OpBuilder&, mlir::Operation, llvm::SmallVectorImplmlir::Value&, llvm::function_ref<void (mlir::Operation)>)

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 1.5182 seconds

   ---User Time---   ---Wall Time---  --- Name ---
   0.0841 (  5.4%)     0.0841 (  5.5%)  CSE
   0.0017 (  0.1%)     0.0017 (  0.1%)    (A) DominanceInfo
   0.5847 ( 37.5%)     0.5847 ( 38.5%)  Canonicalizer
   0.2028 ( 13.0%)     0.2028 ( 13.4%)  LowerFIRRTLToRTL
   0.0711 (  4.6%)     0.0321 (  2.1%)  'rtl.module' Pipeline
   0.0711 (  4.6%)     0.0321 (  2.1%)    RTLCleanup
   0.1276 (  8.2%)     0.1276 (  8.4%)  CSE
   0.0085 (  0.5%)     0.0085 (  0.6%)    (A) DominanceInfo
   0.4869 ( 31.3%)     0.4869 ( 32.1%)  Canonicalizer
   1.5572 (100.0%)     1.5182 (100.0%)  Total

Do you know anything about the liveness stuff in canonicalize? Should it run each iteration, or is there a cheaper way to do this?

-Chris

clattner · March 15, 2021, 4:10am

Some other possibly interesting data:

Hacking mlir::simplifyRegions into a noop (commenting out its body) halves the time in canonicalize (both passes).
mergeIdenticalBlocks seems pretty cheap.
disabling just runRegionDCE speeds up Canonicalize by 45%
disabling just eraseUnreachableBlocks speeds up Canonicalize by 19% (but exposes a circt bug, which is now fixed).

I think this is where the next round of analysis should go, do you have cycles to investigate this?

-Chris

River707 · March 15, 2021, 4:25am

Yeah, I can look into it as this has been a TODO for a while. I think we should be more selective about when we run specific region simplifications, as you mention. Like, eraseUnreachableBlocks should only really be run if we touched the CFG in some way(deleted terminators, added non-entry blocks, inlined regions, etc.). runRegionDCE is similar, though I suspect there are other general algorithmic cleanups that should happen there anyways.

– River

clattner · March 15, 2021, 4:45am

Awesome, thank you River!

-Chris

clattner · March 15, 2021, 5:20pm

In case it is helpful, there is an even larger example here. It also occurs to me that it would have been easier to do this work had I disabled multithreading.

drom · March 15, 2021, 6:29pm

There are cases when Canonicalizer dominates the execution time.
I wonder about memory consumption? Probably a big chunk of that 800MB ?

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 497.5140 seconds

   ---Wall Time---  --- Name ---
   1.6782 (  0.3%)  CSE
   0.0258 (  0.0%)    (A) DominanceInfo
  12.2527 (  2.5%)  Canonicalizer
   6.2663 (  1.3%)  'firrtl.circuit' Pipeline
   6.2663 (  1.3%)    LowerFIRRTLTypes
   5.1700 (  1.0%)  LowerFIRRTLToRTL
   1.3908 (  0.3%)  'rtl.module' Pipeline
   1.3908 (  0.3%)    RTLCleanup
   2.5205 (  0.5%)  CSE
   0.2930 (  0.1%)    (A) DominanceInfo
  468.2355 ( 94.1%)  Canonicalizer
  497.5140 (100.0%)  Total

{
  totalTime: 507.605,
  maxMemory: 867676160

github.com

circt/perf/blob/gh-pages/test2-2021-03-15.log

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 497.5140 seconds

   ---Wall Time---  --- Name ---
   1.6782 (  0.3%)  CSE
   0.0258 (  0.0%)    (A) DominanceInfo
  12.2527 (  2.5%)  Canonicalizer
   6.2663 (  1.3%)  'firrtl.circuit' Pipeline
   6.2663 (  1.3%)    LowerFIRRTLTypes
   5.1700 (  1.0%)  LowerFIRRTLToRTL
   1.3908 (  0.3%)  'rtl.module' Pipeline
   1.3908 (  0.3%)    RTLCleanup
   2.5205 (  0.5%)  CSE
   0.2930 (  0.1%)    (A) DominanceInfo
  468.2355 ( 94.1%)  Canonicalizer
  497.5140 (100.0%)  Total

{

This file has been truncated. show original

jpienaar · March 15, 2021, 7:23pm

Very easily could but also a function of the user written folders. E.g., we ran into an issue which took 10 min as someone wrote a bad folder, so it is useful to break that out and see which folders are taking what time vs the pass itself.

River707 · March 16, 2021, 3:11am

So, I took a look at test2 today and I found something interesting. There are a few parts of RegionDCE that can be improved (as expected), but the execution time of it is actually quite fast. With some good ole printf debugging, I found that we were actually spending very little time in region simplification. The reason why it creates problematic compile time is because of the fact that it clears the OpFolder (which results in the problems that you have already identified w.r.t constant behavior) when simplification is successful. I reworked the simplification logic to track operation erasure using the RewriterBase class I added a little while ago and the execution time dropped in half, with the result being pretty much the same as when you don’t run region simplification at all.

I’ll try to prepare a patch series tomorrow.

– River

bondhugula · March 16, 2021, 2:27pm

If you did bottom-up, you’d catch the “consumer-driven” simplification early - for eg. deadness due to zero use or other erasure that simplifies your producers. I think that might have been the motivation behind doing a post-order walk and extracting from the end. Eg:

func (%a : ..)
  b = add a, a
  c = mul b, b
  y  = alloc(c)
    ...
  return x

If you did bottom-up, you eliminate everything in one pass IIUC (note the call to isOpTriviallyDead early in the iteration or via folding). With top-down, you would be 2x worse?

For the nested blocks case, both, with the previous approach and this one, parents are processed before their descendants. I think this part is less important.

clattner · March 16, 2021, 4:16pm

Hey River,

I had a two thoughts last night (pretty impressive, I know! and I’m curious what you think about these high level points:

First, it isn’t obvious to me why canonicalize is doing an optimistic DCE pass in the first place. By the same logic, we should be running SCCP (optimistic CCP) as well, and many other things. These seem better split out to their own DCE or ADCE like pass. Do you know of any clients for which this is actually important or can we just drop it entirely?

Second, and a bigger deal, I think the top level loop of GreedyPatternRewriteDriver::simplify is wrong. Boiling it down, it currently looks like this:

  do {
    changed = doTrivialDCEFoldingAndPatternRewriting()
    changed |= doSimplifyRegions()
  } while (changed && not too many iterations);

The original idea of this was that CFG simplification can unlock other canonicalizations, e.g. merging two identical blocks can make a conditional branch have the same destination for both the true and false edges. However, the simplify regions simplifications are extremely unlikely to occur in practice (this is why they are pulled out of the main workloop) and the doTrivialDCEFoldingAndPatternRewriting part of the algorithm is already iterative, so it doesn’t need yet-another iteration loop around it. I think that we should change the above loop to:

  do {
    doTrivialDCEFoldingAndPatternRewriting()
    changed = doSimplifyRegions()
  } while (changed && not too many iterations);

The implication of this is that cases we only rerun doTrivialDCEFoldingAndPatternRewriting in the unlikely case where “doSimplifyRegions” does something. This will have the effect of halving the iterations in practice.

I think that both of these problems were my bright idea back in the day, before we understood how the stack would go together.

-Chris

clattner · March 16, 2021, 4:23pm

bondhugula:

If you did bottom-up, you’d catch the “consumer-driven” simplification early - for eg. deadness due to zero use or other erasure that simplifies your producers. I think that might have been the motivation behind doing a post-order walk and extracting from the end. Eg:
func (%a : ..)
  b = add a, a
  c = mul b, b
  y  = alloc(c)
    ...
  return x
If you did bottom-up, you eliminate everything in one pass IIUC (note the call to isOpTriviallyDead early in the iteration or via folding). With top-down, you would be 2x worse?

For the nested blocks case, both, with the previous approach and this one, parents are processed before their descendants. I think this part is less important.

I haven’t done the archeology on this, but I think that I wrote this code originally, and I wasn’t worried about that. I just pounded out the simplest thing possible with the plan of coming back to it later (now apparently) when we had larger testcases that worked with MLIR

Your point is a good one though. I think that it would be best to merge this into the prepass since there is a simple fixed point to trivial deadness, and it is easy to drive it through the SSA def-use chains.

The reason that top-down simplification is important for folding / pattern rewriting is that these happen in the middle of the expression tree, and doing so re-adds the operands and users back to the worklist. Folding the operands away first can cause a lot fewer iterations of the worklist.

-Chris

Topic		Replies	Views
"mlir-translate --mlir-to-llvmir" and "llc" are slow MLIR	10	722	March 26, 2021
MLIR News, 33rd edition (5/1 - 5/14/2021) Newsletter	0	1086	May 5, 2021
Speeding up OperationName::OperationName? MLIR	2	338	January 5, 2022
CIRCT now in Compiler Explorer (godbolt.org) CIRCT	2	899	August 4, 2022
[npcomp] CIRCT integration torch-mlir	7	774	October 6, 2020

Speeding up canonicalize?

Related Topics