Parallelizing the verifier

Over in the circt project, the verifier (in -verify-each mode) is slower than the rest of the compiler. While we could turn off -verify-each, I like the bug finding ability of it, and would prefer to go as far as we can without doing this.

There are a few ways to speed up the verifier, but the lowest hanging fruit is to parallelize it. Particularly for module passes, it would be nice to verify each function in parallel with each other.

There are a couple of concerns about this: 1) while verification hooks are generally read only, we don’t actually have a declared policy about this. 2) the granularity of an individual verify call on an op is so small that we may run into more orchestration overhead than we benefit from parallelization.

I think the answer to this it do an “effective in practice” hack of some sort. The OperationVerifier::verifyOpAndDominance method in Verifier.cpp walks the region tree of the IR, recursively verifying an op before its enclosed regions in preorder (it actually has two different passes one for op verification one for dominance, but that’s a detail I’ll ignore here).

In order to address the granularity issue, I think we can have the verifier parallelize the verification of the top-most region that contains “isolated from above” ops. This would ensure that functions get verified in parallel, and addresses the (fairly theoretical) concurrency issue by aligning with our existing parallelization strategy for “isolated from above” ops.

Does anyone have any concerns about this, or a better idea?

-Chris

3 Likes

I think it’s not uncommon to have more levels of hierarchy, so that the interesting bits to parallelize isn’t the ‘top-most region’. how about instead, that any isolated from above operation will be checked in paralllel, unless it is already in a region being checked in parallel, which I think would end up being the same for functions-in-modules, but also allow other scenarios that aren’t necessarily at the toplevel?

That seems like a good strategy to me!

We should just make it const! Oh wait… :wink:

Ok, it will take more restructuring of the code, but I can take a crack at this.

It turns out that checking: op.hasTrait<OpTrait::IsIsolatedFromAbove>() is more expensive than the benefit. It looks like we need some sort of trait cache or something.

-Chris

Do we really need to work around the case of a verifier that makes modifications? I think it would be quite reasonable to have a policy that this is not allowed.

It would be even better if we could enforce that property somehow, like in the compiler

:stuck_out_tongue:

So if we drop that requirement we don’t need isolated from above, right? It might be worth profiling and making sure “everything in parallel” is actually bad. Assuming it is (also my gut) then can we just the existence of regions as the deciding factor of whether to parallelize? Should be fast to check. Not sure if there’s another notion of region size we can access quickly.

That said, maybe a trait cache is a good idea anyway. I just think going to much length to support the obviously-bad-idea of modifying something during verification would be a bit strange.

Parallelizing the verifier SGTM in general.

For interfaces, computing a pointer hash is generally more expensive than a simple binary search. I have considered in the past storing the traits in a SmallVector on the AbstractOperation, which would make it easier to do binary search instead of linear scan (and also remove the indirect call).

– River

I don’t think we have anything explicitly stating it, but this has definitely been the expectation.

– River

To be clear: I already consider this the expectation, I was on-board with the “isolated from above” as a good heuristic for the parallelism granularity, nothing more. Of course if checking the trait on every op is too expensive that complicates things :slight_smile:

1 Like

But that doesn’t address the perhaps more common case of an extremely large block in a function from where verification parallelism is desired, i.e., with parallelism across IsolatedFromAbove ops, you don’t get scalable parallelism but just some fixed constant parallelism. The case here doesn’t say how many functions and how large they were.

I think we should put an #pragma omp parallel for on the verifier loop that iterates over the ops in a block, and not just do function-level op parallelism for the IR verifier.

We should really be documenting it here that the choice to not using consts takes away the enforcement mechanism for something like the verifier! Even without parallelization, we still need that? And finally, just having a declared policy perhaps falls short. There could be a way to compute a hash on the whole IR (just on the top-level ModuleOp - print to string buffer and hash) that could be checked before/after to see if IR was modified in the verifier either when asserts are enabled or just for debugging purposes?

You mean conceptually right?

The doc already has:

The major advantage of allowing const on MLIR types is as a marker in APIs that indicate that the function will not modify the specified values.

Which is exactly to me what this is about. The downsides and costs/trade-offs/impact in practice explained there still remains. Enforcement needs a different mechanism: const is not enough IMHO, enough though it is very useful, especially for folks who accidentally mutate or use calls to utility functions that do. For enforcement we could consider randomizing the order of verification in parallel (even running it consecutively such), that along with TSAN.

+1. And I don’t think there has been any expectations about verification order set that would have really allowed for depending on this. Now of course in practice there is an order and folks could have depended on it. So we might need an update to documentation + way to flush out issues before we flip it.

1 Like

Having spent a bit of time looking at this, the use of parallelism in MLIR can be a lot better than it is. It has a lot of the broad strokes right, but the stuff it is building on in LLVM (e.g. TaskGroup, Support/Parallel.h) is “not great”.

Ignoring the verifier issue (e.g. turning it off between passes), I’m seeing a ~2x speedup from using multithreaded optimizations on “function passes” on my 8 core laptop, but this is coming with a 2x increase in user time, which seems egregiously high. This is on a mac running 10.15, not linux.

I haven’t had time to dig into this, but I suspect there are a few things that are just obviously wrong. If that is true, then any work to parallelize the verifier won’t be showing the speedup expected, and constant time costs like checking traits will dominate.

-Chris

Concrete examples (on a large .fir file), single threaded:

/usr/bin/time  firtool large_core.lo.fir -verilog -o u8_maxed_out.lo.fir.v -pass-timing --mlir-disable-threading -verify-each=false
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 12.6233 seconds

   ---Wall Time---  --- Name ---
   5.7567 ( 45.6%)  'firrtl.circuit' Pipeline
   5.7567 ( 45.6%)    'firrtl.module' Pipeline
   2.8520 ( 22.6%)      CSE
   0.1146 (  0.9%)        (A) DominanceInfo
   2.9047 ( 23.0%)      SimpleCanonicalizer
   3.2858 ( 26.0%)  LowerFIRRTLToRTL
   0.0024 (  0.0%)  RTLMemSimImpl
   3.3314 ( 26.4%)  'rtl.module' Pipeline
   0.0555 (  0.4%)    RTLCleanup
   1.5444 ( 12.2%)    CSE
   0.0802 (  0.6%)      (A) DominanceInfo
   1.7315 ( 13.7%)    SimpleCanonicalizer
   0.2469 (  2.0%)  RTLLegalizeNames
  12.6233 (100.0%)  Total
       38.82 real        36.90 user         1.70 sys

Multithreaded:

/usr/bin/time  firtool large_core.lo.fir -verilog -o u8_maxed_out.lo.fir.v -pass-timing  -verify-each=false   
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 4.2981 seconds

   ---User Time---   ---Wall Time---  --- Name ---
  12.4374 ( 62.0%)     2.0203 ( 47.0%)  'firrtl.circuit' Pipeline
  12.4374 ( 62.0%)     2.0203 ( 47.0%)    'firrtl.module' Pipeline
   6.1484 ( 30.7%)     1.5556 ( 36.2%)      CSE
   0.3001 (  1.5%)     0.1175 (  2.7%)        (A) DominanceInfo
   6.2890 ( 31.4%)     0.4647 ( 10.8%)      SimpleCanonicalizer
   1.0549 (  5.3%)     1.0549 ( 24.5%)  LowerFIRRTLToRTL
   0.0025 (  0.0%)     0.0025 (  0.1%)  RTLMemSimImpl
   6.2986 ( 31.4%)     0.9556 ( 22.2%)  'rtl.module' Pipeline
   0.1195 (  0.6%)     0.0087 (  0.2%)    RTLCleanup
   2.9298 ( 14.6%)     0.7424 ( 17.3%)    CSE
   0.1950 (  1.0%)     0.0234 (  0.5%)      (A) DominanceInfo
   3.2494 ( 16.2%)     0.2045 (  4.8%)    SimpleCanonicalizer
   0.2647 (  1.3%)     0.2647 (  6.2%)  RTLLegalizeNames
  20.0582 (100.0%)     4.2981 (100.0%)  Total
       28.28 real        46.10 user         3.05 sys

This isn’t measuring some of the parts of firtool (noticably, parsing and building IR for the 700M fir file :wink: but @fabianschuiki is working on improving that.

The thing that is notable to me is the big difference between wall time on single threaded and user time in multithreaded. E.g. the first pass of SimpleCanonicalizer goes from 2.9s to 6.3s. It is also interesting that this effect isn’t seen in LowerFIRRTLToRTL (which is a “module pass” that internally uses parallel_for), so I suspect that this has something to do with the passmanager mechanics.

-Chris

In my experience so far, there are two notable areas that are hurting us the most when it comes to multithreading:

  • TaskGroup and Support/Parallel.h
    Mostly self-explanatory, but the use cases that have been covered up until now are now really the same as what we want to do.
  • Lack of a concurrent hashmap
    A lot of things are way too lock-happy for my taste, and I chock a lot of this up to the fact that we don’t have a good concurrent hashmap that we can use.

tldr; IMO we are missing a lot of the nice fundamental utilities that make threading easier to implement, and easier to scale.

– River

I haven’t looked at it recently, but it’s worth noting that parallelism requires us to clone the actually Pass instance in memory. If some passes have a lot of state, or do a lot of work in their initialization, that could show.

It might be worthwhile to split the pass prototype (with business like rewrite patterns gathered from the context) from the actual execution state of the pass (replicated per operation). Something like a PassState that gets allocated per async/sync executor and can be cleared between ops (for the sake of reusing already allocated memory in sets, vectors, etc.).

The SimpleCanonicalizer in @clattner’s example keeps the rewrite patterns in the pass. Not sure about the semantics of FrozenRewritePatternSet, but I guess that still makes a deep copy upon pass cloning for multithreading.

It does not make a deep copy, all copies share the same set of patterns. This is part of why FrozenRewritePatternSet was added in the first place.

– River

1 Like

+1, my worry is not verifiers mutating, but rather verifiers that depend on parent ops already being verified. For example, the verifier for ReturnOp would need to check that FuncOp has a “type” attribute and that it holds a FunctionType if it cannot assume that the FuncOp has already been verified.

Does the pass manager copy construct the new passes or reconstruct them?

The new passes are copy constructed from the old passes.

– River