Introduce std.inlined_call op - proposal

Hi all,

It would be really useful to introduce an “inlined_call” operation whose semantics are to execute its region exactly once.

  1. The main goals are to (i) allow inlining in high-level IR in the presence of other ops with regions (eg. affine.for, loop.for), and (ii) provide an abstraction for other imperative “function like” ops to lower to.

Here’s a description of the op.

The inlined_call operation executes the region held exactly once with control returning to right after the op. The op cannot have any operands, nor does its region have any arguments. All SSA values that dominate the op can be accessed inside. Its region’s blocks can have terminators the same way as MLIR functions (FuncOp) can. Control from any return ops from the top level of its region returns to right after the inlined_call op. The results of the inlined_call op match 1:1 with the return values from its region’s blocks.

  Ex:
      loop.for %i = 0 to 128 {
        %y = std.inlined_call -> i32 {
          %x = load %A[%i] : memref<128xi32>
          return %x : i32
        }
      }
  1. Not having an abstraction like this would mean that one is unable to inline a function without lowering all other higher level control flow constructs to the flat list of blocks / std CFG form.

  2. The op would allow the benefit of all transformations on the higher level ops (like affine.for/if, loop.for/if) available in the presence of the inlined calls. The inlined calls continue to benefit from propagation of SSA values across their top boundary. Functions won’t have to remain outlined until the aforementioned higher level ops have been lowered out, which could be later than desired.

  3. The op can live in the standard dialect (std.inlined_call) given the really simple implications as far as control flow goes + it could be targeted by many dialects.

  4. As examples, abstractions like affine grayboxes, lambdas with implicit captures (or even explicit when possible) could be lowered to this without first lowering out structured loops/ifs or outlining. But the initial use case is the “all implicit capture” one. In particular, an affine.graybox with > 1 block in its region is nicely lowered to such an std.inlined_call (those with 1 block can be readily inlined as is).

  5. The “inlined_call” op is then easily lowered away by the existing inlining pass/utility once the remaining non standard control flow ops have been converted to std control flow (flat list of blocks).

The exact syntax with the implementation introducing it is here:

Any comments/feedback before I can submit this for review?

I’m not sure if I quite understand the point here. It like the goal is to introduce a standard CFG form inside a region which does not have the standard CFG form… Is this really more attractive than doing inter-procedural transformations? I’m missing why inlining the call completely into a loop.for() body is not desirable.

If I understand this correctly, the main thing this allows is shielding an outer region from the control flow abstractions used by the inlined code? For example, the outer region might assert it has only a single block, and this op allows passing that verification for arbitrary contained code.

So in a sense, this is a primitive that captures a SESE control flow invariant in the IR. That seems like a really powerful modeling tool. It seems less about inlining per se.

@mehdi_amini this is related to what we were talking about the other day w.r.t. modeling structured control flow.

I read this the same way as @_sean_silva : this seems akin to structurally nest a SESE subset of a CFG. This can be a useful thing to have, I wonder what kind of trait/properties would define the legality of eliminating the nesting (actually “inlining” the region in the parent).

Hi Sean,

That’s right - that’s one way to look at this.

That’s right - this is just reflective of a function that has “just been inlined”. There are other ways/lowering paths on which one could end up with this op, typically from custom dialect ops that have regions. (For eg. the proposed affine.graybox becomes an std.inlined_call once its explicit captures are propagated and thus eliminated.) I couldn’t immediately find a better name; std.exec_and_return, std.exec_region, etc. are other possibilities.

[quote="stephenneuendorffer, post:2, topi

In addition to what Sean already mentioned, please note that:

  1. you can’t always inline the call into a loop.for because loop.for can only hold a single block while the callee could have multiple blocks. It could have perhaps also been the reason the IR was in an outlined form to start with (in the absence of an abstraction like std.inlined_call). Similarly, one can’t always inline into an affine.for or affine.if (more restrictive than loop.for/if), and in general into any op that places restrictions that inlining would break.

  2. The benefits of “inlined regions where SSA values are accessed via standard dominance” vs “outlined / interprocedural” has I think been discussed before here (and is one of the motivations with regions). With the former, you get the benefit of all standard SSA canonicalizations without having to maintain/update arguments, move IR across function boundaries, and run passes concurrently on functions.

(Sorry for the late reply. Still recovering for the new year)

I’m generally in favor of this, for some of the reasons you mentioned. I just have a few questions/comments.

  • Do you intend to replicate many/all of the ABI-ish characteristics of FuncOp; calling convention, attributes(e.g. argument attributes, and those that would be required in llvm like stack protector level). I wonder if in some cases we would just temporarily inline, do some optimization, and then outline back into a function.

*How would this compare to something like an inlined_func operation? Which seems to be similar to this except for the fact that the user would use a call_indirect for the actual dispatch.

Hi River,

Reg. your two questions:

  1. This op will not have any region arguments/operands - so, argument attributes and related things aren’t applicable. (They’d have already been “propagated”.) For the other ABI kind of characteristics of func op, I haven’t thought of how they’d manifest here - it depends. The approach to temporarily turn into an std.inlined_call, do some optimization, and outline back sounds interesting. Haven’t though about this, but it looks like a nice alternative to interprocedural optimization and with clear benefits in some cases.

  2. With call_indirect, you’d still have operands and arguments; std.inlined_call doesn’t. So, although they conceptually represent the same thing, they are vastly different in how they get values from outside, and thus for all optimization/analysis.

This seems to assume a single call-site, in which case I’m not sure why we wouldn’t just inline?

There are many cases in which inlining generally isn’t profitable or desired, e.g. during partial inlining, when avoiding inlining cold code into a hot function, etc. One aspect that was interesting to me is if this can allow for further optimization intraprocedurally, without completely committing to inlining everything.

I’m just not sure the cases where you both:

  • know there is a single unique call-site
  • you don’t want to inline
  • you have optimization to benefit from inlining

Are really common enough.
In particular, I strongly suspect that these cases that could justify “you don’t want to inline” should be handled by a better outlining or some code layout that would move cold basic block out of the way.

You mean just leave it as an std.inlined_call? Because doing the actual inlining could break the parent op’s invariants - like into loop.for/if, affine.for/if - the actual inlining can only happen when such parent ops themselves get lowered away, which might be much much later. As a result, returning the optimized std.inlined_call to an outlined form lets you have multiple smaller functions (in spite of single call-site) if that is determined to be better for eg to compile multiple function ops in parallel in addition to the reasons River lists downthread. Note that the extreme scenario of using std.inlined_call all over the place gives you big and fewer functions in the module (even single call site). AFAICS, a utility to outline an std.inlined_call appears useful nevertheless at some point down the road – but yes, in most cases, we’d just leave it as std.inlined_call.

I’m saying that “temporarily inlining and outlining back” looks like a strategy suitable only if the function has a single call-site, otherwise it is like cloning the function for each call site (the optimization will inject call-site context).

Okay, that’s right. But you also mentioned “in which case I’m not sure why we wouldn’t just inline?” - I think River and I are saying why we may want to outline it back in some cases (even if single call site). Anyway, this is all tangential on one specific use case of this op, which isn’t among the motivational/priority ones IMO.

Is there a preference as far as the naming goes? Possibilities:

inlined_call
exec_region
call_region

The latter two don’t have the implication that one arrived there via inlining. I tend to prefer exec_region.

exec_region seems best to me amongst these alternatives :slight_smile:

Adding to the bikeshed:

Maybe it is all the time I spent working on linkers… but I parse “exec_region” as a declarative statement “this is an executable region” rather than the imperative statement “execute this region”. May I propose “run_region”/“execute_region” to resolve that ambiguity. Or maybe just “run”/“execute”/“invoke”?

exec_region appeared imperative to me in line with the well-known execlp, execv family of functions. But I’m fine with execute_region as well. The “region” suffix is important here I think.

SGTM. This is super useful!

Btw, is the plan for the usual inliner to in-line this op, or a dedicated pass?

Good question. I think we can start with a dedicated pass/utility - because if this is made part of the inlining, someone just wanting to lower away the execute_region will get other inlining that they didn’t want. Unlike regular inlining, inlining the execute_region is necessary to be able to lower to LLVM and other similar IRs that don’t have a concept of nested control.

Another option is to rename the currently misnamed ‘loop’ dialect, have this op live there, and implement its inlining when converting to std dialect (via a conversion pattern rewriter). So the inlining here would happen when we go from the ‘loop’ dialect to the std dialect. I think the ‘loop’ dialect can be renamed ‘region’, and this op can become region.execute. loop.for → region.for, loop.if → region.if make sense because all of the ops there have regions and they just execute regions in different ways. loop → ‘scf’ (structured control flow) was earlier thought of as a renaming but looks like ‘region’ is the more appropriate term for that structured control flow.