LLVM Discussion Forums

[RFC] Add scf.while to scf dialect

We have scf.for and scf.if and a few other structured control flow operations, but no scf.while yet.

Has this been considered in the past? I can see some design complications of receiving the value for the condition from the “outside” as well as from within the while-body, but overall this seems a useful construct to me.

Is there general interest?

2 Likes

Yes, this would be good to have. I am inclined to propose a design along the lines of the “functionalized” while form used by hlo.while with the following major deviations:

  1. Get rid of tuples in favor of list of values (for scf.while operands, results, body entry args, condition entry args, and body yield values).
  2. Operands/args/results for body/cond/op would no longer be restricted to tensor types obviously, but could be arbitrary types.
  3. Use the AffineScope trait on scf.while - that defines a new top level scope for symbols, and this is practically useful - allows more things to be modeled as sequences of affine loop nests inside.
  4. Rest all stays the same as hlo.while, i.e., the op isn’t marked IsolatedFromAbove - this allows all SSA values to be accessed via dominance both in the body and condition regions.

The dataflow and execution semantics of the hlo.while can be found in the mlir-hlo repo (mhlo dialect) or on the XLA op semantics page of the TensorFlow website (under WhileOp). This design works well for scalar opts as well as other things (since values can be returned/passed across and out).

Yes absolutely.

Allow me some historical digression. scf.for originated in Linalg and circa July 2019 came the time to refactor because the concept was more generally reusable (e.g. for mapping to GPU, having a common abstraction on the way to unstructured std basic blocks + branches, representing non-affine control-flow and for future control-flow / loop types). We went through the traditional discussion of “should we just have a std.for and std.if” or a new dialect.

A bunch of digressions and renamings later, scf is now well-established, it is indeed time to start looking at more general (but still structured) control-flow.

@aartbik I suspect you are interested in such generalizations in the context of the ongoing sparse work.
One key question for me is: what transformations will be interesting in this context (esp. for vectorization + LICM) and how does the representation help or hinder the transformation?

I’ll just paste content from the original proposal that didn’t make it publicly back then. It is largely unchanged except for replacing the original cf dialect name by scf. All this of course begs to be modernized and tested against real use cases.

Loops

The values for the lower bound, upper bound and step are precomputed when entering the scf.for operation. Alternatively we considered attaching regions to scf.for that would compute the upper bound and the step at each iteration. We do not have a concrete transformation that would warrant the extra complexity at this time. This may evolve in the future.

A generalization of the scf.for loop may resemble

scf.for %i : f32 = %init        // an induction variable of any type is declared
condition {                     // it is available in the condition block
  %0 = call @some_func(%i) : (f32) -> i1       // arbitrary computation of the condition
  cf.cond %0 : i1               // cf.cond tells cf.for about the value of the condition
}
update {                        // the induction variable is available in this block
  %cst = constant 0.05 : f32    // arbitrary computation of the new value of the
  %1 = addf %i, %cst            //       induction variable
  cf.next %1 : f32              // cf.next tells cf.for about the new value
}
body {                          // the induction variable is available in the body
  "use"(%i) : (f32) -> ()       // but cannot be changed (SSA)
}

Note that such a generalization would also have the benefit of keeping perfectly nested loop ops structure when performing e.g. tiling. To simplify analyses, an attribute could be added to specify whether condition and increment (1) can be computed only once on entry or (2) need to be updated iteratively.

A more general scf.while loop may be created along the following lines

// A set of values are declared in the "while" construct.  
// They are assigned some initial value and are available in the regions below (entry block arguments).
scf.while(%v1 = %init1 : index, %v2 = %init2 : memref<?xf32>) {
  // The condition block has arbitrary computation and terminates with "scf.cond" that
  // consumes a value of i1 type telling whether to continue the loop.
  %0 = "condition"(%v1, %v2) : (index, memref<?xf32>) -> i1
  scf.cond %0 : i1
} do {
  // The body block has arbitrary computation and terminates with "scf.next", the
  // operands of which will be used as values of the loop-defined values in the next
  // iteration.
  store %v2[%v1] : memref<?xf32>
  %cst = constant 2 : index
  %1 = addi %v1, %cst : index
  scf.next %1, %v2 : index, memref<?xf32>
}

Thanks for bringing up this topic!

1 Like

This is more or less the same as our original idea of extending SCF, as Nicolas posted. A specific design question is the interaction between the condition and the body: can the condition region have access to body arguments/results?

Should we also have it on scf.for then?

Absolutely! A single compressed storage format can still be expressed with a proper for loop. But coiterating over intersection or unions of compressed storage formats really needs a while construct.

These things are already spelled out in the hlo.while design. The values returned by the body’s terminator are received by the condition region to evaluate the predicate for the next iteration - there is a 1:1 correspondence among those values. There is no other deviation on the semantics/structure from the hlo.while other than those I listed.

Where is this design documented?

Can the condition pass values to the body as well? Otherwise, there is no way for the body to reuse values computed in the condition (this is a problem with tf.While; not sure if it affects hlo.while). It is natural for such values to be reused, because the condition dominates the body.

Chris wrote a doc during his time at google that concluded that the way to fix tf.While is to reformulate it from this:

    // LIMITING, makes many things awkward
    cond region: {T1,T2,T3…} -> bool                         // aka “T -> bool”
    body region: {T1,T2,T3…} -> {T1,T2,T3…}           // aka “T -> T”

to this:

    // Much more consistent
    cond region: {T1,T2,T3…} -> { bool, U1, U2, U3, ...}       // aka “T -> bool, U”
    body region: {U1, U2, U3, ...} -> {T1,T2,T3…}                  // aka “U -> T”

This makes me think that a do-while could be more regular:

    body region: {T} -> {bool, T}

Any while can be turned into a do-while right? Would this structure be more regular? What do we lose by mixing the condition computation and the body in the same region?
Or asking differently: what kind of properties and constraints are put on the the condition region to justify its separate existence?

do-while can’t iterate zero times. So when coming from “while-like” (but not do-while-like) abstractions, in general one will be duplicating the condition into an enclosing scf.if to catch the case of the the zeroth iteration (much as traditional loop rotation does; in the case of tiny predicates like i<n, this isn’t a big deal, but in general one needs a cost model for how much duplication is acceptable)

Since the condition dominates the body, you could just access via dominance. Do you need to explicitly pass? If dominance works works, it’s a liability to also allow explicit passing - it’s a side channel dataflow.

Furthermore, if RegionBranchOpInterface things are implemented for this, things like sparse conditional CP, bufferization, and copy removal would just work out of the box.

I would really like to see these reformulated here in a generic way that does not refer to HLO or TF. We will have to add that in the op documentation anyway.

Do we need to explicitly forward all arguments through the condition, or could we rather do something like the following?

cond region: {T1, T2, T3} -> {bool, U1, U2}
body region: {T1, T2, T3, U1, U2} -> {T1, T2, T3}

Is your idea to have two different blocks in the same region?

No. Like hlo.while, there would be a conditional region and a body region.

MLIR doesn’t allow two regions on the same op to refer to each other’s SSA values, so they would currently have to be explicitly passed. @River707 for if that could be changed.

I think that would solve the problem as well. It has the disadvantage that it forces the shared T, which artificially extends the live range of T.

See this rewrite. Notice that the “shared T” formulation forces the values T produced by the body to be live until the next invocation of the body. But their live range might end inside cond.

"shared T" formulation:
cond: T -> bool, U
body: T, U -> T
-->
Chris' original formulation:
let T' = T
let U' = T, U
// in "cond": replace `return bool U` with `return bool, T, U`, extending the live range of "T"
cond: T' -> bool, U'
body: U' -> T'

I think this was probably never needed / rarely needed. We could change it - if we don’t, then it’s just that they would have to be explicitly passed as yield operands and then received as block arguments. If the ‘then’ region of an if op can’t dominate the else region, I guess there could be an op/design where one of the op’s regions dominates another region. Maintaining forward dominance will keep things consistent and avoid explicit passing?

I don’t think XLA does very much cross-control flow optimizations, since they aren’t as relevant in ML compilers for large tensor operations. But here, we are pondering using scf.while to implement inner loops of sparse kernels and other unknown, more general use cases. Building a principled design that doesn’t prevent us from implementing certain classes of optimizations (such as eliminating redundancy between cond and body) will pay off I think.

I talked with one of the XLA developers about this some time ago, and he recognized that the current XLA design isn’t perfect. The current XLA design has the redeeming advantage that it makes their buffer allocation simpler, since they just must-alias the body args and the body return values (which itself isn’t necessarily optimal, but is simpler to implement).

Well, it’s the use of tuples that I find particularly inconvenient, but otherwise, the cross iteration flow of values requires 1:1 mapping between yield operands (body return values) and the body arguments for simplicity.

Lots of interesting discussion, thank you for that. Pragmatically, I would prefer a while-loop over a do-while loop exactly for the zero trip reason. If we need to guard a do-while with an explicit if-operator, code generation starts to resemble just coding this out with if-operators and “jumps”.

Even more pragmatically, is anybody already planning to add this to the scf dialect anytime soon, or is that something I should start myself? And if the latter, can we agree on one design, just to avoid repeating this discussion during the code review? :wink:

Yes, that would actually be good - allowing for better mixing without the need for execute_regions. It also makes things consistent because scf.while could be converted to scf.for when possible for better optimization, and one should be able to represent the contained affine nests without inserting an affine.execute_region.

I ran into a need for a while loop in a side project… Any further thoughts on this?

I am considering the following form:

%res:2 = scf.while (%arg1 = %init1, %arg2 = %init2) : (i32, f32) -> (i64, f64) {
  /* the 'before' region takes operands (i32, f32 here) as arguments */
  /* or whatever was `yield`ed by the 'after region */
  %0 = zexti %arg1 : i32 to i64
  %1 = fpext %arg2 : f32 to f64
  %2 = "some.boolean.op"() : () -> i1
  /* it yields a boolean condition (true to proceed to the body) */
  /* and whatever arguments that are forwarded to the body */
  scf.yield %2, %0, %1 : i1, i64, f64
} do {
^bb0(%arg1: i64, %arg2: f64):
  /* the 'after' region takes as arguments whatever is `yield`ed by the */
  /* 'before' region and yields the new values of 'before' region arguments */
  scf.yield %0, %1 : i32, f32
}

The result of the entire op is yielded by the ‘before’ region. If we need to yield from the ‘after’ region, we can forward the values to the ‘before’ region (it will have to be executed anyway to check the exit condition) and yield from there. This sounds flexible enough to represent both a “while” and a “do-while” loop, the ‘after’ region can be a simple forwarder.

That being said, it looks like a very thin layer of structure on top of a CFG and that may be too loose for any useful transformation. In particular, I am pondering if we want to have an explicit “condition” region and require that is has no side effects. Also, LICM would only work on one of the regions in its current form.

2 Likes