Hi all,

We are currently upstream our dynamic shape compiler (a.k.a DISC). I’d like to discuss supporting dynamic shape control flow in advance and would appreciate any feedbacks.

1, Control flow overview

- structure control flow (e.g. mhlo/lmhlo whileOp, IfOp). Using region to represent conditional/while semantics.
- CFG. traditional version (based on branch op). More general, but harder to optimize.

2, Dynamic shape control flow

shapes of values inside while/if body are unknown and may be variant between different iterations.

Some examples:

```
func @dynamic_shape_while(arg1: tensor<?x?xf32>) {
%1 = "mhlo.while"(%arg1) ( {
^bb0(%targ1: tensor<?x?xf32>):
%true = mhlo.constant dense<1> : tensor<i1>
%0 = "mhlo.multiply"(%targ1, %targ1) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
%1 = "mhlo.subtract"(%targ1, %0) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
%2 = "mhlo.compare"(%targ1, %1) {comparison_direction = "LT"} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xi1>
%3 = "mhlo.reduce"(%2, %true) ( {
^bb0(%lhs: tensor<i1>, %rhs: tensor<i1>):
%4 = "mhlo.and"(%lhs, %rhs) : (tensor<i1>, tensor<i1>) -> tensor<i1>
"mhlo.return"(%4) : (tensor<i1>) -> ()
}) {dimensions = dense<[0,1]> : tensor<2xi64>} : (tensor<?x?xi1>, tensor<i1>) -> tensor<i1>
"mhlo.return"(%1) : (tensor<i1>) -> ()
}, {
^bb0(%targ1: tensor<?x?xf32>):
%0 = "mhlo.multiply"(%targ1, %targ1) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
%1 = "mhlo.concatenate"(%0, %0) { dimension = 0 : i64 } : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
"mhlo.return"(%1) : (tensor<?x?xf32>) -> ()
}) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
return %1 : tensor<?x?xf32>
}
```

3, Bufferization for dynamic shape control flow

3.1 Solution #1: mhlo while → CFG in tensor level → bufferization.

To my best knowledge, bufferization (and deallocation) pass does not support dynamic shape very well. For example, deallocation pass could not support dynamic CFG loop according to this and this. In fact, I think it’s very hard to implement a general deallocation pass for dynamic shape control flow CFG. Please correct if I’m wrong.

3.2 Solution #2: mhlo while → lmhlo while (bufferize in structure format) → CFG in buffer level.

This is our current implementation. However, we have to extand lmhlo.while/if ops since they are not support dynamic shape well. Take lmhlo.while op as an example.

```
// original definition
def LHLO_WhileOp: LHLO_Op<"while", [
DeclareOpInterfaceMethods<RegionBranchOpInterface>,
DeclareOpInterfaceMethods<LoopLikeOpInterface>]> {
let summary = "While operator";
let description = [{
Returns the result of executing a body function until the cond body returns
true.
See https://www.tensorflow.org/xla/operation_semantics#while.
}];
let arguments = (ins
Arg<Variadic<LHLO_PredBuffer>, "", [MemWrite]>:$cond_val,
OptionalAttr<I64Attr>:$trip_count);
let regions = (region SizedRegion<1>:$cond, SizedRegion<1>:$body);
}
// Dynamic shape version for LHLO_WhileOp
def LHLO_WhileOp: LHLO_Op<"dynamic_while", [
DeclareOpInterfaceMethods<RegionBranchOpInterface>,
DeclareOpInterfaceMethods<LoopLikeOpInterface>]> {
let summary = "While operator";
let description = [{
Returns the result of executing a body function until the cond body returns
true.
See https://www.tensorflow.org/xla/operation_semantics#while.
}];
let arguments = (ins
Arg<Variadic<LHLO_Buffer>, "", [MemRead]>:$args,
OptionalAttr<I64Attr>:$trip_count);
let results = (outs Variadic<LHLO_Buffer>:$results)
let regions = (region SizedRegion<1>:$cond, SizedRegion<1>:$body);
}
```

The main differences are:

- dynamic_while op does not accept or return tuple type.
- dynamic_while op accepts a list buffer and
**return a list of buffer**. - the buffers returns from the dynamic_while op are transferred to the caller and it’s the responsibility of the user to deallocate these buffer correctly.
- while body needs to deallocate its operands.

Some examples:

```
func @test(%arg0: memref<?x?xf32>) -> memref<?x?xf32> {
// owership of %0 and %1 are transfered to the caller
%0, %1 = "lmhlo.dynamic_while"(%arg0, %arg0) {
// conditional body
} {
^bb0(%targ0: memref<?x?xf32>, %targ1: memref<?x?xf32>)
// loop body
// allocate new buffers (size may change)
// use %targ0 and %targ1 to fill the buffer
// deallocate %targ0 and %targ1
// return new buffers
}
memref.dealloc %0 : memref<?x?xf32>
return %1: memref<?x?xf32>
}
// lower dynamic_while to CFG
^init(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>):
%0 = memref.alloc(...) // alloc_like(%targ0)
%1 = memref.alloc(...) // alloc_like(%targ1)
"lmhlo.copy"(%arg0, %0) : (memref<?x?xf32>, memref<?x?xf32>) -> memref<?x?xf32>)
"lmhlo.copy"(%arg1, %1) : (memref<?x?xf32>, memref<?x?xf32>) -> memref<?x?xf32>)
br ^cond(%0, %1)
^cond(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>):
%pred = ...
cond_br %pred, ^body(%arg0, %arg1), ^exit(%arg0, %arg1)
^body(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>):
// allocate new buffers: %targ0, %targ1 (size may change)
// use %arg0 and %arg1 to fill the buffer
// deallocate %arg0 and %arg1
br %cond(%targ0, %targ1)
^exit(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>):
// return %arg0 and %arg1
```

Disadvantages:

- reduant copies especailly if the loop body does not execute at all.
- buffer management is different inside the loop region (e.g. operands are need to be deallocated).

Another problem is how to compose with bufferize-pass.

Is any suggestions?

Thanks!