Redundant buffer synthesis in tf-opt


Consider this (trivial) tensorflow code:

func @myfun(%x:tensor<2xf32>)->(tensor<2xf32>) {
  %c = "tf.Const"(){value = dense<(3.000000e+00,4.0)>:tensor<2xf32>} : () -> tensor<2xf32>
  %0 = "tf.AddV2"(%c, %x) {device = ""} : (tensor<2xf32>, tensor<2xf32>) -> tensor<2xf32>
  return %0 : tensor<2xf32>

Converting it to linalg+std using tf-opt --tf-to-hlo-pipeline --hlo-legalize-to-linalg --mhlo-legalize-to-std produces the following code:

#map = affine_map<(d0) -> (d0)>
module  {
  func @myfun(%arg0: tensor<2xf32>) -> tensor<2xf32> {
    %cst = constant dense<[3.000000e+00, 4.000000e+00]> : tensor<2xf32>
    %0 = linalg.init_tensor [2] : tensor<2xf32>
    %1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %cst : tensor<2xf32>, tensor<2xf32>) outs(%0 : tensor<2xf32>) {
    ^bb0(%arg1: f32, %arg2: f32, %arg3: f32):  // no predecessors
      %2 = addf %arg1, %arg2 : f32
      linalg.yield %2 : f32
    } -> tensor<2xf32>
    return %1 : tensor<2xf32>

This code already has its quirks (the %0 tensor being initialized and given as input to linalg.generic), but my true problem is that I cannot lower this to nice low-level code. My best result is the following:

  global_memref "private" constant @__constant_2xf32 : memref<2xf32> = dense<[3.000000e+00, 4.000000e+00]>
  func @myfun(%arg0: memref<2xf32>, %arg1: memref<2xf32>) {
    %0 = alloc() : memref<1xf32>
    %1 = get_global_memref @__constant_2xf32 : memref<2xf32>
    %c0 = constant 0 : index
    %c2 = constant 2 : index
    %c1 = constant 1 : index
    scf.for %arg2 = %c0 to %c2 step %c1 {
      %2 = load %arg0[%arg2] : memref<2xf32>
      %3 = load %1[%arg2] : memref<2xf32>
      %4 = addf %2, %3 : f32
      store %4, %0[%c0] : memref<1xf32>
      %5 = load %0[%c0] : memref<1xf32>
      store %5, %arg1[%arg2] : memref<2xf32>
    dealloc %0 : memref<1xf32>

Note that a buffer is created, then deallocated, and data is copied around needlessly.

My question: Is there a pipeline of transformations that does not allocate memory and copy data around?


Can you share some details how you got to the code you show? Like which passes you have used?

Certainly. I assume your question concerns the second transformation step (the first I already provided in my post). I have used the following command:

mlir-opt \
    --linalg-bufferize --convert-linalg-to-affine-loops \
    --func-bufferize --buffer-results-to-out-params \
    --tensor-constant-bufferize --finalizing-bufferize \
    --convert-linalg-to-affine-loops \
    --affine-loop-fusion \
    --lower-affine \
    --cse \
    --copy-removal \

You can substitute mlir-opt with tf-opt, the result is (as expected) the same.

Looking at this, I think the issue is that you are running --copy-removal before --buffer-deallocation. To be able to remove copies, we rely on dealloc operations to already be inserted. I tried with

--linalg-bufferize --convert-linalg-to-affine-loops --func-bufferize --buffer-results-to-out-params --tensor-constant-bufferize --finalizing-bufferize --buffer-deallocation --copy-removal

and that gave me

module  {
  global_memref "private" constant @__constant_2xf32 : memref<2xf32> = dense<[3.000000e+00, 4.000000e+00]>
  func @myfun(%arg0: memref<2xf32>, %arg1: memref<2xf32>) {
    %0 = get_global_memref @__constant_2xf32 : memref<2xf32>
    affine.for %arg2 = 0 to 2 {
      %1 = affine.load %arg0[%arg2] : memref<2xf32>
      %2 = affine.load %0[%arg2] : memref<2xf32>
      %3 = addf %1, %2 : f32 %3, %arg1[%arg2] : memref<2xf32>

There also is no real need to use affine here, just using loops gives the same result.

--linalg-bufferize --convert-linalg-to-loops --func-bufferize --buffer-results-to-out-params --tensor-constant-bufferize --finalizing-bufferize --buffer-deallocation --copy-removal --convert-scf-to-std --convert-std-to-llvm

gives you a blurb of llvm IR.

1 Like

Thanks, I’ll try this tomorrow!


An extra remark here: Making copy removal depend on buffer deallocation means that you won’t be able to remove copies in code with static memory allocation.



Thanks @herhut for your suggestion.

And yes I agree it is not optimal at all. There are multiple directions here. Implementing another copy-removal pass at a higher level of abstraction seems like the way to go. I’m not saying it should replace the current pass though.

1 Like

The current copy removal pass was created to remove the copies that are introduced by the deallocation pass. And the transformation of results to extra arguments also relies on this.

For static allocation, you would need a copy for the latter and the former would not introduce any copies for static allocations (as it won’t try to free them).

But I agree that for the general problem, the current copy removal is not good enough.

1 Like