Properly using Bufferization related passes

Hi everyone,

I am currently working on an end-to-end prototype to make linalg on tensors executable end-to-end in the context of tensor-based transformations. In particular, the stickier point I am facing is more aggressively avoiding copies and allocs across function boundaries. I am having difficulties with properly connecting the many bufferization related passes. I think I may also be hitting more fundamental bufferization design issues and I am unclear how to categorize which of the following is:

  • known and is being worked on,
  • known but does not have a clear plan for resolution,
  • unknown yet

Setup

func @init_and_matmul(%arg0: tensor<128x128xf32>, %arg1: tensor<128x128xf32>, %arg2: tensor<128x128xf32>) -> tensor<128x128xf32> {
  %cst = constant 0.000000e+00 : f32
  %0 = linalg.fill(%arg2, %cst) : tensor<128x128xf32>, f32 -> tensor<128x128xf32> 
  %1 = linalg.matmul ins(%arg0, %arg1 : tensor<128x128xf32>, tensor<128x128xf32>) outs(%0 : tensor<128x128xf32>) -> tensor<128x128xf32>
  return %1 : tensor<128x128xf32>
}

func @main() {
  // Some IR to create %1, %2 and %3
  %5 = call @init_and_matmul(%1, %2, %3) : (tensor<128x128xf32>, tensor<128x128xf32>, tensor<128x128xf32>) -> tensor<128x128xf32>
  // print stuff from %5.
  return
}

In the following, I’ll call my prototype pass -mystuff.

Dealloc + copy removal

I start with mlir-opt -func-bufferize -buffer-results-to-out-params -mystuff pass and get to the following IR:

func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>, %arg3: memref<128x128xf32>) {
  %cst = constant 0.000000e+00 : f32
  %0 = alloc() : memref<128x128xf32>
  %1 = alloc() : memref<128x128xf32>
  linalg.fill(%1, %cst) : memref<128x128xf32>, f32 
  linalg.copy(%1, %0) : memref<128x128xf32>, memref<128x128xf32> 
  linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%0 : memref<128x128xf32>)
  linalg.copy(%0, %arg3) : memref<128x128xf32>, memref<128x128xf32> 
  return
}

From this, running mlir-opt -buffer-deallocation -copy-removal produces:

func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>, %arg3: memref<128x128xf32>) {
  %cst = constant 0.000000e+00 : f32
  %0 = alloc() : memref<128x128xf32>
  linalg.fill(%0, %cst) : memref<128x128xf32>, f32 
  linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%arg3 : memref<128x128xf32>)
  return
}

The fill is now dead, the computation is incorrect and %0 leaks.

For completeness, I’ll add that copy, fill and named ops have their effects properly set.

Am I missing something?

Return memref semantics

To try and circumvent the above, I now remove -buffer-results-to-out-params and run -func-bufferize -mystuff, I get:

func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>) -> memref<128x128xf32> {
  %cst = constant 0.000000e+00 : f32
  %0 = alloc() : memref<128x128xf32>
  %1 = alloc() : memref<128x128xf32>
  linalg.fill(%1, %cst) : memref<128x128xf32>, f32 
  linalg.copy(%1, %0) : memref<128x128xf32>, memref<128x128xf32> 
  linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%0 : memref<128x128xf32>)
  return %0 : memref<128x128xf32>
}

This looks reasonable at first sight, after running mlir-opt -buffer-deallocation -copy-removal, I see:

func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>) -> memref<128x128xf32> {
  %cst = constant 0.000000e+00 : f32
  %0 = alloc() : memref<128x128xf32>
  linalg.fill(%0, %cst) : memref<128x128xf32>, f32 
  linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%0 : memref<128x128xf32>)
  return %0 : memref<128x128xf32>
}

Great, the extra alloc + copy is optimized away.

Unfortunately on the caller side I get:

func @main() {
  %cst = constant 0.000000e+00 : f32
  %cst_0 = constant 1.000000e+00 : f32
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %0 = alloc() : memref<128x128xf32>
  %1 = alloc() : memref<128x128xf32>
  %2 = alloc() : memref<128x128xf32>
  linalg.fill(%2, %cst_0) : memref<128x128xf32>, f32 
  linalg.fill(%1, %cst_0) : memref<128x128xf32>, f32 
  %4 = call @init_and_matmul(%2, %1, %0) : (memref<128x128xf32>, memref<128x128xf32>, memref<128x128xf32>) -> memref<128x128xf32>
  dealloc %2 : memref<128x128xf32>
  dealloc %1 : memref<128x128xf32>
  dealloc %0 : memref<128x128xf32>
  // print stuff from %4.
  return
}

i.e. the program executes properly but %4 leaks.

More generally, I am quite unclear what the contract is for passing function boundaries: the return of init_and_matmul could be either a new alloc that needs to be deallocated (this case) or an alias that must not deallocated.

This makes me wonder whether -func-bufferize in the absence of -buffer-results-to-out-params should even be allowed. While we could add a ref-counted type in the future to better model many of the issues related to lifetime, it seems to me the current status is to always create leaks?

Phase ordering

It is unclear to me why it is considered that small “composable” chunks of bufferization are deemed beneficial. In my experience, there is a big difference between:

  • small composable IR building blocks, patterns and transformations that compose nicely in any order
  • fine-grained passes that perform part of the job on a subset of IR operations

The latter creates phase ordering issues and tight implicit coupling between passes that make the system very hard to use. For instance, copy removal depends on -buffer-deallocation and cannot work with static allocations. In my bigger picture, phase-ordering is the one big dragon to slay.

Am I missing something?

What I would like

So what I’d really like is a mix of:

  • guaranteeing -buffer-results-to-out-params is always called (i.e. not returning memref due to correctness re alloc/alias semantics) with
  • support to fold in + out buffers into in-out buffers + a function attribute that tells me a buffer is safe to write in-place from the perspective of the caller (i.e. when func bufferization occured, there are no other reads / unknown uses of the tensor that folded into an in-out buffer, outside of the function).

Internal function bufferization is responsible for doing its own inplace analysis and doing the right thing.

If I locally simulate that %arg2 is safe to write to, running -func-bufferize -buffer-results-to-out-params -mystuff I get:

func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>, %arg3: memref<128x128xf32>) {
  %cst = constant 0.000000e+00 : f32
  linalg.fill(%arg2, %cst) : memref<128x128xf32>, f32 
  linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%arg2 : memref<128x128xf32>)
  linalg.copy(%arg2, %arg3) : memref<128x128xf32>, memref<128x128xf32> 
  return
}

This is as good as I can do now i.e. the local analysis determines I can do whatever with %arg2, but I cannot get rid of linalg.copy(%arg2, %arg3) until I go modify -func-bufferize and/or -buffer-results-to-out-params.

If I simulate the attribute not being present I get:

func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>, %arg3: memref<128x128xf32>) {
  %cst = constant 0.000000e+00 : f32
  %0 = alloc() : memref<128x128xf32>
  linalg.copy(%arg2, %0) : memref<128x128xf32>, memref<128x128xf32> 
  linalg.fill(%0, %cst) : memref<128x128xf32>, f32 
  linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%0 : memref<128x128xf32>)
  linalg.copy(%0, %arg3) : memref<128x128xf32>, memref<128x128xf32> 
  return
}

This is also what I expected.

All this becomes quite more interesting in the presence of transformations on linalg on tensors, but is another level of discussion (i.e. it’s all internal function bufferize).

Getting everything the way I want would require changes to -func-bufferize and/or -buffer-results-to-out-params. Given I am considering a much simpler version of the problem (i.e. bail on branch ops), it does not seem reasonable to do any of this in core atm. OTOH, I am interested in high-performance codegen and I’d claim branches should be aggressively multi-versioned/hyperblock-scheduled away.

So I am wondering whether core bufferization prematurely attacked the more difficult problem and now makes it more difficult to have a make progress on the simple problem?

The bigger question

Now all I have in my prototype is definitely not rosy and uses matchers for specific patterns to avoid inserting copies in the first place. However, I would claim that the problem reduces to either:

  • turning destructive updates into inplace buffer writes and detecting special cases while relying on SSA use-def chains.
  • introducing copies everywhere, dropping SSA values and throwing use-def chains out of the window to instead rely on memory-based dependence analysis. As we consider cross-function-boundary issues, this also begs for aliasing information that we don’t have in MLIR.

None of these are particularly appealing to me, but I have found the first to be relatively surprise-free.
There is also the possibility of refcounting and adding alloc_if + copy_if but this essentially gives up on static guarantees. I would claim that statically knowing whether some fine-grained function performs an alloc and copies is important enough to warrant aggressive multi-versioning/hyperblock-scheduling .

Do people have a significantly better alternative to the binary (ternary?) choice raised in this last paragraph?

Thanks for reading!

2 Likes

Thanks for raising this important issue, Nicolas!

I encountered similar in-place update issues in the sparse compiler (see my last week’s presentation, near slide 45), which I resolved by adding a “fast-output” bufferization flag for the sparse lowering pass. Besides the allocation complications you mention, for sparse computations it is even more important to be able to update in-place, since it can make the difference between an O(n + nnz) vs. O(nnz) kernel complexity!

I hope we can drive this to a satisfactory solution.

As I dig deeper, I’ll mention these additional 2 things:

std-bufferize phase ordering with canonicalization

Basically std-bufferize can fail to apply if chains of tensor_to_memref + tensor_load remain.
However they can be canonicalized away easily and calling -canonicalize before -std-bufferize hides the issue.

buffer-results-to-out-params inserts a copy post dealloc

In one of my local experiment I have the IR:

func @init_and_matmul(%arg0: tensor<128x128xf32>, %arg1: tensor<128x128xf32>, %arg2: tensor<128x128xf32>) -> tensor<128x128xf32> {
  %cst = constant 0.000000e+00 : f32
  %0 = alloc() : memref<128x128xf32>
  %1 = tensor_to_memref %arg1 : memref<128x128xf32, #map>
  %2 = tensor_to_memref %arg0 : memref<128x128xf32, #map>
  %3 = alloc() : memref<128x128xf32>
  linalg.fill(%3, %cst) : memref<128x128xf32>, f32 
  linalg.copy(%3, %0) : memref<128x128xf32>, memref<128x128xf32> 
  linalg.matmul ins(%2, %1 : memref<128x128xf32, #map>, memref<128x128xf32, #map>) outs(%0 : memref<128x128xf32>)
  %4 = tensor_load %0 : memref<128x128xf32>
  dealloc %3 : memref<128x128xf32>
  dealloc %0 : memref<128x128xf32>
  return %4 : tensor<128x128xf32>
}

Calling -buffer-results-to-out-params on that creates a copy post-dealloc:

func @init_and_matmul(%arg0: memref<128x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<128x128xf32>, %arg3: memref<128x128xf32>) {
  %cst = constant 0.000000e+00 : f32
  %0 = alloc() : memref<128x128xf32>
  %1 = alloc() : memref<128x128xf32>
  linalg.fill(%1, %cst) : memref<128x128xf32>, f32 
  linalg.copy(%1, %0) : memref<128x128xf32>, memref<128x128xf32> 
  linalg.matmul ins(%arg0, %arg1 : memref<128x128xf32>, memref<128x128xf32>) outs(%0 : memref<128x128xf32>)
  dealloc %1 : memref<128x128xf32>
  dealloc %0 : memref<128x128xf32>
  linalg.copy(%0, %arg3) : memref<128x128xf32>, memref<128x128xf32> 
  return
}

I am unclear whether this is a deeper phase ordering design issue (i.e. buffer to out params must happen before dealloc are inserted) or just a simple bug that should be fixed. I am the one introducing the dealloc in this example (in a scoped fashion with the alloc) and I’d claim that scoping is generally a strongly desired behavior.

Taking a step back:

Upstream MLIR tries to use memref as both “buffer abstractoin that you manipulate from host / random branchy code” and “indexable mutable array that codegen uses for emission of highly structured program fragments”. They are very different use cases, though in the case of static shapes and certain limiting assumptions on parallelism they can be conflated.

Really this is all a big mess because of that. You also seem to be worrying about certain things (like making calling conventions efficient for callees that allocate and return buffers) that are really more about host/random branchy code, which memref is really the wrong abstraction for. We are inside a huge local minimum upstream for that use case, and the more you push on it, the more problems you’ll find. There is no easy way out – it’s going to be hacky as long as we are using memref (except in static-shape, no parallelism worlds, which are not interesting to me and not consistent with the direction that a lot of folks want MLIR to go).

So, to set expectations, you are poking a hornet’s nest here. Also, I don’t think that the local minimum that we are currently in upstream really leads anywhere super useful in the long term (or even medium term). We need to incorporate ideas like those of IREE and TFRT related to modeling of host code and device kernel code generation in a more explicit way, likely using linalg on tensors as the basis, and likely never even materializing memrefs (except possibly at some very, very late stage where it is just used as a “indexable mutable array” datatype that is always provided by the caller).

func-bufferize is designed to be run independently of -buffer-results-to-out-params. Considering that buffer-results-to-out-params is fundamentally limited to static shapes (how big of a buffer should the caller allocate? it is undecidable in general), depending on it in any way is highly suspect.

It’s worth separating bufferization, which is just the process of converting tensors to memrefs, from buffer optimizations. The bufferization passes play together nicely. What is missing is that the buffer optimizations (and stuff that is needed for them to work, such as defining calling conventions and such) don’t really work well today. Simple example:

def maybeCopy(x: memref):
  if cond():
    return view(x)
  else:
    return copy(x)

Today it is legal to write a function like that, but obviously the result cannot be freed via any analysis of the caller. The only solution is to have a pass that inserts copies.

With the foundations of calling conventions and such in place, the firs thing would be to create a pass that inserts copies as needed so that functions like maybeCopy above insert copies along all paths. That at least gets the program to a guaranteed “actually possible to be memory safe” state, which we currently don’t have formalized. Because bufferization happens via local rewrites, which can introduce 0, 1, or many allocations depending on the nature of their bufferization, such a pass that has a whole-function understanding of the memref lifetimes is necessary and separate from bufferization.

Once we have that, then it is possible to write a pass that interprocedurally hoists allocations into callers to create out params (for suitable private functions). This is different than buffer-results-to-out-params which is a calling convention thing that affects public functions as well; I actually wrote that pass and it was basically just to appease a very specific use case, and is actually not even used today and could be deleted – it’s not a good design.

Yes, it must happen right after bufferization on the “no deallocs exist yet” form. Again, don’t depend on that pass, it’s fundamentally problematic except in the static shape case.

Not sure what you mean. std-bufferize expands single ops, so it always applies when the ops are present. Whether there is another tensor_to_memref or tensor_load in surrounding code doesn’t matter. Perhaps you want finalizing-bufferize? (see " General structure of the bufferization process" Bufferization - MLIR)

1 Like

This is essentially what the buffer deallocation pass does. It inserts copies such that the function itself is in a memory safe state. What is not modeled is the behavior of functions and function calls. It could easily be extended to also handle function arguments and results if we had a way to encode what the calling convention is.

The reason why this does not work well in the example that @nicolasvasilache shared is that the buffer deallocation pass reuses the linalg.copy operation for its modelling of buffer copies. We have argued for a memref.copy operation for a while that more closely models the behavior that we want but so far there always was pushback to adding the copy. Maybe with the memref dialect split this will now happen.

Indeed. The structure is that local rewrites create a block of code that does 1 to n alloc operations, then performs whatever is needed to produce the results (in @nicolasvasilache example a linalg.copy and a linalg.matmul are needed to produce the result) and then the buffer remains untouched and is only read.

The deallocation pass then inserts the required memref.copy operations (today these unfortunately are linalg.copy operations) to do lifetime management of the buffers. We have decided to insert more copies and not do any mini-optimizations in this pass for simplicity. Instead, we have a cleanup pass (copy removal). That pass only works if all copy operations are inserted by the deallocation pass. More precisely, it assumes that copied buffers are not mutated.

As I said, we already have that pass. Otherwise I agree that the buffer-results-to-out-params was an intermediate step and that a generalized solution would be preferable.

The sandbox has been landed here. It depends on a few things being flushed through core to properly enable the prototypical inplace updates.

One of those things is “last-write” type of canonicalization of scf.for.

This canonicalization exhibits an incorrect folding behavior:
tensor_load(tensor_to_memref(x)) -> x which ignores aliasing.

While having transient pass-internal assumptions can be fine when the transient behavior does not leak to the outside world, this is not the case here. The sandbox prototype has its own assumptions but it is very careful to not let them leak out.

We could approximate no-aliasing by restricting the folding to occur only when tensor_to_memref is immediately preceded by tensor_load in the same block. This is a conservative step back towards correctness until better alias analysis becomes available.
This is however likely to result in the introduction of many more copies in core bufferization. Since some of this is now running in production, it should be carefully measured if we go this route. This should be driven by stakeholders of the current bufferization behavior.

If this is deemed too risky, I can just create a new tensor_load-like op in linalg without the unsafe folding behavior. Once my prototype is more battle-tested I can come back with more data.

Please let me know if there are strong preferences here.

Actually… maybe I spoke too fast: this passes the OSS tests and seems to also pass all our internal tests: ⚙ D97957 [mlir] Tighten the rules around folding TensorLoadOp.

Can you show the test case? You should not be writing to the result of tensor_to_memref, as it might be constant memory.

@herhut proposed a semantics where it was illegal to write to the operand of tensor_load after it is used, which fixes this, but I don’t know if we made progress on this.