[RFC] Dialect for bufferization-related ops

As for the Comprehensive Bufferization (we’ve been talking about renaming it to “One Shot Bufferization” or some other name), the following things would go into the new dialect:

  • ComprehensiveBufferize transform and pass. This thing is doing the bufferization. It has dependencies on the tensor and the memref dialect.
  • BufferizableOpInterface: Ops that implement this interface can be bufferized.
  • “External model” implementations of the op interface. E.g., interface implementations for ops in the vector dialect, so that these ops can be bufferized. These should ideally live in their respective dialects (e.g., vector), but it may be too early for this. Or we may even want to have them in the bufferization dialect forever, so that other dialects stay small. There is a separate build target for each supported “external” dialect, so the only build targets that have dependencies on another dialect (apart from memref and tensor) are the ones that contain the external model.
  • An attribute and its verifier.

For those unaware of what Comprehensive Bufferize is: It’s a single-pass bufferization that does a whole-function analysis, whereby ops can be from different dialects. This whole-function analysis makes it easy to detect “inplace bufferization” opportunities (i.e., avoiding buffer copies).

E.g., one main use case that we have is bufferizing matching tensor.extract_slice / tensor.insert_slice pairs inplace during tiling of Linalg ops.

%0 = tensor.extract_slice %A [%offset] [%sz] [1]
%1 = "do_something"(%0)
%2 = "do_somthing_else"(%1)
%3 = tensor.insert_slice %2 into %A [%offset] [%sz] [1]

Comprehensive Bufferize can trace back the source of a tensor.insert_slice in the reverse SSA use-def chain, until it eventually ends up at a tensor.extract_slice, which extracts from the same buffer that the insert_slice is inserting into. If that’s the case, no buffer copy is needed.

This is not something that the existing multi-pass bufferization can do at the moment, because the tensor-bufferize pass has no knowledge about ops from other dialects that are on the reverse SSA use-def chain path. It is my understanding that the existing bufferization pass would insert a buffer copy. The “copy removal” pass could be extended, so that it would remove the copy again. However, it is easier to make the right decision in the first place; reasoning about SSA values is easier than reasoning about memref aliases.

On the other hand, there are also cases that the Comprehensive Bufferize pass cannot handle. E.g., returning newly allocated buffers from a function or block. So we need both and we have to make them composable. For more details, I wrote up a high-level description of the design of Comprehensive Bufferize.

comprehensive_bufferize.pdf (353.7 KB)

2 Likes

Got it. I don’t think that “lowerable to LLVM” is a particularly important criteria, but I can see the benefit of splitting these ops out for the canonicalization layering. In any case, splitting them out makes sense to me. I agree with River’s metapoint about “we should have a plan here” and not just push stuff around without consensus on what the end game looks like.

-Chris

1 Like

Thanks for the explanation Matthias! That makes perfect sense. Is the inability to return newly-allocated buffers a fundamental technical design decision, or something that just hasn’t been implemented yet?

@River707 @_sean_silva

Additionally, to what Matthias described above the following interfaces/ops/passes will be moved to Dialect/Bufferize.

  • AllocationOpInterface - interface that declares buildClone and buildDealloc methods, that are used in buffer deallocation pass. Currently, it resides in SideEffectInterfaces.td and will be moved to Dialect/Bufferize/AllocationOpInterface.td

  • memref::CloneOp, memref::TensorLoadOp and memref::BufferCastOp operations that are specific to the bufferization passes

  • BufferOptimizations.cpp, BufferDeallocation.cpp, BufferResultsToOutParams.cpp, Bufferize.cpp, BufferizeUtils.cpp will all be moved as well.

I think this makes sense. These are all fairly opinionated things that aren’t totally core to the specifics of memrefs (and certainly aren’t at a level of generality to make sense in lib/Transforms either) and are specific to a particular flavor of end-to-end flow.

1 Like

I have strong concerns about trying to unify the bufferization paths under one dialect. It does seem like just a mash up of things without a coherent story (which makes sense cause it was developed by two completely different groups with different focus). Echo-ing Rivers comments here, this probably needs to be flushed out more in terms of what the goals of this dialect are. From IREEs perspective, there is a strong dependence on the Comprehensive bufferization in Linalg, but the general bufferization is something that we cannot use. Being forced to use both would basically make it untenable for us to use either.

This comes back to I think a more structural issue. MLIR so far has had good development around code-generation, but when it comes to modeling things that are dependent on the environment/system/runtime aspects, things have been developed with one use case in mind. As long as these pieces are separate, its not an issue, but combining the whole program bufferization with the comprehensive bufferization violates that IMO. I am sympathetic to the fact that we need to have a more coherent bufferization story in core, but seems like we need to build that story first before we try to merge things that have been developed in different silos.

Also echo-ing the earlier conversation about having a place to prototype ideas and share them would be great. I think the bar for something landing in core should be fairly high, i.e. after majority of the design aspects have been sufficiently worked out, and different use cases have been accounted for. Especially for things that depend on the structure of a particular runtime, it is not even clear to me landing this in core makes sense. Each runtime has different constraints and focus areas. So what works for one might not work for another. All this could be flushed out in a common experimental area where things might get deleted, reworked drastically. As Stella mentioned, we are trying to use sandbox for this purpose. In absence of this I think some development in core has happened with a single use case in mind. At the same time (in IREE for example), we have things that have been kept in IREE and not upstreamed cause it is experimental. Both are detrimental.

1 Like

+1 Mahesh. I hadn’t seen that @pifon2a’s suggestion included moving Comprehensive there too. I am also generally skeptical about mashing together Comprehensive and the other flow. I think that moving all the things @pifon2a suggested (minus Comprehensive) there is good, since they are all basically related to one particular flow and it centralizes what is otherwise scattered around the codebase, but all fairly related to each other.

The beauty of Comprehensive is that it is 100% runtime agnostic, essentially, since it never allocates/escapes memory and has a sequential execution model; it seems like that is a building block that essentially any target that looks like a microprocessor (CPU, GPU, DSP) needs. This leads to a nice partitioning between the “runtime bufferization”, which is highly opinionated, and the low-level loop codegen bufferization (Comprehensive) which I think is a layer we can be pretty confident about sharing in core. I want to keep that layering, as this “runtime” vs “codegen” bufferization split seems to be an emerging design pattern.

From that perspective, all the things that @pifon2a suggested moving (minus Comprehensive) are basically the building blocks for a particular “runtime bufferization”, albeit one that is not very sophisticated (no refcounting, sequential execution model, allocates so much stuff, etc.). I’m not sure what to call it. It’s basically the bufferization for the “simple memref runtime” or something.

I don’t understand. How does it force anyone to use both bufferizations? It is just a path. There will be different passes, you can just use comprehensive bufferization and not all of them.

This RFC was about moving the existing ops that are already landed in core. What does it have to do with the sandbox? If we want to have some new type of bufferization, we can prototype it there. This RFC is about untangling memref and tensor dialects by moving tensor_load, clone and buffer_cast ops out of the memref dialect. They ended up in memref dialect because of the std split. They don’t belong there.

Yes, the RFC was not about ComprehensiveBufferize. But ComprehensiveBufferization cannot stay in Linalg and also needs a place. I thought that it would be nice to keep it in Dialect/Bufferize. Otherwise, I think it is even more confusing to have comprehensive bufferization in mlir/Interfaces and mlir/Transforms and have “runtime bufferization” (inconceivable bufferization?) in Dialect/Bufferize. If you think otherwise, please suggest we can put comprehensive bufferization.

Did we reach a consensus at least about moving the “incomprehensive” bufferization-related ops/passes/interfaces to Dialect/Bufferize?

I think the suggestion being followed is to use this opportunity of moving ops as a forcing function to have a more concrete discussion about the end state – and that is going to involve talking about the details of the different things that exist in this area.

This thread has been educational for me so far. But we may also be in ODM territory too, depending on how “end state” we want to enumerate right now.

My 2 cents.

1 Like

If there are in the same dialect, then presumably there is something that ties them together. Once they are tied together the coupling between the two will only increase. If they are never going to be tied together, why even have them in the same dialect.

As long as the RFC also does not encompass moving comprehensive bufferization into this dialect, I have no concerns.

Why? It is very much related to the Linalg based lowering. Call it anything apart from ComprehensiveBufferization if that reduces the confusion, but it is very tightly related to Linalg IMO.

+1. This sounds like an ODM discussion would help.

I personally am ok moving forward with that.

I suspect that “comprehensive” will land in Transforms/OneShotBufferize (both interfaces and pass – the interface is not generic enough for living in Interfaces) as it is not tied to any particular dialect, and is generally consistent with the kind of interface-driven transforms that live there. (naming TBD)

The implementation is coupled to linalg, but is it fundamental or accidental? Could it do the same thing with an interface an be extensible on non-linalg thing? (I think Sean hints at this as well)

My understanding was that the main difference is that comprehensive is a “one shot lowering” while the other one is a “progressive lowering”.

I agree with Stella that this may also be a good opportunity to think about the desired end-state on all of this.

I don’t have a good grasp if it is possible (and/or desirable) for the two approaches to converge: what are the fundamental differences or aspects that would make it clear that the two should exist for the foreseeable future?

Sure. But I agree with Sean here that these are fundamentally two different bufferizations. The “ComprehensiveBufferization” is within a particular scope. So its more re-usable and less opinionated. The other bufferization is very opinionated, more aligned to “runtime-y” things. They convert from tensors to memrefs, but thats the end of their similarity. Putting both in one dialect somehow gives the impression they are related, but they are really solving orthogonal problems. To some extent one should not influence the other (for me the comprehensive bufferization should not start catering to needs of the other whole program bufferization).

Based on a discussion I had with Nicolas yesterday, I don’t think we should try to converge the two approaches. Comprehensive is a very scoped and reusable abstraction, for example, it has strong guarantees about not escaping or allocating memory, which is an important property for it to be reusable for a lot of use cases. The other one is not as well-defined how it can be practically reused.

At this point though, the non-Comprehensive bufferization is the only thing we have upstream that allows me to take a pile of linalg ops and just run it (albeit with a number of issues). See the Torch-MLIR RefBackend link. Comprehensive requires things to be annotated as being in-place/etc. and is not suitable for random code that returns freshly allocated tensors (i.e. most practical user programs). This is not a bad thing – it just means that Comprehensive has to be nested within another layer with lifetime management/etc. (I really think this is the right layering from a design perspective).

1 Like

Well, I was just enunciating/seconding the suggestion that River and Chris had made UT.

I also have a suspicion that 90% of the issue here is naming. “Comprehensive bufferization” is a specific algorithm to induce aliasing on a constrained vocabulary of ops (which it sounds like started specific and is being generalized to be interface-based but also has some domain specific heuristics embedded in its analysis).

There is a note in the design doc that Matthias linked (which I can’t find now), which indicated that future work could follow “comprehensive bufferize” with the more “memref/runtime bufferize” for cases that escape/are weird/etc.

Naming aside, I don’t think these are solving similar levels of the problem, and probably should both exist as part of a family of algorithms.

Sounds good. Let’s discuss whether we want to put both of them in one directory at ODM. I would still like to proceed with moving the non-Comprehensive bufferization in the meantime.

1 Like

This is quite counter-intuitive to me, as I’d expect something less “one shot” / more progressive to be more reusable and less opinionated.

If I skip the part about “what works today”, isn’t your last sentence hinting at a direction in which bufferization can converge?

1 Like

Comprehensive is “comprehensive if your program meets the requirements”. Putting the program in a form that meets the requirements is actually a majorly difficult task if done in any generality / for any realistic set of production needs. It’s best not to conflate the two.

I agree that having two bufferization passes could be confusing for users.

I still have a somewhat limited understanding of the existing bufferization pass, but from what I’ve seen, I think that Comprehensive Bufferize (or parts of it) can be combined with the existing bufferization passes.

I am seeing three main difference between the two bufferization approches.

  1. Incremental vs. one-shot bufferization. This is arguably a bit fragile and leaking bufferization details, but I can also envision situations where it may be convenient to do a partial bufferization (and do the rest later). Comprehensive Bufferize bufferizes all ops that implement the BufferizableOpInterface and fails when it encounters an op that does not.

  2. Inter-function analysis. Comprehensive Bufferize does a simple call graph analysis to decide which function args can be modified and to find out if there is a mapping between function args and return values.

  3. Returning memrefs from functions/blocks. This is currently not supported in Comprehensive Bufferize, but I also don’t understand how the existing bufferization passes support this use case. If someone could chime in… (see example below)

  4. Detecting read-after-write conflicts before bufferization vs. making lots of copies and removing them after bufferization.

Regarding point (1), I think instead of failing bufferization, Comprehensive Bufferize should generate memref.buffer_cast and/or memref.tensor_load ops (controllable via a flag). In the same way as the existing bufferization passes. In the short term, I think this would make both bufferizations composable. In the medium term, we can replace the implementation of bufferization passes such as -tensor-bufferize with Comprehensive Bufferize by simply registering only the external models of tensor dialect ops etc. (Other ops will be ingored by the bufferization.) In the long term, we could clean up existing code and test cases, so that it uses a single-pass bufferization instead of multiple partial bufferization passes. (And we should get better bufferization quality, because we will generate fewer copies in the first place; point (4).)

Regarding point (2), let’s ignore Comprehensive Bufferize’s inter-function analysis for the moment and leave it in the Linalg dialect. Comprehensive Bufferize’s main advantage is its intra-function analysis. I think the inter-function analysis was needed only to make our test cases run. We can think about this later and just keep reusing -func-bufferize and whatever else is needed.

Regarding point (3), my understanding of the existing bufferization passes is not good enough. Comprehensive Bufferize could actually quite easily be changed to allow returning memrefs. But that would create a memory leak. On the other hand, I am seeing the exact same problem with the existing bufferization passes. Maybe I am missing some pass in the pass pipeline?


// RUN: mlir-opt %s -scf-bufferize -tensor-bufferize -std-bufferize -func-bufferize --finalizing-bufferize

func @if(%pred: i1, %true_val: tensor<?xf32>, %false_val: tensor<?xf32>, %elem:f32, %sz:index) -> tensor<?xf32> {
  %0 = scf.if %pred -> (tensor<?xf32>) {
    scf.yield %true_val : tensor<?xf32>
  } else {

    %result = tensor.generate %sz {
    ^bb0(%i : index):
      tensor.yield %elem : f32
    } : tensor<?xf32>

    scf.yield %result : tensor<?xf32>
  }
  return %0 : tensor<?xf32>
}

The bufferizes code has no dealloc, so it leaks memory.

module  {
  func @if(%arg0: i1, %arg1: memref<?xf32>, %arg2: memref<?xf32>, %arg3: f32, %arg4: index) -> memref<?xf32> {
    %0 = scf.if %arg0 -> (memref<?xf32>) {
      scf.yield %arg1 : memref<?xf32>
    } else {
      %1 = memref.alloc(%arg4) : memref<?xf32>
      %c0 = arith.constant 0 : index
      %c1 = arith.constant 1 : index
      scf.parallel (%arg5) = (%c0) to (%arg4) step (%c1) {
        memref.store %arg3, %1[%arg5] : memref<?xf32>
        scf.yield
      }
      scf.yield %1 : memref<?xf32>
    }
    return %0 : memref<?xf32>
  }
}
1 Like