LLVM Discussion Forums

What is the strategy for tensor->memref conversion? (bufferization)

I’ve been seeing a couple random patterns popping into the codebase related to tensor->memref conversion (also sometimes called “bufferization” in some parts of the codebase?). Examples 1, 2, 3.

Do we have an RFC for this? What’s the plan? Having recently implemented (and then reimplemented) this in the npcomp reference backend pass pipeline I know that there’s a design space here that requires taking opinions, and MLIR core seems to be incrementally taking a design opinion without very much discussion. Can we have that discussion?

Speaking from the npcomp side, I am happy to give a deeper dive into how we do the lowering there if folks would find that useful. We have a working reference backend that can already run, some, nontrivial, programs. I’d be happy to rebase on whatever core is doing, but I don’t really see the design or what opinions it is taking.

CC @pifon2a, @tpopp, @nicolasvasilache, @herhut, @dfki-mako and probably others I’ve missed

1 Like

I am a little confused here, we had a full presentation by DFKI folks a few ODMs back about this topic: https://drive.google.com/corp/drive/u/0/folders/1ztiyJadS-i_N2gRL_XBTPOXE6mgmfPMJ.

This describes how things work with control-flow, function boundaries, buffer placement, copy elision etc.
My understanding from the discussion in the presentation is that improvements are available by passing some booleans around instead of copying too much in certain cases.

Linalg on tensors + init args is now connected to this and work is ongoing to connect the pieces and make it run end-to-end, starting with matmul.

Offhand I can see 3 things:

  1. Constant tensors to buffers + the discussion about llvm.global. I think it is also possible to just go through vectors. I’ll modify the revision accordingly so that it works more generally without involving an opinion about llvm.global.
  2. Shape computation (for dynamic allocation): for a reasonable class of ops, a simple iterative elimination process has shown to work in the past. I think @pifon2a has signed up for implementing something similar.
  3. Signature of external functions: right now I am being a bit cavalier here declaring essentially print_memref_f32(tensor<*xf32>) and letting the conversion turn it into the right thing. In the future we’ll need something more robust, a function attribute to trigger conversion of signatures for such functions that are subject to ABI external considerations seems reasonable.

It seems to me a lot of the work in that space has been covered by DFKI and connecting the missing pieces seems relatively straightforward.

Could you maybe list the points that you have experience with requiring taking new/different opinions and/or that are not covered by the above?

1 Like

I saw the DFKI presentation, but it seemed to be mostly about deallocation / elision / optimization, not about the actual conversion process. There were a couple slides briefly touching on the conversion process, but it seemed like a light treatment and it wasn’t clear how settled the design was.

Especially as I was seeing varied and confusing terminology, it didn’t seem like the pieces were fitting into a coherent framework. E.g. in various places either “bufferization”, “TensorToMemref”, “TensorsToBuffers”, BufferPlacement (what does “placement” have to do with the conversion patterns themselves? e.g. changes in “placement” files in this commit). Also the core conversion patterns need to inherit from BufferAssignment{OpConversionPattern,TypeConverter}, but it’s not clear why conversions are associated with “assignment”. Is “assignment” different from “placement”? I don’t know.

Sorry if these are dumb questions, but I really find the current state very difficult to parse, even after paying close attention and asking multiple questions in the DFKI presentation.

There is some history there, I believe one early approach was to pre-compute the position for alloc/dealloc before the conversion and use this during the conversion: this lead to the “Placement” terminology I believe. “Assignment” makes more sense to me in a use-case like XLA where the memory envelop is entirely computed and you just have to “assign” each value to a position in the single buffer.

“Bufferization” makes more sense to me than “TensorToMemref” because it refers to the abstract concept/transformation, and we can have data type that aren’t Tensor that we may want to participate in this process.

I’m all for improving consistency though, so thank you for pushing on this!

1 Like

Ah, that makes sense! I remember seeing that on-the-fly placement stuff in the past which is some baggage that probably contributed to my confusion at parsing the current state.

I’ll do a pass to normalize the terminology to “bufferize” if there isn’t any dissent.

I think the term “assignment” was used with the understanding that you are “assigning buffers to tensors”. The BufferAssignmentOpConversionPattern is used to do the tensor to memref conversion. I tend to agree with your concern on naming accuracy and with what @joker-eph says above.

These conversions are extensively used for xla_hlo to xla_lhlo (mhlo to lmhlo now). As you know, the -hlo-legalize-to-lhlo is really a tensor to memref conversion with alloc’s being placed in a straightforward way. That keeps the pass simple and other concerns away (like lifetimes, reuse or copies). The -copy-removal pass gets rid of unnecessary copies and is really solid - I like the fact that it’s moved into MLIR proper and uses RegionBranchOpInterface - so it could be made to work with various if/while. All of these passes have worked for me satisfactorily on really large models. I’m curious to know, other than the naming clarity, what exactly you need on the npcomp side that you find missing.

Besides, the -buffer-placement pass is I believe really an end-of-lifetime marker and live range adjustments - it mainly places dealloc’s in something that didn’t have any dealloc's at all. So it’s meant to run right after the tensor to memref conversion which itself isn’t aware of memref lifetimes. So, placement here could be confusing as well.

Yes, there is a fair amount of history in the naming that is in need of cleaning up. The team at DFKI is also working on splitting the insertion of de-allocations and optimization of alloc placement into separate passes, so that one does not have to use the pipeline exactly as the hlo to lhlo lowering does.

I quite like the bufferize name (and have also started using it within kernel generator). So I would be all for standardizing on that.

@dfki-mako can you prioritize the cleanup, so that new uses can directly use better naming?

For patterns that implement the tensor to memref transformation, owning dialects can move them to a bufferize.cc and maybe expose a populateBufferizationPatterns so that the patterns can be mixed into a larger bufferization pass.

Now that I understand it better, maybe nothing is missing :slight_smile:

One small thing that still seems to be missing is being able to split the bufferization process into multiple passes for clarity (I really don’t like having mega-conversion passes; it’s harder to debug and test). The key thing needed for that is source/target TypeConverter materializations. Unfortunately, MLIR core doesn’t have suitable ops for that. tensor_load is a suitable source materialization, but tensor_store is not suitable for use as a target materialization as it doesn’t create a new SSA value of memref type (I guess you could insert a bunch of std.dim ops + std.alloc + tensor_store, but that’s bulky and doesn’t naturally fold away at the end of the conversion process).

In npcomp, I locally created tensor_to_memref and memref_to_tensor ops that I use for the materializations (example). I would love to be able to upstream these, but we don’t really have a good place to put them. Since I can easily keep these locally, it’s not a huge problem though.

To be precise, we would define “bufferize” as “the process of converting to buffer types”, and hence mostly refers to a set of conversion patterns that do type conversion (which might involve inserting allocs).

This is then distinct from buffer optimizations and insertion of deallocations, etc. Is that what you were thinking?

Quick question: Is deallocation-insertion and buffer optimizations expected to be running before and/or after linalg tile-and-fuse on buffers?

Also, what is the expected ordering of deallocation-insertion with respect to buffer optimizations?

Def +1 from me on the tensor_to_memref and back.

Thanks. Could you expand out the op descriptions there: https://github.com/llvm/mlir-npcomp/blob/master/include/npcomp/Dialect/Refback/IR/RefbackOps.td#L45 ? Are those ops functionally equivalent to tensor_load and alloc + tensor_store resp. - but keeping all the relevant IR together? I’ve always thought that memref to tensor could be done by using a mem2reg equivalent (treating the entire buffer as one thing), which we may anyway need at some point and which will be control flow aware across ops with regions.

Correct. They could be lowered to tensor_load and alloc+tensor_store. In the case of npcomp, we convert all tensors to memrefs eventually, so they all get eliminated by folding tensor_to_memref(memref_to_tensor(m)) -> m.

I would think they would run “after” linalg : tile (and fuse) on buffers. Basically do a straight-forward allocation. Then tile + fuse. Then some buffers are unnecessary. Do buffer optimizations on the remaining buffers. Doing it before might lead to some weird dependencies that is not fundamental to the computation.

Also if you do linalg tile + fuse on tensors, this whole point is mute.

1 Like

Ok, I’ve been reading the code and have come up with a laundry list of refactorings that I’ll do over the next working days (@dfki-mako don’t worry about it) . Overall this looks pretty nice, modulo the confusing naming.

  • Transforms/BufferPlacement.h -> Transforms/Bufferize.h and clearly split out those components from BufferPlacement.cpp which will just have the buffer placement pass itself (unrelated to the conversion infra stuff)
  • BufferAssignmentTypeConverter -> BufferizeTypeConverter
  • move bufferization patterns/passes consistently to lib…MyDialect/Transforms/Bufferize.cpp and include…MyDialect/Transforms/Bufferize.h and test…MyDialect/bufferize.mlir
  • add tensor_to_memref ops and upstream the source/target materialization from npcomp into BufferizeTypeConverter. This makes the various bufferization passes (currently added only for testing the bufferization patterns) composable with each other in production pipelines.
  • upstream scf.if/scf.for type conversion from npcomp (not tied to bufferization per se, but is a drop-in pattern for those conversions, and bufferization is a nice test pass which also happens to be useful for composing into pipelines)
  • upstream some extract_element and some other miscellaneous bufferization patterns.
1 Like

All of this sounds good to me. But for the third bullet: which Dialect's transforms are you moving them to? These passes already depend on or in future would depend on interfaces used by region holding ops, and the Std dialect won’t depend on such. Why can’t they stay in lib/Transforms?

Reg. bullet 5: Could the type conversion be implemented on LoopLikeOpInterface instead of on scf.for? Both affine.for and scf.for are equivalent in that regard - they also both support return values. This is although a general question I had about whether conversions and op rewrite patterns could be implemented on op interfaces - I think there are no examples of this.

See the last 2 commits in Linalg.

I’d like to move to a model where each dialect is “self-sufficient” w.r.t. bufferization. I was referring to e.g. https://reviews.llvm.org/D88083 which will be moved to Dialect/Shape/Transforms/Bufferize.cpp

Interesting idea. I’ll look into it.

Thanks for cleaning this up.

There is work underway to separate the bufferization into

  • the patterns (which should be bufferize)
  • a set of optimizations (loop hoisting, dominance based hoisting to avoid frees)
  • deallocation insertion

See https://reviews.llvm.org/D87756 for the WIP change.

Our approach so far has been to combine the patterns from various dialects into a single lowering step to buffers. Not arguing against the materialization approach you put forward, just mentioning this so that both variants remain available. In essence, this requires access to the registered patterns.

Nice. @dfki-mako FYI to avoid duplicating work.

1 Like

Generally the idea is that you can bufferize (also partially) early (if you want), then keep optimizing the IR on buffers, which includes optimizing allocations but can also mean tiling. The placement of deallocations should be relatively late, as it finalizes lifetimes of buffers and makes further optimizations harder.

The buffer optimizations are designed to run before inserting deallocations.

2 Likes