[RFC] Promoting `linalg.init_tensor` to the `bufferize` dialect

Drafting on some recent discussions regarding relayering/splitting Linalg, one of the ops that should graduate out of linalg is linalg.init_tensor.

Background

linalg.init_tensor was added in ⚙ D93374 [mlir][Linalg] Define a linalg.init_tensor operation. and, roughly speaking, the op creates an “empty” tensor of a given shape. However, the ODS is not really precise about what this means. The commit message gives more insight into the intended use, but uses semantically-confusing phrases like

  • “is expected to overwrite the entire tensor” - the tensor type is value-semantic, so talking about “overwriting” it doesn’t make any sense from the perspective of defining the op semantics.
  • “will be used as the init tensor of such Linalg operations” - having an op that can only be used as operand to another specific op doesn’t really fit well with the data model of value-semantic tensors – I can pass any tensor to anything that accepts a tensor and that should be ok (same for block arguments).

The use case

Before tightening up the semantics of this op, I would like to provide concrete insight into the most important use case for this op. Basically, linalg.init_tensor is the equivalent of memref.alloc. However, it has proven very useful to not jump to memrefs immediately, but rather to stay in value-semantic tensor land to retain SSA use-def chains. One key trick that makes this work well is the use of “tied” operands (which in linalg take the form of outs operands), which simplify analyses and transformations and can be used to model a restricted form of in-place updates in a program that otherwise has value-semantics. The burden then falls on “bufferization” to make the updates truly in place if possible (and make copies if not). This is a reuse analysis somewhat analogous to register allocation in a traditional compiler. In fact, LLVM’s MachineIR has the same kind of “tied” operands, for modeling so-called “2 address” instructions like add eax, edx in a “3 address” SSA IR.

Another word for this value-semantic modeling of an underlying mutable program is “destination-passing style”. This destination-passing style gets the advantages of value-semantics while maintaining enough predictability w.r.t. in-place updates to be useful for codegen. There isn’t a specific, bulletproof contract for what becomes in-place at the buffer level, but the reality is that bufferization algorithms are sufficiently sophisticated that in practice the results are unsurprising.

Non-goals

We don’t intend for this op to be an “undef” in the LLVM sense (uses seeing different values), or touch on any complex UB topics that are needed to unlock certain aggressive optimizations (e.g. counterfactual reasoning with poison values) – it has been found that these optimizations are not important at the level of abstraction / use cases that this op is used for.

This allows us to propose a fairly relaxed semantics that sidestep “the hard parts” of UB/poison/undef discussions.

Proposed semantics / new op name

Rename linalg.init_tensor to bufferize.unspecified_tensor (please bikeshed), and give it the following semantics:

"
bufferize.unspecified_tensor produces a tensor of the given shape with unspecified values.
Each instance of the bufferize.unspecified_tensor op potentially produces different values.
"
Additionally:

  1. The op has an “allocation” side effect. This prevents it from being CSE’d (for the use cases of this op, CSE is not really useful and I hear that the existing CSE behavior of linalg.init_tensor has actually caused some headaches)
  2. The op is restricted to memref-compatible element types

The reason this op fits well in the “bufferize” dialect is that, as I mentioned, the op is really “memref.alloc” but for the tensor-level destination-passing style. This is very intimately related to bufferization – in fact, it is so intimately related that people who use this op are usually thinking about the underlying mutable program they expect it to bufferize to rather than the nominal value-semantic program actually written in the IR.

I believe that the proposed semantics here are fully-defined and well-specified at the value-semantic tensor level, while still having the right properties for the main use case of this op.

Thoughts?

@nicolasvasilache @MaheshRavishankar @mehdi_amini @herhut

1 Like

Just as an extra data point, I had the need for a similar operation in the sparse tensor dialect for cases where a sparse tensor materializes during a computation. It was subject to similar conditions you sketched above, i.e. it has an underlying implied allocation side effect (technically, though, the op is purely SSA based without side effects). I would be strongly in favor of merging the ops at some point (although e.g. tensor seems a better dialect for this than bufferize, although I understand your reasoning for this).

    %c4 = arith.constant 4 : index
    %C = sparse_tensor.init [%c4, %c4] : tensor<4x4xf64, #CSR>
    %D = linalg.matmul
      ins(%A, %B: tensor<4x8xf64, #CSR>, tensor<8x4xf64, #CSR>)
         outs(%C: tensor<4x4xf64, #CSR>) -> tensor<4x4xf64, #CSR>

Adding a memory side effect for this op seems a bit weird, since I don’t think it really has such a side effect. I get that this is convenient because we don’t want it CSE’ed, but that seems like something to address in a cost model, not by having a pretend side effect. If it’s the best path forward for now, then it seems like a pragmatic tradeoff, but I think it should be discussed and acknowledged as such.

I have concerns with this: as specified this reminds me of “stateful” operation in Tensorflow. Strictily speaking you can’t duplicate any such operation, so technically (without more info in the spec) this forbids sinking, loop unrolling, inlining, …

This makes me quite uneasy: it seems that we’re not describing the operation behavior properly based on the passes limitations here. Any such value-based op should be able to be CSE’d. Whether it is profitable is another question and I believe this is more the role of a phase that would “undo” any CSE for profitability. Do you have reference to the “headaches” this caused so far? The few cases I saw were more pointing to misassumption in the way the bufferization passes were written than anything else (basically implementation/design bugs).

1 Like

First, the use case of linalg.init_tensor as a proxy for memref.alloc was never its intended use case. The wording of the patch that you point to explicitly says

The tensor is materialized only for the shape of the output 

This is the load bearing part. I agree I should have worded the patch better, but it was done at a point where (my) understanding of Linalg on tensors was still evolving. So lets leave aside that point.

linalg.init_tensor was added for was to get around a missing type. What is really needed is

  1. A type that carries the shape of a tensor and not the tensor itself.
  2. Operations that need a tensor operand only for its shape should take as an operand a Value of this type (and not the tensor type).

For example

%1 = linalg.fill(%cst, %0) : f32, tensor<?x?xf32> -> tensor<?x?xf32>

Here the linalg.fill operation does not actually need a tensor. It needs the shape of the tensor to fill into and it “creates” a new tensor from it. That type was missing (and is still missing). The linalg.init_tensor was added explicitly as a placeholder till we had that type. And of-course we still dont have that type. The fact that linalg.init_tensor has been used as a way to signify allocation at tensor specification is really something that is specific to bufferization.

So coming to this RFC. I do understand that the current LinalgComprehensive bufferization requires a bufferization.unspecified_tensor. But I’d say it is a requirement of the current algorithm and its implementation. There is a potential different algorithm/implementation that doesn’t need it. That is not to say we have to change the current implementation. Having such an operation does make the implementation easier. So +1 from me for having such an operation in the bufferization dialect.

What I am strongly -1 on is removing linalg.init_tensor. All lowerings from TOSA/MHLO to Linalg use this operation in the way it is intended to be. To provide information about the shape of the tensor to the operations that only need the shape. If the RFC was about adding such a type, and then removing linalg.init_tensor I would have been strong +1 on it. But exposing the bufferization operation all the way to TOSA/MHLO to use would be a violation of layering and I have strong concerns about this. One of the advantages of a tensor-based approach that it is side-effect free. Changing that contract can have undesirable downstream effects

2 Likes

This is precisely what I am trying to work out. The current bufferization just allocates a buffer when it sees a linalg.init_tensor. I dont have enough data to say that this approach is wrong. It might be fine. But CSE of linalg.init_tensor creates a false sharing. So what I am trying out as I move IREE to use the usptream bufferization is effectively unCSE the operations. Still trying to figure it out though.

+1 - this matches my thinking. If this is important enough to the bufferization algorithm, then it should be conducting its own profitability analysis/transformation vs just relying on some latent structure in the program. (of which, CSE can foul up, but so can anything)

+1 - I feel like this is where the design went wrong and where we should focus on fixing it. An abstract_tensor which is legal for metadata ops but cannot be used for data access seems like an important component that I have seen nicely rationalize other representations. Having such a thing be coherent with the tensor/ShapedType hierarchy would help a lot at this level.

Thanks Stella for introducing this term here (I was hoping you would based on our discussion). A relevant example here is JAX abstract values (I dont have a good pointer but this one : How JAX primitives work — JAX documentation)
Particularly this passage is informative

In particular, a JAX-traceable function is sometimes invoked by JAX with 
abstract arguments. An example of a JAX abstract value is
ShapedArray(float32[2,2]), which captures the type and the shape of
values, but not the concrete data values. JAX primitives know how to operate
on both concrete data values and on the JAX abstract values.
1 Like

My understanding based on what Nicolas described to me is that actually the use of init_tensor as “just a shape” is not actually accurate anymore. What I remember hearing is that the moment you start tiling/etc. on tensors, the “shape-only” interpretation falls apart (even for elementwise/broadcast ops like linalg.fill), and that it really is a placeholder for a “memref.alloc under the hood” when init_tensor is created by linalg transformations.

@nicolasvasilache can you explain more?

Which hints that we are missing something in my view. At the outset, the program can be correctly represented with pure tensor / abstract_tensor semantics, and that is a valuable property. However, after a certain point of transformation, you may need to introduce something more semantically valuable to where you are going. It seems like we just haven’t defined that transition point robustly and are relying on an off-label use of an op that we just had lying around – and are paying the price of fudging that.

I can’t say precisely what the missing piece/step is, but it seems like we got here by shortcutting the strict value semantics at the top level and if we want to fix it, we need to walk it forward and find the precise addition at the transition point where the program becomes partially buffer-aware.

1 Like

Yes, thats part of the challenge to fix. Lets say you have this example

%0 = linalg.init_tensor [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill(%cst, %0) : f32, tensor<?x?xf32> -> tensor<?x?xf32>

If you tile this you get

%0 = linalg.init_tensor ...
%1 = scf.for %iv0 = ... iter_args(%arg0 = %0) {
   %2 = tensor.extract_slice %arg0[%iv0] [...][...]
   %3 = linalg.fill(%cst, %2)
   %4 = tensor.insert_slice %3 into %arg0[%iv0]...
   scf.yield %4 : tensor<?x?xf32>
} -> tensor<?x?xf32>

If we use a type that is just a tensor shape this IR falls apart. Particularly, the restriction of scf.for that the type of iter_args is same as the scf.yield and the returned value does not hold. So the representation of tiling on tensors using scf.for falls apart. But I dont think that implies that linalg.init_tensor as a proxy for shape falls apart.

1 Like

I fully agree that it’d be nice to address this part, but there might be a wrinkle in that the “outs” on a linalg generic can be used for the shape, but also for its value as an initial value when there is a reduction involved. If I remember correctly, Nicolas was concerned about the code complexity in having to handle all these cases separately (just a shape or a tensor, or both).

+1 I was just thinking that. Though based on my observations we aren’t really “paying the price” for anything. AFAICT linalg.init_tensor basically works and is a good design point. It happens to cover both use cases really well. Is that not the case?

I think that will not be the case if you introduce a side-effecting operation instead of linalg.init_tensor.

Ugh, yeah you’re right :confused: That kind of kills this whole proposal and makes linalg.init_tensor really no different from a tensor full of zeros (!), in the sense that all linalg.init_tensor’s of the same shape must have some specific, identical contents to be well-defined w.r.t. value semantics and not triggering this “stateful op” kind of thing.

An allocation side effect on an op with value semantics generally doesn’t prevent anything you couldn’t do with a side-effect free op, besides CSE (but see Mehdi’s point about the “stateful op” thing – that’s actually a very serious concern and basically kills the idea of linalg.init_tensor being any different than a tensor full of zeros).

Leaving aside my dislike of the term outs for all of this, I think that the additional type actually could make some of those distinctions more clear and verifiable. For example, not all constructs require a data-carrying tensor (i.e. parallel loops), but for those that do (i.e. reduction loops), it is not easy without analysis to determine whether the out you are given is metadata-only (i.e. the result of an init_tensor) or data-carrying (i.e. the result of fill). If all of this were encoded in the type system, it should turn that into a local verification but not really change the structure of things otherwise.

I haven’t worked this through in detail at all of the layers. I’m more just noting that we got here by fudging the line, and we may be in the territory of popping back to the top of the stack and seeing if this core modeling needs repair vs a more point fix to post-hoc accept the existing structure.

2 Likes

Well, when I said “paying the price”, I meant that we may not have been completely honest to the system with respect to this being a pure/no-side-effects/value-based op (i.e. after a certain point, we don’t use it that way) and then the completely reasonable decisions that CSE does end up violating downstream assumptions. Either we can make a point repair to the algorithms that are dealing with this, which may be the right thing. Or there is a representaiton bug to fix.

1 Like

I was actually never sure why we don’t just use a tensor.zeroes() instead of init_tensor and rely on the optimizer to “dead-store-eliminate” the zero initialization?
The original introduction of the op was motivated by something like “we don’t have a type to model just the shape/metadata so we’ll use a ‘fake’ tensor with no data” IIRC, but using zeroes would work for this as well in terms of functionality I think.

1 Like

As a rule of thumb, note that the whole notion of structured codegen on tensors is modeled after LLVM’s manipulations of structs “but more dynamic”:

  1. create a struct with no load bearing value
  2. insertvalue/extractvalue into/out of it (insertion produces a copy of the value with some field changed)
  3. pass around and compute with it

In the tensor domain, the operations to insert / extract operate on full subsets (mostly regular dense subtensors for now) rooted at dynamic offsets and with dynamic sizes. Similarly, insert returns a new tensor with some entries updated.

Struct manipulations resolve into register and memory manipulations; tensor manipulations resolve into alloc and memory manipulations.

In this codegen/transformations-first model, ops do not materialize new values out of thin air (unlike in more frontend-y things like e.g. XLA) but rather behave just like ops on LLVM structs.

Materializing new values out of thin air is a fundamentally different abstraction than this “destination style” that we have been using to allow breaking ops into smaller n-D pieces (which is the key interface needed for all transformations). Note that producing values out of thin air means the “alloc” is internal to the op, which has some scope-breaking implications. The “it’s just a shape thinking” is not suited to the way we are doing transformations in the tensor domain: this does not compose with scf.for that return tensors for instance.

In any case, there is no magic: “someone/something” has to create allocs.
Having one load-bearing operation for that purpose has been a very useful for transformations compositions and avoid rabbit holes of IR design (i.e. let’s add just one more op…).

1 Like