Do we need all ops to take tensors?

Here is an interesting design question w.r.t. to a more linalgy direction or a more xla_hlo-y representation for Tensor Compute Primitives. Let’s consider a simple pointwise add. Consider these two IR representations:

the “tensor” approach:

%add = "tensor.add"(%lhs, %rhs) : (tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32>

the “lifting” approach:

%add = "lifting.pointwise"(%lhs, %rhs) : (tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32> {
  ^bb0(%0: f32, %1: f32)
    %2 = "scalar.add"(%0, %1) : (f32, f32) -> f32
    yield %2
}

(note that “lifting.pointwise” is a very special case of linalg.generic)

The lifting approach has the appealing property that we only need to define scalar versions of the operations, and all logic related to shape propagation, shape functions, broadcasting, tensor-vs-memref, etc. is concentrated in just the “lifting.pointwise” op.

However, the lifting approach has the downside that (at least naively) it relies the ability to fuse large amounts of computation into the body of the regoin in order to do even simple arithmetic simplification. Consider the user code:

x = add(t0, t1)
if cond():
  y = sub(x, t1)
  print(y)
else:
  y = sub(x, t0)
  print(y)
# No other uses of `x` or `y`.

We would like to optimize this into the code

if cond():
  print(t0)
else:
  print(t1)

With the “tensor” approach, this is a relatively straightforward application of a fold like sub(add(t0, t1), t1) -> t0. However, with the “lifting” approach the IR for this becomes much harder to follow and requires looking through regions to find this simple identity. Here, control flow presented a barrier that made it impossible to fuse everything into a single “lifting.pointwise” (at which point the scalar application of the sub(add(t0, t1), t1) -> t0 pattern would trivially apply).

Thoughts? In some ways, I feel like the “lifting” approach feels more principled, and linalg shows that it scales to a large family of ops. However, it seems hard to beat the intuitive simplicity of the “tensor” approach when it comes to extending many basic optimizations to the tensor domain. On the other hand, I feel like the “tensor” approach might become less “intuitively simple” once broadcasting, shape propagation, tensor-vs-memref, etc. come into the picture. With the right traits though, perhaps it will be palatable? Maybe we need to discuss those other topics first.

Thanks for starting this first thread branching off from the announcement @_sean_silva.

Why not both? :slight_smile: Atm I am experimenting with both forms in Linalg with the notion of named ops and creating a mapping between one form and the other.

More generally, different types of fusion can be done at multiple levels (reusing your terminology):

  1. “Tensor” approach (a.k.a TASO, TensorRt, TF …) which operate on semantically charged named things)
  2. “Lifting approach” (e.g. Linalg-style on tensors (which sits in between 1. and 3.))
  3. “Lifting approach on buffers” (e.g. Linalg-style on buffers / Halide-style / TVM-style)
  4. Loops (affine, scf)
  5. CFG

The lower you go, the harder it becomes to perform fusion and the more of a raising / inverse problem you have.

Add a layer of progressive lowering motherhood and apple pie and I think the lessons is do what you can at the level of abstraction where it is easiest if it makes sense. However, we’re still only touching the notions of expresiveness and fusion in a very simple case.

IMO, we need to think more broadly in both expressiveness (reductions, gather-scatter etc) and about impact on key transformations (fusion, tiling, ability to recreate a library call, vectorization).

To sum it up, I think it is great to start scoping out subpieces/projections of the general problem, there are also other considerations that need to be considered. But for the particular problem you defined, I’d venture that:

  1. we should have both forms
  2. we should be able to map between both forms

The mapping could be done in a “direct” way:

  1. a named op in Tablegen constructs its region from a declarative form (e.g. a language level annotation))
  2. matchers are generated by having a function that builds the region to go from “tensor approach” to “lifting approach”. “isa” consists of building the region for X and doing a region comparison with the “lifted form”.

I think this should allow avoiding a good part of the inverse/raising problems.

Of course this does not work magically after fusion and other things but there are many orthogonal problems to solve if you want this information to resist fusion (which is why Linalg tileAndFuse + library in the buffer world, etc).

After I saw your proof-of-concept for generating the named ops (please add link when it is open source), a light went on in my head. I think that’s really promising :slight_smile:

Or even if we don’t go exactly with the “tensor” approach being literally associated with the linalg dialect via “named ops”, it still strongly implies a set of traits that we should consider having for all such ops, even in a pure “tensor approach” dialect, such that we can translate them to linalg generically and also centralize all handling of shapes, broadcasting, etc.

Absolutely, the “named ops” approach is the first one that came to mind and, as we iterate, I expect we’ll learn about those traits.

+100 I think we can do significantly better re. single idiomatic source of truth and those other aspects you list while automating the progressive lowering to Linalg and other loop forms.

+1. I thought it was clear one needs both. :slight_smile: Besides the scenario @_sean_silva points out (which is SSA optimizations extended to tensor ops and values), there are other scenarios: the operations on tensor form lend themselves to more domain-specific transformations that rely on inspecting attributes as well as exploiting the op’s algorithmic aspects. For eg. if you have enough information to decide that you’d implement a convolution with certain sizes (say 5x5 weight windows or larger) with an FFT instead of in the time domain, such decisions don’t require looking at the memory form or any other expanded out form – it would be done on the tensor operand form of the op with the attributes. (Just an example scenario. ) Other things like folding transpose of a transpose or folding successive reshapes etc. in several cases don’t have to wait for expansion out of tensor type form. The op’s semantics can be exploited in the pattern rewrites on the tensor SSA form. Enabling all of this has been part of MLIR’s goals right from the early days of its design — but the choice and definition/details of these tensor compute ops hasn’t been put in a dialect reusable by ML frontends.

There are however also going to be several cases where you’ll have to make a choice between implementing it on tensor types or memref types – and also wind up having it work on both forms.

+1. I also would hope that we get both in the end.

The original example add is probably not very informative as in both forms you ultimately rely on semantic knowledge of what the operations does. You need to know what tensor.add and scalar.add do to understand both forms. The only declarative information you have gained going to linalg is that it is element-wise.

The flip-side of this, as @bondhugula stated, is that you nail down more details of the actual implementation of the operation. Knowing that it is element-wise is probably always beneficial while, staying with the other example, for a convolution you might want to keep the used algorithm unspecified.

For that reason, I see linalg at a lower level than a tensor level dialect. And named operations are essentially tensor dialect ops that have a (declaratively) specified lowering to linalg. That is of great value but in my opinion we should not gate tensor level operations on the existence of such a lowering.

I’m on the same line here: I would express a named op in terms of its lowering to Linalg when possible, but the high level dialect may have more concept than what is representable in Linalg if needed.

@herhut @mehdi_amini Can you give specific examples of such ops in the higher-level dialect which you don’t expect to be representable in linalg?

One example that I can think of is e.g. softmax, which internally hides a reduction along with other stuff, so isn’t declaratively lowerable to a single linalg generic op. On the other hand, I notice that in our prior art XLA’s HLO doesn’t have an op for that, so it requires frontends to split the softmax into multiple ops (all of which, to my knowledge, are individually lowerable to a single linalg generic).

(btw, I notice that XLA has a fused batchnorm op which is curious; is it just a historical artifact?)

I guess I’m curious whether we can use declarative-specified, linalg-compatible, semantically-charged named ops as a well-defined subproblem of the larger TCP story. Or to put it another way: suppose we had a declaratively specified dialect of all common ML operators that are expressible as linalg generic. What would be left to solve for the overall TCP story? Are there any remaining parts that are non-orthogonal to said dialect?

If we just start with HLO, the most obvious that come to my mind is the set of RNG ops, TriangularSolve, Sort.
If we take more than HLO, there are things like tf.Unique or tf.Where.

This is mostly a historical artifact (it’s canonicalized into multiple ops right now on all backends) but if e.g. cuDNN batchnorm were to be significantly faster than XLA-generated code on a new GPU architecture that canonicalization might be switched the other direction for that backend.

Each of these XLA ops exists because it benefits from special codegen or a library call on one or another backend (e.g. cuSOLVER for the linear algebra ops on GPU)

To put this in the perspective of Linalg.

The generic ops have the semantics of “Perfectly Nested Writes To The Whole Output Operands” and we use transformations + LICM and conditional hoisting to create imperfectly-nested control-flow as we progressively lower away from Linalg.

Some basic modeling of a DAG is expected to be useful as an addition.
I expect something like linalg.batchnorm = seq/dag(linalg.reduce, linalg.mean, linalg.variance) (I forget the details of batchnorm off the bat but we had it as a few TC ops).

Once we have basic linalg “named ops” support, we can push this exploration further.