[RFC] std elementwise ops on tensors

Our std elementwise ops like std.addf accept tensors as operands. However, this fact has not been used at all AFAIK, so it needs a little love to make it great.

I have recently started writing a conversion to linalg from std elementwise ops on tensors, and it is working really nicely (one small MatchAnyOpTypeTag pattern can convert all ~30 ops to linalg): https://reviews.llvm.org/D90354

So I’d like to formalize a couple things to make transformations like these easier/possible to write.

1. Semantics in case of dynamic tensor size mismatches

Currently, elementwise ops on tensors don’t have a clearly specified behavior in the case of a dynamic shape mismatch. E.g. consider

%0 = addf %lhs, %rhs : tensor<?xf32>
// Generic form:
%0 = "std.addf"(%lhs, %rhs) : (tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32>

Due to dynamic shapes, at runtime %lhs could have type tensor<2xf32> and %rhs could have type tensor<7xf32>. (this is not an issue for vectors since they are statically shaped).

Proposal: We specify that std elementwise ops on vectors have undefined behavior in the case of dynamic shape mismatches of the operands. This is consistent with being able to lower them to linalg and other code generation systems.

Alternatives

Rejected alternative: in case of dynamic shape mismatch, the ops must safely terminate the program. This is problematic, because then we need to reify error handling logic before lowering them to code generation systems.

Rejected alternative: Disallow tensor types in std elementwise ops. While it is not hard to directly create a linalg op with a std elementwise op in its payload, this does incorporate an early decision to commit to linalg. With std elementwise ops on tensors, bidirectional conversions between the elementwise ops of HLO and TOSA becomes pretty easy (possibly even removing the need for some of the ops in those respective dialects).

A good amount of thought went into the design of our std elementwise ops, and I can’t think of a compelling reason why any aspect of their design rationale (such as splitting addf from addi) would not apply equally to the tensor case (especially as it already applies to the vector case!).

2. Annotations that these ops behave “elementwise”

In my current std.elementwise → linalg patch, I need to have a big nasty “isa” check enumerating all the elementwise ops that are handled (see function isStdElementwiseOpOnRankedTensors).

This is problematic, since it is difficult to keep up to date, and also the “elementwise” property is useful for other analyses/transformations.

Proposal: Add an Elementwise trait with the following semantics for any op that has this trait:

  1. If any result is a tensor, there must be at least one operand which is a tensor.
  2. If any operand is a tensor, then there must be at least one result which is a tensor.
  3. The static types of all tensor operands and results must have the same shape (element type can vary, such as std.select predicate having i1 element type).
  4. The dynamic shapes of all tensor operands and results must be the same, otherwise the op has undefined behavior.
  5. (“systematic scalarizability” property) All Elementwise ops that have operands/results of tensor types, must also be valid with the same operands/results changed to their respective element types. This creates the “scalarized” form of the op.
  6. The semantics of the op on tensors must be the same as applying the scalarized op at each corresponding element of the tensor operands/results in parallel.

The above also would apply to vectors as well (need to wordsmith the definition to include this), and can be used to derive transformations such as a “scalarization” pass or implementing interfaces such as VectorUnrollOpInterface.

Some care has been taken in this definition. For example, a std.select op with scalar predicate and tensor true/false operands still abides by the Elementwise trait, because the semantics are still the same as applying the scalarized op at each corresponding element. Also, we don’t need to specify which operands/results are allowed to be tensors: the “systematic scalarizability” property is all we need.

Additionally, any op (not just std ops) can implement Elementwise, and the above convert-std-to-linalg pass can convert it to linalg (in fact, we can rename the pass “elementwise-to-linalg”). For example, a downstream dialect might want to have an “add” op that lowers to addf or addi depending on the type; such an op can be easily adapted to be Elementwise.

Alternatives

Alternative considered: Have Elementwise be an OpInterface (instead of a trait) with a method “createScalarElementwiseComputation” that replaces the “systematic scalarizability” property above with a totally customizable hook. This could work as well, but seems bulkier for little added benefit.

The added benefit would be that it would allow e.g. a tensor “maxf” op to implement createScalarElementwiseComputation that creates scalar std.cmpf/std.select in the method. However, it seems preferable to adopt the proposal here, and instead legalize the tensor “maxf” op into tensor “std.cmpf” + tensor “std.select”, which then both implement Elementwise and can be trivially fused. Alternatively, the tensor “maxf” op could also adopt Elementwise by simply defining its behavior on scalars, which seems very little effort.

Another alternative could have been to disallow tensor types in std and split this out in a Tensor dialect which would define all the Ops manipulating tensor, with possibly additional attributes (Should an op have an attribute to enable/disable broadcasting behavior? Could we model layout for the produced tensors on the operations themselves? …). This would also separate the Tensor domain more clearly from the scalar/vector operations.

Yeah, I think there is a lot of merit in splitting this out to another dialect.

Very much +1 on the ElementWiseInterface, as you noted there are also a bunch of use cases on the vectorization/vector transformation side.

But that would make it nearly a replica of the mhlo dialect for the ops in question, and the latter already has all the op definitions, canonicalizations, and folding patterns implemented. Instead of splitting these into another dialect, we could just delete the tensor type support on the std dialect arithmetic ops and bootstrap the new dialect with code from mhlo.

Curious to know in what ways it would be different from the templated patterns that exist for mhlo to lmhlo or mhlo/lmhlo to linalg conversions? They too have patterns that are heavily shared by all the ops.

Why the desire to separate tensor domain form scalar/vector for elementwise? I think for elementwise ops, there is no need to separate them. In fact, we could in theory even define these ops for any type implementing ElementwiseTraversableTypeInterface if we added such a thing.

Touching on some specifics:

Broadcasting

This can mean various things. I’ll break it down.

There are the following design axes:

  1. Are disallowed shapes at runtime UB or a guaranteed error? Example: dynamic types are dynamic types for addf are (tensor<7xf32>, tensor<8xf32>) -> error or UB?
  2. Is rank-difference broadcasting allowed? Example: dynamic types are tensor<7xf32>, tensor<8x7xf32> -> tensor<8x7xf32>.
  3. Is broadcasting of 1-size dimensions allowed? E.g. dynamic types (tensor<1xf32>, tensor<7xf32> -> tensor<7xf32>

Answers:

  1. It should be UB. Otherwise we cannot lower directly to linalg or other code generation technologies. (need to reify error handling code). This is not to say that a higher-level dialect that wants to model it (“TCF” :slight_smile: ) would make sense.
  2. Experience with XLA shows that the very first thing that we want to do with rank-difference broadcasting ops is to split them into non-broadcasting variants and explicit broadcast ops. So the answer here is we don’t want that. Linalg can trivially fuse away the explicit broadcasts.
  3. Broadcasting of size-1 dimensions should be disallowed, otherwise in the dynamic shape case it’s not possible to lower this to linalg or other codegen technologies in a general way. There are some tricks floating around by playing with memref strides, but they seem to break important invariants. The null hypothesis on supporting this is “no, don’t support it; maybe someday when we know how to handle this situation better in codegen”.

There is a good, clear reason for 2 and 3 as well: once various forms of broadcasting are allowed, it becomes impossible to write simple patterns like add(a,sub(b, a)) -> b without auditing every single pattern to make sure that it checks the broadcast attributes correctly. Specifically in that example, if a is higher rank than b then the correct pattern is add(a,sub(b, a)) -> broadcast_to(b, shape_of(a)). On the other hand, that fold is incorrect in the case of b having higher rank than a. This gets even harder to reason about in the case of size-1 broadcasting.

Conclusion: broadcasting is a clearly separable problem compared to what std elementwise ops on tensors, as defined in this RFC, provides. There is room for an abstraction layer above what is being discussed in this RFC where it might make sense to provide it.

By avoiding broadcasting, this RFC provides:

  1. All folds “just work” across scalars, vectors, tensors – no duplicated work, and no need to audit vast numbers of patterns for correct checking of all broadcasting cases.
  2. With basically zero code, except an “elementwise” annotation that we would want anyway even for the scalar/vector case, we get a pretty comprehensive set of elementwise ops on tensors
  3. Trivial lowering to codegen technologies that want to express loop iteration + a “scalar payload”, like linalg or MDH which we saw in today’s ODM. There is no need to tediously maintain a mapping to a separate set of scalar ops.

Layout

Layout specifications (either on tensors or on the ops themselves) are problematic because they make common-sense folds fail to work similarly to what I described for broadcasting. E.g. add(a,sub(b, a)) -> b becomes impossible to write without a dedicated pattern that is “layout-aware”. With the semantics proposed here, all folds are shared with scalars and vectors and “just work”.

This does not preclude having a lower-level abstraction that includes the possibility of holding this information. However, it seems more likely that things like layout transformation will want the extended analysis abilities of technologies like linalg. There doesn’t seem to be a compelling reason for std.addf to need to track a layout annotation.


Please let me know any other concerns. Because this RFC involves no code or other added complexity besides an Elementwise trait that we seem to want anyway, I think the most compelling types of concerns are of the form “if we allow std ops to operate on tensors in the way described in this RFC, then it will prevent us in the future from doing X / will make Y harder to do”.

I can give an example of such a concern: if addf as it is defined today is restricted to scalars/vectors, then it cannot produce UB. But with this extension, it can. So this extension prevents us from knowing a priori that any instance of this op is “speculatable”.

The answer to that concern is that in the fullness of time, addf will likely grow fast-math flags like LLVM’s, in which case, it will potentially have UB if, say, the “UB if the input is a NaN” attribute is enabled. Thus, we will need to add a dynamic callback for checking speculatability depending on the op properties, and this callback can easily check for the tensor case as well. Similarly, addi will probably grow a “nsw” attribute or similar.

I agree with this. I’m much more in favor of std.addf and friends working on tensor types if broadcasting is not part of the model.

I’m not sure this follows. Layout specifications should be part of the tensor type itself, and there should be explicit reshape-like operators that change layout. This means that the result of an add feeding into a sub (in the SSA sense) has to be the same type, and therefore has to be the same layout. I don’t see a problem with doing simple folds between two ops with the same layout - the result must be the same type, so the layout will be preserved.

Does this make sense to you?

-Chris

That direction definitely makes sense and seems consistent with this RFC!

I was also trying to cover some of the other points in the design space, such as addf {layout=...} that Mehdi hinted at, which would be inconsistent with this RFC. I agree, if we wanted a addf {layout=...} op, a separate dialect is the right choice.

This RFC is mostly a recognition that a very tightly specced extension of scalar/vector std.addf and friends naturally extends the behavior to tensors. The key additional semantics that extends the vector case to tensors is “UB in case of dynamic shape mismatch”. Dynamic shape mismatch is not an issue with vectors because they are statically shaped.

Even if we reject the “std.addf on tensors” parts of this RFC, I believe that the Elementwise trait as I’ve defined it is merely a formalization of the existing semantics and would be useful to incorporate if only to better define the vector/scalar correspondence.

It’s a little ugly that the correspondence to std elementwise ops is hidden deep in a bunch of template partial specializations and special-cased to expressing that correspondence to lmhlo. For example, this wouldn’t compose nicely for a future mhlo -> std elementwise -> tosa kind of scenario.

I think in a future world, we can directly convert mhlo elementwise ops to std elementwise ops on tensors with a nice set of “normal” patterns (not something hidden in a template specialization), and then use that single pattern to convert to linalg on tensors, which we can then bufferize with a single pattern as well. This flow + linalg named ops could subsume much of the need for the lmhlo dialect, barring legacy concerns.

I must have missed that proposal, but such a thing should definitely not be in the standard dialect IMO, because then everyone that touches the std dialect would have to be aware of that attribute.

I thought the plan was to add a layout string to TensorType. We talked about that in a design meeting ages ago.

-Chris

There are some concerns with a string for a layout on Tensors in that it is “opaque”.
Another question was about the semantics aspect of such and attribute on Tensor which can be seen as not having a memory representation, which can be seen as contradicting the notion of layout in the first place. But I don’t want to digress from this thread, we should likely bring this back separately and reconsider it carefully!

Why shouldn’t we? What is the advantage of bundling tensors inside the standard dialect, in particular with only element-wise support?
To me the tensor-domain with only element-wise is not very useful in itself: you need much more (see mhlo/TOSA/… TCP).
Then in terms of layering and progressive lowering: standard operation on scalar and vectors seems much lower level than tensors, they don’t really belong together. Tensors are also special beast in that they are losely specified and there are various lowering options: it seems easier to me if standard is not ambiguous this way and we express the implementation choice for a tensor during the lowering from tensor-domain to standard.
In consequence of this, it also seems easier to me to form a mental model around a dialect focusing on the tensor domain instead in a more comprehensive way. At least I can draw a graph of progressive lowering and how dialects are involved, and manage to explain it to people :wink:

This was the idea with TCP as well: inspiring from MHLO and develop a comprehensive solution in the tensor domain. It isn’t clear to me that lowering from such dialect to linalg/loop would need to include tensors. So again, going from tensor domain and lowering, it means that we would go from standard dialect on tensor to a mix of Linalg/affine/vector/standard-on-scalar.
The way we will have it in practice is that the various tensor dialects will redefine all the element-wise operations anyway and implement the natural lowering (again it fits a mental model of layering) instead of using standard-on-tensor, at least that’s what I observe at the moment.

Coming back to a detail of the proposal: how do you handle zero-rank tensors in the standard dialect by the way? Canonicalize to scalar?

Vectors aren’t very useful without vector.transfer_read/vector.transfer_write, but we still have std elementwise ops on them. I don’t see why this same logic doesn’t apply to that case.

I think the vector folks have been very successful here. They rely on std elementwise ops as a base, + a “vector” dialect that contains vector-specific constructs. I think we can apply the same kind of IR design priniciples to the low-level tensor domain as well. Do you see a reason why this is different? I understand that there are different options for lowering tensors, but I don’t see a compelling reason why basic elementwise ops need to care.

Also, as I said, I think that defining an ElementwiseTraversableTypeInterface is a pretty cool idea we can get to down the road, so that !sparse.tensor or !distributed.tensor or mydialect.tensor_that_tracks_dimension_upper_bounds “just works”. We are already thinking along that direction in linalg, where all that linalg has to know is how to iterate the data types.

Over time, I think it has become clear that one of the great strengths of MLIR is the ability to mix multiple dialects that mutually reinforce each other and can coexist in a heterogeneous-dialect representation.

For example, in the original TCP kickoff thread, we were pondering questions like control flow, side effects, or differentiability. Now, we’ve learned as a community these that are really better done as traits/interfaces over “open” op sets. There’s no need for the compiler’s primary transformation IR to be a “closed system” in any way.

The only time we need a “closed system” is for import/export formats that we can treat as legalization targets (e.g. LLVM IR, SPIRV, etc.).

As we described in the recent npcomp talk, it is becoming clear that TCP is not a fixed dialect, but rather a description for a family of ops at a similar level of abstraction. We can mix and match linalg named ops on tensors, std elementwise ops, mhlo.send/recv, and tosa.rescale all in the same program if we want (and it doesn’t seem entirely crazy that a next-gen distributed, quantized, sparse ML compiler might actually look like that!).

I fully agree that we should aim for consolidation and providing a nice “comprehensive” solution as much as possible. I just don’t think that this RFC does anything to move us away from that vision. In fact, by providing more “batteries” (that are actually specced and documented) upstream, we will kickstart forward progress on this.

They might do that for now. But if we provide a bunch of nice folding/canonicalization on std elementwise ops, I think they will be adopted. Besides abiding by some external import/export restrictions (which we have no control over, and shouldn’t influence our decisions upstream), I think there is no reason to assume that the MLIR ecosystem will remain fragmented when nice batteries are provided.

I don’t think there is anything special to do. They are handled consistently with the case for all other ranks. For example, when lowering to linalg, there is an empty affine map, and no iterator types. Everything just works.

I don’t think that canonicalizing them to scalars is a good choice, because that changes the output type which might make ops that use the result become invalid.

On the argument of “element-wise usefulness” alone you’re right, but that’s only a small part of the system:

I see a fundamental difference with vector in that they intend to map to “register”, they are “packed” and have a “defined” memory layout, which isn’t the case for tensors.
Vectors are meant to be mixed with scalars and correspond to a somehow well identified layer of the progressive lowering.

Sure, but this does not seem to me to be an argument for mixing tensor inside the standard dialect instead of separating the tensor domain. Quite the opposite actually: if I design a system that works on the tensor domain, I’d like to be able to use ops that are restricted to the tensor domain that I can interact more safely with.

In terms of getting to well identified and separate components that can be isolated, I really believe that getting a smaller scope for the standard dialect is getting us in a better space.
Getting a dedicated dialect would allow both this dialect and standard to be more strict, which leads to a more robust system ultimately: the narrower the API the easiest it is for user to not be surprised and implement the contract correctly.

This is also consistent with the general IR design: for similar reason that std.addi if split from std.addf, when I match an operation I need to handle it consistently. We don’t want passes to crash when they are fed with valid IR, and if the standard dialect operations accept tensors then we need to defend against it everywhere.

Again, I think everything here applies to vector as well – a pass that doesn’t handle std.addf on vectors could crash. Eliminating vectors would allow making std more strict, etc. It seems pretty cheap to write a simple pass verifying that every Elementwise op only operates on scalars/vectors; that pass can be dropped into any pipeline that might care.

The only compelling argument I hear is that “vectors are like registers, and therefore somehow different from tensors”. But from a semantics perspective, elementwise is just elementwise. I see it as follows: we provide std ops on scalars, and then those ops “map” to larger elementwise traversable types like vector and tensor (and eventually !sparse.tensor, !distributed.tensor, !compressed.tensor, …) in a straightforward way.


Let’s look at the alternative to this RFC. No aspect of the design rationale of the std elementwise ops (like splitting addf from addi) seems to really need to be different for the tensor case. So we are really looking at the same set of operators for std and a new tensor dialect. Thus, the next step is going to be to create ElementwiseOpsBase.td with a bunch of base classes for these ops, and stamp them out in the std dialect on scalars/vectors and then create a nearly identical set of ops in the tensor dialect. We will then have to deal with having folders somehow shared between them, and nearly identical testing. That really doesn’t seem like a great state to me.

Additionally, it doesn’t set a good composability precendent. Suppose we split exp/atan2/etc. out of the std dialect into the transcendental dialect. Do we also need to create a tensor_transcendental dialect now?

And suppose a downstream dialect has a bitops.popcount scalar op, and they want to incorporate that into tensor-level compute (perhaps they are doing research into 1-bit neural networks). They will look upstream and find that the precedent is to create tensor_bitops dialect with a separate op, duplicate their TableGen definitions, etc.

With the Elementwise trait as I’ve defined it, that downstream user would simply add Elementwise onto bitops.popcount and then the entire codegen stack would light up and seamlessly interoperate with their op, including eventually sparse, distributed, etc. tensors. Or possibly even other ElementwiseTraversable types, like a database column or analytics data stream.

I don’t think I agree with “composability”: to me this has nothing to do with “composability” of dialects here, merely “code reuse” / “refactoring”, and I sympathize with the goal, not necessarily that the technical conclusion is as obvious to me as it is to you right now.
“Composability” for example is that you can use the standard op on scalar in the body of a linalg op which maps the element of the tensor. Another example is that scf composes with std.
I always considered that dialects compose quite naturally but the salient point of the design is in the type system.
Otherwise there is purely a judgement call on layering involved

I think I may have come up with a more articulated version of the “register-level” intuition.

The “register-level” ops can bound their undefined behavior to a “poison value” in the case of e.g. a violated nsw. This allows LLVM’s poison/freeze type of modeling, and allows e.g. an add nsw to be speculated (I misspoke earlier saying that it couldn’t). But applying elementwise ops on tensors, when their size mismatch, can cause nasal demons type of undefined behavior (e.g. segfault or worse).

Thus, std ops on tensors and other more complex data types are very different in that we cannot reason about them as poison values in the case of UB.

That’s a pretty strong argument for keeping tensors separate from “machine” level data types.

Actually, for that, I think my reasoning from earlier applies, in a modified form.

For e.g. checking udiv speculatability (llvm equivalent code), we are still going to want a dynamic callback, so incorporating a data type check into our speculatability infrastructure doesn’t seem too onerous.

Also, it’s not clear to me whether scalable/variable length/symbolic size vectors (which I think is an extension we are pursuing) will share this property as well, so we might need to do a more elaborate speculatability check based on datatype anyway.

Sure, but some form of being “opaque” is natural for this. There are lots of different sorts of layouts, and I believe we want an “open” system here, so people can experiment with new layouts (e.g. sparse formats) without having to hack the compiler. An open system means that you need an opaque form, and stuff that is touching TensorType will have to conservatively handle things it doesn’t know.

While there is a possibility that we could do a structured system like attributes, I think that strings will work well for this. We have have a namespaced system with a few standard things "std.nchw" or whatever, while allowing people to invent their own things. A system of interfaces can then be developed to decode and product them for imperative code that wants to reason about strides or whatever.

Is there another reasonable proposal for how to handle this? I don’t think that putting attributes on operations is comparable. This just moves the problem, and encoding this in the type is more natural given that tensors are a value semantic ssa value that carry a layout.

-Chris

I don’t think from structure to string is the only option here. We had the discussion and proposal last year here around options too for a group of common ones. Having layout simply be an Attribute would give opaque/extensibility too. And avoid hard coding domain specific/non-general terminology into standard (e.g., NCHW doesn’t generalize to larger dimensions, getting terms for every dimension in a non-uniform manner not intuitive [DHW? MKL layout? Morton layout? row & column major do scale]) which requires looking up in a glossary vs a mapping (have as convention the mapping is logical to physical + a mapping function, and anyone can read arbitrary dimensions and expand it to blocked layouts etc).

Then one could also have dialect attributes which could hook in easier with interfaces (so definite +1 on system of interfaces).

Now an issue is how do these propagate. It sounds above requiring that all types & layouts match when doing a std.addi and the result is the same layout as both inputs. So if we have a 0 tensor splat constant, we may need it in multiple different layouts.

But perhaps we should split this out into a separate convo (and also related to Types & Attributes )

Why is elementwise particular to tensors? What is required there that would not work for vectors or memrefs? Elementwise to me just means these operations are/can be represented as map of scalar functions (restriction on iteration space).

1 Like

For now, I was thinking tensors and vectors would be hardcoded (the final wording would cover vectors and tensors as I mentioned in the OP), because this RFC is about formalizing the current behavior, and then we can build out from that.

We don’t currently have a type trait for ElementwiseTraversable/MappableElementwise. But yes, I’d like to make it a more open system. This probably shouldn’t apply to memrefs because they are not value types. E.g. you can’t CSE an “addf” on memrefs without seeing if anybody has written to the memref between the dominating addf and the dominated addf.


And yes, let’s keep the layout discussion (as interesting as it is) in its own thread and not derail this one.


I had a video call with Mehdi on Friday and he seemed supportive. Wondering if folks have any objections or if we can move forward with this.