Is the TCP "matmul" op marked NoSideEffect?

Consider this op:

"tcp.matmul"(%lhs, %rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 

If the matmul dimensions dynamically mismatch, what happens?

I see two solutions:

  1. It is undefined behavior if the shapes mismatch. We can just assume they don’t. Thus tcp.matmul can be considered NoSideEffect

  2. All tensor<...> values in TCP are logically ErrorOr<tensor<...>>, and we need explicit bridging ops that manifest the error at the boundaries of TCP computations. However, we should probably use a an explicit ErrorOr type in that case.

I think we should take option 1., and it is the responsibility of lowering passes to reify the shape checks and ensure that the shapes are compatible. That is, lowering would look something like:

%result = "frontend.matmul"(%unclean_lhs, %unclean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 

–>

%lhs, %rhs = "tcp.clean_for_matmul"(%unclean_lhs, %unclean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>)  -> (tensor<?x?xf32>, tensor<?x?xf32>) 
%result = "tcp.matmul"(%lhs, %rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 

Note that the same applies for even simple elementwise operations. E.g. tcp.add(%0, %1) : (tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32> needs similar treatment with a corresponding tcp.clean_for_pointwise op.

The tcp.clean_for_* ops potentially “throw an error” and are not marked NoSideEffect. We could then have optimization passes that attempt to hoist the tcp.clean_for_* ops and eliminate redundancy, creating large subgraphs that we can optimize cleanly.

Thoughts?

1 Like

Undefined behavior is a side effect (specifically: it is every possible side effect). As an extreme example, if "tcp.matmul" were NoSideEffect then it should be legal for the optimizer to introduce "tcp.matmul"(%x, %y) for arbitrary %x and %y. However that can introduce UB in a program with no UB.

However, we can say something weaker: a particular instance of "tcp.matmul" is NoSideEffect if its inputs are the result of a "tcp.clean_for_matmul" op.

We have discussed this in the context of the shape dialect work, as well. The idea was to make all assertions explicit and separate (yet not independent) of the shape computations. This has the advantage that you can optimize them separately of the operations and hence might unlock optimization potential.

We also wanted to treat the side-effects of these operations differently. When producing optimized code, it should be legal to reorder the assertions. That will give you programs that are still correct but might fail in interesting ways, as assertions from later in the program might fire before the ones earlier, creating misleading error messages. Hence, in debug mode, one would disallow the reordering.

For optimizations, one probably would want to model this at a lower level of abstraction. So instead of tcp_clean_* spell the constraints out. That can also be done as a lowering step from the tcp_clean_* ops, but then it is not clear to me what their value is.

A constraint language on shapes that can be shared seems like a good ultimate goal for this.

This reminds me a bit of the discussion about explicit vs. implicit broadcasts in HLO.

2 Likes

+1, I think semantics like that make a lot of sense. @River707 do we have a way to model this in the side effects framework? Like a special effect category for this special kind of “reorderable” effect?

Agreed. I used the clean_* ops purely syntactically to get the idea across. The exact final design won’t necessarily look like that.

Thanks Sanjoy, that makes a lot of sense. So the semantics would be:

  1. tcp.matmul has no side effects if its operands satisfy the preconditions established by tcp.clean_for_matmul.
  2. tcp.matmul has undefined behavior otherwise

Thus, it is the responsibility of passes that create tcp.matmul ops to insert the necessary tcp.clean_for_matmul (or some equivalent set of checks establishing the precondition, or not inserting anything if there is special knowledge that the preconditions are satisfied). We can then provide utility passes that optimize the preconditions.

Do folks agree with that basic template for defining TCP ops?

I don’t understand what the proposal is, can you expand with an example?

Note that this is not a strict win for the compiler since reordering means that we can’t use a check to strength reduce later checks.

For instance, say you had the following (I’m assuming checks are all of the form Assert(condition) for simplicity):

%a_witness = Assert(x > 100);
%b_witness = Assert(x > 90 && x < 500);

we cannot optimize the second assert to Assert(x < 500) since the optimizer could (assuming the semantics you suggest) later reorder the second Assert to happen before the first, creating a weaker %b_witness.

I am somewhat unclear what this discussion is really about.

Here are some of my priors in this space, based on linalg:

  1. linalg.generic and all “named ops” have requirements that all sizes match. If they don’t it’s UB and one exposes themselves to out of bounds, shooting themselves in the foot etc. I don’t see any of this as different from a bunch of for + load + store (to which this ends up lowering in many cases) and I don’t see a reason to treat them differently: they have side effects but they can still be DCE’d in the same way that computations into local buffers that don’t escape can be DCE’d.
  2. from the linalg.generic specification is is easy to generate assertions that can either be verified statically when possible or must be computed at runtime.
  3. I don’t see optimization of these shape computations / assertions being something worth considering in the short-term. matmul is O(N^3), runtime shape calculation is O(1). Has this been measured and determined to be worth considering now ? I can’ shake the feeling that this looks like premature optimization to me.

What am I fundamentally missing here?

1 Like

I agree with all your points, and I think it basically plays in nicely with what I’m saying.

I’m just thinking about how we can model things such that linalg/TCP can optimize as it wants to (i.e. UB/“all bets are off”, at the linalg level, if things like dimensions mismatch), yet we can still have the safe program execution that higher levels of abstraction expect. This requires that the lowering reify safety checks (that can possibly be optimized away) to guard any part of the program that uses linalg/TCP.

Happy to expand on this, but the short answer is that yes, these kinds of overheads and error checks are very important, especially as one considers smaller ML models running on edge devices. E.g. a size-128 LSTM running at batch-1; with modern ISA’s, let alone accelerators, the raw MACs for each iteration can be done in 100’s – low-1000’s of cycles. One of the main design theses of IREE is that by surfacing these at compile time we can avoid a lot of the cost.

I believe the OP is an example. Do you have anything else you would like elaboration on?

I don’t think the focus is on optimizing the shape computation here, but handling the error semantics. Without it you can’t really fuse (or expose sequence of computation to Linalg without these error checking “barrier”). For example:

%result0 = "frontend.add"(%unclean_lhs, %unclean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 
%result = "frontend.add"(%result0, %unclean_rhs2) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 

If the second add has to error out on shape mismatch, you can’t fuse them until you reason about the shape computation and rework the computation I believe. The code above would be lowered to something like:

%lhs0, %rhs0 = "tcp.clean_for_add"(%unclean_lhs, %unclean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>)  -> (tensor<?x?xf32>, tensor<?x?xf32>) 
%result0 = "tcp.matmul"(%lhs0, %rhs0) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 
%lhs1, %rhs1 = "tcp.clean_for_add"(%result0, %unclean_rhs2) : (tensor<?x?xf32>, tensor<?x?xf32>)  -> (tensor<?x?xf32>, tensor<?x?xf32>) 
%result0 = "tcp.matmul"(%lhs1, %rhs1) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 

The goal is to have the two add together and hoist as early as possible all the error computation, so that you can expose to Linalg a large sequence of operations that fit the “requirements that all sizes match”.

I don’t know how to form it though, it would look something like this:

%lhs0_, %rhs0 = "tcp.clean_for_add"(%unclean_lhs, %unclean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>)  -> (tensor<?x?xf32>, tensor<?x?xf32>) 
%lhs0 %rhs1 = "tcp.clean_for_add"(%lhs0, %unclean_rhs2) : (tensor<?x?xf32>, tensor<?x?xf32>)  -> (tensor<?x?xf32>, tensor<?x?xf32>) 
%result0 = "tcp.matmul"(%lhs0, %rhs0) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 
%result0 = "tcp.matmul"(%result0, %rhs1) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 

For some reason I missed half of the original post and read the whole thread, I just re-read and it is very clear I think, thanks!

2 Likes

I think that when we actually code this up, the clean_for_* will turn into (either being lowered to, or literally start out that way) into open-coded shape calculations. So concretely it will look like:

%result0 = "frontend.add"(%unclean_lhs, %unclean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 
%result = "frontend.add"(%result0, %unclean_rhs2) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> 

will expand first to something like this. Note that it is just the same fully-general pattern inserted twice, once for each frontend.add op.

%ul_s = shape.shape_of %unclean_lhs
%ur_s = shape.shape_of %unclean_rhs
%broadcasted0_s = "shape.broadcast_shape"(%ul_s, %ur_s)
"shape.assert_no_error"(%broadcasted0_s)
%clean_lhs = "tcp.broadcast_to"(%unclean_lhs, %broadcasted0_s)
%clean_rhs = "tcp.broadcast_to"(%unclean_lhs, %broadcasted0_s)
%result0 = "tcp.add"(%clean_lhs, %clean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
%r0_s = shape.shape_of %result0
%ur2_s = shape.shape_of %unclean_rhs2
%broadcasted_s = "shape.broadcast_shape"(%r0_s, %ur2_s)
"shape.assert_no_error"(%broadcasted_s)
%clean_result0 = "tcp.broadcast_to"(%result0, %broadcasted_s)
%clean_rhs2 = "tcp.broadcast_to"(%unclean_rhs2, %broadcasted_s)
%result = "tcp.add"(%clean_result0, %clean_rhs2) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>

We will then run a pass that uses shape transfer functions on tcp.add and tcp.broadcast_to to RAUW the %result0_s with %broadcasted0_s, breaking the data dependency between the two tcp.broadcast_to/tcp.add ops.

%ul_s = shape.shape_of %unclean_lhs
%ur_s = shape.shape_of %unclean_rhs
%broadcasted0_s = "shape.broadcast_shape"(%ul_s, %ur_s)
"shape.assert_no_error"(%broadcasted0_s)
%clean_lhs = "tcp.broadcast_to"(%unclean_lhs, %broadcasted0_s)
%clean_rhs = "tcp.broadcast_to"(%unclean_lhs, %broadcasted0_s)
%result0 = "tcp.add"(%clean_lhs, %clean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
%ur2_s = shape.shape_of %unclean_rhs2
%broadcasted_s = "shape.broadcast_shape"(%broadcasted0_s, %ur2_s)
"shape.assert_no_error"(%broadcasted_s)
%clean_result0 = "tcp.broadcast_to"(%result0, %broadcasted_s)
%clean_rhs2 = "tcp.broadcast_to"(%unclean_rhs2, %broadcasted_s)
%result = "tcp.add"(%clean_result0, %clean_rhs2) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>

Then we use:

  • the fact that shape.shape_of and shape.broadcast_shape is NoSideEffects to hoist it
  • the fact that tcp.broadcast_to and tcp.add have UB for the error cases to justify hoisting the second shape.assert_no_error past the first block of tcp.broadcast_to/tcp.add
  • use the UB behavior of tcp.broadcast_to and tcp.add to hoist “tcp.broadcast_to”(%unclean_rhs2, %broadcasted_s)
%ul_s = shape.shape_of %unclean_lhs
%ur_s = shape.shape_of %unclean_rhs
%ur2_s = shape.shape_of %unclean_rhs2
%broadcasted0_s = "shape.broadcast_shape"(%ul_s, %ur_s)
%broadcasted_s = "shape.broadcast_shape"(%broadcasted0_s, %ur2_s)
"shape.assert_no_error"(%broadcasted0_s)
"shape.assert_no_error"(%broadcasted_s)
%clean_lhs = "tcp.broadcast_to"(%unclean_lhs, %broadcasted0_s)
%clean_rhs = "tcp.broadcast_to"(%unclean_lhs, %broadcasted0_s)
%clean_rhs2 = "tcp.broadcast_to"(%unclean_rhs2, %broadcasted_s)

%result0 = "tcp.add"(%clean_lhs, %clean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
%clean_result0 = "tcp.broadcast_to"(%result0, %broadcasted_s)
%result = "tcp.add"(%clean_result0, %clean_rhs2) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>

Now we have all the tcp.add ops in their own nice little SSA use-def subgraph, except for the intervening tcp.broadcast_to %clean_result0 = "tcp.broadcast_to"(%result0, %broadcasted_s). That one is seems nontrivial to remove. For example, suppose that dynamically the types are:
%unclean_lhs: tensor<1x1>
%unclean_rhs: tensor<1x1>
%unclean_rhs2: tensor<3x3>

Then given the lowering as it stands now, the first tcp.add will return tensor<1x1> and the intervening tcp.broadcast_to will in fact need to dynamically perform a 1x1 → 3x3 broadcast. However, it could also be:

%unclean_lhs: tensor<3x3>
%unclean_rhs: tensor<3x3>
%unclean_rhs2: tensor<3x3>

Then there is no broadcast needed.

@nicolasvasilache how would it be best to model in linalg this possibility of dynamically different broadcasting behavior depending on whether any of the inputs have dimensions which dynamically are of size 1?

1 Like

Is it dynamically different broadcast behavior or is it a dynamic broadcast that you may only have enough information dynamically to elide completely (but is still valid to perform/fuse)?

The question to me is whether you can represent the intervening %clean_result0 statically in a way that is legal for both degenerate and non- degenerate.

If so, I suspect the most valuable mechanism for detecting whether it is safe to elide the dynamic broadcast in real programs is based on whether you can prove somewhere in the dataflow that the potentially degenerate dimensions are dynamically equal.

Yes, you can. But you have to encode the dependency explicitly. What you want to say here is “given the first assert, the second one can be expressed as” which can be encoded as

%a_witness = Assert(x > 100);
%b_witness = Assert(a_witness && x < 500);

This gives you finer granularity.

Thanks for doing a great example!

If this means they have no side-effects (because we allow them to have UB in the error case) then I agree. Compartmentalizing the side effects into separate assertions for error conditions is the enable here.

If it is worthwhile to fuse the two adds one could do multi-versioning at this point and create two versions. If we had a trivially_broadcastable helper (just for the sake of an example) one could do

%ul_s = shape.shape_of %unclean_lhs
%ur_s = shape.shape_of %unclean_rhs
%ur2_s = shape.shape_of %unclean_rhs2
%broadcasted0_s = "shape.broadcast_shape"(%ul_s, %ur_s)
%broadcasted_s = "shape.broadcast_shape"(%broadcasted0_s, %ur2_s)
"shape.assert_no_error"(%broadcasted0_s)
"shape.assert_no_error"(%broadcasted_s)
%clean_lhs = "tcp.broadcast_to"(%unclean_lhs, %broadcasted0_s)
%clean_rhs = "tcp.broadcast_to"(%unclean_lhs, %broadcasted0_s)
%clean_rhs2 = "tcp.broadcast_to"(%unclean_rhs2, %broadcasted_s)

// Assuming here that %broadcasted0_s is the shape of %result0.
%pred = "tcp.trivially_broadcastable"(%broadcasted0_s, %broadcasted_s)
loop.if %pred {
  %result0 = "tcp.add"(%clean_lhs, %clean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
  %clean_result0 = "tcp.trivial_broadcast_to"(%result0, %broadcasted_s)
  %result = "tcp.add"(%clean_result0, %clean_rhs2) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
} {
  %result0 = "tcp.add"(%clean_lhs, %clean_rhs) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>
  %clean_result0 = "tcp.broadcast_to"(%result0, %broadcasted_s)
  %result = "tcp.add"(%clean_result0, %clean_rhs2) : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?
}

Where trivial_broadcast_to would be a broadcast that we can handle in fusion, likely one that does not expand along an axis or maybe the identity broadcast case.

This is assuming that we cannot model the full broadcast or that the full broadcast is less efficient.

After thinking about it over the weekend, one way to look at this is that it is a contstrained form of dynamic ranks. That is, when seeing "frontend.add"(%lhs, %rhs) : (tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32> you really are dealing with one of four computations:

// `&` is "`?` but known to not be of size 1"
"frontend.add"(%lhs, %rhs) : (tensor<&xf32>, tensor<&xf32>) -> tensor<&xf32>
"frontend.add"(%lhs, %rhs) : (tensor<&xf32>, tensor<1xf32>) -> tensor<&xf32>
"frontend.add"(%lhs, %rhs) : (tensor<1xf32>, tensor<&xf32>) -> tensor<&xf32>
"frontend.add"(%lhs, %rhs) : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>

(notice that the “truth table” here is that of logical-OR, if you consider 1 to be False and & to be True; that is if either input extent is & then the output extent is &).

In terms of the loop nest that needs to be emitted, considering the semantics of 1 vs &, then the preceding 4 options are actually identical to the following:

// Increments index by 1 for LHS and RHS on each loop iteration.
"frontend.add"(%lhs, %rhs) : (tensor<&xf32>, tensor<&xf32>) -> tensor<&xf32>
// Increments index by 1 for LHS and 0 for RHS on each loop iteration.
"frontend.add"(%lhs, %rhs) : (tensor<&xf32>, tensor<f32>) -> tensor<&xf32>
// Increments index by 0 for LHS and 1 for RHS on each loop iteration.
"frontend.add"(%lhs, %rhs) : (tensor<f32>, tensor<&xf32>) -> tensor<&xf32>
// Increments index by 0 for LHS and for RHS on each loop iteration (or there can actually be no loops).
"frontend.add"(%lhs, %rhs) : (tensor<f32>, tensor<f32>) -> tensor<f32>

That is, size-1 broadcasting is effectively a limited form of rank-dynamism where ?'s can dynamically end up squeezed out of the dimension list and replaced with rank broadcasting.

Or we can flip this around… we can represent static-rank rank broadcasting by adding extra “known to be size 1” dimensions to lower-rank operands.

The only thing that seems to paper of this is using strided tensors, and setting a stride of 0 for the “size 1” dimensions. But I think strided tensors that dynamically could have stride 0 is unfriendly to optimize for, though I’d like confirmation from experts that this is the case.

As Sanjoy corrected me in the past, UB is “potentially every side effect”. So we cannot simply mark these as NoSideEffect because that would mean that you can arbitrarily insert the ops anywhere in the program. But obviously we cannot insert these ops anywhere in the program since doing so might introduce UB to a program that does not have UB.

So I think we need a category separate from NoSideEffect, perhaps uncreatively named UBOrNoSideEffect. It would still permit many transformations. In particular, UBOrNoSideEffect ops can be freely reordered with each other. Additionally, side-effecting ops can be hoisted above UBOrNoSideEffect, since the side effect from the hoisted op is a valid behavior for a program with UB.

+1

This is indeed a pandora box I would prefer not to open: representing “internal aliasing” in the data type “hides” information that should be visible in the control-flow abstraction into the data type. Yes, it is (somewhat) recoverable but I think it opens up a bunch of corner cases and is generally not well understood in combination with HPC codegen.

I’d characterize it as “PL and library”-level thinking leaking into the compiler world :slight_smile:

I’m not saying it’s impossible to start designing codegen that would also support such cases, more that it displaces problems into a “less known” world (at least less known by me).

I see, so only operations that use the witness are guaranteed to see its effect. Makes sense.

I agree with what @sanjoy_das_google says about UB implying side-effecting, as undefined includes any kind of behavior. What I am wondering about is why do we need the UB here? Do these operations really have undefined behavior or do they potentially produce an undefined result but still remain pure (as in side-effect free)?

Practically, this would mean that we would need to guarantee that the operations do not behave in a perceivable erroneous way (like, triggering a segfault). One way to do this is to prevent them from executing at all when their corresponding guards fail. So we would need to ensure they do not get moved before their guards, either by modelling this with side-effects or by passing explicit witness values.

In the former case, both the guard and computation have a shared side-effect that prevents reordering. For example, they could both have an effect on the correctness state of the program (the guard “writes” it, the computation “reads” it). This is very coarse grained, unless we have a fine granular model for this where the guard only affects some sub-state and might prevent re-ordering of code where only some guards could be moved early. A theoretical example

guard_a
%v1 = op_a
guard_b
%v2 = op_b(%v1, ...)
guard_c
%v3 = op_c(%v1, ...)

If we can combine two guards a and c or prove c to be always true, we would get

guard_a_and_c
%v1 = op_a
guard_b
%v2 = op_b(%v1, ...)
%v3 = op_c(%v1, ...)

yet we would still not be able to move op_c before guard_b.

With explicit witness values, this would all be encoded into data flow, so we would have

%wa = guard_a
%v1 = op_a(%wa)
%wb = guard_b
%v2 = op_b(%wb, %v1, ...)
%wc = guard_c
%v3 = op_c(%wc, %v1, ...)

Proving guard_c to be always true would free op_c to be moved freely.