Is the TCP "matmul" op marked NoSideEffect?

By “any side effect” I mean “any side effect of the compiler’s choosing”. This is why we say (for instance) that if a program has UB then any behavior is valid behavior.

I don’t think this breaks reordering though, as long as we don’t introduce UB to a program that doesn’t have UB.

Just so we’re on the same page, even if we use guards with witness values, matmul still could not be tagged as side-effect free.

For instance, a “malicious” optimizer could translate:

%w = unrelated_witness

to

%w = unrelated_witness
matmul(%a, %b, %w) // segfaulting matmul, %w has nothing to do with %a and %b

if matmul was stated as side effect free.

How do we model this in LLVM? Like dividing by zero with the udiv instruction. Where in LLVM is the code/predicate/property that would allow us to insert an add (with fully defined wrapping) in an arbitrary place but not a udiv? Is it just llvm::isSafeToSpeculativelyExecute and the burden is on pass authors to make sure they remember to call it?

Btw, it seems to me that we currently model llvm.udiv in the LLVM dialect incorrectly. It is currently marked NoSideEffect:

LLVMOps.td

class LLVM_ArithmeticOp<string mnemonic, string builderFunc,
                        list<OpTrait> traits = []> :
    // !!!!! Incorrectly marked NoSideEffect !!!!!
    LLVM_OneResultOp<mnemonic,
           !listconcat([NoSideEffect, SameOperandsAndResultType], traits)>,
    ... {
...
}
...
def LLVM_UDivOp : LLVM_ArithmeticOp<"udiv", "CreateUDiv">;

I believe so.

Only if you assume that UB is a side-effect, I am not convinced by this though. If we were looking at LLVM for example we could also have a “safe to speculate” traits/interface independently of side-effects.

I think we’re subtly talking about two different things. It will take a blog post (or “one pager” :slight_smile: ) to elaborate, but roughly:

  1. I’m not saying “UB is a side effect”. That would indeed be a problem.
  2. Programs can have multiple legal behaviors and the compiler can arbitrarily “choose” a particular behavior out of this set. In LLVM the canonical example is undef: print((i8)undef) has the following set of behaviors {"prints 0", "prints 1", ..., "prints 0xff"}. This choice is usually called “refinement”. There is some more discussion about refinement here.
  3. UB can be modeled as an extreme version of such an operation with multiple possible behaviors: it can have any possible behavior. So *(int*)0 has the infinite set of behaviors {"prints 0", "prints 1", ..., "formats hard drive", "invokes nasal demons"}. The compiler is free to choose any such behavior to “implement” the UB, and since this set contains all possible behaviors, the compiler cannot choose incorrectly.

NoSideEffect means “it is guaranteed to have no side effect”. If there is the possibility of UB, then that statement is not true, since UB can result in a side effect. To break this down into unambiguous logical implications:

(UB is possible when executing FooOp && UB can trigger a side effect)
==> the statement “executing FooOp is guaranteed to have no side effect” is false
==> FooOp cannot be marked NoSideEffect

This does not involve claiming that “UB is a side effect”.

As I understand it, UB is just not a valid execution of the program. So I have a hard time following your reasoning when it starts with “UB is possible when executing FooOp”, at least not in the usual sense of UB we manipulate when coming from C/++.

If I make a parallel with C again, my claim is that any call to this function: void foo(int i) { return i+1; } does not involve a side-effect. There are calls to this function that are UB, but they just aren’t valid program execution, I don’t consider them as “having a side effect”.

This is why LLVM has the llvm::isSafeToSpeculativelyExecute check I believe.

Okay, it sounded like we were using a definition of NoSideEffect that implied isSafeToSpeculativelyExecute (at least, @sanjoy_das_google’s examples of inserting erroneous operations seemed to imply that). Is this the behavior we want?

It would seem then that the canBeHoisted function in LoopInvariantCodeMotion is buggy under “NoSideEffect does not necessarily imply isSafeToSpeculativelyExecute”.

@River707 does NoSideEffect imply isSafeToSpeculativelyExecute?

I actually think this problem goes deeper. Consider

func @f(%arg0: tensor<?x1xf32>, %arg1: tensor<1x?xf32>) -> tensor<?x?xf32> {
  %0 = "tcp.matmul"(%arg0, %arg1) : (tensor<?x1xf32>, tensor<1x?xf32>) -> tensor<?x?xf32>
  return %0 : tensor<?x?xf32>
}

It’s trivially easy for a user to pass in dynamic arguments of size 1M which results in a 1M*1M == 1T sized output, which is likely to OOM any end system. So when expanding this tcp.matmul op to “alloc_result_memref; matmul_on_memrefs”, the “alloc_result_memref” could fail. The end-user API is unlikely to allow UB in this case. The implication is that tcp.matmul still needs to encapsulate some sort of defined error behavior. The compilers I’ve seen that handle the simpler fully static shape case don’t have to deal with this because they computes a static memory plan ahead of time, and errors arise deterministically at the initial “allocate all the memory for the model” (possibly spuriously, e.g. spuriously triggering an error if the program contains if dynamically_false(): do_useless_matmul_that_results_in_ridiculously_large_output() is in the program).

Sadly, my background is perhaps too heavily in the C/C++ side of the world which leans heavily on UB for these cases, which most of our frontends when talking about tensors operations don’t have.

Would love to hear from folks with more experience with safe languages (@sanjoy_das_google, @rjmccall ?). E.g. how does Java/Swift/etc. model the notion of reporting “required errors” for even “basic” ops. Surely some amount of reordering is allowed. E.g. does an overflow check prevent hoisting an integer operation past a print? Or potential OOM prevent hoisting any kind of allocation past a print?

Out of bound and null pointer exceptions are “strict” and cannot be reordered. But the JVM spec also defines “virtual machine errors” that “may be thrown at any time during the operation of the Java Virtual Machine”. But it isn’t UB though.

Why is this a problem? Won’t we have to have error semantics for mismatched shapes anyway?

You generally shouldn’t hoist allocations unless they’re “provably” not going to fail, at least within some reasonable model.

The mismatched shape checks can be reified in IR and separated. But having a “tcp.matmul” op that takes tensors and returns a tensor intrinsically hides an allocation of the result (which might fail). I can’t think of any predicate we can reasonably reify and separate from the tcp.matmul op that would guarantee the allocation of its result cannot fail.

I see, that’s a good catch! For now I have an intellectually unsatisfying solution: let’s not provide any hard guarantees around not introducing OOMs; speculating a matmul that causes an OOM that would not have otherwise happened is a QoI issue, not a correctness issue. This is probably fine since we almost never want to (say) launch a non-trivial CUDA kernel that would not have otherwise been launched.

Provide such a guarantee on the maximum memory usage would be really hard: any change in the schedule of these operation could break it by changing the live range of the tensors.

If this becomes a real problem we might have to go TF’s way and not force a total order for all operations and make alloc an async operation that can return at some future time when more memory is available.

The shape checks are a correctness issue for the matmul operation and are a required part of its semantics. Memory allocation failures on the other hand are not intrinsic to the matmul operation (it does not even say anything about allocation) but are an implementation detail of the machine that executes them. The failure is also not a local property of the operation but a global property of the executing machine.

Hence, I would argue that we should not model allocation failures as operation dependent errors but as non-recoverable (or with @sanjoy_das_google suggestion of async alloc maybe recoverable). Also, the system cannot make any promises about memory use other than trying to optimize such that the program gets executed. In particular, we should not codify the memory semantics of operations on values.

2 Likes

I generally agree with you, but just to push back on this a bit: machine learning systems are different from more traditional virtual machines in this aspect. For a standard JVM (say) there isn’t a big difference between using 16G vs using 16.01G, but for a ML model running on an accelerator this can make the difference between the model running or not running (or running very slowly because we’re swapping between host and GPU). So I think we need to do one of the two things (or both):

  1. Have some guarantees around memory usage. This may be semantically modeled in the IR somehow (which would be unusual, as you suggest) or could be an informal “handshake agreement” between the various MLIR passes.
  2. Make the performance cliff from using more memory than the accelerator has less steep. Perhaps a sufficiently smart rematerialization pass that also knows to swap to the host when needed will be enough.

I love your framing here Stephan. In the C analogy that I’m more familiar with (and other folks here might enjoy), the C standard describes an abstract machine that does not really have a notion of running out of memory. For example, a C program can in the abstract machine create an arbitrary amount of live memory on the stack with a simple recursive routine, but implementations obviously break down at some point in an unspecified machine-specific way.

Actually, even on CPU if you go over the limit and end up swapping to disk your performance can fall off a cliff (or you can even get killed by the OOM killer). Computer systems have grown though and so this is less of an issue for typical programs these days, to the point that we often don’t even consider it.

I believe that ML systems will end up in the same scenario, though of course a subset of power users will know how to push their systems to the limit and understand the consequences; we should assume power users are only a relatively small subset of users (otherwise, I argue, we haven’t built a very usable system).

Mehdi’s example of live ranges shows that it can be NP-hard even in the simplest case to provide true guarantees. And throw in a little bit of dynamicity, control flow, etc. and it rapidly becomes undecidable what can trigger memory exhaustion, to the point that even the simplest form of CSE can provide no real guarantees. Therefore, designing our systems to respect some semantically load-bearing notion of guaranteed memory usage seems like it will be a) futile and b) overconstraining.

So yeah, this isn’t something that falls only on us in the compiler to solve. It’s up to hardware, runtime, etc. folks in collaboration with us to make a usable practical system.

I would argue C++ compilers already do a version of (1) where implicitly guarantee that they won’t make the memory consumption much worse. So spilling registers to increase the size of a stack frame is OK (since typically that only “slightly” increases physical memory usage), but reordering “free(some_ptr); malloc(4096);” to “malloc(4096); free(some_ptr);” is not OK even if it is technically allowed by the spec.

Agreed, but this collaboration does not mean we get to ignore the issue in our design. :slight_smile: