LHLO is a variant of HLO that appears after buffer allocation. Buffers are allocated externally to the function, so one needs to pass into the function the address of a buffer (that is, a memref) where the result should be stored. Instead of returning the tensor, the function in LHLO will store its contents in the provided buffer. Without this transformation, each function would have to allocate the memory for its results, which is often incompatible with the runtime that manages all memory itself.
Thanks Alex, and there is another question here. If buffers are allocated externally, that we must inference the shapes every time. Maybe we can allocate memory just before ‘xla_hlo.copy’ ?
And, I have a corner case here, the compiled subgraph contain a ‘Unique’ operation, you know that we can not inference the unique output’s shapes via only inputs’ shape. We must run it, and then get the output’s shapes. So in this case, we even can not inference shape externally.
It isn’t clear to me why this is intrinsic: instead the function could allocate itself its result and return it, leaving deallocation up to the caller.
It is true that XLA today rely on static shapes and compile-time buffer allocation, but we want to support more dynamism in the future. I think the current implementation was just an easy way to get things started, but we need to evolve this.
It is not. I’m merely explaining how it works today.
I would find it interesting to explore ownership semantics on types in MLIR, e.g., to make it clear when a function allocated the buffer it returned and expects the caller to manage deallocation as opposed to when it returned the buffer it took as argument.
We are currently revamping the buffer allocation and are splitting it into separate pieces so that users can make these trade-offs themselves. The signature rewriting that is currently hard-coded in the buffer allocation is just one way to do it and we are aware of this.
To support allocating results in the function and passing them out, one would need to rewrite the signature differently, i.e., into a function that returns a memref, and replace the rewrite for return accordingly.
In full generality, it can be fairly complicated and require adapting not just allocation but synchronization based on what the invoker needs for maximum efficiency. On the IREE side, we’ve been leaning towards treating the equivalent of what you have now as the “raw” function that users are not expected to invoke directly (but we still export them with a mangled name for the adventurous) and then generating stub functions with specific types and calling conventions for the various ways that callers will typically need to invoke the function (ie. Pre-allocated/checked, internally allocated, semaphore guarded, etc). We tend towards using higher level types at these interface boundaries than we use internally (ie. Full shapes, strides, arc/ref types, etc). It’s still evolving a lot as we try to get the balance right.
In a full system, these stub functions can also do dispatches for layout conversion if necessary and have access to enough of the lower level details to make this very efficient without presenting a complicated public ABI to outside invokers.
We’ve also toyed with the idea of generating allocation stubs so that pipelined invokers could generate deeper pipelines by having an entrypoint to the function that only does allocation for subsequent invocations. For true data dependent cases, pre allocation in this way would not work, and you may get a pipeline bubble. But cases like unique are interesting because the upper bound size may be sufficient for these pipeline pre-allocated cases.
Explorations into any of that would need quite a few more opinions taken (and types needed) vs what is in core, and it pushes you a lot closer to territory that is usually deemed to belong to whatever runtime the compiler is generating code for (which is why we’re exploring it in IREE, where we are have a target and have insights into what opinions will be necessary to make it efficient).
Thanks Stephan, could you share some examples or some fake code pieces.
I think this involve two things:
The first is where we allocate the buffer. If in the tensorflow and then we should pass the buffer as a argument, or, in the mlir, that we should consider the second thing below.
The second is how to allocate the result buffer in the mlir, may be we should a unified allocator in tensorflow and mlir, this can be solved if we implement a new alloc pass, for example, it allow tensorflow and mlir AllocOp use the same allocator. This is important because we need to free the result buffer out of mlir.