This is RFC to introduce an interface that allows for tiling + distribution of a class of operations that do not naturally fall within the Structured Op interface, but still have semantics that do allow them to be tiled and distributed (and fused in a manner similar to tile + fuse works on Structured Ops). The Linalg tiling algorithm uses the indexing maps, i.e. affine maps that describe the access pattern from the iteration space of the operation to data space of each individual operand, that allow Structured ops to
- Compute the loop bounds based on the size of the operands
- Allows description of a tiling algorithm (and by extension tile + fuse) that is purely based on the affine maps (and their inversion). The tiling algorithm itself uses the indexing maps to
- Compute the tile of the iteration space that is required to compute a tile of the result of an operation
- Given the tile of the iteration space, compute the tile of the operands needed to compute the tile of the result of the operation
- If the producer of an operand is also a Structured op, the same algorithm can be used to compute a tile of the operand in place by tiling the producer.
Many operations do not have an easy way of specifying the indexing maps, or at least and indexing map that is invertible. In absence of this, the rest of the document describes an interface that allows each operation to specify the information needed by the tiling algorithm to implement tiling in a manner similar to Structured ops.
The interface described here only deal with generating a tiled implementation of the operation, but the same should be usable for tile + fuse but is not explicitly part of this RFC. The tiling algorithm is not described in detail here. It has been prototyped within IREE here and this RFC is to start the process of upstreaming this interface to the MLIR code base. I will also be sending out the patch to move this upstream in a few days.
The algorithm for tiling itself is fairly straight-forward. Without describing the algorithm itself, the rest of this RFC describes the interface and shows the IR that would be generated for some operations that do not fit into the Structured op criteria, i.e.
- scatter
- gather
- pad
Similar to Structured operations, the operations here can have either tensor
operands (i.e. has tensor semantics) or memref
operands (i.e. has buffer semantics). The interface is intended to support tiling and distribution of both. Note that there is already a linalg.pad_tensor
operation which can be tiled and fused (even though the operation is not a Structured operation). Here it is used for illustration purposes, and the current implementation can be adapted to use this interface.
Disclaimer: This is very much an experimental interface, and is expected to evolve over time. This is being upstreamed so that different teams working on similar things can collaborate more effectively.
Intererface methods
SmallVector<StringRef> getLoopIteratorTypes();
This method allows the operation to specify the number of (mutually independent) loops that are required to implement the operation, and the dependence of each loop. This is similar to the iterator_types
of Structured operations.
SmallVector<Range> getLoopBounds(OpBuilder &b)
This method returns the SSA values the [lower_bound
, upper_bound
, step
] of each of the loops of the operation. The size of this list must be same as the size of the list returned by getLoopIteratorTypes()
LogicalResult getTiledImplementation(OpBuilder &b,
ValueRange dest, ValueRange offset, ValueRange tileSizes,
SmallVectorImpl<Value> &tiledResults,
SmallVectorImpl<SmallVector<Value>> &resultOffsets,
SmallVectorImpl<SmallVector<Value>> &resultSizes);
The algorithm used to the tile the operations is expected to use the first two methods to generate the inter-tile loops. Note for the initial implementation, a simplifying assumption is made that only the "parallel"
loops are tiled and distributed. For the body of the tiled operation, the above method will be invoked.
Here
-
dest
is theValue
s into which the result of the tiled operation needs to be computed into. When the operation has tensor semantics, these are the values into which the results of the tiled operations are inserted into. When the operation has buffer semantics, these are the values into which the results of the tiled operations are written into using in-place update. -
offsets
are theValue
s that specifies the offsets along each dimension of the iteration space that locate the tile. -
tileSizes
are theValue
s that specifies the sizes of along each dimension of the iteration space that define the tile. -
tiledResults
, when the tiling is performed on an operation with tensor semantics, returns theValue
that represents the tile of the result that is computed by the tiled operation. This can be left empty when tiling is performed on an operation with buffer semantics. -
resultOffsets
, when the tiling is performed on an operation with buffer semantics, returns theoffsets
for each result that locate the tile of the result produced by the tiled operation within the final result tensor. This can be left empty when tiling is performed on an operation with buffer semantics since the update happens in-place -
resultSizes
, when the tiling is performed on an operation with buffer semantics, returns thesizes
for each result specifies the size of the tile of the result produced by the tiled operation. This can be left empty when tiling is performed on an operation with buffer semantics since the update happens in-place
Using these interface methods, below are example IR of scatter, gather and pad operations after tiling
Scatter Operation
The following is a sample scatter
operation (this operation does not exist in MLIR and is only for illustrative purposes).
func @scatter(
%updates: tensor<?xi32>, %indices: tensor<?xindex>, %dest: tensor<?xi32>)
-> tensor<?xi32> {
%result = scatter
ins(%updates, %indices : tensor<?xi32>, tensor<?xindex>)
outs(%dest : tensor<?xi32>) -> tensor<?xi32>
return %result : tensor<?xi32>
}
Values from %update
are written into %dest
at locations specified by %indices
. %update
and %indices
have the same shape.
The loop iterators types for this op is {"parallel"}
The loop bounds of this op is {[0, tensor.dim(%indices, 0), 1]}
The tiled implementation generated would be
func @scatter_update_slice(
%updates: tensor<?xi32>, %indices: tensor<?xindex>, %dest: tensor<?xi32>)
-> tensor<?xi32> {
%c0 = constant 0 : index
%ub = tensor.dim %indices, %c0 : tensor<?x?xi32>
%ds = tensor.dim %dest, %c0 : tensor<?x?xi32>
%ts = “get_tilesize_op”(%c0) : (index) -> index
%result = linalg.tiled_loop (%iv) = (%c0) to (%ub) step (%ts)
ins(%arg0 = %updates, %arg1 = %indices : tensor<?xi32>, tensor<?xindex>)
outs(%arg2 = %dest : tensor<?xi32>) -> tensor<?xi32> {
%update_tile = tensor.extract_slice %arg0[%iv] [%ts] [1] :
tensor<?xi32> into tensor<?xi32>
%indices_tile = tensor.extract_slice %arg1[%iv] [%ts] [1] :
tensor<?xi32> into tensor<?xi32>
%scatter_tile = scatter
ins(%update_tile, %indices_tile : tensor<?xi32>, tensor<?xindex>)
outs(%arg2 : tensor<?xi32>) -> tensor<?xi32>
%yield = tensor.insert_slice %scatter_tile into %arg2[0] [%ds] [1] : tensor<?xi32> into tensor<?xi32>
linalg.yield %yield : tensor<?xf32>
}
return %result : tensor<?xi32>
}
The invocation of getTiledImplementation
for this op would
- specify
offsets
to be{%iv}
- specify
tileSizes
to be{%ts}
On return from this call, the interface implementation would populate
-
tiledResults
with{%scatter_tile}
-
resultOffsets
with{{%c0}}
-
resultSizes
with{{%ds}}
Note that the size of the result tile produced is same size as the size of %dest
tensor of the untiled op. This is assuming that the semantics of the scatter operation disallows multiple updates to the same location using %indices
. This is implied by the iterator types being specified as {"parallel"}
.
Gather operation
The inverse of the scatter operation is a gather operation.
func @gather(%input: tensor<?xf32>, %indices: tensor<?xindex>) -> tensor<?xf32> {
%c0 = constant 0 : index
%d0 = tensor.dim %indices, %c0 : tensor<?xi32>
%init = linalg.init_tensor [%d0] : tensor<?xf32>
%result = gather
ins(%input, %indices : tensor<?xf32>, tensor<?xindex>)
outs(%init : tensor<?xf32>) -> tensor<?xf32>
return %result : tensor<?xf32>
}
The loop iterators types for this op is {"parallel"}
The loop bounds of this op is {[0, tensor.dim(%indices, 0), 1]}
The tiled implementation generated would be
func @gather(%input: tensor<?xf32>, %indices: tensor<?xindex>) -> tensor<?xf32> {
%c0 = constant 0 : index
%d0 = tensor.dim %indices, %c0 : tensor<?xindex>
%ts = “get_tilesize_op”(%c0) : (index) -> index
%init = linalg.init_tensor [%d0] : tensor<?xf32>
%result = linalg.tiled_loop (%iv) = (%c0) to (%d0) step (%ts)
ins(%arg0 = %input, %arg1 = %indices : tensor<?xf32>, tensor<?xindex>)
outs(%arg2 = %init : tensor<?xf32>) -> tensor<?xf32> {
%init = linalg.init_tensor [%ts] : tensor<?xf32>
%indices_tile = tensor.extract_slice %arg1[%iv] [%ts] [1] :
tensor<?xi32> into tensor<?xf32>
%result_tile = gather
ins(%arg0, %indices_tile : tensor<?xf32>, tensor<?xindex>)
outs(%init : tensor<?xf32>) -> tensor<?xf32>
%result = tensor.insert_slice %result_tile into %arg2[%iv] [%ts] [1] :
tensor<?xf32> into tensor<?xf32>
linalg.yield %result : tensor<?xf32>
}
return %result : tensor<?xf32>
}
The invocation of getTiledImplementation
for this op would
- specify
offsets
to be{%iv}
- specify
tileSizes
to be{%ts}
On return from this call, the interface implementation would populate
-
tiledResults
with{%gather_tile}
-
resultOffsets
with{{%iv}}
-
resultSizes
with{{%ts}}
Note that converse to the scatter, the gather access all of the %input
for every tile of the operation. The is not really overly arduous. The eventual generated code would just read from the location of %input
needed to compute the result within each tile, i.e. there is no actual copy or redundant reads of the %input
. The entire slice is extract for correct representation of the computation.
Pad operation
func @pad(
%input: tensor<?x?xf32>, %low0: index, %low1 : index, %high0 : index,
%high1 : index, %pad_value : f32) -> tensor<?x?xf32> {
// Has two loops that are [“parallel”, “parallel”]
// Loop bounds are [0, memref.dim(%result, 0)), [0, memref.dim(%result, 1))
%result = pad %input, %pad_value,
low [%low0, %low1], high [%high0, %high1] :
tensor<?x?xf32> to tensor<?x?xf32>
return %result : tensor<?x?xf32>
}
The loop iterators types for this op is {"parallel", "parallel"}
The loop bounds of this op is {[0, tensor.dim(%result, 0), 1], [0, tensor.dim(%result, 1), 1]}
. Note that this could be re-rewritten in terms of shape of %input
The tiled implementation generated would be
func @pad(
%input: tensor<?x?xf32>, %low0: index, %low1 : index, %high0 : index,
%high1 : index, %pad_value : f32) -> tensor<?x?xf32> {
%d0 = … : index // evaluates to memref.dim(%result, 0)
%d1 = … : index // evaluates to memref.dim(%result, 1)
%in_d0 = memref.dim %input, %c0 : index
%in_d1 = memref.dim %input, %c1 : index
%init = linalg.init_tensor [%d0, %d1] : tensor<?x?xf32>
%result = linalg.tiled_loop (%iv0, %iv1) =
(%c0, %c0) to (%d0, %d1) step (%ts1, %ts2)
ins(%arg0 = %input : tensor<?x?xf32>)
outs(%arg1 = %init : tensor<?x?xf32>) {
%init_tile = linalg.init_tensor[%ts1, %ts2] : tensor<?x?xf32>
// The next few lines represent the "scalar" implementation of pad operation. You can also use a
// represent this computation using "slices" as done here :
// https://github.com/llvm/llvm-project/blob/main/mlir/test/Dialect/Linalg/tile-pad-tensor-op.mlir
%result_tile = scf.for %iv2 = %c0 to %ts1 step %c1 init(%arg2 = %init_tile) {
%y0 = scf.for %iv3 = %c0 to %jt step %c1 init(%arg4 = %arg2) {
%cond = call @is_in_bounds(...)
%4 = scf.if %cond {
%id0 = subi %iv2, %low0 : index
%id1 = subi %iv3, %low1 : index
%5 = tensor.extract %arg0[%id0, %id1] : tensor<?x?xf32>
scf.yield %5 : f32
} else {
scf.yield %pad_value : f32
} -> f32
%5 = tensor.insert %4 into %arg4[%iv2, %iv3] : tensor<?x?xf32>
scf.yield %5 : tensor<?x?xf32>
}
scf.yield %y0 : tensor<?x?xf32>
}
%result =
subtensor_insert %result_tile into %arg1[%iv0, %iv1] [%ts1, %ts2] [1, 1] :
tensor<?x?xf32> into tensor<?x?xf32>
linalg.yield %result : tensor<?x?xf32>
}
return %result : tensor<?x?xf32>
}
Future enhancements
(a.k.a. things not part of the initial RFC)
-
While fusion of such operations in a manner similar to tile + fuse of Structured ops is a prime motivation factor, the initial RFC does not cover these aspects. Some rough ideas on this
- The tiled implementations are expected to generated
tensor.extract_slice
s operations to get the slice of the operands needed for the tiled operations. These give theoffsets
andsizes
needed to further tile the produces “in-place” using the same interface above, similar to how tile + fuse works on Structured operations. - For examples like
gather
above, thetensor.extract_slice
is the entire input for every tile. If fusing through that using that directly is wasteful. Instead you would need some pre-preocessing to compute the “locations” of the%input
needed to compute the tile of the gather. This is now approaching inspector-executor techniques and other techniques from sparse computations that are used for such purposes. Those could be leveraged here without any fundamental changes to the interface. (HT @herhut for discussions on this)
- The tiled implementations are expected to generated
-
The current RFC does not deal with tiling and distribution of iterator types that are not
"parallel"
. There is another RFC that is addressing a similar shortcoming of the tiling + distribution of Structured operations. Extensions to this interface should be able to achieve the same for operations that are not structured operations.