[RFC] Changes to linalg::TiledLoopOp to unblock reductions.
Background
linalg.tiled_loop
is an operation used for data tiling of linalg ops that enables fusion, tiling, padding, distribution and other transformations on tensors. In the original RFC it was proposed to explicitly specify subtensor
/tensor.extract_slice
and subtensor_insert
/tensor.insert_slice
operations.
That way the tiled_loop
is structured as gettiles/compute/inserttilesback operations.
%sum = linalg.tiled_loop (%i) = (%c0) to (%size_0) step (%c10)
ins (%in_ = %in: tensor<100xf32>)
outs (%out_ = %out: tensor<100xf32>)
iterator_types ("parallel") {
// Extract tiles.
%in_sub = tensor.extract_slice %in_[%i][%c10][%c1]
%out_sub = tensor.extract_slice %out_[%i][%c10][%c1]
// Actual computation.
%add_sub = linalg.generic {
indexing_maps = [affine_map<(d0) > (d0)>],
iterator_types = ["parallel"]}
ins(%in_sub: tensor<10xf32>)
outs(%out_sub: tensor<10xf32>) {
^bb0(%in_elem: f32, %out_elem: f32):
linalg.yield %in_elem : f32
} > tensor<10x32>
// Insert tiles.
%insert_result = tensor.insert_slice %add_sub into %out_[%i][%c10][%c1]
// Yield the whole tensor after destructive update.
linalg.yield %insert_result : tensor<100xf32>
}
There are several disadvantages with this approach:

linalg.yield
returns the whole tensor after destructive update, not the computed tile i.e. the type of the yielded value has the type of the original tensor but not the tile.  No perfect nesting of
linalg.tiled_loop
operations.  It is not clear how to support reductions.
Analogy to linalg.generic
linalg.generic
contains a region that represents a scalar computation on elements of ins
and outs
tensor/memref arguments. The exact indices of the elements are specified by the affine indexing affine maps. Reductions are implicit. There are two indicators of whether an instance of linalg.generic
is a reduction or not. First, one has to look for âreductionâ iterator type. Second, there should be a read of the output element.
#id = affine_map<(d0) > (d0)>
%sum = linalg.generic {indexing_maps = [#id, #id],
iterator_types = ["reduction"]}
ins(%in: tensor<?xf32>)
outs(%out: tensor<f32>) {
^bb0(%in_elem: f32, %out_elem: f32):
%0 = addf %in_elem, %out_elem : f32
linalg.yield %0 : f32
}
So, linalg.generic
is a loop that specifies how to get elements from the tensors/memrefs and what to do with them. When lowered to SCF or Affine loops, the necessary load operations are inserted using the affine maps for both input and output arguments and linalg.yield
op results in a store operation of the computed values to the SAME output locations.
linalg.tiled_loop
can be remodeled to become consistent with the structure of linalg.generic
. It can be thought of as a linalg.generic
on nonscalar subsets of inputs and outputs. Instead of indexing affine maps, it has tensor.extract_tile
operations to specify the tiles. It can have a terminator that yields the computed tiles for some tile of output. That would make tensor.insert_tile
implicit.
Remark Later, we might be able to introduce operations other than tensor.extract_slice
to specify partitioning of the arguments.
Proposed changes
Introduce linalg.tiled_yield
that takes two groups of SSA values. The first group is the computed results, the second one is the âdestinationâ tiles of the output.
Transpose example
#id = affine_map<(d0, d1) > (d0, d1)>
#tr = affine_map<(d0, d1) > (d1, d0)>
%sum = linalg.tiled_loop (%i, %j) = (%c0, %c0) to (%size_0, %size_1)
step (%c10, %c10)
ins (%in_ = %in: tensor<100x100xf32>)
outs (%out_ = %out: tensor<100x100xf32>)
iterator_types ("parallel", "parallel") {
%in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
%out_sub = tensor.extract_slice %out_[%j, %i][%c20, %c10][%c1, %c1]
%transpose_sub = linalg.generic {
indexing_maps = [#id, #tr],
iterator_types = ["parallel", "parallel"]}
ins(%in_sub: tensor<10x10xf32>)
outs(%out_sub: tensor<10x10xf32>) {
^bb0(%in_elem: f32, %out_elem: f32):
linalg.yield %in_elem : f32
} > tensor<10x10xf32>
linalg.tiled_yield %transpose_sub in %out_sub : tensor<10x10xf32>
}
Reduction example
#id = affine_map<(d0) > (d0)>
%sum = linalg.tiled_loop (%i) = (%c0) to (%size) step (%c10)
ins (%in_ = %in: tensor<100xf32>)
outs (%out_ = %out: tensor<f32>)
iterator_types ("reduction") {
%in_sub = tensor.extract_slice %in_[%i][%c10][%c1]
%sub_sum = linalg.generic {indexing_maps = [#id, #id],
iterator_types = ["reduction"]}
ins(%in_sub: tensor<10xf32>)
outs(%out_: tensor<f32>) {
^bb0(%in_elem: f32, %out_elem: f32):
%0 = addf %in_elem, %out_elem : f32
linalg.yield %0 : f32
}
linalg.tiled_yield %sub_sum in %out_
}
The PR to push the changes to linalg::TiledLoopOp
and the associated passes, i.e. fusion, tiling, distribution, bufferization, can be found in https://reviews.llvm.org/D106066.