[RFC] Add Linalg TileOp

Motivation

Support for tiling and the related transformations such as tile-and-fuse, tile-and-distribute for Linalg operations on tensors.

Tiling for a Linalg operation on buffers results in an scf.for or scf.parallel op with a region that contains a number of subview operations to transform the original input and output arguments of the op.

Tiling for a Linalg operation on tensors results in a number of subtensor ops for the inputs and subtensor_insert ops for the outputs of the op, i.e. the elements of the output tensor are populated via destructive updates. Neither scf.for nor scf.parallel are suitable for containing subtensor_insert in their bodies.

The goal is to design a new op, that has a similar semantics to scf.parallel, but at the same time yields subtensors for the specified outputs to make subtensor_insert implicit.

The new op

The linalg.tile operation represents a loop nest taking the usual lower bounds, upper bounds and steps arguments and the input/output tensor arguments like in linalg.generic. It has one region capturing the loop body. The body region must contain exactly one block that terminates with linalg.subtensor_yield with the arguments matched to the output tensors.

linalg.tile (%i0, %i1) = (%lb0, %lb1) to (%ub0, %ub1) step (%s0, %s1)
    ins (%A, %B : tensor<?x?xf32>, tensor<?x?xf32>)
    outs (%C, %D : tensor<?x?xf32>, tensor<?x?xf32>)  {
  ...
  %C_subtensor = 
  %D_subtensor = 
  ...
  linalg.subtensor_yield %C_subtensor into [offsets][sizes][strides],
                         %D_subtensor into [offsets][sizes][strides]
                         : type(%C_subtensor), type(%D_subtensor) 
}

Tile-and-fuse example

Let’s consider a tiling example for a matmul:

func @matmul(%lhs: tensor<24x64xi8>, %rhs: tensor<64x192xi8>,
             %uninit_out: tensor<24x192xi32>) -> tensor<24x192xi32> {
  %c0 = constant 0 : i32
  %c42 = constant 42 : i32

  %out = linalg.fill(%uninit_out, %c0)
      : tensor<24x192xi32>, i32 -> tensor<24x192xi32> 

  %prod = linalg.matmul_i8_i8_i32
      ins(%lhs, %rhs : tensor<24x64xi8>, tensor<64x192xi8>) 
      outs(%out : tensor<24x192xi32>) -> tensor<24x192xi32>
  return %prod : tensor<24x192xi32>
}

Tile the first dimension by 4:

func @matmul(%lhs: tensor<24x64xi8>, %rhs: tensor<64x192xi8>,
             %uninit_out: tensor<24x192xi32>) -> tensor<24x192xi32> {
  %c0_i32 = constant 0 : i32
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %c3 = constant 3 : index
  %c4 = constant 4 : index

  %lhs_d0 = dim %lhs, %c0: tensor<24x64xi8>
  %lhs_d1 = dim %lhs, %c1 : tensor<24x64xi8>
  %rhs_d0 = dim %rhs, %c0: tensor<64x192xi8>
  %rhs_d1 = dim %rhs, %c1 : tensor<64x192xi8>
  %out_d0 = dim %out, %c0: tensor<24x192xi32>
  %out_d1 = dim %out, %c1 : tensor<24x192xi32>

  %out = linalg.fill(%uninit_out, %c0_i32)
      : tensor<24x192xi32>, i32 -> tensor<24x192xi32> 
  %prod = linalg.tile (%i) = (%c0) to (%lhs_d0) step (%c4)
      ins(%lhs, %rhs : tensor<24x64xi8>, tensor<64x192xi8>) 
      outs(%out : tensor<24x192xi32>) {
    %lhs_d0_size = affine.min affine_map<(d0)[s0] -> (4, -d0 + s0)>(%i)[%lhs_d0]
    %lhs_sub = subtensor %lhs[%i, 0] [%lhs_d0_size, %lhs_d1] [1, 1]
        : tensor<24x64xi8> to tensor<?x?xi8>

    %out_d0_size = affine.min affine_map<(d0, d1) -> (4, d0 - d1)>(%out_d0, %i)
    %out_sub = subtensor %out[%i, 0] [%out_d0_size, %out_d1] [1, 1]
        : tensor<24x192xi32> to tensor<?x?xi32>

    %prod_sub = linalg.matmul_i8_i8_i32
        ins(%lhs_sub, %rhs : tensor<?x?xi8>, tensor<64x192xi8>)
        outs(%out_sub : tensor<?x?xi32>) -> tensor<?x?xi32>

    linalg.subtensor_yield
        %prod_sub into [%i, 0][%out_d0_size, %out_d1][1, 1] : tensor<?x?xi32>
  }
  return %prod : tensor<24x192xi32>
}

After the consumer operation was tiled, we can fuse it with the linalg.fill.

func @matmul(%lhs: tensor<24x64xi8>, %rhs: tensor<64x192xi8>,
             %uninit_out: tensor<24x192xi32>) -> tensor<24x192xi32> {
  %c0_i32 = constant 0 : i32
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %c3 = constant 3 : index
  %c4 = constant 4 : index

  %lhs_d0 = dim %lhs, %c0: tensor<24x64xi8>
  %lhs_d1 = dim %lhs, %c1 : tensor<24x64xi8>
  %rhs_d0 = dim %rhs, %c0: tensor<64x192xi8>
  %rhs_d1 = dim %rhs, %c1 : tensor<64x192xi8>
  %out_d0 = dim %out, %c0: tensor<24x192xi32>
  %out_d1 = dim %out, %c1 : tensor<24x192xi32>

  %prod = linalg.tile (%i) = (%c0) to (%lhs_d0) step (%c4)
      ins(%lhs, %rhs : tensor<24x64xi8>, tensor<64x192xi8>) 
      outs(%uninit_out : tensor<24x192xi32>) {
    %lhs_d0_size = affine.min affine_map<(d0)[s0] -> (4, -d0 + s0)>(%i)[%lhs_d0]
    %lhs_sub = subtensor %lhs[%i, 0] [%lhs_d0_size, %lhs_d1] [1, 1]
        : tensor<24x64xi8> to tensor<?x?xi8>

    %out_d0_size = affine.min affine_map<(d0, d1) -> (4, d0 - d1)>(%out_d0, %i)
    %uninit_out_sub = subtensor
        %uninit_sub[%i, 0][%out_d0_size, %out_d1][1, 1]
        : tensor<24x192xi32> to tensor<?x?xi32>
    %out_sub = linalg.fill(%uninit_out_sub, %c0_i32)
        : tensor<24x192xi32>, i32 -> tensor<24x192xi32> 

    %prod_sub = linalg.matmul_i8_i8_i32
        ins(%lhs_sub, %rhs : tensor<?x?xi8>, tensor<64x192xi8>)
        outs(%out_sub : tensor<?x?xi32>) -> tensor<?x?xi32>

    linalg.subtensor_yield
        %prod_sub into [%i, 0][%out_d0_size, %out_d1][1, 1] : tensor<?x?xi32>
  }
  return %prod : tensor<24x192xi32>
}
4 Likes

I think tile isn’t the right name for this. What you have is neither a tile nor does it mean the imperative / verb tile. tiled_generic? The op is separating the space of tiles from a tile in its modeling. (The tiling has already happened.)

Funnily enough, the original name was tiled_generic. I did not like linalg.tiled_generic, because it resembles linalg.indexed_generic and linalg.generic and it can be confusing.

??!! If you think tiled_generic resembles indexed_generic and generic, nearly everything would resemble everything else! I don’t see any source for confusion - in fact, the commonality is useful.

The source for the confusion is that the users might think that only *generic ops are allowed in the body of this new op, which is not true. Any linalg op (or even not linalg) would do. tiled_generic is much closer to SCF loop ops than to linalg.generic ops.

I dont want to really wade into a naming discussion (I am severely hampered myself in this), but based on @pifon2a comment above (and my understanding/requirement) matches that), it is really a representation of the multiply nested loops that represent the inter-tile loop iteration, with the body being the computation of a tile. With that in mind would linalg.tiled_loop be more appropriate?

Some other questions:

  1. It seems like this op might require an iterator_types attribute to designate what are dependences between the tiles. Without that you would be losing dependence information about the loops. For example, a 3D linalg.matmul operation the iterator_types on the linalg.tiled_loop op would be [parallel, parallel, reduction] .

  2. I think this is possible, but to be sure, we should be able to do an “interchange” transformation on the linalg.tiled_loop operation fairly trivially (given the iterator types of the represented loops)

Apart from this, the spec looks good to me. I am really happy this will go a long way in killing subtensor_insert.

Sounds good. I will add the iterator_types attribute. In the pretty form, I can omit it if every dim is parallel and print all of them otherwise.

Thanks Alex for pushing on this!

scf.for with sequential semantics is actually fine (parallel is something else though). The destructive updates are a bit annoying to work with but still end up being a welcome simplifcation compared to memref + load + store. Basically in traditional polyhedral analysis, one would worry about all possible interleavings of possibly aliasing loads and stores in the context of the k-common surrounding loops. With scf.for + subtensor + subtensor_insert + yield there is a lot more ordering and dealiasing thanks to SSA use-def chains. Still, we can indeed simplify things and make them nicer to work with with an abstraction like you propose.

There are some interesting details that are swept under the rug here and that are useful to unpack, esp. if we also think of this in the context of other data types than dense tensors and memrefs.
Linalg ops that conforms to the structured op interface fully derive their iterator bounds from their operands and come with a number of structural properties.
Applying tiling and fusion on such ops introduces control-flow like scf.for and other ops to capture information that cannot be represented with the op semantics.

It would have been nice to keep similar structural properties in the op but this comes with a nasty inverse problem that we don’t think we want to have to solve.

Essentially:

%0 = scf.for %i = 0 to 42 step 5 iter_args(%arg0 = %C ) -> tensor<?x?xf32> {
  %tiledA = subtensor %A ... 
  %tiledB = subtensor %B ...
  %tiledC = subtensor %arg0 ...
  %tiledD = some_op_1(%tiledA, %tiledB) ...
  %tiledE = some_op_2(%tiledC, %tiledD) ...
  %res = subtensor_insert %tiledE into %arg0 ...
  scf.yield %res : tensor<?x?f32>
}

is really: tile(some_op_2(some_op_1(%A, %B) ,%C), 5).

And we would like to express this as an op with a region that I’ll call “region after”-tiling:

linalg.some_fancy_op (%A, %B, %C) <some_extra_attributes_and_operands> {
^bb(%tiledA, %tiledB, %tiledC) {
  %tiledD = some_op_1(%tiledA, %tiledB) ...
  %tiledE = some_op_2(%tiledC, %tiledD) ...
  linalg.yield %tiledE ...
}}

Unfortunately, we could not find a good generic form for this that works in the general case as it raises some nasty inverse problems. The options we have discussed are:

  1. explicitly represent subtensor / subtensor_insert or some slightly re-spelled version in the body of the new nesting linalg op (current proposal).
  2. keep a “region before”-tiling and a “region after” tiling + enough info to apply tiling to derive the subtensor / subtensor_insert (leaning towards no).
  3. pass whatever info is necessary to <some_extra_attributes_and_operands> to only keep the “region after”-tiling (abandonned).

The issue with 3. is that <some_extra_attributes_and_operands> will quickly grow and we don’t even know how to solve some cases (i.e. given the “region after”-tiling form with ops that have their own indexing maps and parallelism semantics + a bag of <some_extra_attributes_and_operands> it is unclear we can recover the subtensor / subtensor_insert with their permutation in the general case). Additionally, it will not properly compose with other things (e.g. mapping to an SPMD model).

The issue with 1. is that the control of the op becomes “iteration-centric” (i.e. loop-like) as opposed to “data-centric” for the structured linalg ops. Longer-term things like tiling and distributing non-dense-strided-memref data types will not work on this abstraction (e.g. this only models hyperrectangular iteration spaces). Still we can make progress with it and reevaluate when more pieces are connected.

The issue with 2. is that it keeps around some old IR form and is generally counterintuitive.
In the future we will likely reopen the work to extend the linalg.range abstractions so that we can have 2. without the inverse problem that would also work with other data types. This will take some time to get designed and discussed.

This will quickly grow to become unmanageable.
I’d split that into multiple terminators or multiple ops + yields, something close to subtensor_insert + yield. This will require a little bit of non-local behavior for the validation and has a bit of an awkward semantic:

%rX = insert_like %x into %X[offsets][sizes][strides]: typeof(%x) into typeof(%X)
yield %rX : typeof(%rX) // But we actually really yield %x@[offsets][sizes][strides]

Would it be reasonable to have blocks with multiple terminators or is this something generally not worth exploring?

+1, we will need that to remain closer in spirit to a “data-centric” abstraction.

subtensor_insert is fine and if you look closer this proposal does not remove the abstraction but rather its interaction with scf.for for the parallel case. However we end up spelling this op and its version of subtensor_insert that yields, the underlying tradeoff I wrote about at the top of the post still persists: in SSA-value land the destructive update pattern is useful, we use it for vector.transfer + scf.for + scf.yield hoisting too. This hides is a little more from the user but comes with non-trivial tradeoffs.

Can you clarify what aspect exactly? It isn’t clear to me what is the issue here?

That seems indeed a quite strange semantics to me here. Why do we need both an insert and a yield? Why isn’t the yield enough with the insert embedded in the parent op’s semantics?

I’m not convinced by the concept of multiple terminators: terminator is about transferring control, it isn’t clear to me how you spread this across multiple ops or what it enables?

I think Nicolas meant that it’s a lot of variadic arguments to have and it would duplicate most of the code implemented for the ops with OffsetSizeAndStrideOpInterface, which have offsets, sizes and strides for a single tensor argument. And here we would have all of that for every tensor.

This would be somewhat similar to an scf.parallel loop that has several reductions in its body. See example in scf.parallel with 2 scf.reduce ops.

I would change it to

linalg.subtensor_yield
        %produced_subtensor into [%i, 0][%out_d0_size, %out_d1][1, 1] : tensor<?x?xi32>

to remove the target tensor from it.

Multiple terminators is strange indeed. I am actually OK with multiple yield values even if it “looks” verbose. (Its IR, being pretty is secondary IMO)

Agreed that the bigger problem is interaction of subtensor_insert with scf.for. That is addressed here. Maybe a separate discussion, but subtensor_insert itself is problematic. Being in the tensor world it is actually kicking the cost of this abstraction down the road. When you bufferize you need to handle this correctly to avoid the humungous cost that a naive translation would entail. (The same issue actually happens in LLVM with structs being SSA values. The deSSA passes need to be smart about handling structs and it becomes load-bearing for performance). I understand the purpose it is serving. My hope is that this tile_* operation will allow us to more easily avoid pitfalls during bufferization and make handling of this much easier.

Actually from my perspective it would be good to avoid pushing the insert semantics to the parent. It would get us back to having to handle subtensor_insert as I just mentioned above.

@MaheshRavishankar @nicolasvasilache Can I assume that the offsets-sizes-strides args for the subtensor_insert are always the same as the offsets-sizes-strides args for the subtensor op that was applied to the “init tensor” in the outs section? In that case, it would be possible to omit subtensor_insert ops and just have `linalg.yield %subtensor_1, %subtensor_2.

Tile and fuse n ops with n large enough, each of them with their own flavor of mixed static and dynamic operands (yes we want canonicalizations to work as well as they do with other ops of the interface).
Add higher-rank operations (e.g. 7+D conv) and rank-reducing subtensors and you got yourself a nice mess.

Not sure that works. Then lowering the yield will require you to carry state from when lowering the subtensor op in the region. That would painful.

I don’t want to hold up this line of work but as you mention it here: Would it suffice to change scf.parallel so that it beyond reduction also has a variant for producing tensors? In essence adding something like scf.subtensor_yield?

One difference I could think of is that scf.parallel has no notion of parallel and reduction dimensions. But ultimately, isn’t this just required here to derive the shape of the produced output tensor? That could be specified explicitly.

Good question. The main reason to have this op in LinalgOps.td is to understand what sort of abstraction is actually needed. I think if we converge to a loop-like op, it would make sense to move it to SCF or extend scf.parallel or even replace scf.parallel by it. Until then, we keep it in Linalg.

the terminator can have a region attached to it.