[RFC] Changes to linalg::TiledLoopOp to unblock reductions

[RFC] Changes to linalg::TiledLoopOp to unblock reductions.

Background

linalg.tiled_loop is an operation used for data tiling of linalg ops that enables fusion, tiling, padding, distribution and other transformations on tensors. In the original RFC it was proposed to explicitly specify subtensor/tensor.extract_slice and subtensor_insert/tensor.insert_slice operations.

That way the tiled_loop is structured as get-tiles/compute/insert-tiles-back operations.

%sum = linalg.tiled_loop (%i) = (%c0) to (%size_0) step (%c10)
    ins (%in_ = %in: tensor<100xf32>)
    outs (%out_ = %out: tensor<100xf32>)
    iterator_types ("parallel")  {
  // Extract tiles.
  %in_sub = tensor.extract_slice %in_[%i][%c10][%c1] 
  %out_sub = tensor.extract_slice %out_[%i][%c10][%c1]
  
  // Actual computation.
  %add_sub = linalg.generic {
      indexing_maps = [affine_map<(d0) -> (d0)>],
      iterator_types = ["parallel"]}
     ins(%in_sub: tensor<10xf32>)
     outs(%out_sub: tensor<10xf32>)  {
   ^bb0(%in_elem: f32,  %out_elem: f32):
     linalg.yield  %in_elem : f32
    } -> tensor<10x32>
    
  // Insert tiles.
  %insert_result = tensor.insert_slice %add_sub into %out_[%i][%c10][%c1]
  
  // Yield the whole tensor after destructive update.
  linalg.yield %insert_result : tensor<100xf32> 
}

There are several disadvantages with this approach:

  • linalg.yield returns the whole tensor after destructive update, not the computed tile i.e. the type of the yielded value has the type of the original tensor but not the tile.
  • No perfect nesting of linalg.tiled_loop operations.
  • It is not clear how to support reductions.

Analogy to linalg.generic

linalg.generic contains a region that represents a scalar computation on elements of ins and outs tensor/memref arguments. The exact indices of the elements are specified by the affine indexing affine maps. Reductions are implicit. There are two indicators of whether an instance of linalg.generic is a reduction or not. First, one has to look for “reduction” iterator type. Second, there should be a read of the output element.

#id = affine_map<(d0) -> (d0)>
%sum = linalg.generic {indexing_maps =  [#id, #id],
    iterator_types = ["reduction"]}
    ins(%in: tensor<?xf32>)
    outs(%out: tensor<f32>)  {
  ^bb0(%in_elem: f32,  %out_elem: f32):
    %0  = addf %in_elem,  %out_elem : f32
    linalg.yield  %0  : f32
}

So, linalg.generic is a loop that specifies how to get elements from the tensors/memrefs and what to do with them. When lowered to SCF or Affine loops, the necessary load operations are inserted using the affine maps for both input and output arguments and linalg.yield op results in a store operation of the computed values to the SAME output locations.

linalg.tiled_loop can be remodeled to become consistent with the structure of linalg.generic. It can be thought of as a linalg.generic on non-scalar subsets of inputs and outputs. Instead of indexing affine maps, it has tensor.extract_tile operations to specify the tiles. It can have a terminator that yields the computed tiles for some tile of output. That would make tensor.insert_tile implicit.

Remark Later, we might be able to introduce operations other than tensor.extract_slice to specify partitioning of the arguments.

Proposed changes

Introduce linalg.tiled_yield that takes two groups of SSA values. The first group is the computed results, the second one is the “destination” tiles of the output.

Transpose example

#id = affine_map<(d0, d1) -> (d0, d1)>
#tr = affine_map<(d0, d1) -> (d1, d0)>

%sum = linalg.tiled_loop (%i, %j) = (%c0, %c0) to (%size_0, %size_1)
    step (%c10, %c10)
    ins (%in_ =  %in: tensor<100x100xf32>)
    outs (%out_ =  %out: tensor<100x100xf32>)
    iterator_types ("parallel", "parallel")  {
  %in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
  %out_sub = tensor.extract_slice %out_[%j, %i][%c20, %c10][%c1, %c1]
 
  %transpose_sub = linalg.generic {
      indexing_maps =  [#id, #tr],
      iterator_types =  ["parallel", "parallel"]}
      ins(%in_sub: tensor<10x10xf32>)
      outs(%out_sub: tensor<10x10xf32>)  {
    ^bb0(%in_elem: f32,  %out_elem: f32):
      linalg.yield  %in_elem : f32
  } -> tensor<10x10xf32>
  linalg.tiled_yield %transpose_sub in %out_sub : tensor<10x10xf32>
}

Reduction example

#id = affine_map<(d0) -> (d0)>
%sum = linalg.tiled_loop (%i) = (%c0) to (%size) step (%c10)
    ins (%in_ = %in: tensor<100xf32>)
    outs (%out_ = %out: tensor<f32>)
    iterator_types ("reduction")  {
  %in_sub = tensor.extract_slice %in_[%i][%c10][%c1]
  %sub_sum = linalg.generic {indexing_maps =  [#id, #id],
    iterator_types =  ["reduction"]}
    ins(%in_sub: tensor<10xf32>)
    outs(%out_: tensor<f32>)  {
  ^bb0(%in_elem: f32,  %out_elem: f32):
    %0  = addf %in_elem,  %out_elem : f32
    linalg.yield  %0  : f32
  }
  linalg.tiled_yield %sub_sum in %out_
}

The PR to push the changes to linalg::TiledLoopOp and the associated passes, i.e. fusion, tiling, distribution, bufferization, can be found in https://reviews.llvm.org/D106066.

Two questions:

  • Wasn’t a design like that with a yield for each tile considered originally? I kind of remember @nicolasvasilache objecting to this.
  • I’m puzzle by the in %out_sub: what is the %out_sub doing here? The linalg.generic in the body should already produce a complete tensor with %transpose_sub, and it already consumes the %out_sub tile as needed. Why wouldn’t any update be done in the body and the yield just produce a complete tile?

The objection was to a terminator that reproduces the logic of multiple subtensor_insert all crammed together within a single op + attributes.

I’ll repost from my reply at the time:

The difference in this new RFC is that the “insert_like” becomes unnecessary and Alex proposes to instead tie the yield to the “extract_like” op. The same discussion about non-local behavior for validation applies.

The %out_sub + yielding a tile instead of the whole tensor is what enables reduction semantics at the tile level. The one thing that is not ideal is the “non local verification behavior” I mentioned above.

The crux of the problem is that an interesting abstraction gap appears when one mixes all of the constraints below:

  1. tiling (and other transformations)
  2. parallel iterators
  3. reduction iterators
  4. tensor SSA values

This is related to the semantics of the yield and the type of the yielded value vs the returned value.
That abstraction gap disappears once bufferization is performed.

The only job of the tiled_loop abstraction is to hide that abstraction gap and allow transformations on tensors + late bufferization.

The abstraction we have today uses

linalg.tiled_loop` + `extract` + `insert` + `yield the whole tensor`

This works fine for 1. + 2. + 4. but not for 3. Yielding the tile appears essential to encoding parallel + reduction at the tile level.

This RFC proposes to drop the usage of insert_like in this case, which adds an extra simplification. The representation would resemble:

linalg.tiled_loop` + `extract`  + `yield the tile`

As a reminder, the ideal form we’d have loved to have is mentioned in that previous post and can be most naturally expressed as the result of a transformation:

Long story short, there are some deep representational considerations here that are driven by transformations. The solution proposed in this RFC is a nice incremental step forward. Compared to the ideal-but-still-missing-stuff form of linalg.some_fancy_op, this simplifies the “insert_like” part but the “extract_like” part is still obtained through “nested extract op + SSA values” and not “enclosing op semantics”. There is strong suspicion that we cannot get rid of the “nested extract op + SSA” part without significant other degradations.

Still, I don’t think this is the end of the road and this is likely to continue evolving as we learn more from practice once we can also handle parallel codegen across reductions on tensors (mouthful).

Wasn’t a design like that with a yield for each tile considered originally? I kind of remember @nicolasvasilache objecting to this.

@mehdi_amini, In the original RFC, it was

linalg.subtensor_yield %C_subtensor into [offsets][sizes][strides],
                         %D_subtensor into [offsets][sizes][strides]
                         : type(%C_subtensor), type(%D_subtensor)

It has all arguments of a subtensor_insert inlined in the terminator except for the destination arg. The destination tensor was supposed to be taken from the outs args of the enclosing tiled_loop.

It has several downsides:

  1. the read-modify-write semantics is not so obvious, since you would need to compare all offsets-sizes-strides in subtensor_yield and subtensor operation to make sure that you are reading and writing to the same slice.
  2. it is tied to subtensor operation and it is not clear how to extend it to support gather and other subset computations. tiled_yield op already supports extract_slice and identity as subset ops.
  3. it is ugly and clumsy. Reading such IR is hard if you have several results.

I’m puzzle by the in %out_sub : what is the %out_sub doing here? The linalg.generic in the body should already produce a complete tensor with %transpose_sub, and it already consumes the %out_sub tile as needed. Why wouldn’t any update be done in the body and the yield just produce a complete tile?

I am not sure I understood the question. There might be several linalg.generic ops inside of the tiled_loop. The yield %out_sub here represents a subset computation for RW access to the output. So, extract_slice here is roughly what affine_map is in linalg.generic. It is just a location. We are also working on a different set of ops which would have a block for every index or subset computation for inputs/outputs. In that case the yield would have only a list of computed tiles to return because we would know how to associate it with the computed subsets/indices. For the current tiled_loop we don’t really know how to associate the produced tile with the slice of the output without explicitly listing it. Of course, we could omit the output tile, assume that there is a single extract_slice for every output and then find it. But there is another problem here. For example, in the IR for a simple reduction above, we are not applying extract_slice to the output.

Just leaving a note of support here. I think this is a great step towards a better modeling of tiles and strictly better than what we have today.

I agree with what @nicolasvasilache said, this is not the end of the road but unblocks further exploration.

Thanks! I didn’t remember the exact aspect that was bothering you, that makes it very clear :slight_smile: what the difference is, I was anchored to the “yield a tile” vs “yield the tensor”.

I don’t understand what this last sentence means actually (it is also harder for me to reason about semantics any time “RW access” is used in conjunction with immutable tensors).

Maybe you can provide a better motivating example? Right now you have;

  %in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
  %out_sub = tensor.extract_slice %out_[%j, %i][%c20, %c10][%c1, %c1]

I read this as producing two tensors for the current tile, the first one is the input data, and the second is the “init” data in case of reductions (it may be used just to carry the shape if its value isn’t used).

But then when you yield the tile, the %add_sub covers the entire tile, so I don’t get what the second operands brings here: it is just a tensor whose shape is a tile size like the first operand.

Looking at the reduction example seems the same case (a tile output is a scalar here, but you still have the first yielded operand covering the entire tile output).

That’s also confusing to me: isn’t extract_slice actually returning a slide of the input? What do you mean by “it is just a location”? It seems to me like a copy of (a slice of) the content of the input.

In general I think it makes sense to have the body yield a tile instead of the whole tensor. So +1 for the direction. (Note this is kind of what we do in IREE with flow.dispatch.tensor.load and flow.dispatch.tensor.store.

I mostly have nit about

linalg.tiled_yield %transpose_sub in %out_sub : tensor<10x10xf32>

what does "%transpose_sub in %out_sub" mean?

Would something like

linalg.tiled_yield %transpose_sub as %out_sub

be more readable. Essentially saying that %transpose_sub replaces what was %out_sub.

Also,

 tiled_loop.yield %sub_sum in %out_

was this a typo or is tiled_loop.yield signifying something else.

Side note : This does seem to fit well with the interface RFC for `TilingInterface` for tiling operations that dont fit into Linalg Structured Operation definition which also is actually only having the tiled implementation return the tile and moving the tensor.insert_slice into being an implementation detail of the generated tiled code.

Happened to chat about this with @mehdi_amini . Will list a some things that I realized based on that discussion (he might add more to this)

  • The reduction example above seems to be missing how the yielded tiles are combined. This should be made explicit by maybe an extra region that takes two tiles as arguments and returns the combined tile (similar to how reductions are modeled in scf.parallel.

  • One more thing that is a bit of a weirdness is that in

linalg.tiled_yield %transpose_sub in %out_sub 

you are actually dont need the value of %out_sub (i.e. the value in %out_ at [...][...][...]). Is it being used only to represent the tile of the %out_ that is being overwritten. Does that mean that the operand to the left of %in should always be defined by a tensor.extract_slice. In other words the real purpose of %out_sub is to propagate the use of %out_, and the [..][..][..]. Being more explicit about that might make things less confusing.

Yes: I understood a few aspects thanks to Mahesh!

First it is good to keep in mind that linalg.tiled_loop isn’t really a “structured op” in the sense that the ins and outs aren’t connected in any way to the iteration space, which is exclusively driven by the (%i) = (%c0) to (%size_0) step (%c10) expression.

So it isn’t even clear what semantics the ins and outs really carry in the first place, it seems more like a convention than anything at the tensor level, it may show up more useful at the memref level (can you also provide a memref example?).

For example it should be valid to write the transpose tiled_loop without an “outs”:

%sum = linalg.tiled_loop (%i, %j) = (%c0, %c0) to (%size_0, %size_1)
    step (%c10, %c10)
    ins (%in_ =  %in: tensor<100x100xf32>)
    iterator_types ("parallel", "parallel")  {
  %in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
  %out_sub = tensor.init_tensor : tensor<10x10xf32>
  %transpose_sub = linalg.generic {
      indexing_maps =  [#id, #tr],
      iterator_types =  ["parallel", "parallel"]}
      ins(%in_sub: tensor<10x10xf32>)
      outs(%out_sub: tensor<10x10xf32>)  {
    ^bb0(%in_elem: f32,  %out_elem: f32):
      linalg.yield  %in_elem : f32
  } -> tensor<10x10xf32>
  linalg.tiled_yield %transpose_sub in /* use range, see below */: tensor<10x10xf32>
}

Unless the outs is used to compute the shape of the results of the tiled_loop?
Which allows a tiled_loop to only partially produce the output, the other values would be read from the outs (which is really not well named for the tensor level, as it shows here again).

Then as Mahesh mentioned, the linalg.tiled_yield takes the tile to yield as first argument, but **because the linalg.tiled_loop isn’t structured, it needs to also carry the range for the tile.
I feel that a more accurate model of what we want to express here would be to actually carry the intent with a type instead of “faking” it and hoping to recover from the extract_slice:

...
  %tile_range = tensor.tile_range [%j, %i][%c20, %c10][%c1, %c1] : !tensor.tile_range
  %out_sub = tensor.extract_slice %out(%tile_range) // only if %out_sub is used as input to the computation!
  ...
  linalg.tiled_yield %transpose_sub at %tile_range : tensor<10x10xf32>
}

The alternative is to express the coordinate in the tiled_yield explicitly, but that was what @ntv found to be hard to manage in the original proposal because of the amount of variadic.
Using a proper SSA value for the range make the linalg.tiled_yield very regular: operands comes in pair. A variadic of pair seems quite manageable here (actually the proposal here also has a variadic of pair so I guess we’re on the same page in terms of complexity of the yield…).

Yes, it is a typo. Thank you for noticing, it was supposed to be linalg.tiled_yield %sub_sum in %out_.

The types of the results of a tiled loop should be the same as the types of the tensors in outs. So, it is not valid to write the transpose without the outs.

Exactly! If we develop this idea and make tiled loop even more generic, we would need a DSL to compute subsets.

 %result = very.generic_loop (%i) = (%c0) to (%100) step (%c10) {
    input(%in: tensor<100xf32>, %ivs: !dsl.ivs<1>) {
      %0 = dsl.tile_range [%i][%c10][%c1] : !dsl.tile_range
      very.yield %0 : !dsl.tile_range
    }
    output(%out: tensor<100xf32>, %ivs: !dsl.ivs<1>) {
      %0 = dsl.tile_range [%i][%c10][%c1] : !dsl.tile_range
      very.yield %0 : !dsl.tile_range
    }
    computation(%in_: tensor<10xf32>, %out_: tensor<10xf32>) {
      %result = <some_computation> : tensor<10xf32>
      very.yield %result : tensor<10xf32>
    }
  } -> tensor<100xf32>

I think the proposal here is a step in the right direction, in the same way that init_tensor was a step in the right direction (although I still dislike the fact that we pass a tensor value for the output even if only a shape would be needed). It allows us to model reductions and slightly clean up the modelling we currently have without leaping directly into the domain of fully generic index/tile expressions.

Just to be clear, I like the proposal with a dsl to express tiling/indexing but that is a big leap from the current way linalg works.

It’s really unclear to me why another OP is necessary here. It seems to me like there is a fundamental need to describe unordered reduction-type operations and linalg does this quite nicely. It seems to me that the missing piece is a way of explicitly describing a tiling of a tensor, e.g.:

= tensor.tile(4,4, %in) : tensor<32x32xfloat> → tensor<8x8xtensor<4x4xfloat>>

Then these tiled linalg operations could simply use the existing linalg operations to express reductions.

When a flow is at the point where it wants to make the order of the reductions explicit, then they could be lowered into regular scf or affine loops.

~~ That is strange syntax. The tile operation itself is returning the whole tensor, but the region is computing a tile. I would rather go with the tile_range operation. It would basically remove the OffsetStrideSizeInterface and all of that would just be part of the type. ~~

(The comment between ~~ and ~~ was supposed to be strikethrough, doesnt render that way).
Edit : Ignore the striked-through above. I misread what was posted.

Just restating this from an earlier post so that it doesnt get lost. I think this needs to be addressed

A completely tangential comment since the proposal here is to add a new op. I think folks adding new ops to linalg should really split the file LinalgOps.cpp - it currently takes 22+s to compile this single file even on a fast processor and I’m sure this is slowing down nearly everyone including those not using linalg – any changes to anything LinalgOps depends on and rebuilding mlir-opt and all the static libs will hit this compile time wall.

1 Like

This representation, if I understand it right from the brief description, would make e.g. padding implicit. Also, how would you represent fusion at the tile level in this scenario?

The main issue is expressing the construction of a tensor-value from subcomponents. That is difficult to express with scf loops as there is no operation to laminate partial updates. This is the main addition (in my perspective) that tiled_loop brings.

This is also not linalg specific and could be seen as an evolution of scf.parallel at the tensor level.

The “DSL” isn’t very clear to me, but I’m interested to have a more thorough look at what I posted about modeling properly the location in the output tensor with an SSA value instead of tracking back to an extract_slice, which really looks like a hack to me.

In essence, if you want to model the location in the output tensor, you need a way to describe this location. for ‘tiled_loop’, it is not even a location but a subspace of the index space. In the current modelling, this is a rectangular tile but in general it could be a more complex structure.

The question then is, how do you describe this tile? You could have an op similar to the tile op that @stephenneuendorffer described but on indices and not tensors. You could also model this as a function from some index space to another, similar to what affine maps do. This is what the very.generic_loop does. Or you could have a hybrid, an op that has an index transformation expression and a description of a tile of the iteration space.

How this should look like is not clear (at least to me) and @pifon2a is looking into some ways to generalize tiling. While we are getting there, we can still improve the current model a bit, which is what this RFC does. I don’t think it is worse than what we have now and tiled_loop also is not a new operation.

While we make these steps, we also need to make sure it composes with @MaheshRavishankar work on a tiling interface in another recent RFC.

Right, and what I proposed above seems like a proper way to describe this I believe, isn’t it?

So I don’t think you really address my question / concern directly: the current RFC modeling seems shaky to me, and I’m concerned about moving forward with such solution. I exposed a more principled solution above and I didn’t hear a proper rebuttal of why we couldn’t do this instead.

Yes, it is one way to describe it. We are just not sure what the ideal way to describe a tile would be and we want to prototype a bit more to get a better understanding of which model we would like best and whether it covers all the other use cases we have, like gather and concat. We can start a separate RFC to discuss our current thinking and have this discussion there.

In the meanwhile, we also want to be able to tile reductions because that is a functionality we need now. This RFC proposes a minimal change to unblock reduction while we figure out the bigger picture. I think this is in line with how linalg has been evolving over time, doing incremental steps as we go. The point of the RFC is to make sure we are not breaking anyone’s pipeline with this change and to make people aware.

So, in the end, this is not a technical discussion about the approach (hence you won’t get a rebuttal, I agree with what you say) but a question of incremental change in the right direction vs. getting it perfect right away.