Understanding the vector abstraction in MLIR

dcaballe · April 28, 2020, 12:30am

Hello!

I’m doing some experiments to try to better understand the general vector abstraction in MLIR and I have some questions/comments. This directly relates to and adds some points on this previous post ([MLIR] Multidimensional vector abstraction). Let me share a small example to better illustrate my findings and goals. Imaging that we have the following scalar function:

  func @scalar_test(%in_out : memref<21xf32>) {
    affine.for %i = 0 to 21 {
      %ld = affine.load %in_out[%i] : memref<21xf32>
      %add = addf %ld, %ld : f32
      affine.store %add, %in_out[%i] : memref<21xf32>
    }
    return
  }

Then, we want to write a vector counterpart. Let’s assume that we target AVX2 but we want to write it in a generic way so that we can leverage some of the affine/loop/std optimizations, and then lower it to a hypothetical AVX2 dialect later on, or even leave the hw-specific lowering to LLVM. This is the first thing I wrote, which is not correct in MLIR:

  func @cast_test(%in_out : memref<21xf32>) {
    %c16 = constant 16 : index
    %c20 = constant 20 : index

    // Process 16 elements, 8 elements at a time: potentially lowered to YMM ops/regs.
    %vec8 = memref_cast %in_out : memref<21xf32> to memref<?xvector<8xf32>>
    affine.for %i = 0 to 2 {
      %ld8 = affine.load %vec8[%i] : memref<?xvector<8xf32>>
      %add8 = addf %ld8, %ld8 : vector<8xf32>
      affine.store %add8, %vec8[%i] : memref<?xvector<8xf32>>
    }

    // Process 4 elements: potentially lowered to XMM ops/regs.
    %vec4 = memref_cast %in_out : memref<21xf32> to memref<?xvector<4xf32>>
    %ld4 = affine.load %vec4[%c16] : memref<?xvector<4xf32>>
    %add4 = addf %ld4, %ld4 : vector<4xf32>
    affine.store %add4, %vec4[%c16] : memref<?xvector<4xf32>>

    // Process 1 element: Scalar ops/regs.
    %ld1 = affine.load %in_out[%c20] : memref<21xf32>
    %add1 = addf %ld1, %ld1 : f32
    affine.store %add1, %in_out[%c20] : memref<21xf32>
    return
  }

The code above is not allowed because vector is considered a memref element type and memref_cast cannot change the element type of a memref. This means that we cannot memref_cast a vector to a scalar or even a vector to another vector with a different length (e.g., memref_cast %in_out : memref<?xvector<8xf32>> to memref<?xvector<4xf32>>).

Then, I gave std.view a try and wrote something like this:

  func @view_test(%in_out : memref<84xi8>) {
    %c2 = constant 2 : index
    %c5 = constant 5 : index
    %c16 = constant 16 : index
    %c20 = constant 20 : index

    // Are the strides correct?
    %vec8 = view %in_out[][%c2] : memref<84xi8> to memref<?xvector<8xf32>>
    %vec4 = view %in_out[][%c5] : memref<84xi8> to memref<?xvector<4xf32>>
    %scalar = view %in_out[][] : memref<84xi8> to memref<21xf32>

    // Process 16 elements, 8 elements at a time: potentially lowered to YMM ops/regs.
    affine.for %i = 0 to 2 {
      %ld8 = affine.load %vec8[%i] : memref<?xvector<8xf32>>
      %add8 = addf %ld8, %ld8 : vector<8xf32>
      affine.store %add8, %vec8[%i] : memref<?xvector<8xf32>>
    }

    // Process 4 elements: potentially lowered to XMM ops/regs.
    %ld4 = affine.load %vec4[%c16] : memref<?xvector<4xf32>>
    %add4 = addf %ld4, %ld4 : vector<4xf32>
    affine.store %add4, %vec4[%c16] : memref<?xvector<4xf32>>

    // Process 1 element: Scalar ops/regs.
    %ld1 = affine.load %scalar[%c20] : memref<21xf32>
    %add1 = addf %ld1, %ld1 : f32
    affine.store %add1, %scalar[%c20] : memref<21xf32>

    return
  }

This seems to compile, which is great! However, I see a couple of drawbacks here:

My function now is not type safe. Using views requires 1Dxi8 memrefs so we basically have to drop all the shape and element type information from the memref parameter, make it opaque and, therefore, rely on the caller to pass the expected buffer.
Vectorizing a single loop in a function impacts the function signature and all other uses of the vectorized memref in the function.

Using the vector dialect would be another option, but I think that would mean moving my code to another level of abstraction and probably not being able to apply affine optimizations on it, right?

My general feeling is that currently memref_cast is a bit too constrained and there is no other simple option for memref castings that only “change the number of read/written elements” (scalar<->vector or vector<->vector). Views are really powerful, but I think it’s an overkill to use them for these castings. They were introduced to address a different and more complex kind of problems.

I guess I can summarize the questions and design decisions I would like to better understand as follows:

Why vector is a memref element type?
Why a memref_cast can’t convert between: a) a scalar and a vector with the same “element” type; b) two vectors with different vector length and the same “element” type?
What does it mean that an alloc or block argument (or any non-memory op on a memref type) has a vector type? Isn’t this unnecessarily adding/enforcing how data has to be read/written at a point where only allocation/shape/layout information should be needed?
What is the best way to represent vector code suitable for the affine/std domain?

Thanks in advance!
Diego

mehdi_amini · April 28, 2020, 4:12am

If you don’t have a vector type as a memref: how do you load multiple elements in a vector?
You could have a special vector load which loads multiple elements from a scalar memref into a vector, but then it’d could be a gather since you don’t have guarantee on the contiguity of the memory?

I suspect that this is about the fact that this wouldn’t be correct in general? Again the issue is that a memref does not necessarily have to logically consecutive element contiguous in memory, I suspect this is the reason for most of the restriction about memref.

bondhugula · April 28, 2020, 12:32pm

This is exactly why I had to create a memref_shape_cast op for my experiments in the gemm codegen article. I later generalized the memref_shape_cast to just cast from any last dimension size memref (including dynamically sized) of scalar elt type to a memref of vector elt type. Here is its doc comment:

The memref_shape_cast operation converts a memref from an non-vector
element type to another memref of a vector elemental type while not changing
the source memref’s element type. The last dimension size of the source
dimension is divided (floor division) by the vector size to obtain the
corresponding dimension for target memref type.

    %MV = memref_shape_cast %M : memref<64x16xf32> to memref<64x2xvector<8xf32>>
    %AV = memref_shape_cast %A : memref<?x?xf32> to memref<?x?xvector<8xf32>>

I’m happy to contribute this upstream if there is consensus on where this would go. Having a separate op makes sense to me to start with. The code is already here. The LLVM lowering for it works all the way through execution.

When combined with full/partial tile separation, it allows one to vectorize with dynamically shaped memrefs and unknown trip counts, where the vector memref is used for the ‘then’ branch code and the scalar memref is used for the ‘else’ branch (which is not vectorized). Here’s an example showing the generated code.

   %0 = memref_shape_cast %arg2 : memref<?x?xf64> to memref<?x?xvector<4xf64>>
    %1 = memref_shape_cast %arg1 : memref<?x?xf64> to memref<?x?xvector<4xf64>>
    %2 = dim %arg2, 0 : memref<?x?xf64>
    %3 = dim %arg2, 1 : memref<?x?xf64>
    %4 = dim %arg0, 1 : memref<?x?xf64>
    affine.for %arg3 = 0 to #map18()[%3] {
      affine.for %arg4 = 0 to #map17()[%4] {
        affine.for %arg5 = 0 to #map16()[%2] {
          affine.for %arg6 = #map14(%arg3) to min #map15(%arg3)[%3] {
            affine.for %arg7 = #map12(%arg5) to min #map13(%arg5)[%2] {
              affine.if #set0(%arg6, %arg7)[%3, %2] {
                affine.for %arg8 = #map8(%arg4) to min #map9(%arg4)[%4] {
                  affine.for %arg9 = #map6(%arg6) to #map7(%arg6) step 4 {
                    %5 = affine.load %1[%arg8, %arg9 floordiv 4] : memref<?x?xvector<4xf64>>
                    affine.for %arg10 = #map4(%arg7) to #map5(%arg7) {
                      %6 = affine.load %arg0[%arg10, %arg8] : memref<?x?xf64>
                      %7 = splat %6 : vector<4xf64>
                      %8 = mulf %7, %5 : vector<4xf64>
                      %9 = affine.load %0[%arg10, %arg9 floordiv 4] : memref<?x?xvector<4xf64>>
                      %10 = addf %9, %8 : vector<4xf64>
                      affine.store %10, %0[%arg10, %arg9 floordiv 4] : memref<?x?xvector<4xf64>>
                    }
                  }
                } 
              } else {
                affine.for %arg8 = #map8(%arg4) to min #map9(%arg4)[%4] {
                  affine.for %arg9 = #map6(%arg6) to min #map11(%arg6)[%3] {
                    %5 = affine.load %arg1[%arg8, %arg9] : memref<?x?xf64>
                    affine.for %arg10 = #map4(%arg7) to min #map10(%arg7)[%2] {
                      %6 = affine.load %arg0[%arg10, %arg8] : memref<?x?xf64>
                      %7 = mulf %6, %5 : f64
                      %8 = affine.load %arg2[%arg10, %arg9] : memref<?x?xf64>
                      %9 = addf %8, %7 : f64
                      affine.store %9, %arg2[%arg10, %arg9] : memref<?x?xf64>
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
    return

It doesn’t deal with multi-dimensional vectors, but I think the conversion to the latter is usually only meaningful in conjunction with tiled data layouts.

That’s right, but those constraints can be checked in most cases (except when you have a dynamic stride along a memref dimension). If you want to consider only the scenarios where you have a logical dimension that has contiguity in the physical space, the op could restrict itself to that. OTOH, if one wants to be general, a vector casting op will need to define where the vector elements are coming from in the scalar memref (since the logical space could be permuted and its dimensions strided when mapped to the physical space). The lowering will then have to make use of gathers for such a general vector casting op.

nicolasvasilache · April 28, 2020, 1:30pm

I think one of the historical hiccups was that a general memref cast would try to mix layout + type change and that does not pass function boundaries. IIRC, to recover the edge information, one needs to keep a reference to the type before it was cast.

A new op for just changing the element type makes sense and has already been demonstrated as Uday mentioned.

dcaballe · April 28, 2020, 9:35pm

Thanks for your replies.

Sorry, my question wasn’t clear. I wanted to specifically ask about the element qualifier. I wonder why the element type of a memref<8xf32> is f32 and in memref<2xvector<4xf32>> is vector<4xf32> instead of just f32. In the context of memory buffers, I think vector is closer to a shape modifier than to an element type. Or not even that! As I tried to suggest in question #3, I think that vector at this level of abstraction should be more like a virtual register that only makes sense for read/write and non-memory operations.

I personally think that it’s better to keep things a bit more generic at this level of abstraction but being able to represent both scalar and vector loads/stores with affine/non-affine properties, including the non-stride-one vector counterparts.

Those are interesting observations that bring up more questions! Does vector imply consecutive/contiguous elements in memory? For all the dimensions? (I couldn’t find it in the doc). How could we represent a 2D vector load or a gather operation then? For example, it would be interesting to see how we could write a 1D vector version of this:

  func @gather(%in : memref<400x400xf32>) {
    affine.for %i = 0 to 400 {
      // Note step 4
      affine.for %j = 0 to 400 step 4 {
        %ld0 = affine.load %in[%i, %j] : memref<400x400xf32>
      }
    }
    return
  }

Or a 2D vector version of the following, assuming, for instance, that we target a vector memory unit that is able to read 2x4xf32 at once (i.e., two separate 1D vectors of 4-contiguous f32 elements):

  func @2d(%in : memref<400x400xf32>) {
    affine.for %i = 0 to 400 {
      affine.for %j = 0 to 400 {
        %ld = affine.load %in[%i][%j] : memref<400x400xf32>
      }
    }
    return
  }

The first thing that comes to mind to support these cases (please, take this with a pinch of salt) is that we could relax a bit the vector abstraction at affine/standard level to be able to represent more generic vector operations while keeping affine/non-affine properties and without introducing too many vector-specific ops. Then, we could lower that to a more explicit/constrained representation by using the vector dialect or hw-specific dialects. Some quick ideas of what this could imply at affine/std level:

Vector type could be defined as a virtual register so it wouldn’t provide information about contiguity of elements in memory or any other layout information.
(Therefore), vector type wouldn’t be allowed in a memref type.
Memory operations could encode the number of elements to read/write per memref dimension in the op type. An n-D vector type could be used to read as many elements from each memref dimension as elements in each dimension of the op vector type.
Memory operations could be extended to contain stride/index information to be able to model gather operations. This would decouple the virtual vector register abstraction from the contiguous/non-contiguous memory access pattern.
Non-memory operations would remain unchanged.

A potential materialization of the previous points on the gather examples could be:

  func @vec_gather(%in : memref<400x400xf32>) {
    affine.for %i = 0 to 400 {
      // Note 'stride' and affine.load type.
      affine.for %j = 0 to 400 step 16 {
        %ld = affine.load %in[%i, %j] stride[4] : (memref<400x400xf32>) -> vector<4xf32>
      }
    }
    return
  }

Stride could be just a simple map. We could generalize this to be able to represent an arbitrary gather by allowing symbolic vector values as strides in the non-affine version:

    affine.for %i = 0 to 400 step 4 {
      %idxs = affine.load %indexes[%i] : vector<4xindex>
      %ld = load %in[] stride[%idxs] : (memref<400x400xf32>, vector<4xindex>) -> vector<4xf32>
    }

For the 2d example (note the 2D vector type):

  func @vec_2d(%in : memref<400x400xf32>) {
    affine.for %i = 0 to 400 step 2 {
      affine.for %j = 0 to 400 step 4 {
        %ld = affine.load %in[%i][%j] : (memref<400x400xf32>) -> vector<2x4xf32>
      }
    }
    return
  }

I’m pretty sure that I’m missing a few problems that this approach would introduce. However, if this direction sounds interesting enough, I could spend some time investigating it.

Thanks!

dcaballe · April 28, 2020, 9:38pm

This sounds good to me if the ideas above don’t make sense. I guess what I’m still missing is why we need a separate operation to do these castings. I find a bit confusing that memref_cast is actually used for shape casting and then memref_*shape*_cast is used for element type casting (that also impacts the shape). Wouldn’t make sense to just extend memref_cast to support vector type castings given that the fact that vector is considered an element type is somehow arguable?

mehdi_amini · April 29, 2020, 12:24am

As far as I know: yes. As a value a vector is “dense” and intended to match vector registers and vector load/store (this should answer your previous question about “why the element type of a memref<8xf32> is f32 and in memref<2xvector<4xf32>> is vector<4xf32> instead of just f32”)

dcaballe:

Or a 2D vector version of the following, assuming, for instance, that we target a vector memory unit that is able to read 2x4xf32 at once (i.e., two separate 1D vectors of 4-contiguous f32 elements):
  func @2d(%in : memref<400x400xf32>) {
    affine.for %i = 0 to 400 {
      affine.for %j = 0 to 400 {
        %ld = affine.load %in[%i][%j] : memref<400x400xf32>
      }
    }
    return
  }

If I understand correctly you want

    affine.for %i = 0 to 400 step 2 {
      affine.for %j = 0 to 400 step 4 {
        %ld = affine.vector_load %in[%i][%j] : (memref<400x400xf32>) -> (vector<2x4xf32>)

Of course the vector_load does not exist, but I think this represent the semantics you’re after, so it is just a matter of creating the right op.
The closest you can use to model this with the ops in MLIR at the moment is I think the std.dma_start (https://mlir.llvm.org/docs/Dialects/Standard/#dma_start-operation) which allows you use a “vector memory unit” to load the data. However it’ll produce a memref and you’d need to load from the memref to get the vector.

I don’t understand why is the current vector layout in memory a problem for you? You want a different kind of load to create your vector (like the dma_start) but this seems orthogonal to “what do vector represent when used a element type of a memref”.
I think it is still useful to be able to capture that the unit of logical addressing is a consecutive vector in memory for a memref (which guarantee simple load/store without gather/scatter). You don’t have to use it if you don’t want to structure your memref this way though.

Oh looks like just what I wrote for you above, so we’re on the same page here

mehdi_amini · April 29, 2020, 12:28am

Actually I forgot to look into the vector dialect, and this may be providing the right load already: 'vector' Dialect - MLIR

bondhugula · April 29, 2020, 12:42am

The lowering for this one isn’t the desired one / perhaps - AFAIR, the vector transfer read lowers to a loop loading the elements one by one into the vector. Creating a memref of a vector elt type however is natural for the purpose of just reusing the std to llvm load/store lowering transparently to get vector elt types.

bondhugula · April 29, 2020, 12:51am

The memref_cast was originally just intended to cast one of more static dimensions to dynamic ones or vice versa - it was just to hide/unhide a memref static dimension for the purposes of escaping uses where a shape erasure was desired – it is not the C style cast or similar general type casting you are thinking of. An extension in that direction would make the existing canonicalization patterns on it messy (take a look at shape folding for eg.) - i.e., you’d have to perform more checks on the memrefs there to see if the memref casting is compatible for those canonicalizations. Reg. this memref_shape_cast I mentioned, better names could actually be memref_vectorize or memref_vectorize_cast, since it’s not really a general cast. But I believe having separate ops for the very different functionality these are providing would make all the transformation infrastructure around them simpler. Otherwise, in theory, at an extreme, one could bundle all of these memref to memref conversion operations (view, subview, memref_cast, memref_shape_cast) into a single memref_cast!

mehdi_amini · April 29, 2020, 12:58am

I think we should separate the semantics of an operation from its lowering on a particular target as it is implemented today. I would rather ask in terms of modeling: if you have a target with a 2-D vector memory unit (as @dcaballe was asking above), does vector.transfer_read models what you want and can you lower it to the right set of intrinsics for this target?

bondhugula · April 29, 2020, 1:18am

Could you fix this snippet by adding more to the trailing type list of affine.load? I think you intend to say that affine.load should be generalized to provide not just the memref elt types but vectors of those along multiple dimensions including with strides?

Actually, it doesn’t - it doesn’t support strides or the indexing @dcaballe was asking for - but just vectors along a specified order for the dimensions. It could be extended though by including scaling factors in what is meant to be a permutation map.

mehdi_amini · April 29, 2020, 1:42am

Right, it’s missing strides, I stopped at the 2D example loading slices of 2x4xf32 which does not require strides I believe.

bondhugula · April 29, 2020, 3:52am

Besides the strides part, there is in general a design space to explore here based on what @dcaballe is suggesting. Instead of affine.load/store being extended to handle the kind of thing vector.read_transfer/write_transfer want to model, one could have affine.vload/vstore with maps to carry both permutation/striding info - the info say where and with what stride the vector dimensions are coming from. We need to see how the duplication here could be avoided with the vector.read/write_transfoer or whether this should lower to vector.read/write_transfer. We’d want utilities like store to load fwd’ing and dependence information to work with these, and so it’s ideal to have these implemented as some load/store ops and have a loadOp/store op interface while having separate ops. That way, in the printed form, the vector elt type and vector map info would only appear for the affine.vload/vstore.

I don’t see though why vector type should be disallowed in memrefs. Those would just work in conjunction, and they do provide more information let’s say at the allocation/def site of such vector memrefs, and also the guarantee that a single element of such a memref is actually contiguous in the physical space.

nicolasvasilache · April 29, 2020, 4:09am

Coming a bit late to the party, here are some extra comments that have not been answered before.

Philosophically, there is a lot of similarity between this and the vector.transfer_read/write operations (as Mehdi pointed out). I didn’t go as far as requiring that memref<vector<...>> be disallowed though.

What would you suggest this “virtual register” would lower to?
I believe there are some caveats here, see this part of the discussion in the Vector dialect doc.

This would be a welcome addition / extension. I can definitely see this fold into the vector.transfer ops. Note that the vector.transfer abstraction also wants to use padding for the edge case.

Fully agreed, using an n-D vector type as a value is quite orthogonal to loading/storing from memory.
The rubber hits the road when one want to index with a non-static index. That is where putting stuff in memory is necessary (the alternative of using vector.insertelement and the LLVM equivalent is painful at best, refering back to the deeper dive in the vector dialect).

Right, when it was first created there was no vector dialect so the lowering was just a naive scalar copy + cast. Now the 1-D case (in LLVM) lowers to masked load/stores and I am working on the n-D case. The next step will be to retire the scalar copies and rework the permutation map to emit vector.transpose and vector.broadcast ops that will lower to the llvm.matrix intrinsics.

Adding striding/gather/scatter semantics to the mix will be very useful.

+1 on this: complexity of usages really matter IMO and each op can be viewed as a projection on the solution space. It significantly simplifies the problem to deal with special cases. After all, the only thing we are doing is injecting static behavior to make the problem more tractable and the generated code less dynamic.

dcaballe · April 29, 2020, 6:09am

Thanks for the comments. Very clarifying! Let me recap a few things and limit a little the scope of the discussion :). My main goal is to represent affine loop nests for inputs that are already in vector form (you can think of OpenCL, ISPC, etc.) and be able to apply affine optimizations on them (loop interchange, unrolling, fusion, etc.). These vector operations may include any flavor of vector memory operations (contiguous, strided, gather/scatter, etc.). Some of these memory operations may be affine, some other may not.

Discussion points:

We need to cast from scalar memrefs to vector memrefs and from/to vector memfers with different number of elements.
We need to represent vector memory operations in a way that affine analyses can reason about what these memory operations are doing so that affine optimizations can be applied.
Let’s leave the memref vector type discussion for later (I think it will come up again when we make progress on striding/gather/scatter for #2).
Let’s leave 2D vectorization for another day

Reg #1, introducing a new memref casting op sounds good to me if the casting is needed. Thanks for clarifying, @bondhugula!

Reg #2…,

mehdi_amini:

If I understand correctly you want
affine.for %i = 0 to 400 step 2 {

affine.for %j = 0 to 400 step 4 {

%ld = affine.vector_load %in[%i][%j] : (memref<400x400xf32>) -> (vector<2x4xf32>)
Of course the vector_load does not exist, but I think this represent the semantics you’re after, so it is just a matter of creating the right op

Yes! That’s what I was trying to describe. The examples above should make more sense now. I had looked at vector.transfer_read/write but I thought we would need an affine counterpart to separate vector read/write that are affine from those that are not (similar to affine.load/store vs std.load/store). If we can reuse/extend vector.transfer_read/write to represent affine and non-affine vector memory operations and teach affine analyses and optimization how to deal with it, I’m totally fine with that! However, I think it could be too much overloading for a single operation, esp. if we take the generic gather/scatter into account. That’s why I suggested having affine and non-affine flavors.

I’ll come back to the more philosophical questions once these two points are a little clearer.

bondhugula · April 29, 2020, 6:57am

@dcaballe All of this makes sense and sounds great; I’ll be quite interested in this line of work.

dcaballe · April 30, 2020, 4:15pm

I’m playing with the vector.transfer_read/write and I’ll try to come back with some examples. Answering some of the previous questions in the meantime:

Yep. Note that there were two reasons to suggest disallowing vector types in memrefs: 1) the more philosophical one, 2) deal with the problem of vector loads/stores without having to introduce new memref cast ops. Introducing a new cast op seems promising, we can talk about #1 later over some examples.

Thanks for the link! Very interesting discussion! I fully agree that trying to lower something so complex as a multi-dimensional vector to a “canonical form” in LLVM that you can later fold back to its original semantics is challenging, if not unfeasible in some cases. Something similar happens with complex (and not so complex) vector idioms that do not have a native representation in LLVM. Luckily, the matrix type should make things easier for the n-D vector problem.

Note that the “virtual register” suggestion is not too different from what we have right now, or what we have in LLVM. It’s more about decoupling a bit more vector registers from memory (which seems we all agree based on the comment below). Again, something philosophical that we can discuss later.

I think we all agree on this! I would even be more specific: an n-D vector type as a value is quite orthogonal to how its data is loaded/stored from memory, which may include contiguous, strided or gather/scatter vector memory ops, scalar loads/stores + insertion/extraction ops, multiple contiguous loads + shuffle/shifting ops, etc.

dcaballe · May 2, 2020, 1:37am

I gave a try to vector.transfer_read/write in a few examples and I think it’s doing what I’m looking for. Interestingly, these ops don’t require a vector memref as input so even the memref vector casting doesn’t seem necessary. The width of the vector load/store is taken from the result type (I’m sorry, now I feel I was inadvertently trying to reinvent the wheel!). This is how my initial example looks like:

  #identity_map = affine_map<(d0) -> (d0)>

  func @vec_transfer_test(%in_out : memref<21xf32>) {
    %cf0 = constant 0.0 : f32
    %c16 = constant 16 : index
    %c20 = constant 20 : index

    // Process 16 elements, 8 elements at time: potentially lowered to YMM ops/regs.
    affine.for %i = 0 to 16 step 8 {
      %ld8 = vector.transfer_read %in_out[%i], %cf0
        { permutation_map = #identity_map }
        : memref<21xf32>, vector<8xf32>
      %add8 = addf %ld8, %ld8 : vector<8xf32>
      vector.transfer_write %add8, %in_out[%i]
        { permutation_map = #identity_map }
        : vector<8xf32>, memref<21xf32>
    }

    // Process 4 elements: potentially lowered to XMM ops/regs.
    %ld4 = vector.transfer_read %in_out[%c16], %cf0
      { permutation_map = #identity_map }
      : memref<21xf32>, vector<4xf32>
    %add4 = addf %ld4, %ld4 : vector<4xf32>
    vector.transfer_write %add4, %in_out[%c16]
      { permutation_map = #identity_map }
      : vector<4xf32>, memref<21xf32>

    // Process 1 element: Scalar ops/regs.
    %ld1 = affine.load %in_out[%c20] : memref<21xf32>
    %add1 = addf %ld1, %ld1 : f32
    affine.store %add1, %in_out[%c20] : memref<21xf32>
    return
  }

As @bondhugula pointed out, the lowering to LLVM could be improved. For my example, vector transfer ops are lowered to masked loads/stores, where disabled lanes are populated with the padding value %cf0. However, after some LLVM optimizations, masked loads/stores are turned into unmasked loads/stores:

  define void @vec_transfer_test(float* nocapture readnone %0, float* %1, i64 %2, i64 %3, i64 %4) local_unnamed_addr #0 !dbg !3 {
    %6 = bitcast float* %1 to <8 x float>*, !dbg !7
    %unmaskedload1 = load <8 x float>, <8 x float>* %6, align 1, !dbg !9
    %7 = fadd <8 x float> %unmaskedload1, %unmaskedload1, !dbg !10
    store <8 x float> %7, <8 x float>* %6, align 1, !dbg !11
    %8 = getelementptr float, float* %1, i64 8, !dbg !12
    %9 = bitcast float* %8 to <8 x float>*, !dbg !7
    %unmaskedload2 = load <8 x float>, <8 x float>* %9, align 1, !dbg !9
    %10 = fadd <8 x float> %unmaskedload2, %unmaskedload2, !dbg !10
    store <8 x float> %10, <8 x float>* %9, align 1, !dbg !11
    %11 = getelementptr float, float* %1, i64 16, !dbg !13
    %12 = bitcast float* %11 to <4 x float>*, !dbg !14
    %unmaskedload = load <4 x float>, <4 x float>* %12, align 1, !dbg !15
    %13 = fadd <4 x float> %unmaskedload, %unmaskedload, !dbg !16
    store <4 x float> %13, <4 x float>* %12, align 1, !dbg !17
    %14 = getelementptr float, float* %1, i64 20, !dbg !18
    %15 = load float, float* %14, align 4, !dbg !19
    %16 = fadd float %15, %15, !dbg !20
    store float %16, float* %14, align 4, !dbg !21
    ret void, !dbg !22
  }

If this optimization doesn’t happen for more complex scenarios, I guess we could make padding optional (I saw some examples without padding in the documentation but they don’t seem to be working right now) and lower transfers without padding directly to unmasked vector loads/stores.

Thoughts?

I’m now looking at what is needed to enable something like affine fusion in the presence of these vector transfer ops.

nicolasvasilache · May 2, 2020, 3:25am

Hi Diego,

It is great that this seems close enough to a useful abstraction for your use case.

From a pure LLVM lowering for 1-D vector CPUs I was thinking of adding an in_bounds ArrayAttribute that would specify which dimensions are statically known to be in bounds and just emit unmasked load/stores along those.

Such an attribute can either be set declaratively e.g. in the case of higher-level lowering to vectors (e.g. Linalg) or can be discovered/canonicalized with passes such as the if-hoisting / loop unswitching that Uday added recently.

Do you see a use for such an attribute on your end?
We prob. also need a bunch of extra canonicalizations.

Lastly, note that vector_transfer plays nicely with other vector operations for the purpose of unroll-and-jam following SSA use-def chains.

Lurking behind all this is the question I keep coming back to: if we want to not unroll too much and use loops, indexing into vectors becomes dynamic (i.e. dependent on the IV variable) and we have to go back to some memory form. On a retargetability side, things to consider (even if you prefer to keep them under the rug for now ):

n-D vectors for HW where going to scalar is totally prohibitive and padding with neutral + doing useless work is much better.
n-D vectors for GPUs where the masked / predicated abstraction works really well and is a quick way to get good perf.

All this cycles back to the “deep dive” lowering to LLVM and there are implications on memory alignment in the n-D case. As more people poke at it changes may be needed.

As far as composing with affine is concerned, I imagine a version that embeds affine maps into the op itself is probably the preferred path to play nicely with existing implementations. In this case we should try to make the affine semantics additive and be sure the core of the op can be reused everywhere.
To some extent, note that this op could probably already be considered affine if the indexings were verified to be exactly affine_apply/dim/sym/constant (bonus points for ensuring at-most-length-1-chain-of-affine-apply and keeping that property by construction) but I speculate the implementations of transformations may not like such inference and prefer to have the op statically constructed with explicite affine maps attributes.

Topic		Replies	Views
Some MLIR warts we've found in our convolution kernel generator MLIR	11	663	November 17, 2021
Intel Advisor/ MLIR MLIR	0	262	April 7, 2022
Controlling HW-specific pattern injection MLIR	4	378	November 9, 2021
[MLIR] Clarifications about memrefs / vectors / tensors Beginners	10	1688	January 15, 2021
MLIR News, 53th edition (16th August 2023) Newsletter llvm-weekly	0	762	August 16, 2023

Understanding the vector abstraction in MLIR

Related Topics