LLVM Discussion Forums

[Vector] Vector distribution (large vector to small vector)

I have just starting experimenting with vector distribution. Based on code review comments this needs some discussions, so I’m starting this thread so that we can the design of those kind of transformation.

Disclaimer: This is highly experimental and many things are likely to change. The idea is to start experimenting with basic transformations to be able find the problems early and be able to iterate until we can find something we are happy with.

As @nicolasvasilache had explained in the ODM on vectors, there are benefits to represent the program as large vectors (much larger than what the target supports). This allows expressing the dependencies with SSA values which makes the analysis simpler and allow us to later decide what should be demoted to memory and what should stay in register.

One of the challenge with this, is that we need to break up those large vectors incrementally during codegen to eventually map to the native size the HW support. The distribution could be done in many ways, it could be distributed on different threads, it could be serialized in a loop or it could be unrolled.

To break up the vector, the current direction I’m experimenting is to use some transient instructions called extract_map/insert_map (the name is still under progress to be improved, the op will most likely have to use affine_map to generalize the transformation to N-D vectors) and we propagate those instructions through the SSA chain until we get to memory access operations.

For instance if we have:

%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
%acc = addf %a, %b: vector<256xf32>
vector.transfer_write %acc, %out[%c0]: vector<256xf32>, memref<?xf32>

and we want to distribute to a serial loop to be:

  scf.for %arg5 = %c0 to %c32 step %c1 {
   %idx = affine.apply #map0(%arg5)
    %a = vector.transfer_read %in1[%idx], %cf0: memref<?xf32>, vector<8xf32>
    %b = vector.transfer_read %in2[%idx], %cf0: memref<?xf32>, vector<8xf32>
    %acc = addf %a, %b: vector<8xf32>
    vector.transfer_write %ins, %out[%idx]: vector<8xf32>, memref<?xf32>
  }

To avoid having to do the rewrite all at once we use the vector.extract_map/insert_map to do the conversion incrementally:

  scf.for %arg5 = %c0 to %c32 step %c1 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %acc = addf %a, %b: vector<256xf32>
    %ext = vector.extract_map %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %ins = vector.insert_map %ext, %arg5, 32 : vector<8xf32> to vector<256xf32>
    vector.transfer_write %ins, %out[%c0]: vector<256xf32>, memref<?xf32>
  }

Then propagate, we merge the vector.insert_map with vector.transfer_write. We match elementwise operations with vector.extract_map and eventually vector.extract_map with vector.transfer_read

Obviously we need some analysis to make sure the transformation overall is correct and this will have to be done at some point. Currently I’m working on creating the basic patterns that will combine to be able to get the transformations we want.

I hope this helps to understand the rational behind those first patterns. Feel free to join the discussion if you have ideas on alternative solutions.

FYI @joker-eph

I’m not super comfortable with the current extract_map / insert_map because the IR semantics isn’t clear to me in these stages. It seems like extract_map / insert_map delimits some sort of region of an data flow graph that would be “mapped” onto the vector element.

As I mentioned in the review, I don’t quite get why we don’t model this with regions for example? It seems that mapping a computation on a vector fits more naturally like this in MLIR.

I am unclear what “stages” you are referring to? These ops are intermediate abstractions that are necessary to build simpler and more composable abstractions when performing vector transformations. This is similar to how other vector transformations and lowerings are implemented (see e.g. how vector.extract_slices / vector.extract_strided_slices / vector.tuple_get and their insert equivalent interac/canonicalize/fold to implement n-D vector unrolling and propagate to vector.transfer operations).

The name of the ops can indeed be improved: it betrays the fact that we are considering using affine_map as part of the op semantics. But we are keeping the introduction of affine_map attribute (and their associated complexity) to when we absolutely need it. Please feel free to propose better names.

This is incorrect. This is a similar abstraction than what I described above with the exception that vector.extract_slices / vector.extract_strided_slices / vector.tuple_get all take static offsets. These new ops allow using dynamic offsets and canonicalize / transform quite differently from the existing ones.

You’re probably referring to parallel execution semantics which is orthogonal to these ops that are exercised in a sequential setting. When mapping to parallel hardware will be involved, it will make sense to consider regions.

A computation is not mapped on a vector. What you seem to be referring to is what “Linalg on vectors” would be (and yes, regions are involved), but we are quite far from that yet.

I see very little connection between the names and terms being used and what the op appears to be doing. Instead of insert_map / extract_map, did you mean insert_chunk/slice or extract_chunk/slice or something like get_chunk / set_chunk? There are no maps being handled nor is there really any mapping.

Distribution is also not apt here - there’s no loop distribution nor a strong connection to partitioning/distribution. The terms that I can think of are chunking, tiling, or devectorization. You are really making the ops “fine-grained” here.

Besides this, I think the op’s doc description needs to be fixed as well for clarity - I posted some comments after the revision on this had been committed.

I have a few question before I can fully comment on the new operations. As stated before, I also don’t think you should call this distribution, as that has a specific meaning in the restructuring compiler world.

In your first step, it would have helped if you clarified the map which you omit.

scf.for %arg5 = %c0 to %c32 step %c1 {
   %idx = affine.apply #map0(%arg5)
   ....

Given the context of breaking up the super vector into chunks of 8, I assume this really would be the following, right?

scf.for %idx = %c0 to %c256 step %c8 {
   ....

Since this is a relatively simple transformation, I am not sure what you mean by “rewrite all at once”. In your proposal, I find

scf.for %arg5 = %c0 to %c32 step %c1 {
  %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
  %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>

unclear, for example. Why read 256 elements in a loop now? Is that just an intermediate step?

Also, is the following correct? Where does the 64 suddenly come from?

acc = addf %a, %b: vector<64xf32>

I’m referring to the fact that this seems used as “intermediate steps” in the lowering, but in a way that isn’t obviously semantically correct to me.

I can’t understand the semantics (and so the correctness) of going from:

%a = vector.transfer_read %A[%c0]: memref<32xf32>, vector<32xf32>
%b = vector.transfer_read %B[%c0]: memref<32xf32>, vector<32xf32>
%c = addf %a, %b: vector<32xf32>
vector.transfer_write %c, %C[%c0]: memref<32xf32>, vector<32xf32>

to:

%a = vector.transfer_read %A[%c0]: memref<32xf32>, vector<32xf32>
%b = vector.transfer_read %B[%c0]: memref<32xf32>, vector<32xf32>
%ea = vector.extract_map %a[%id : 32] : vector<32xf32> to vector<1xf32>
%eb = vector.extract_map %b[%id : 32] : vector<32xf32> to vector<1xf32>
%ec = addf %ea, %eb : vector<1xf32>
%c = vector.insert_map %ec, %id, 32 : vector<1xf32> to vector<32xf32>
vector.transfer_write %c, %C[%c0]: memref<32xf32>, vector<32xf32>

This IR, without more context, is reading imperatively and so we go from executing a single addf on a vector<32xf32> to executing a single addf on a vector<1xf32>.
(this is the doc for the operation itself: https://mlir.llvm.org/docs/Dialects/Vector/#vectorextract_map-mlirvectorextractmapop )

Since you’re saying that my description of a subgraph implicitly mapping a computation on the vector, it is really unclear to me what is the content of the vector produced by vector.insert_map.
It seems to create “out-of-thin-air” 32 elements from an SSA value containing a single element.

Ah ok I get the issue now, thanks!

@ThomasRaoux can we make the doc and the test use a loop that iterates over multiplicity ?

The mechanisms that we are proposing here are building blocks.
The aspects related to parallelism semantics and data-ownership + mapping are out of scope for now and will indeed most likely involve ops with regions.

I think this may be trying to put too much in a single explanation.

I’ll try to break it down, while explaining the bigger picture, as there are multiple things happening here:

  1. The introduction of new ops to extract/insert m-D vectors out of n-D vectors. The particularity of these ops is that:
    a. they want to extract data that starts at symbolic offsets within the n-D vector.
    b. they will want to specify a distribution scheme (block or cyclic for now).
    c. in the future, they will want to specify a symbolic multiplicity but this will require some type changes / extension (think scalable vectors [and forget about it at once because we are not discussing this now]).

  2. Canonicalization and folding patterns that know how to propagate through other vector ops all the way to source vector.transfer_read and vector.transfer_write. It is necessary to propagate all the way to memory address calculation because this is the place where symbolic offsets resolve for n-D vectors. On the way to memory ops there are pointwise vector ops but also different flavors of insert/extract, permutation, reshape, contraction/structured ops etc. Small composable IR makes it tractable to progressively propagate and lower through these ops without having to resort to complex C++ logic to keep track of things.

  3. Transformations that make use of points 1. and 2. above, among which:
    a. adding a loop to reduce vector size. This trades off register pressure and IR size for ILP. As an interesting (internal) use case there are some usages involving very large native vector sizes that want LLVM emulation. Without going back to loops we see timeouts on 100sK lines of LLVMIR.
    b. in the GPU / SPIR-V case: connecting non-cooperative operations with cooperative operations, v4 loads and more generally starting to use the vector abstraction mapping to GPU threads. This is the part that requires region-based modeling as @joker-eph raised but also probably some sort of data ownership model: this very much involves “distribution” (but not loop fission as per @aartbik comment :wink: ).
    c. lowering to scalable vectors (future).

Additionally to the breakdown above, there are interactions on how other transformations/progressive lowerings are written and opportunities for unification with other existing foldings + new vector interfaces.

For now, we are focusing on point 1.a. and 1.b in the 1-D vector case for 1-1 and 1-N multiplicity which is what these operations are about. Writing the execution unit test is a bit tricky because 3.a. is a bit more difficult to explain in a standalone setting, but hopefully people interested in details should see the gradient direction.

Now that the general vision is laid out, please propose better names for vector.extract_map / vector.insert_map and note that the block/cyclic specification may or may not be written using affine_maps :slight_smile: So far I like the vector.insert/extract_chunk and would propose vector.insert/extract_block to be reminiscent of the block/cyclic mapping we wish to make available.

Thanks!

I’m happy to talk about the name, but I’d like to see a complete piece of IR for which I can understand the semantics.
At the moment we have a pass in-tree that still perform a transformation that I can’t understand how it is semantics preserving, it goes directly:

func @distribute_vector_add(%id : index, %A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> {
  %0 = addf %A, %B : vector<32xf32>
  return %0: vector<32xf32>
}

into:

  func @distribute_vector_add(%arg0: index, %arg1: vector<32xf32>, %arg2: vector<32xf32>) -> vector<32xf32> {
    %0 = vector.extract_map %arg1[%arg0 : 32] : vector<32xf32> to vector<1xf32>
    %1 = vector.extract_map %arg2[%arg0 : 32] : vector<32xf32> to vector<1xf32>
    %2 = addf %0, %1 : vector<1xf32>
    %3 = vector.insert_map %2, %arg0, 32 : vector<1xf32> to vector<32xf32>
    return %3 : vector<32xf32>
  }

(based on a “distribution-multiplicity=32” option).

We have a function for which the interface does not change (it provides vectors of 32xf32 and get back a full vector 32xf32, and the body was performing a single 32xf32 add but is now performing only a single 1xf32 add to produce the entire vector.

Agreed, this is incorrect, we discussed it offline last week with @ThomasRaoux , and is fixed in https://reviews.llvm.org/D89291. I will review tomorrow.

FYI this revision does not change the example I mentioned apparently at the moment.

I think two problems are causing confusions.

  • The unit test don’t keep the semantic as the test pattern does an arbitrary transformation (code below) that isn’t meant to be semantically correct but just simulate part of the transformation that would be done to break up the vector in smaller part. I don’t think this is really needed and we could start with the IR already transformed. This way the ID wouldn’t come out of nowhere and we are just testing the propagation patterns.
      func.walk([&](AddFOp op) {
        OpBuilder builder(op);
        Optional<mlir::vector::DistributeOps> ops = distributPointwiseVectorOp(
            builder, op.getOperation(), func.getArgument(0), multiplicity);
        if (ops.hasValue()) {
          SmallPtrSet<Operation *, 1> extractOp({ops->extract});
          op.getResult().replaceAllUsesExcept(ops->insert.getResult(),
                                              extractOp);
        }
      });
  • The semantic of insert_map is not well defined. The idea of using extract/insert is to be able to propagate the lowering of the large vectors to small vectors incrementally (@nicolasvasilache explained the benefits of this approach). To be able to do incremental lowering we rely on the fact that in general executing instructions several time doesn’t change the result as long as there isn’t any side effect. I think the problem is that right we start breaking up the arithmetic instruction and as @joker-eph pointed out this means we end up with an intermediate large transfer_write which isn’t correct until we fold it with the insert_map. I think to solve this we should start from transfer_write and propagate up.

Based on that the transformation stages would look like below.
Original code:

%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
%acc = addf %a, %b: vector<256xf32>
vector.transfer_write %acc, %out[%c0]: vector<256xf32>, memref<?xf32>

Then create loop and break up the transfer_write only:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %acc = addf %a, %b: vector<256xf32>
    %ext = vector.extract_map %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    vector.transfer_write %ext, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

Propagate the extract_map:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %exta = vector.extract_map %a[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %extb = vector.extract_map %b[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %acc = addf %exta, %extb: vector<8xf32>
    vector.transfer_write %acc, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

Fold extract_map in transfer_write:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%arg5], %cf0: memref<?xf32>, vector<8xf32>
    %b = vector.transfer_read %in2[%arg5], %cf0: memref<?xf32>, vector<8xf32>
    %acc = addf %a, %b: vector<8xf32>
    vector.transfer_write %acc, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

This way the semantic is preserved at every stage of the transformation and we don’t need an insert_map anymore.
It wouldn’t work for instructions with side-effects returning a value (like atomics) that would have to be handled differently.

The first stage would require analysis to make sure we don’t have synchronization problems but right now we are really trying to build the infrastructure and start from the assumption that the first transformation is done. As Nicolas mentioned the plan is to make it work with N-D vectors and support different distribution scheme and that’s what I’m planning to concentrate on once we agree that we have solid basis.

If this makes sense to you I’ll send a patch to make the lowering behave as described.

You are right I was representing the multiplication by 8 by affine.apply but this is obviously better.

Hopefully the explanation above clarifies this part. The idea is that it is is still correct to load those in the loop and the canonicalization should break it up so that each iteration only loads what it needs.

This is a typo, I’m editing my post.

It is common for transformations to go through intermediate state with invalid IR.

The issue here is we are trying to make the integration test minimal to test exactly the canonicalization patterns. To do this we expose IR in a transient state that should not be visible.

I recommend going the way Thomas has proposed but keeping the insert operation.
The testing pass gets closer to step 3.a. that we wanted to keep for when we have a better understanding of the different cases (a) n-D to m-D (b) 1-K multiplicity © mixed block / cyclic.
But now that the pandora box has been cracked open we may as well just go for it.

Initial IR

%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
%acc = addf %a, %b: vector<256xf32> // + some attribute or other mechanism to trigger the test pass 
vector.transfer_write %acc, %out[%c0]: vector<256xf32>, memref<?xf32>

Internal step 1

Take program slice around the target addf, splice that into a loop, rewrite add as add + extract + insert, ignore aliasing issues for now as we are in a test pass and should control the
testing environment):

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %acc = addf %a, %b: vector<256xf32>
    %ext = vector.extract_map %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %val = vector.insert_map %ext, %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    vector.transfer_write %val, %out[%c0]: vector<256xf32>, memref<?xf32>
  }


[Bunch of internal state form propagating the canonicalization patterns]

Final state

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%arg5], %cf0: memref<?xf32>, vector<8xf32>
    %b = vector.transfer_read %in2[%arg5], %cf0: memref<?xf32>, vector<8xf32>
    %acc = addf %a, %b: vector<8xf32>
    vector.transfer_write %acc, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

insert op

Regarding the steps that Thomas proposed, this is the same except the insert op is not dropped. This has a few benefits:

  1. the insert operation is as well defined as the extract operation. This is similar to all other insert/extract pairs in both MLIR and LLVMIR.
  2. we can target any op for distribution and not just the transfer_write. This has similar properties as unrolling.
  3. I still want to unify the underlying mechanisms for canonicalization, vector unrolling and this transformation, esp. when we will start seeing higher-D and ops such as transpose, contract etc. Convergence of these pieces of infra is high importance on my list.

Besides, the following form:

scf.for %arg5 = %c0 to %c256 step %c8 {
  %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
  %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
  %acc = addf %a, %b: vector<256xf32>
  %ext = vector.extract_map %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
  vector.transfer_write %ext, %out[%arg5]: vector<8xf32>, memref<?xf32>
}

Is just internal step #2 after a propagating insert_map. Skipping the insert op will actually be much harder in the end because of the W shaped propagations that will occur in real IR (i.e. program slices that involve multiple sink transfer_write for which the extract_map must agree). Thomas is already experiencing this type of behavior with M shaped propagations: vector.contract propagations to vector.transfer_read and things that have to agree.

You have to break assumption while transforming the IR, of course, even just for the time of remapping operands, however this would be the first time I would see the IR itself being designed with loose semantics across transformations, or materialized in the IR this way.

This is a bit uneasy to me, I rather find representation that are self contained instead (and in particular MLIR expressiveness is intended to be able to capture this!).

And this goes beyond the semantics of vector.extract_map and vector.insert_map themselves: they are in the data flow but they actually have spooky effect on the whole region (and more?).
For example, starting with your first step:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %acc = addf %a, %b: vector<256xf32>
    %ext = vector.extract_map %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %val = vector.insert_map %ext, %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    vector.transfer_write %val, %out[%c0]: vector<256xf32>, memref<?xf32>
  }

Assuming I don’t know much about extract_map and insert_map, and I would want to handle them very conservatively/opaquely (assuming any possible side-effects). I still see an IR where each iteration is overwriting the memref location corresponding to the previous iteration. I should be able to keep only the last iteration write to the memref here.

I’m wondering if we can have a more “self-consistent” IR using structured op, like starting with:

    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %acc = addf %a, %b: vector<256xf32>
    %val = vector.map %acc[%arg5 : 32] : vector<256xf32> to %mapped_acc : vector<8xf32> {
       yield %mapped_acc : vector<8xf32>
    }
    vector.transfer_write %val, %out[%c0]: vector<256xf32>, memref<?xf32>

Where the vector.map could replace the extract_map/insert_map pair, and you can then incrementally sink the operand of the vector.map into the region until you only have transfer_read operations, at which point it can be made an SIMT kernel operation.

I have the same concern perhaps to a greater extent. While it is completely fine and natural for IR to be invalid in the middle of a pass/utility/transformation because it has to be valid only at the time the utility/pass/transformation completes which is also when it may hit the verifier, the situation we have here is very different. A well-defined transformation or lowering utility shouldn’t really be generating invalid or semantically incorrect IR (w.r.t the original thing) – the fact that a FileCheck/verification is being done after such intermediate semantics violating transformation step(?) just sounds wrong to me. Those intermediate steps would be expected to be completely internal and private state for the transformation utility. I can understand the need to break things into smaller pieces to aid debugging, but this sounds completely overdone and goes in the wrong direction on other things. Instead, you’d keep such mutations internal and pick the smallest piece that takes you from your input to a valid and semantically equivalent IR – that would sort of be your lower bound on the lifetime of the transformation to check input and output on. Also, I have myself never seen such semantic breaking mutations being isolated this way anywhere in MLIR, which is also why I think this in the wrong direction.

I really think these patches went in without proper discussion and review and should just be reverted. And to start with, I think the right/fixed op documentation for these insert/extract_map should be posted so that we are on the same page on the semantics. As I pointed out in comments post acceptance of the revision, the current description is really not describing the semantics well and has to be rephrased.

This is incorrect: the ops’ semantics resemble LLVM’s extractelement / insertelement. In the limit of 1-D + multiplicity 1 the ops are equivalent. There is no implicit “behavior across transformations”: there is a self-contained testing pass that is closed and self-contained and whose internal state has transient properties.

Why would you want to go open the internals of a testing pass?

That is certainly a valid proposition and does resemble preliminary ideas about a vector.generic.
Still, we are not there yet but if you want to write a proposal and contribute to the vector dialect, this would be very welcome :slight_smile:

Alternatively, for the purpose of entertaining the thought that all intermediate internal steps of a test pass must be valid and correctly executing IR, a potential way to write that test pass, we could also make the first internal step be:

%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
%acc = addf %a, %b: vector<256xf32>
%res = scf.for %arg5 = %c0 to %c256 step %c8 init_args(%carried_acc : vector<256xf32>= %acc) {
  %ext = vector.extract_map %carried_acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
  %val = vector.insert_map %ext, %carried_acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
  yield %val : vector<256xf32>
} : vector<8xf32>
vector.transfer_write %res, %out[%c0]: vector<256xf32>, memref<?xf32>

Then the test pass could iteratively pull producers / consumers in and have only valid and correctly executing intermediate IR steps. This way, you could stop the test pass at any step and have correctly executing IR. Still this would be imposing a requirement on intermediate IR validity of a test pass that we do not have for core passes.

The way the test has been written indeed exposes internal state of a test pass that should be opaque, this is WIP that should be addressed as part of https://reviews.llvm.org/D89291.

As part of https://reviews.llvm.org/D89291 the doc should also be updated to not use the internal state of a test pass to describe what the op does.

I don’t understand your analogy, isn’t vector.extract closer to extractelement?
Similarly insertelement takes as input a vector and insert an element into an existing vector, like vector.insert.
It really isn’t clear what vector.insert_map is inserting into by itself.

I don’t really get it now: can you show the before/after of a transformation where the IR has extract_map?

If the intent is that these operations cannot exist in the IR in between passes, I would rather avoid having these as part of the vector dialect spec and documented “as if” they have a well defined semantics. They would be really just private markers for the internal state of a particular transformation. (In the extreme we could even have their verifier to unconditionally fail, but that’d be a drag on testing/development though).

Based on @nicolasvasilache feedback I updated the description and semantic of insert_map/extract_map ops in https://reviews.llvm.org/D89563.

@joker-eph I think this should address the semantic problem with insert_map operation. Insert_map now takes the original vector as operand, therefore it has a similar semantic than insertElement ops, it creates a vector based on the original vector and the new smaller one containing only a range of the values. Since we know the range of IDs combining insert_map and vector.transfer_write into a vector.transfer_write storing a smaller vector is correct.

The original vector operation is removed only once the insert_map have been cleaned up.

So as Nicolas pointed out the first step is now:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %acc = addf %a, %b: vector<256xf32>
    %ext = vector.extract_map %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %val = vector.insert_map %ext, %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    vector.transfer_write %val, %out[%c0]: vector<256xf32>, memref<?xf32>
  }

And step2 is:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %exta = vector.extract_map %a[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %extb = vector.extract_map %b[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %acc = addf %a, %b: vector<256xf32>
    %ext = addf %exta, %extb: vector<8xf32>
    %val = vector.insert_map %ext, %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    vector.transfer_write %val, %out[%c0]: vector<256xf32>, memref<?xf32>
  }

Thanks: it is much more understandable to me!
So now the only difference with vector.extract/vector.insert is that the offset is an SSA value instead of an attribute?
If so can should explain a bit how the name _map reflects that?
Is this that the name should be updated with the new form?