[Vector] Vector distribution (large vector to small vector)

I think two problems are causing confusions.

  • The unit test don’t keep the semantic as the test pattern does an arbitrary transformation (code below) that isn’t meant to be semantically correct but just simulate part of the transformation that would be done to break up the vector in smaller part. I don’t think this is really needed and we could start with the IR already transformed. This way the ID wouldn’t come out of nowhere and we are just testing the propagation patterns.
      func.walk([&](AddFOp op) {
        OpBuilder builder(op);
        Optional<mlir::vector::DistributeOps> ops = distributPointwiseVectorOp(
            builder, op.getOperation(), func.getArgument(0), multiplicity);
        if (ops.hasValue()) {
          SmallPtrSet<Operation *, 1> extractOp({ops->extract});
          op.getResult().replaceAllUsesExcept(ops->insert.getResult(),
                                              extractOp);
        }
      });
  • The semantic of insert_map is not well defined. The idea of using extract/insert is to be able to propagate the lowering of the large vectors to small vectors incrementally (@nicolasvasilache explained the benefits of this approach). To be able to do incremental lowering we rely on the fact that in general executing instructions several time doesn’t change the result as long as there isn’t any side effect. I think the problem is that right we start breaking up the arithmetic instruction and as @mehdi_amini pointed out this means we end up with an intermediate large transfer_write which isn’t correct until we fold it with the insert_map. I think to solve this we should start from transfer_write and propagate up.

Based on that the transformation stages would look like below.
Original code:

%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
%acc = addf %a, %b: vector<256xf32>
vector.transfer_write %acc, %out[%c0]: vector<256xf32>, memref<?xf32>

Then create loop and break up the transfer_write only:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %acc = addf %a, %b: vector<256xf32>
    %ext = vector.extract_map %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    vector.transfer_write %ext, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

Propagate the extract_map:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %exta = vector.extract_map %a[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %extb = vector.extract_map %b[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %acc = addf %exta, %extb: vector<8xf32>
    vector.transfer_write %acc, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

Fold extract_map in transfer_write:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%arg5], %cf0: memref<?xf32>, vector<8xf32>
    %b = vector.transfer_read %in2[%arg5], %cf0: memref<?xf32>, vector<8xf32>
    %acc = addf %a, %b: vector<8xf32>
    vector.transfer_write %acc, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

This way the semantic is preserved at every stage of the transformation and we don’t need an insert_map anymore.
It wouldn’t work for instructions with side-effects returning a value (like atomics) that would have to be handled differently.

The first stage would require analysis to make sure we don’t have synchronization problems but right now we are really trying to build the infrastructure and start from the assumption that the first transformation is done. As Nicolas mentioned the plan is to make it work with N-D vectors and support different distribution scheme and that’s what I’m planning to concentrate on once we agree that we have solid basis.

If this makes sense to you I’ll send a patch to make the lowering behave as described.

You are right I was representing the multiplication by 8 by affine.apply but this is obviously better.

Hopefully the explanation above clarifies this part. The idea is that it is is still correct to load those in the loop and the canonicalization should break it up so that each iteration only loads what it needs.

This is a typo, I’m editing my post.