[Vector] Vector distribution (large vector to small vector)

ThomasRaoux · October 14, 2020, 9:36pm

mehdi_amini:

At the moment we have a pass in-tree that still perform a transformation that I can’t understand how it is semantics preserving, it goes directly:
func @distribute_vector_add(%id : index, %A: vector<32xf32>, %B: vector<32xf32>) -> vector<32xf32> {
  %0 = addf %A, %B : vector<32xf32>
  return %0: vector<32xf32>
}
into:
  func @distribute_vector_add(%arg0: index, %arg1: vector<32xf32>, %arg2: vector<32xf32>) -> vector<32xf32> {
    %0 = vector.extract_map %arg1[%arg0 : 32] : vector<32xf32> to vector<1xf32>
    %1 = vector.extract_map %arg2[%arg0 : 32] : vector<32xf32> to vector<1xf32>
    %2 = addf %0, %1 : vector<1xf32>
    %3 = vector.insert_map %2, %arg0, 32 : vector<1xf32> to vector<32xf32>
    return %3 : vector<32xf32>
  }
(based on a “distribution-multiplicity=32” option).

We have a function for which the interface does not change (it provides vectors of 32xf32 and get back a full vector 32xf32 , and the body was performing a single 32xf32 add but is now performing only a single 1xf32 add to produce the entire vector.

I think two problems are causing confusions.

The unit test don’t keep the semantic as the test pattern does an arbitrary transformation (code below) that isn’t meant to be semantically correct but just simulate part of the transformation that would be done to break up the vector in smaller part. I don’t think this is really needed and we could start with the IR already transformed. This way the ID wouldn’t come out of nowhere and we are just testing the propagation patterns.

      func.walk([&](AddFOp op) {
        OpBuilder builder(op);
        Optional<mlir::vector::DistributeOps> ops = distributPointwiseVectorOp(
            builder, op.getOperation(), func.getArgument(0), multiplicity);
        if (ops.hasValue()) {
          SmallPtrSet<Operation *, 1> extractOp({ops->extract});
          op.getResult().replaceAllUsesExcept(ops->insert.getResult(),
                                              extractOp);
        }
      });

The semantic of insert_map is not well defined. The idea of using extract/insert is to be able to propagate the lowering of the large vectors to small vectors incrementally (@nicolasvasilache explained the benefits of this approach). To be able to do incremental lowering we rely on the fact that in general executing instructions several time doesn’t change the result as long as there isn’t any side effect. I think the problem is that right we start breaking up the arithmetic instruction and as @mehdi_amini pointed out this means we end up with an intermediate large transfer_write which isn’t correct until we fold it with the insert_map. I think to solve this we should start from transfer_write and propagate up.

Based on that the transformation stages would look like below.
Original code:

%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
%acc = addf %a, %b: vector<256xf32>
vector.transfer_write %acc, %out[%c0]: vector<256xf32>, memref<?xf32>

Then create loop and break up the transfer_write only:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %acc = addf %a, %b: vector<256xf32>
    %ext = vector.extract_map %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    vector.transfer_write %ext, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

Propagate the extract_map:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
    %exta = vector.extract_map %a[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %extb = vector.extract_map %b[%arg5 : 32] : vector<256xf32> to vector<8xf32>
    %acc = addf %exta, %extb: vector<8xf32>
    vector.transfer_write %acc, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

Fold extract_map in transfer_write:

  scf.for %arg5 = %c0 to %c256 step %c8 {
    %a = vector.transfer_read %in1[%arg5], %cf0: memref<?xf32>, vector<8xf32>
    %b = vector.transfer_read %in2[%arg5], %cf0: memref<?xf32>, vector<8xf32>
    %acc = addf %a, %b: vector<8xf32>
    vector.transfer_write %acc, %out[%arg5]: vector<8xf32>, memref<?xf32>
  }

This way the semantic is preserved at every stage of the transformation and we don’t need an insert_map anymore.
It wouldn’t work for instructions with side-effects returning a value (like atomics) that would have to be handled differently.

The first stage would require analysis to make sure we don’t have synchronization problems but right now we are really trying to build the infrastructure and start from the assumption that the first transformation is done. As Nicolas mentioned the plan is to make it work with N-D vectors and support different distribution scheme and that’s what I’m planning to concentrate on once we agree that we have solid basis.

If this makes sense to you I’ll send a patch to make the lowering behave as described.

aartbik:

In your first step, it would have helped if you clarified the map which you omit.
scf.for %arg5 = %c0 to %c32 step %c1 {
   %idx = affine.apply #map0(%arg5)
   ....
Given the context of breaking up the super vector into chunks of 8, I assume this really would be the following, right?
scf.for %idx = %c0 to %c256 step %c8 {
   ....

You are right I was representing the multiplication by 8 by affine.apply but this is obviously better.

aartbik:

Since this is a relatively simple transformation, I am not sure what you mean by “rewrite all at once”. In your proposal, I find
scf.for %arg5 = %c0 to %c32 step %c1 {
  %a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
  %b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>

Hopefully the explanation above clarifies this part. The idea is that it is is still correct to load those in the loop and the canonicalization should break it up so that each iteration only loads what it needs.

This is a typo, I’m editing my post.

Topic		Replies	Views
Vector to ArmSME MLIR	5	422	September 8, 2023
Case Study Docs on Vector Dialect CPU Codegen MLIR	14	2655	September 18, 2020
[VectorOps] Vector -> GPU for single Block / Warp or Group / SubGroup MLIR	10	1291	June 18, 2020
Extending vector operations LLVM Dev List Archives	13	67	July 23, 2008
N-D vector transfer_read/write MLIR	12	962	July 11, 2020

[Vector] Vector distribution (large vector to small vector)

Related Topics