[RFC] Adding reduction support (iter_args) to the Affine vectorizer

Hi all,

We are interested in adding support for reductions to the Affine vectorizer and improve the vectorization approach so that it can vectorize more scenarios. This RFC describes the basics of the problem, the initial use cases that we want to support and two proposals with the high-level implementation steps. We can elaborate further on any of these points, as needed.

We would love to hear what you think!

Use Case Scenarios & Challenges

We are targeting affine.for loop nests where the loop-carried dependence of the reduction is represented using iter_args. The following example shows a loop nest with a reduction on the inner loop:

func @main(%in: memref<16x64xf32>, %out: memref<64xf32>) { 
 %cst = constant 0.000000e+00 : f32 
 affine.for %arg2 = 0 to 64 { 
   %final_red = affine.for %arg4 = 0 to 16 iter_args(%red_iter = %cst) -> (f32) { 
     %5 = affine.load %in[%arg4, %arg2] : memref<16x64xf32> 
     %6 = addf %red_iter, %5 : f32 
     affine.yield %6 : f32 
   } 
   affine.store %final_red, %out[%arg2] : memref<64xf32> 
 } 
 return 
} 

There are (at least) two vectorization strategies that could be applied to the nest above: vectorizing the outer loop or vectorizing the inner loop. Vectorizing the outer loop is relatively simple since there are no dependences that prevent vectorization at that level. A vector iteration of the outer loop computes a full reduction of the inner loop per vector lane. No horizonal vector reduction is required. The resulting vector code would look like this for VF=4:

  func @main(%arg0: memref<16x64xf32>, %arg1: memref<64xf32>) { 
    %cst = constant dense<0.000000e+00> : vector<4xf32> 
    %cst_0 = constant 0.000000e+00 : f32 
    affine.for %arg2 = 0 to 64 step 4 { 
      %0 = affine.for %arg3 = 0 to 16 iter_args(%arg4 = %cst) -> (vector<4xf32>) { 
        %1 = vector.transfer_read %arg0[%arg3, %arg2], %cst_0 : memref<16x64xf32>, vector<4xf32> 
        %2 = addf %arg4, %1 : vector<4xf32> 
        affine.yield %2 : vector<4xf32> 
      } 
      vector.transfer_write %0, %arg1[%arg2] : vector<4xf32>, memref<64xf32> 
    } 
    return 
  } 

Vectorizing the inner loop is a little more involved. The vectorizer should detect that the iter_args loop carried dependence is a supported reduction (simple). The vectorized inner loop returns a vector with VF partial reductions that should be reduced to the final scalar value using a horizontal vector reduction operation. The resulting vector code would look like this for VF=4:

  func @main(%arg0: memref<16x64xf32>, %arg1: memref<64xf32>) { 
    %cst = constant 0.000000e+00 : f32 
    %cst_0 = constant dense<0.000000e+00> : vector<4xf32> 
    affine.for %arg2 = 0 to 64 { 
      %0 = affine.for %arg3 = 0 to 16 step 4 iter_args(%arg4 = %cst_0) -> (vector<4xf32>) { 
        %2 = vector.transfer_read %arg0[%arg3, %arg2], %cst : memref<16x64xf32>, vector<4xf32> 
        %3 = addf %arg4, %2 : vector<4xf32> 
        affine.yield %3 : vector<4xf32> 
      } 
      %1 = vector.reduction "add", %0 : vector<4xf32> into f32 
      affine.store %1, %arg1[%arg2] : memref<64xf32> 
    } 
    return 
  } 

However, dealing with iter_args and affine.yield op in the Affine vectorizer poses some challenges. The current vectorization approach only vectorizes chains of operations that start with a load op and end with a store op, which leads to two important limitations:

  1. Operations within the loop to be vectorized that are not part of a load-to-store chain are not vectorized.
  2. Only load and store ops start and end a chain of operations to be vectorized.

Consequently, this scheme makes the vectorization of iter_args and affine.yield op hard since they wouldn’t be part of any load-to-store chain.

Proposal 1: Extend existing vectorization approach to support iter_args and affine.yield

We could extend the exiting vectorization approach to support root and terminal operations other than affine loads and stores. affine.yield would be added as a new terminal operation. affine.for ops that return values would be added as a new root operation. Vectorization of iter_args would be special-cased.

This approach provides a very ad-hoc solution to the reduction problem and requires minimal changes in the vectorizer but it does not provide a generic solution to the aforementioned vectorization limitations. These limitations are not unique to supporting iter_args and affine.yield. We struggled with the same problem in internal use cases unrelated to reductions. For these reasons, I personally wouldn’t recommend this approach and I think it’s a bit hacky.

Proposal 2: Change to a “vectorize-all topological” vectorization approach

As part of the reduction effort, we could change the current vectorization approach to better accommodate reduction support and improve the vectorization capabilities. We propose to change the load-to-store chain vectorization approach to a more generic/traditional one that would vectorize all the instructions within the loops to be vectorized in topological order. Consequently, affine.yield and affine.for components (results/iter_args) would be considered for vectorization.

The implementation of the new approach would reuse most of the existing code since it would only change the order in which operations are vectorized, not how they are vectorized. A topological order ensures that the operands of an operation to be vectorized have been vectorized already (with some exceptions) which would also simplify the existing vectorization transformation and some of the internal data structures. This approach would also pave the way to introducing vectorization interfaces and other abstractions to make the vectorizer more reusable across dialects. The main downside of this approach is that it requires significant changes and much more work.

Execution

We suggest the following implementation steps for Proposal 1:

  1. Extend the vectorizer to support multiple root and terminal operations.
  2. Support the outer loop vectorization scenario: add basic support to vectorize iter_args and affine.yield op. Extend loop dependence check so that loops with iter_args are not considered for vectorization. Only loops enclosing other loops with iter_args should be considered for vectorization if the rest of the legality checks are met.
  3. Support the inner loop vectorization scenario: add support to vectorize loops with iter_args when the loop carried dependences are supported reductions. Generate the proper horizontal reduction operation required for those cases.

Similarly, the implementations steps for Proposal 2 would be:

  1. Change the vectorization approach so that it vectorizes all the operations in the target loops in topological sort. Changes in this step should be strictly limited to preserving existing functionality and implementing the algorithmic change. No other features should be introduced at this stage.
  2. Same as #2 in Proposal 1.
  3. Same as #3 in Proposal 1.

Please, let us know if any further clarification is needed!

Thanks,
Diego

@nicolasvasilache, @aartbik, @bondhugula, @ftynse

Thanks @dcaballe for the writeup and clear explanation of the issues you are currently facing.

I’d go directly for #2, the topologically sorted traversal and no root/terminal is the direction Linalg vectorization also has evolved towards so maybe there is an opportunity for code reuse.

Thanks @nicolasvasilache. Yes, I think at the bare minimum we should be able to reuse some common utilities, such as vectorizeOneOp and the like, once we align both approaches and their data structures a little more.