Outer product spilling

Hi all,
I am not sure this is known or easily sorted. I was trying to reproduce the tiling strategy implemented by a famous Arm library, ACL. They use a 8x12 inner tile (see the kernel implementation here.

I was trying to reproduce (through GitHub - google/iree-llvm-sandbox) the same implementation. However, digging more and more, I observed heavy spilling in the generated assembly, which deteriorates performance quite significantly.

I was able to come up with a simple example written in Vector dialect:

func @outerproduct(%arg0: vector<8xf32>, %arg1: vector<12xf32>) -> vector<8x12xf32> {
  %c0 = constant 0 : index
  %c2048 = constant 2048 : index
  %c2 = constant 2 : index
  %2 = vector.outerproduct %arg0, %arg1 : vector<8xf32>, vector<12xf32>
  %3 = scf.for %arg5 = %c0 to %c2048 step %c2 iter_args(%arg6=%2) -> vector<8x12xf32>{
     %4 = vector.outerproduct %arg0, %arg1, %arg6 : vector<8xf32>, vector<12xf32>
     scf.yield %4 : vector<8x12xf32>
  }
  return %3 : vector<8x12xf32>
}

If I compile this with

mlir-opt test_outer_product.mlir --convert-vector-to-scf --lower-affine  --convert-scf-to-std --convert-vector-to-llvm --convert-std-to-llvm  --reconcile-unrealized-casts| mlir-translate --mlir-to-llvmir | ~/llvm-project/build/bin/llc  -O3  -march=aarch64 -mcpu=neoverse-n1  -filetype=asm -o outer.s

I can observe quite a lot of spilling going on. I will post a small extract here:

ldr	s7, [sp, #1200]
	mov	v6.s[2], v7.s[0]
	mov	v16.s[2], v7.s[0]
	mov	v21.s[2], v4.s[0]
	ldr	s7, [sp, #1232]
	mov	v12.s[2], v7.s[0]
	mov	v3.s[2], v7.s[0]
	ldr	s7, [sp, #1208]
	mov	v6.s[3], v7.s[0]
	mov	v16.s[3], v7.s[0]
	mov	v21.s[3], v5.s[0]
	ldr	s7, [sp, #1240]
	mov	v12.s[3], v7.s[0]
	mov	v3.s[3], v7.s[0]
.Ltmp0:
	.loc	1 13 11 prologue_end            // <stdin>:13:11
	fmul	v11.4s, v21.4s, v0.s[0]
	fmul	v9.4s, v16.4s, v0.s[0]
	.loc	1 21 11                         // <stdin>:21:11
	fmul	v8.4s, v21.4s, v0.s[1]
	fmul	v31.4s, v16.4s, v0.s[1]
	.loc	1 29 11                         // <stdin>:29:11
	fmul	v29.4s, v21.4s, v0.s[2]
	fmul	v28.4s, v16.4s, v0.s[2]
	.loc	1 37 11                         // <stdin>:37:11
	fmul	v26.4s, v21.4s, v0.s[3]
	fmul	v25.4s, v16.4s, v0.s[3]
	.loc	1 45 11                         // <stdin>:45:11
	fmul	v23.4s, v21.4s, v1.s[0]
	fmul	v22.4s, v16.4s, v1.s[0]
	.loc	1 53 11                         // <stdin>:53:11
	fmul	v20.4s, v21.4s, v1.s[1]
	fmul	v19.4s, v16.4s, v1.s[1]
	.loc	1 61 11                         // <stdin>:61:11
	fmul	v18.4s, v21.4s, v1.s[2]
	fmul	v17.4s, v16.4s, v1.s[2]

This is the same behaviour I see on my more complex Matrix Multiply examples.

My questions are:
a) Is this a problem also for x86 machines?
b) Is this MLIR’s fault or LLVM’s fault?
c) Do you think this can be fixed (not necessarily easily)?

Thank you for any kind of support!

Additional info

I spent my day playing with this, and I got more info about this:
a) PowerOfTwo outer loops don’t spill: 8x4, 8x8 work fine
b) NonPowerOfTwo outer loops always spill: 8x5, 8x6, 8x12 always spill
c) If I remove the loop in MLIR, and I only leave two outer-products, it does not spill
d) I debugged the llvm pipeline, and it seems that the pass aarch64-local-dynamic-tls-cleanup introduces the spilling. In partcular, for 8x5xf32 outer product we have:

  liveins: $x8, $q0, $q1, $s2, $s3, $s4, $s5, $s6
  %129:fpr32 = COPY $s6
  %128:fpr32 = COPY $s5
  %127:fpr32 = COPY $s4
  %126:fpr32 = COPY $s3
  %125:fpr32 = COPY $s2
  %124:fpr128 = COPY $q1
  %123:fpr128 = COPY $q0
  %122:gpr64 = COPY $x

while for 8x8f32 we have:

  liveins: $x8, $q0, $q1, $q2, $q3
  %54:fpr128 = COPY $q3
  %53:fpr128 = COPY $q2
  %52:fpr128 = COPY $q1
  %51:fpr128 = COPY $q0

So it looks like the 5xf32 vector is using 5 single-register, while the 8xf32 vector is using 2 quad-registers.

I will dig more on this topic, but the root cause is likely to be:
a) LLVM vectorizer usually produces “normal” sizes vector, mostly power-of-two. It would never produce a 5xf32 vector
b) With MLIR those vectors seem likely, because we want specific tiling strategies

Still preliminary, but I see two different solutions:
a) Make the LLVM backend support weird vector lengths (5x, 6x, 12x)
b) Legalize the MLIR backend to use only PowerOfTwo vectors

Maybe you guys are already aware of some legalizations? I wasn’t able to find much. If not, what would be the best approach?

Thanks again!

cc: @nicolasvasilache @chelini (thought you guys might be interested)

Thanks @giuseros for the detailed analysis.

It seems a bit surprising that this is not handled well with the proper shuffle instructions and canonicalization + folding at the LLVM level. Does it help to remove/unroll the loop?
Maybe the type of peephole optimization you need here does not play nicely with the loop yields.
@dcaballe may know more about this and whether it is worth it / feasible to fix in LLVM.

On the MLIR side, it should be possible to use the “vector unrolling patterns” for such cases to ensure that vector.contract / vector.outerproduct get broken down to vector<4xf32> and vector<8x4xf32>.
To use those, look for:

void populateVectorUnrollPatterns(RewritePatternSet &patterns,
                                  const UnrollVectorOptions &options);

I am not 100% sure about the current state of affairs on composition of unrolling patterns with canonicalizations and foldings especially in presence of vector.transpose, vector.transfer with permutations and vector.shape_cast but it seems like a good time to make a focused push on supporting more cases.

vector.contract + elementwise were historically the first supported ops, so you may have to start from that to see the first pieces connect.

I can see how scf.for + iter_args could throw a wrench into the unrolling patterns so you may want to start without the loop and see whether you get what you want first.
@ThomasRaoux is the most up to date here.

Once the basic cases work, we should also think about how to bubble some more control to the user.

Hi @nicolasvasilache ,
Thanks for your answer! So I spent my day investigating the backend. I found out that the legalization during instruction selection is basically only splitting by 2 each time (and widening if the vector length is odd). The files I looked at are:

llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.cpp
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
llvm/lib/CodeGen/TargetLoweringBase.cpp

I tried to to hack my way into those files to make the codegen work (for my very specific case) but it wasn’t that simple.

Tomorrow I will have a glance at the MLIR side, but I have a question:

I am not 100% sure about the current state of affairs on composition of unrolling patterns with canonicalizations and foldings especially in presence of vector.transpose , vector.transfer with permutations and vector.shape_cast but it seems like a good time to make a focused push on supporting more cases.

What do you mean exactly by this? I interpret this as “it will be quite hard because the pattern might be quite convoluted to specify (given there are vector_transfers)”. Is this the case?

FYI, I don’t have any transpositions anymore (I am doing A’*B, and this made performance… worse :slight_smile: ).

I will let you know how I am getting on.

Thanks,
Giuseppe

cc’ing also @stevenvar who’s helping me a lot on the LLVM side of the things.

I meant that it’s been a while since I touched this and I haven’t paged the improvements that @ThomasRaoux did back in my working memory. For the simple case of matmul, just hooking the unroll and folding patterns that iterate until the unrolling are folded into the vector transfers should be relatively easy. I think this should also work through vector transpose but I am not 100% sure right now.

For more complex cases that would involve vector.shape_cast, there will be missing patterns that we should implement when we see the need for them.

@gysit was reporting similar perf loss on x86 CPU recently, haven’t dug into his example yet but I think he sees perf loss also on powers of 2, which should be orthogonal. Will try to look tomorrow (I am traveling though), as well as land his CL in the sandbox.

The unroll pattern currently works on element-wise/contract and transfer ops however the target size still need to divide the original size. In this example it could be use to broken up the vector<8x12xf32> into a sequence of vector<1x4xf32> which should allow to avoid having LLVM DAG legalization kick in.
You would have to do the unrolling on vector.contract right now.

That being said that won’t solve problem with shaped like vector<5xf32>. We do have plan to extend unrolling to support non aligned cases but I haven’t got to it (mostly because I never found the need) This would allow breaking vector<5xf32> into vec<4x32> + scalar.

If instead you want to extend the vector in such case then I assume it requires something like padding.

@gysit was reporting similar perf loss on x86 CPU recently, haven’t dug into his example yet but I think he sees perf loss also on powers of 2, which should be orthogonal. Will try to look tomorrow (I am traveling though), as well as land his CL in the sandbox.

True I played a bit with transposing both A and B and did see a performance loss of about a factor two. It may in deed be worth in looking a bit deeper into the problem. Additionally, I guess packing is needed to make the layout changes really effective. Exposing the right flags in the Sandbox for this is probably another important step forward.

Using patterns in MLIR was way easier :slight_smile:

Thank you guys!

While performance are very far from what I want, at least now they don’t look horrible (looks like spilling disappeared) and I am back to the problem of packing.

I would say that the issue of unrolling an 8x5 (8x4 + scalar or 2x8x4 with padding) will come once the auto-tuner will start playing a more central role.

Indeed, from this paper it looks like the “best” tiling strategy is a 8x5 for the architecture I am considering.

For now I am only trying to get as close as possible to the hand-written code, so I can stick with 12x8 (and apply 4x4 unrolling), but I think in the future the auto-tuner should be safely try 8x5 and evaluate if it is better than 12x8 (without being “cheated” by the back-end).

Anyway, I will let you know how things pan out!

Thank you all!