# Micro-kernel analysis

We are working on the problem of generating the best microkernel to improve the performance of GEMM (and generically many linear algebra operations). We will share in this page what we found so far. We are mostly interested in Arm architectures, so any reference to assembly implies the Arm ISA.

# What is a micro-kernel?

In any loop-based computation the micro-kernel is the inner for loop in the computation.

In the case of GEMM, that we are using as an example, this is the loop-nest of the computation:

And the micro-kernel is represented by:

``````
for pr = 0, ..., kc -1 in steps of 1

Cc(ir:ir+mr -1, jr:jr+nr-1) += Ac(ir:ir+mr-1, pr) * Bc(pr, ir:ir+nr-1)

``````

Please note that `Ac` and `Bc` are laid out such that memory accesses are contiguous in memory.

# How to write a micro-kernel

Many of the BLAS libraries (ACL, OpenBLAS, etcâ€¦) tend to write the micro-kernel in hand-written assembly.

This has the advantage of squeezing the last drop of performance from the machine, but has different disadvantages:

• The library designer can be stuck in local minima

• It is very hard to generalize to new {micro/macro} architectures

The compiler, on the other hand, might find hard to generate the optimal micro-kernel, but it can list a series of optimizations that can be tentatively applied by an auto-tuner to improve the performance. We can call a sequence of those optimization a `strategy`.

The main condition for this to work is if the auto-tuner asks to apply a given strategy (e.g., `strategy=<Tile<3,3,3>, unroll_x_2, pipeline>`) and the compiler generates the best possible code.

Letâ€™s say that, given a strategy `S` and a program `P`, if we consider:

• `Lopt = S(P) // strategy S applied on program P by a library developer`
• `Copt= S(P) // strategy S applied on program P by the compiler`

We would like to have `Lopt==Copt` (one way of defining the `==` operator is through performance, i.e., `T(Lopt)==T(Copt)`).

We ( @stevenvar @allenaann and me) analyzed (and are still analyzing) those optimizations that can compose those strategies, and this is a summary of what we found out

## Overall results

So far, those have been the best results we were able to obtain:

Tile (mrxnr) MLIR transformations applied MLIR/Theoretical
8x8 Unrolling(x32) >72%
4x16 Unrolling(total) >82%
12x8 Unrolling(x2)+Pipelining+legalization+scheduling >81%
12x8 Unrolling(x2)+Pipelining+legalization+scheduling+prefetching >84%

Few words to clarify some of the above techniques:

• Total unrolling means unrolling the entire micro kernel. I.e., removing the kernel for-loop and computing `kc` `mr x nr` outer-products

• 12x8 is the tiling that ACL uses (which gives the best performance so far)

• Scheduling is manually applied at the MLIR level, almost before entering LLVM-land

• Legalizing means producing MLIR instructions that have a 1:1 correspondence in assembly. For instance, a `vector<12xf32>` is not legal in (Arm) assembly, and this means that the LLVM back-end has to split it up in chunks of `vector<4xf32>`. Apparently, LLVMâ€™s handling of non-power-of-two vectors is sub-optimal, and in general illegal operation in the LLVM world generate sub-optimal code.

## Assembly inspection

While it is hard to read the assembly for the total unrolled case, a skim through the assembly in the case of the 12x8 tiling strategy, shows that the micro-kernel assembly seems in a good shape:

• All registers are used (`v0-v31`)

• The instruction dependencies (`ldr -> fmla`) are minimized

• Loads for `vX` seem to be placed immediately after the fmla instructions that use `vX`

All in all, the generated assembly is quite nice, albeit still sub-optimal

## Prefetching

We wrote a simple auto-tuner to inject prefetch instructions in the MLIR micro-kernel. While in ACL and OpenBLAS prefetching seems to be quite effective we only got 3% speed up. We will let the tuner run more to see if we can identify better prefetching values

## LLVM compiler and limitations

We are using the LLVM compiler as a sort of black-box (mostly for lack of expertise).

The main issue we are facing is that there seems to be a dependence between the MLIR that we generate and the LLVM performance. In other words, we cannot assume that two functional equivalent MLIR programs are optimized in the same way by the LLVM compiler. Is this expected?

The clear example is about scheduling (@nicolasvasilache, you were right ): the position of independent operations in MLIR and the scheduling+RA of the LLVM compiler are connected.

This MLIR/LLVMIR dependency begs the question if scheduling should be really done in MLIR. Can we enhance the scheduling algorithm of LLVM with something more exotic (given that we are less time constrained)?

## Conclusion

The main take-away from all this is that there is a way of writing MLIR that produces (almost) optimal assembly.

This is good news for us, because we â€śonlyâ€ť need to provide and use the right set of transformations to accomplish the task.

Now it is the time to push for these transformations into the compiler. Some of those are available (pipelining), some need a bit of improvement (legalization), for some we need to understand how to split the responsibility between MLIR and LLVM (scheduling + prefetching).

4 Likes

Very cool, thanks for the analysis @giuseros

Yup â€¦

Note that there are some unresolved inflight discussion related to `vector.transpose`: https://groups.google.com/g/llvm-dev/c/P_CXf5hPro8 between what ISel / SelectionDAG lets you do in LLVM and where we can be (I measure 20-30% perf penalty in a pure transpose case).

Iâ€™d be interested to seeing (and reusing) your impl / tuning knobs for this

Yes please! Stars are quite aligned for all this to compose.

I am afraid for now we are stuck in â€śrandomly move stuff aroundâ€ť-land and expose a knob that we can turn to generate various versions and get the best. A little more control at the LLVM level would go a long way but may be very hard to add; please feel free to chime in on the `llvm-dev` thread above.

Due to email being hard and 1980â€™s stuff, I donâ€™t know if you can chime in unless you receive the email independently. If you donâ€™t have it I can +1 you in the thread.

Also, @dcaballe (who may also have good insights to share on all these topics)

@giuseros actually, I was just pointed to this and started digging: 2019 LLVM Developersâ€™ Meeting: V. Keles & D. Sanders â€śGenerating Optimized Code with GlobalISelâ€ť - YouTube

Given your domain, Iâ€™d recommend treating LLVM like a code generator - leave it register allocation and scheduling, but do not ask it to do legalization or mid-level optimizations. Iâ€™d recommend not even running the LLVM â€śoptâ€ť passes for your experiments. This will give you more control and predictability.

3 Likes

It may be worth trying a â€ślightâ€ť pipeline (`-O1`?) just do perform basic scalar optimization (SROA, cse, etc.), there is likely a bunch of things related to address computation and other naive thing emitted to LLVM IR that you want to handle before getting into the backend.

Hi Nicolas,

Note that there are some unresolved inflight discussion related to `vector.transpose` : https://groups.google.com/g/llvm-dev/c/P_CXf5hPro8 between what ISel / SelectionDAG lets you do in LLVM and where we can be (I measure 20-30% perf penalty in a pure transpose case).

Curious, because we are doing experiments by (manually) packing and transposing the input and we didnâ€™t see any performance regression (cc @allenaann ). But yes, I will attentively read the thread and the video you sent me(if you could +1 me in the thread, that would be great)

Iâ€™d be interested to seeing (and reusing) your impl / tuning knobs for this

I think we should find a common place to share these things (although all I am doing is very sketchy atm). I am wondering if GitHub - google/iree-llvm-sandbox is a bit more stable now to be used for this purpose.

I am afraid for now we are stuck in â€śrandomly move stuff aroundâ€ť-land and expose a knob that we can turn to generate various versions and get the best. A little more control at the LLVM level would go a long way but may be very hard to add; please feel free to chime in on the `llvm-dev` thread above.

The problem of adding knobs is that you can end up adding a lot of them. I mean if my kernel is (say) 33 lines of code, we get O(33!) reordering combinations (maybe not this bad because of the dependencies, but still). Same thing for my prefetching tuner (which generates a very large space).

If we could put a bit more logic in the LLVM it would be great: at that level there is already the knowledge of the registers and the machine model - I think - so the number of reasonable combinations should be less I spoke with different people, and it looks like there is an infra to customize the scheduling algorithm with something fancier (compiling time is less of an issue in this domain).

Hi @clattner , @mehdi_amini
The suggestion of turning off (or reducing) LLVM optimizations was quite unexpected. But yes, I agree. LLVM for us is a code generator, and as such we can offload to it only RA and scheduling (which is exactly what I wanted to hack, at least for the scheduling part). This thinking opens the door for interesting directions

Once the MLIR side looks fine, I will reiterate on this.

Thanks,
Giuseppe

Yes that is meant for this purpose and has evolved to a more stable point these days, it is still not integrated properly on our end so I will have to patch things manually and give you credit in the commit. But this was expected and I am willing to personally pay that manual integrate price for some time (I literally do not have a better solution atm but it will come, also cc @stellaraccident ).

True, OTOH for such an experiment we only need something reproducible than can cover a lot of cases while always generating correct code (i.e. def-use order preserving). So a heuristic that takes a single int value and just moves stuff around is what I was thinking about. This should be significantly simpler to control but would not be very predictable (i.e. given a fixed value, â€śsomethingâ€ť happens).

Been talking about this to a bunch of different folks but I lack the background and time to acquire it.
I would gladly adopt any solution that knowledgeable folks would give me; tis the week after all

Substantial developments have happened in matmul loop nest designs since the the Goto / BLIS style of loop nest (image pasted from BLIS loop nest in the initial message in this thread). I donâ€™t know if thatâ€™s of interest here and donâ€™t want to necessarily assume so (after all, the BLIS loop nest model is still good enough for a â€śmicro-kernel analysisâ€ť, just perhaps not for a whole-matmul analysis) but people have suggested that I mention my comments about this here, so here you go â€“ feel free to ignore!

1. Rigid loop nests like this donâ€™t scale across the whole spectrum of rectangular (M,K,N) problem shapes (e.g. if M is the largest dimension you probably donâ€™t want to start iterating over M only at the 3rd loop into the nest). That is why traditional implementations based on such loop nests have tended to ship a large number of loop nests (non optimial code size compromise) or rely on a JIT. In this more recent matmul library we have coalesced all the outer loops into one loop.
2. It used to be, until the early 2010s, that one could prove that cache-friendliness was achieved by having a loop hierarchy closely mirroring the memory system hierarchy like this, but on current CPUs, due to how prefetching has evolved, the overriding concern tends to be to have as simple memory access patterns as possible, and that can run at odds with traditional loop nest design (example: tiling the k dimension by kc in the outer loops may be seen as cache-friendly because of how it helps tiles fit in some level of cache, but it actually shortens the stretches of contiguous, simple memory accesses (from k down to just kc) so it doesnâ€™t necessarily help.
3. This may be something youâ€™re already doing, but itâ€™s unlike what the BLIS loop nest image literally says: you really want the packing (into Ac and Bc) to be entirely hoisted out of loops. So that itâ€™s O(N^2) , being taken out of the O(N^3) matmul work. Trying to minimize memory usage here by reusing a smaller bufferâ€™s bytes is the wrong compromise (but it wasnâ€™t so in the 2000s with the smaller RAM of machines then when the Goto paper came out, when this tradition was established).
1 Like

Hi @bjacob ,
Thanks for the great insights. So I agree that the BLIS loop nest might not be optimal in some cases, but this helps the cause I mean, even the faster libraries runs at 91% of the peak on our experiments.

Manually rewriting those routines with different packing strategies, fusing loops together, etcâ€¦ means a very large development time which could be reduced if we can automatically generate the operation (packing, kernel, etcâ€¦). So the BLIS setup could be seen just like the X0 point in our search space.

Orthogonal question: I thought Ruy was mainly supposed to work on small (ML-style) matrices. Do you find the same insights on bigger matrices as well?

Giuseppe

Thanks @giuseros for the response. Ruy is indeed aimed at mobile ML workloads. That does involve many small sizes, but not only, and often one or two of the dimensions are small and some other dimension is large (very rectangular shapes). So itâ€™s really aiming for graceful scaling across the (m,k,n) 3D shape space, in a small binary package. As any combination of the m, k or n dimensions might be small or large, we preferred having only one outer loop so that at least that loop will have a large number of iterations if any of the m, k, n dimensions is large at all. Scaling to large sizes (â€ścache friendlinessâ€ť) becomes a detail of that single loopâ€™s counter gets â€śdecodedâ€ť to a tile, e.g. following a space-filling curve through the m,k,n 3d tile space. (we actually avoid tiling on k, which makes it a 2d space, which helps make space-filling curves easier to implement while also ensuring that the tiles, being only on (m,n) dimensions, are mutually independent computations).