GPU Compute Basic Algorithms

It doesn’t always have to be micro-ops in descendent dialects but indeed, progressive lowering + transformations that break up an op into smaller ops seem to be a reasonable way to get started. You could look at how vector.contract gets progressively lowered to either:

  1. vector.matrix_multiplyllvm.matrix_multiply
  2. vector.outerproduct + vector.transposeinsert/extract/fma
  3. directly to insert/extract/fma

From any of these levels it would make sense to go to a HW-specific dialect e.g. GPU.
However there are also implications on chains of ops with sources / sinks to memory (e.g. load/store and vector.transfer); see e.g. @ThomasRaoux’s commits to iree to see a bit of the gradient here.

There is also some WIP that adds a lowering to vector.reduce.

A lot more work is needed to get a meaningful set of useful primitives such as described by @Lichtso .

For example, Linalg supports a primitive lowering of ops to library calls for semantically named ops. This needs to be extended (e.g. like discussed here) but it shows that starting from high-level, semantically charged ops, we can mix transformations, codegen and library calls. This is also why I have been talking about ops whose semantics is captured by attributes: this allows building generic mechanisms to mix codegen + library calls.

I imagine similar mechanisms can be generalized and serve the same purpose. However, it is unclear at this point whether all/most of the ops discussed in the meeting have such attribute-based representations but the least the compiler / analysis and transformations need to know about e.g. my_special_scan_and_conv_op, the better.