Some MLIR warts we've found in our convolution kernel generator

@Mogball - forking from the thread so this doesn’t clutter up the
I’ll admit that “extensive infrastructure” could also have been worded “various custom ops and a good amount of code that feels unnecessary”, but I’d been staring at a lot of that code all day yesterday.

So, here’s a non-exhaustive list of things my coworkers and I have complained about (and @whchung in case I missed something or didn’t explain or motivate something well)

Indexing into vectors vs. everything else

A vector is indexed by any fixed-size integer type, which doesn’t include index. Most other operations, such as loads/stores to memrefs, applying affine maps, and so on, use index. And so, if we want to, for example, load from %A[f(i0, i1, ..., in)] and store to %v[g(i0, i1, ... in)] (with %A a buffer and %va vector), we’ll need to emit a whole bunch of casts for i0, i1, … in so we can use them in one or the other operation.

In practice, this means each part of our code generation needs to decide if it’s working on index or i32 (even though, in the end, we’re compiling with index == i32 anyway) and emit casts to handle the other type. While these casts have no effect on our final code, they make intermediate results hard to follow and lead to a lot of “oh, whoops, I forgot to cast that” during development.

Loading a vector of things from a memref

If we want to load, say, 4 floats from a memref %A : memref<...xNxf32>, we need our own op to lower to vector loads (since vector.transfer_read is coupled to masked load semantics, which aren’t supported well on AMD systems). In addition, we can’t abstract over whether a load/store loop is using vectors or scalars as well as we’d like because the indexing for loading vectors from a memref requires handling the vector strides ourselves.

The operation we’ve often wished we had is something like vectorize_cast, which would take a memref<...xT> to a memref<...xvector<VxT>>, adjusting the shape accordingly. (and yes, that particular bit of pointer reinterpretation can cause unaligned loads and/or crashes on some architectures, but we know the buffers we’re dealing with are appropriately aligned per calling convention).

Constructing (and querying) deeply nested attributes

As a results of where our algorithm came from (and needing to be able to efficiently handle unrolled loops), in addition for maintaining affine maps that describe how to translate between coordinate spaces, we also have transformation attributes. For example, here’s the description of how the output tensor gets mapped to a matrix for multiplication in a particular backward weight convolution (given output, filter, derive input) from partway through the generation process

%4 = miopen.transform(%arg2) {
    extraPad = false, gemmMExtra = 0 : i32, gemmNExtra = 0 : i32,
    gridwise_gemm_argument_position = 0 : i32, 
    layout = 
      [{lower_layer_dimensions = [1 : i32], 
         lower_layer_names = ["go"], 
         transformation = "PassThrough",
         upper_layer_dimensions = [0 : i32], 
         upper_layer_names = ["gemmG"]},
      {lower_layer_dimensions = [0 : i32, 3 : i32, 4 : i32],
         lower_layer_names = ["no", "ho", "wo"],
         transformation = "Merge",
         upper_layer_dimensions = [1 : i32],
         upper_layer_names = ["gemmK"]},
      {lower_layer_dimensions = [2 : i32],
        lower_layer_names = ["ko"],
        transformation = "PassThrough", 
        upper_layer_dimensions = [2 : i32],
        upper_layer_names = ["gemmM"]}],
    lower_layer_bounds = [8 : i32, 1 : i32, 128 : i32, 28 : i32, 28 : i32],
    lower_layer_layout = ["no", "go", "ko", "ho", "wo"],
    lowest_layer = true, 
    upper_layer_bounds = [1 : i32, 6272 : i32, 128 : i32],
    upper_layer_layout = ["gemmG", "gemmK", "gemmM"]} 
: memref<8x1x128x28x28xf16> to memref<1x6272x128xf16, #map4>

While some of that information (the names) is more of a debug aid, a lot of it is needed to generate efficient code later in the pipeline (walking down a composed series of transformations to determine “if we increment gemmN by 1, which coordinates in the tensor will change, and by how much”, which yields better code than trying to use affine.for or the like.

Generating these attributes involves code like llvm-project-mlir/LowerMIOpenOps.h at miopen-dialect · ROCmSoftwarePlatform/llvm-project-mlir · GitHub, which gives off “this really isn’t the right way to do things” vibes.

We’re currently investigating what can be done about this, which includes

  1. Can we generate the coordinate transform generation code automatically?
  2. Could we structure things such that we didn’t need these massive attributes, which leads into
  3. Could we redesign our kernel generator to take advantage of more of MLIR’s facilities (both because we’re now more familiar with the tool and because that infrastructure has grown over the last few years) so we didn’t need this approach to generate efficient code.
1 Like

Correct me if I’m wrong, but I’m fairly certain that all memref and vector ops accept Variadic<Index>:$indices for indexing. This includes the load and store ops of both dialects.

This sounds like something that should exist. We have memref.reinterpret_cast and vector.bitcast but the former doesn’t change the element type and the latter only works on vectors (to my knowledge).

On the other hand, vector.load can load from a scalar memref to a vector:

// 1-D vector load on a scalar memref.
%result = vector.load %base[%i, %j] : memref<100x100xf32>, vector<8xf32>

// 2-D vector load on a scalar memref.
%result = vector.load %memref[%i, %j] : memref<200x100xf32>, vector<4x8xf32>

Unfortunately, I’m not versed enough on affine side to comment on whether you need to do what you are doing (I’m sure there’s someone around who could), but I can say that you don’t have to rely on DictionaryAttr to store the data you need. You can define your own attributes with AttrDef:

def LayoutAttr : AttrDef<...> {
  let parameters = (ins
      ArrayRefParameter<"int", "">:$lower_layer_dimensions,
      ArrayRefParameter<"std::string", "">:$lower_layer_names,
      ArrayRefParameter<"int", "">:$upper_layer_dimensions,
      ArrayRefParameter<"std::string", "">:$upper_layer_names);
      
  let mnemonic = "layout";
  let assemblyFormat = "`<` struct(params) `>`";
}

Hopefully, this can at least make your code cleaner (and more efficient, because you don’t have layers of ArrayAttr inside DictionaryAttr).

vector.extractelement, vector.insertelement, and friend don’t take index.

That op’s several years newer than our codebase, and so it’s reassuring a replacement for our gpu.buffer_load already exists.

I wasn’t aware of this, and it sounds to me like I’ve got a refactor to do once I’ve got a hot minute

Ah, gotcha. Seems like it should though (and it’s probably one of the more antiquated ops). I’ll try changing it to accepting Index and see if anything major breaks.

1 Like

Do ping me if that goes through so I can backpart the change and don’t have to wait for our next “catch up with tip” operation.

1 Like

Some comments.

These are thin wrappers around LLVM langref’s: LLVM Language Reference Manual — LLVM 13 documentation. Since LLVM used i32 we went for i32. Offhand, I don’t see particular challenges to using index in MLIR but it is unclear to me that we should: there is no magic, until LLVM changes, you will have casts to i32 somewhere. For the purpose of “not forgetting to cast”, this can be pretty easily fixed at the Builder level and automatically introduce the casts for you, but I am unclear that this should be done either.
You mention even though, in the end, we’re compiling with index == i32 anyway: that’s your particular use case and I am unclear this would be a good change for targets on which index == i64 (i.e. potentially truncating casts would appear).

Still, it would be good to have more concrete examples of the IR you generate: it seems you are heavily rely on operations that extract a dynamic value from a vector: either you have special HW for that or LLVM will go back to memory for you. I’d be interested in digging deeper and understanding whether there is something high-level in MLIR that has potentially bad effects all the way down in the backend or if this is a mere convenience issue (at which point you can just cast in the builder and call it a day).

The issues of a vector_cast-like op on memref, alignment and footgun, have been discussed at length in the past: n-D memref is not a pointer you cannot just cast away at the high-level. Designing a proper such op requires including DataLayout considerations. In the absence of that you can “hack your own op” locally or indeed use a lower-level vector.load that will do the right cast at the lower level when pointers are a thing.

Maybe … it’s hard to tell from this pack of attributes. These days, more and more “structured and retargetable codegen” abstractions are getting connected e2e. They are not yet ready for GPU but we are working towards that with progress expected on the vector programming model for GPU in the next 2-3 months (note that there are already facilities to easily map vector-level concepts to load4, pointwise and wmma ops and they will get generalized soon). You can also just go the affine dialect route build your own heuristics if not available yet.

2 Likes

Wrt vector, that same documentation states (emphasis mine)

The second operand is an index indicating the position from which to extract the element. The index may be a variable of any integer type.

and so it could be i64, and therefore we should be able to extractelement with any integer type, which means we can have index (which is some unknown integer type)

Also, in our case, vector is a set of registers that is logically an array. Our matrix multiplication intrinsics return such vectors on the LLVM level.

Re vectorize_cast, yeah, that makes some sense, and the lack of it can be handled through stuff like vector.load and adjusting the indexing we generate.

I’m not sure I entirely follow your last paragraph - can you provide some links?

The change is coming Soon^TM. I just need to add some safeguards (to ensure index is cast to i32 in LLVM and not i64). For the record, using i32 or any other sized integer type is allowed. I’ve simply loosened the constraint to allow Index as well.

Also, I’m not sure LLVM’s insert/extractvalue are limited to i32. I see some examples of them using i64 (for example, vector::InsertOp often lowers to llvm.insertelement where the position is i64)

@Mogball The LLVM documentation says that any integer type will work, so I think you can just cast index to its underlying type, whether it’s i32 or i64 .

1 Like

Right, that’s what I thought. Thanks!

This was a typo - I meant SPIRV and other lowerings of the tops.

My apologies, you’re right I got stuck in some weird local minimum and missed this.
The goal is to follow LLVM here so we should evolve as per your suggestion.
Please ignore the digression.

2 Likes

It’s no problem, these things happen, thank you for being vigilant about the correctness of the codebase.

1 Like