LLVM Discussion Forums

Memrefs and maps for tiling

This topic may have been covered before, but I can’t find answers. I am trying to generate code for a device where memory needs padding in the lowest dimensions. Memrefs and maps seems uniquely suited for this:

#paddedMap = affine-map<(d0, d1) -> (d0 floordiv 32, d1 floordiv 32, d0 mod 32, d1 mod 32)
%1 = alloc() : memref<62x61xf32, #paddedMap>

Using the maps is perfect as it let us preserve the original data dimension (i.e. 62x61, data that matters), while also carrying with the memrefs the desired data layout and memory size needed for data allocation (i.e. 2x2x32x32).

All is well until lowering to LLVM where I hit this assert:

Assertion failed: (isStrided(type) && "Non-strided layout maps must have been normalized away"), function convertMemRefSignature, file /Users/alexe/MLIR/llvm-project/mlir/lib/Conversion/StandardToLLVM/StandardToLLVM.cpp, line 237.

The example is simple (and actually has no padding as dimensions match):

#pad = affine_map<(i) -> (i floordiv 4, i mod 4)>
func @matmul(%A: memref<16xf64, #pad>) {
  affine.for %arg3 = 0 to 16 {
        %a = affine.load %A[%arg3] : memref<16xf64, #pad>
        %p = mulf %a, %a : f64
        affine.store %p, %A[%arg3] : memref<16xf64, #pad>
  }
  return
}

when compiled with various different options (did not find a combination that works):

mlir-opt simple.mlir --simplify-affine-structures --lower-affine --convert-std-to-llvm

I have looked at some of the test codes, they often seems to flatten N-dimensional tensors to a 1D in maps, not sure why that is needed.

Is my error due to missing functionality, erroneous input, erroneous optimization sequence?

I think the expressivity of N dimensional memrefs with maps to deal with layout is a very strong abstraction. I hope we can make it work for this case, which hardly seems to be a corner case.

Thanks

Just before you lower to LLVM, please call normlizeMemRef (there is no pass for it). It will rewrite a memref’s access functions such that its layout map becomes identity. It can then be handled by the std to LLVM lowering.

Update: I notice that -simplify-affine-structures already calls normalizeMemref for all AllocOps. However, here, your memref is as a function argument. Had your memref been the result of an alloc, that would have worked. If you need to keep your function in that form (say it’s not inlined etc.), normalizeMemRef will have to be reused to create a new interprocedural / module pass to perform the necessary rewrites for the function argument case.

Thanks @bondhugula, using an alloc made it work. Will look into adding a call to normalizememRef if needed. Much appreciated

Great. The case where you have such memrefs being passed to functions should be straightforward to extend to as well if needed - the replacement mechanics are the same. It’s just that the function argument will have to be replaced and non-dereferencing uses (in call ops) could simply be replaced (normalizeMemref would bail out currently on such uses). And finally, it could be put into a separate -normalize-memref pass, which would have to be a pass on ModuleOp (because -simplify-affine-structures is a pass on FuncOps).

Hi,
I’m trying to simplify #map0 in following kind of example by

mlir-opt --simplify-affine-structures --allow-unregistered-dialect

, but it does not simplify %1. It seems this is because %1 is referred in test.test() from mlir/lib/Transforms/Utils/Utils.cpp#L70-L73

How can I simplify this? (Where should I modify the source code?)

Input:

#map0 = affine_map<(d0, d1) -> (d1, d0)>
module {
  func @test_simplification(%arg0: memref<10x5xf32>){
    %0 = alloc() : memref<10x5xf32, #map0>
    %1 = alloc() : memref<10x5xf32, #map0>
    "test.test"(%arg0, %1) : (memref<10x5xf32>, memref<10x5xf32, #map0>) -> ()
    dealloc %1 : memref<10x5xf32, #map0>
    dealloc %0 : memref<10x5xf32, #map0>
    return
  }
}

Output:

#map0 = affine_map<(d0, d1) -> (d1, d0)>
module {
  func @test_simplification(%arg0: memref<10x5xf32>) {
    %0 = alloc() : memref<5x10xf32>
    %1 = alloc() : memref<10x5xf32, #map0>
    "test.test"(%arg0, %1) : (memref<10x5xf32>, memref<10x5xf32, #map0>) -> ()
    dealloc %1 : memref<10x5xf32, #map0>
    dealloc %0 : memref<5x10xf32>
    return
  }
}

In this case, mlir::replaceAllMemRefUsesWith will bail out since it can’t safely replace the memref’s use on “test.test” (it’s a non-dereferencing use). The two functions in the source code you should be checking out are normalizeMemRefs and replaceAllMemRefUsesWith.

1 Like

Thanks for your comments! I understood why %1 was not simplified. It is simplified only when it is used by AffineReadOpInterface, AffineWriteOpInterface, AffineDmaStartOp, AffineDmaWaitOp.

I’m trying to understand more about dereferencing use, but I’m not still clear.
I think affine.load is one of the example for the dereferencing use. In this case, memref can be replaced because the memref is not used(not referred) after the ops? If possible, could you tell me more about dereferencing use ?

@bondhugula: Our use case is that basically we have library calls implementing high-level ops such as CONV, LLSM,… which requires the data as represented by the map. By definition, these “external” function will know how to deal with this data. In its simplest term, we really only need to pass the pointers to the data, plus a separate library-specific descriptor that defines the size of the data. Note the test functions will both load some data (read) and store some other data (write).

Thanks for the collective inputs provided on these forums, much appreciated.

Would you be able to suggest an approach where we can “register” these library calls (“test.test” in this example) to tolerate the maps and not prevent the lowering of the maps in the same way as when the “test.test” call is commented out?

Hi @AlexEichenberger @imaihal someone in my team is actually working on the interprocedural version (module pass) of memref normalization. That will handle function argument rewriting, call args, and return signature conversion - so it would be comprehensive.

Over here, looks like all you need is to replace the memref SSA value on your op in spite of it being a non-dereferencing use. This should be pretty straightforward if you need a temp fix - by patching normalizeMemRef and RAMUW. (Just do a regular replace use (setOperand) for its use on your test.test.)

@AlexEichenberger @imaihal please see here: https://reviews.llvm.org/D84490

@bondhugula

I checked the patch and it does the advertised functionality well, as shown with the small example below:

#map0 = affine_map<(d0, d1) -> (d0, d1 floordiv 32, d1 mod 32)>
module {
  func @test(%in : memref<5x10xf32>, %out : memref<5x10xf32, #map0>) {
      affine.for %i = 0 to 5 {
          affine.for %j = 0 to 10 {
              %v = affine.load %in[%i, %j] :  memref<5x10xf32>
              affine.store %v, %out[%i, %j] : memref<5x10xf32, #map0>
          }
      }
      return
  }
  func @test_simplification() {
    %0 = alloc() : memref<5x10xf32>
    %1 = alloc() : memref<5x10xf32, #map0>
    //"test.test"(%0, %1) : (memref<5x10xf32>, memref<5x10xf32, #map0>) -> ()
    call @test(%0, %1) : (memref<5x10xf32>, memref<5x10xf32, #map0>) -> ()
    dealloc %1 : memref<5x10xf32, #map0>
    dealloc %0 : memref<5x10xf32>
    return
  }
}

transformed into

module {
  func @test(%arg0: memref<5x10xf32>, %arg1: memref<5x1x32xf32>) {
    affine.for %arg2 = 0 to 5 {
      affine.for %arg3 = 0 to 10 {
        %0 = affine.load %arg0[%arg2, %arg3] : memref<5x10xf32>
        affine.store %0, %arg1[%arg2, %arg3 floordiv 32, %arg3 mod 32] : memref<5x1x32xf32>
      }
    }
    return
  }
  func @test_simplification() {
    %0 = alloc() : memref<5x10xf32>
    %1 = alloc() : memref<5x1x32xf32>
    call @test(%0, %1) : (memref<5x10xf32>, memref<5x1x32xf32>) -> ()
    dealloc %1 : memref<5x1x32xf32>
    dealloc %0 : memref<5x10xf32>
    return
  }
}

where all the maps are eliminated from the load/store and alloc/dealloc.

However, a pattern that we see often is the lowering to external implementations not expressed in MLIR (think CUDNN calls or the like). In the above example, if we comment the test call lines and uncomment the line with test.test, then the optimization will not remove any of the simplifications, as shown below.

func @test_simplification() {
  %0 = alloc() : memref<5x10xf32>
  %1 = alloc() : memref<5x10xf32, #map1>
  "test.test"(%0, %1) : (memref<5x10xf32>, memref<5x10xf32, #map1>) -> ()
  dealloc %1 : memref<5x10xf32, #map1>
  dealloc %0 : memref<5x10xf32>
  return
}

Is there a way to extend the approach to force the optimization through dialect operations?

In our case, these maps were introduced especially to satisfy expected layout by the dialect “test.” So we can take full responsibility that accesses within test.test will be fine. Happy with either a declarative approach or with a flag for a given dialect.

Sure - I think we should discuss a clean way to support this in a subsequent patch. The mechanics to achieve this are really trivial - not more than a couple of lines. I assume the “test.test” operation you are using is in reality a registered dialect operation. Traits or effects could be one way to model this cleanly.

@abhishek.varma

@bondhugula, you got it, we have a dialect which relies on a custom data layout for its operations, and we would like to continue representing memrefs using the original “logical” indices while hiding the actual projection of the actual data in a map. That way we preserve the original dimensions of the arrays, which we need at times, while being able to alloc and reference data within MLIR using the projected dimensions.

Operations that have a corresponding operations in that dialect will be using that dialects; and their functionality is implemented outside of MLIR. Operations that have no corresponding operations in that dialect will be implemented natively in MLIR, and having the memref maps will be very useful to access the data produced by the dialect operations.

This should enable dialects for many custom accelerators that rely on memory with custom data layouts.

Happy to help

Sure - that was exactly the objective behind having layout maps in memref type from the beginning. D84490 is committed now. There is another revision in the pipeline that completes this normalization by handling ReturnOps as well, which is non-trivial. Will be happy to review if you are able to submit one that adds the desired support.

Hi @AlexEichenberger, @imaihal an update has been made to memref map layout normalization that deals with the ReturnOps. Please see here: https://reviews.llvm.org/D85226

Hi, @bondhugula, @abhishek.varma,
Our test case including map in dialect operations is now successfully normalized by your and @AlexEichenberger’s patch (https://reviews.llvm.org/D86236). Thanks for your help!

We have another requirement about normalizing affine_map with dynamic dimension, as in the code below. (I just changed the dimension of test code here https://github.com/llvm/llvm-project/blob/master/mlir/test/Transforms/normalize-memrefs.mlir#L7-L17)

func @permute() {
  %c64 = constant 64 : index
  %A = alloc(%c64) : memref<?x256xf32, affine_map<(d0, d1) -> (d1, d0)>>
  affine.for %i = 0 to %c64 {
    affine.for %j = 0 to 256 {
      %1 = affine.load %A[%i, %j] : memref<?x256xf32, affine_map<(d0, d1) -> (d1, d0)>>
      "prevent.dce"(%1) : (f32) -> ()
    }
  }
  dealloc %A : memref<?x256xf32, affine_map<(d0, d1) -> (d1, d0)>>
  return
}

Currently this is not normalized, but we found you wrote it as TODO in comments.(https://github.com/llvm/llvm-project/blob/master/mlir/lib/Transforms/Utils/Utils.cpp#L455-L457)
Do you plan to support it?

Hi @imaihal, this isn’t really in our immediate TODO list. Will be happy to help review it if someone takes it up.

Hi @bondhugula, do you think that handling a case where the dynamic dimension is trivially mapped would be an easier stepping stone? See d0 mapping to ? below:

 memref<?x256xf32, affine_map<(d0, d1) -> (d0, d1 floordiv 32, d1 mod 32)>>

I don’t think it’ll make a big difference or any difference at all. It could be done in one shot for the general case I think. An extra “symbol” column would be needed in the constraint system for each dynamic dim, and the upper bound obtained subsequently would be an affine function potentially involving symbols (as opposed to just a constant as was the case for a static memref). It can then be used to construct the allocation for the new memref type. The access replacement logic remains unchanged, right?