Recent internal and external discussions about vectorization prompted some deeper poking at the memref type. Historically, memref was one of the first types designed in MLIR and it is the goto abstraction to allocate memory and interoperate with it.
Unfortunately, the definition of memref has implicit properties baked-in that have been surprising to multiple people and need to be addressed. Apologies if this is trivial to some, I have personally been surprised by this.
MemRef Type Definition
The LangRef states:
A memref type is a reference to a region of memory (similar to a buffer pointer,
but more powerful). The buffer pointed to by a memref can be allocated,
aliased and deallocated. A memref can be used to read and write data from/to
the memory region which it references.
A memref is further described using the notion of “Index Space”:
A memref dimension list defines an index space within which the memref
can be indexed to access data.
// Allocates a memref with 2D index space:
// { (i, j) : 0 <= i < 16, 0 <= j < 32 }
%A = alloc() : memref<16x32xf32>
Index
Data is accessed through a memref type using a multidimensional index into
the multidimensional index space defined by the memref’s dimension list.
// Loads data from memref '%A' using a 2D index: (%i, %j)
%v = load %A[%i, %j] : memref<16x32xf32>
In particular, it is unclear how big the region of memory underlying a memref is, how a memref is indexed or how many bytes of data are allocated by an alloc op. The doc for AllocaOp
states:
The amount of memory allocated is specified by its memref and additional
operands.
Additionally, depending on whether the alloc
op specifies a memref type with a layout map or not, the underlying actual allocation size can change. Let’s ignore this issue here.
Data Layout Basics
MLIR sits at a level of abstraction above LLVM and aims at being more structured. In particular, no pointer type is exposed and instead, memory manipulations occur through the memref. Consider the following MLIR snippet:
func @bar(%n : index) -> memref<?xf32> {
%m = alloc(%n) : memref<?xf32>
return %m : memref<?xf32>
}
The (-O1) LLVM generated from this resembles:
define { float*, float*, i64, [1 x i64], [1 x i64] } @bar(i64 %0) local_unnamed_addr #0 !dbg !3 {
%2 = shl i64 %0, 2, !dbg !7
%3 = tail call i8* @malloc(i64 %2), !dbg !9
%4 = bitcast i8* %3 to float*, !dbg !10
...
ret { float*, float*, i64, [1 x i64], [1 x i64] } %9, !dbg !16
}
The key information here is that the GEP instruction is used to compute the size of the type in memory in a target-independent fashion. In particular:
func @bar7() -> memref<7xf32> {
%m = alloc() : memref<7xf32>
return %m : memref<7xf32>
}
generates:
define { float*, float*, i64, [1 x i64], [1 x i64] } @bar7() local_unnamed_addr #0 !dbg !17 {
%1 = tail call dereferenceable_or_null(28) i8* @malloc(i64 28), !dbg !18
%2 = bitcast i8* %1 to float*, !dbg !20
...
ret { float*, float*, i64, [1 x i64], [1 x i64] } %7, !dbg !26
}
Similarly, memory indexing uses GEP:
func @foo(%A : memref<5xf32>) {
%f42 = constant -42.0 : f32
linalg.fill (%A, %f42) : memref<5xf32>, f32
%eA = memref_cast %A : memref<5xf32> to memref<*xf32>
call @print_memref_f32(%eA) : (memref<*xf32>) -> ()
return
}
results in:
define void @foo(float* %0, float* %1, i64 %2, i64 %3, i64 %4) local_unnamed_addr !dbg !27 {
br label %6, !dbg !28
6: ; preds = %5, %6
%7 = phi i64 [ 0, %5 ], [ %9, %6 ]
%8 = getelementptr float, float* %1, i64 %7, !dbg !30
store float -4.200000e+01, float* %8, align 4, !dbg !31
%9 = add nuw nsw i64 %7, 1, !dbg !32
%exitcond.not = icmp eq i64 %9, 5, !dbg !33
br i1 %exitcond.not, label %10, label %6, !dbg !28
...
Execution
Running the above foo example through the mlir-cpu-runner prints:
Unranked Memref base@ = 0x68a3faa2b60 rank = 1 offset = 0 sizes = [5] strides = [1] data =
[-42, -42, -42, -42, -42]
So far so good. Things go awry with types whose data layout and natural alignment are subjected to padding. Running the following MLIR:
func @foo5vec3f32() {
%f42 = constant dense<-42.0> : vector<3xf32>
%A = alloc() : memref<5xvector<3xf32>>
linalg.fill (%A, %f42) : memref<5xvector<3xf32>>, vector<3xf32>
affine.for %i = 0 to 5 {
%v = load %A[%i] : memref<5xvector<3xf32>>
vector.print %v : vector<3xf32>
}
return
}
prints the expected:
( -42, -42, -42 )
( -42, -42, -42 )
( -42, -42, -42 )
( -42, -42, -42 )
( -42, -42, -42 )
Problem, Alternatives and Tradeoffs
However, looking a little deeper with the following MLIR:
func @fooviewas5vec3f32() {
%c0 = constant 0 : index
%alloc = alloc() : memref<1000xi8>
%f42 = constant dense<-42.0> : vector<3xf32>
%A = view %alloc[%c0][] : memref<1000xi8> to memref<5xvector<3xf32>>
linalg.fill (%A, %f42) : memref<5xvector<3xf32>>, vector<3xf32>
%B = view %alloc[%c0][] : memref<1000xi8> to memref<15xf32>
%eB = memref_cast %B : memref<15xf32> to memref<*xf32>
call @print_memref_f32(%eB) : (memref<*xf32>) -> ()
return
}
reveals the padding introduced automatically by LLVM:
Unranked Memref base@ = 0x145b7fac96a0 rank = 1 offset = 0 sizes = [15] strides = [1] data =
[-42, -42, -42, -1.21979e-12, -42, -42, -42, -1.21979e-12, -42, -42, -42, -1.21979e-12, -42, -42, -42]
This is apparent when compiling the following to (O1) LLVM IR:
func @bar7vec3f32() -> memref<7xvector<3xf32>> {
%m = alloc() : memref<7xvector<3xf32>>
return %m : memref<7xvector<3xf32>>
}
which generates:
define { <3 x float>*, <3 x float>*, i64, [1 x i64], [1 x i64] } @bar7vec3f32() local_unnamed_addr #0 !dbg !27 {
%1 = call dereferenceable_or_null(128) i8* @malloc(i64 128), !dbg !28
the size of a vector<3xf32>
on my machine is 128 bytes.
In practice, all sort of transformations, vector masking and shuffling take place under the hood triggered by the LLVM data layout that are completely opaque to MLIR. One immediate consequence is that any MLIR cast operation that changes the underlying memref element type is at high risk of being fundamentally broken.
One could argue that vectors of non powers of 2 are not relevant but in reality:
- there are interesting performance regimes where it can be beneficial to have non-powers of 2 that still divide the memref size
- the indexing model for
memref<...xT>
essentially does not support casts unless the type T happens to be naturally aligned according to the data layout. This is a HW-dependent that is currently not exposed to MLIR.
Alternative 1: Implement / Expose Data Layout to MLIR
This seems like a straightforward answer but this actually only displaces the problem to MLIR. In the absence of a pointer abstraction, knowing the size of T statically in MLIR does not improve the fact that reads and writes through T* will result in padding in memory.
Alternative 2: Don’t use GEP
The fundamental issue is that GEP and bitcast do not commute:
- GEP is currently used for determining alloc sizes and load/store addresses. This seems natural for an indexing by the element rather than by the address, and seems inline with the “Index Space” definition.
- bitcast is necessary to implement an element type change in LLVM.
An intrusive alternative would be to modify all alloc
/ load
and stores
to ignore the LLVM data layout.
This can be achieved by computing the type size in MLIR and always using inttoptr and ptrtoint + bitcast to i8* to compute the address.
This may have far reaching consequences as I do not understand deeply the potential implications on aliasing and LLVM transformations. In particular this forces “C struct packing” everywhere which will result in misalignments. Also, SPIR-V / Vulkan disallows generic bitcast (see Additional Questions).
Alternative 3: Don’t allow casts that change the memref type.
This is essentially what we have in MLIR today. Certain ops perform an limited opaque manipulation with bitcast under the hood to guarantee safety (e.g. vector.type_cast
(whose doc needs to be revamped) and std.view
). This is unsatisfactory as it seems useful to be able to generally rely on packed layout in memory.
Alternative 4: Only allow casts between a subset of types
This requires #1 to be consistent with LLVM. Additionally, it is unclear that this restriction will result in concretely useful ops. TBD.
Alternative 5: Add a “packed” attribute to memref<… x T>
When the attribute is absent we’d just use the current alloc + indexing based on GEP + (opaque) LLVM DataLayout.
When the attribute is present we’d perform our own byte-level, potentially unaligned, addressing and skip GEP.
The advantage here is that there is precedent in C/C++ for struct packing (e.g. __attribute__((packed))
).
A bitcast-like operation would require packing, the packing attribute could be folded away when we have information from the DataLayout. This may happen late (LLVM dialect) depending on whether or not we want to keep it opaque to MLIR.
The usual complexity + YAGNI vs opacity tradeoffs would apply.
Additional Questions
What should we do at the sub-byte granularity (e.g. i6
because some quantization scheme may want that)?
What are implications on SPIR-V which has limitations on bitcasts and also goes through special op semantics to allow limited and controlled forms of bitcast?
What are the most important aspects when considering padding, packing and alignment and what should MLIR do?
Should we have a pointer type? Should we just always go through view memref<...i8>
?
Should we consider padding at the level of a full memref row ? For instance, memref<?x?xf32, alignment : 128>
generally does not result in 128 alignment except for the rows with some favorable gcd(r, dim#1) % 32.
Some hardware does not have byte addressability or an LLVM backend, what is relevant there?
Other aspects to consider?
Thanks for reading!
@ftynse, @mehdi_amini, @aartbik, @dcaballe, @bondhugula, @ThomasRaoux, @antiagainst, @MaheshRavishankar, @_sean_silva