MemRef type and data layout

Recent internal and external discussions about vectorization prompted some deeper poking at the memref type. Historically, memref was one of the first types designed in MLIR and it is the goto abstraction to allocate memory and interoperate with it.

Unfortunately, the definition of memref has implicit properties baked-in that have been surprising to multiple people and need to be addressed. Apologies if this is trivial to some, I have personally been surprised by this.

MemRef Type Definition

The LangRef states:

A memref type is a reference to a region of memory (similar to a buffer pointer, 
but more powerful). The buffer pointed to by a memref can be allocated, 
aliased and deallocated. A memref can be used to read and write data from/to 
the memory region which it references. 

A memref is further described using the notion of “Index Space”:

A memref dimension list defines an index space within which the memref 
can be indexed to access data.

// Allocates a memref with 2D index space:
//   { (i, j) : 0 <= i < 16, 0 <= j < 32 }
%A = alloc() : memref<16x32xf32>
Index
Data is accessed through a memref type using a multidimensional index into
the multidimensional index space defined by the memref’s dimension list.

// Loads data from memref '%A' using a 2D index: (%i, %j)
%v = load %A[%i, %j] : memref<16x32xf32>

In particular, it is unclear how big the region of memory underlying a memref is, how a memref is indexed or how many bytes of data are allocated by an alloc op. The doc for AllocaOp states:

The amount of memory allocated is specified by its memref and additional 
operands.

Additionally, depending on whether the alloc op specifies a memref type with a layout map or not, the underlying actual allocation size can change. Let’s ignore this issue here.

Data Layout Basics

MLIR sits at a level of abstraction above LLVM and aims at being more structured. In particular, no pointer type is exposed and instead, memory manipulations occur through the memref. Consider the following MLIR snippet:

func @bar(%n : index) -> memref<?xf32> {
  %m = alloc(%n) : memref<?xf32>
  return %m : memref<?xf32>
}

The (-O1) LLVM generated from this resembles:

define { float*, float*, i64, [1 x i64], [1 x i64] } @bar(i64 %0) local_unnamed_addr #0 !dbg !3 {
  %2 = shl i64 %0, 2, !dbg !7
  %3 = tail call i8* @malloc(i64 %2), !dbg !9
  %4 = bitcast i8* %3 to float*, !dbg !10
  ...
  ret { float*, float*, i64, [1 x i64], [1 x i64] } %9, !dbg !16
}

The key information here is that the GEP instruction is used to compute the size of the type in memory in a target-independent fashion. In particular:

func @bar7() -> memref<7xf32> {
  %m = alloc() : memref<7xf32>
  return %m : memref<7xf32>
}

generates:

define { float*, float*, i64, [1 x i64], [1 x i64] } @bar7() local_unnamed_addr #0 !dbg !17 {
  %1 = tail call dereferenceable_or_null(28) i8* @malloc(i64 28), !dbg !18
  %2 = bitcast i8* %1 to float*, !dbg !20
  ... 
  ret { float*, float*, i64, [1 x i64], [1 x i64] } %7, !dbg !26
}

Similarly, memory indexing uses GEP:


func @foo(%A : memref<5xf32>) {
  %f42 = constant -42.0 : f32
  linalg.fill (%A, %f42) : memref<5xf32>, f32
  %eA = memref_cast %A : memref<5xf32> to memref<*xf32>
  call @print_memref_f32(%eA) : (memref<*xf32>) -> ()
  return
}

results in:

define void @foo(float* %0, float* %1, i64 %2, i64 %3, i64 %4) local_unnamed_addr !dbg !27 {
  br label %6, !dbg !28

6:                                                ; preds = %5, %6
  %7 = phi i64 [ 0, %5 ], [ %9, %6 ]
  %8 = getelementptr float, float* %1, i64 %7, !dbg !30
  store float -4.200000e+01, float* %8, align 4, !dbg !31
  %9 = add nuw nsw i64 %7, 1, !dbg !32
  %exitcond.not = icmp eq i64 %9, 5, !dbg !33
  br i1 %exitcond.not, label %10, label %6, !dbg !28
...

Execution

Running the above foo example through the mlir-cpu-runner prints:

Unranked Memref base@ = 0x68a3faa2b60 rank = 1 offset = 0 sizes = [5] strides = [1] data = 
[-42,  -42,  -42,  -42,  -42]

So far so good. Things go awry with types whose data layout and natural alignment are subjected to padding. Running the following MLIR:

func @foo5vec3f32() {
  %f42 = constant dense<-42.0> : vector<3xf32>
  %A = alloc() : memref<5xvector<3xf32>>
  linalg.fill (%A, %f42) : memref<5xvector<3xf32>>, vector<3xf32>
  affine.for %i = 0 to 5 {
    %v = load %A[%i] : memref<5xvector<3xf32>>
    vector.print %v : vector<3xf32>
  }
  return
}

prints the expected:

( -42, -42, -42 )
( -42, -42, -42 )
( -42, -42, -42 )
( -42, -42, -42 )
( -42, -42, -42 )

Problem, Alternatives and Tradeoffs

However, looking a little deeper with the following MLIR:

func @fooviewas5vec3f32() {
  %c0 = constant 0 : index
  %alloc = alloc() : memref<1000xi8>
  %f42 = constant dense<-42.0> : vector<3xf32>
  %A = view %alloc[%c0][] : memref<1000xi8> to memref<5xvector<3xf32>>
  linalg.fill (%A, %f42) : memref<5xvector<3xf32>>, vector<3xf32>

  %B = view %alloc[%c0][] : memref<1000xi8> to memref<15xf32>
  %eB = memref_cast %B : memref<15xf32> to memref<*xf32>
  call @print_memref_f32(%eB) : (memref<*xf32>) -> ()

  return
}

reveals the padding introduced automatically by LLVM:

Unranked Memref base@ = 0x145b7fac96a0 rank = 1 offset = 0 sizes = [15] strides = [1] data = 
[-42,  -42,  -42,  -1.21979e-12,  -42,  -42,  -42,  -1.21979e-12,  -42,  -42,  -42,  -1.21979e-12,  -42,  -42,  -42]

This is apparent when compiling the following to (O1) LLVM IR:

func @bar7vec3f32() -> memref<7xvector<3xf32>> {
  %m = alloc() : memref<7xvector<3xf32>>
  return %m : memref<7xvector<3xf32>>
}

which generates:

define { <3 x float>*, <3 x float>*, i64, [1 x i64], [1 x i64] } @bar7vec3f32() local_unnamed_addr #0 !dbg !27 {
  %1 = call dereferenceable_or_null(128) i8* @malloc(i64 128), !dbg !28

the size of a vector<3xf32> on my machine is 128 bytes.

In practice, all sort of transformations, vector masking and shuffling take place under the hood triggered by the LLVM data layout that are completely opaque to MLIR. One immediate consequence is that any MLIR cast operation that changes the underlying memref element type is at high risk of being fundamentally broken.

One could argue that vectors of non powers of 2 are not relevant but in reality:

  1. there are interesting performance regimes where it can be beneficial to have non-powers of 2 that still divide the memref size
  2. the indexing model for memref<...xT> essentially does not support casts unless the type T happens to be naturally aligned according to the data layout. This is a HW-dependent that is currently not exposed to MLIR.

Alternative 1: Implement / Expose Data Layout to MLIR

This seems like a straightforward answer but this actually only displaces the problem to MLIR. In the absence of a pointer abstraction, knowing the size of T statically in MLIR does not improve the fact that reads and writes through T* will result in padding in memory.

Alternative 2: Don’t use GEP

The fundamental issue is that GEP and bitcast do not commute:

  • GEP is currently used for determining alloc sizes and load/store addresses. This seems natural for an indexing by the element rather than by the address, and seems inline with the “Index Space” definition.
  • bitcast is necessary to implement an element type change in LLVM.

An intrusive alternative would be to modify all alloc / load and stores to ignore the LLVM data layout.
This can be achieved by computing the type size in MLIR and always using inttoptr and ptrtoint + bitcast to i8* to compute the address.

This may have far reaching consequences as I do not understand deeply the potential implications on aliasing and LLVM transformations. In particular this forces “C struct packing” everywhere which will result in misalignments. Also, SPIR-V / Vulkan disallows generic bitcast (see Additional Questions).

Alternative 3: Don’t allow casts that change the memref type.

This is essentially what we have in MLIR today. Certain ops perform an limited opaque manipulation with bitcast under the hood to guarantee safety (e.g. vector.type_cast (whose doc needs to be revamped) and std.view). This is unsatisfactory as it seems useful to be able to generally rely on packed layout in memory.

Alternative 4: Only allow casts between a subset of types

This requires #1 to be consistent with LLVM. Additionally, it is unclear that this restriction will result in concretely useful ops. TBD.

Alternative 5: Add a “packed” attribute to memref<… x T>

When the attribute is absent we’d just use the current alloc + indexing based on GEP + (opaque) LLVM DataLayout.
When the attribute is present we’d perform our own byte-level, potentially unaligned, addressing and skip GEP.
The advantage here is that there is precedent in C/C++ for struct packing (e.g. __attribute__((packed))).

A bitcast-like operation would require packing, the packing attribute could be folded away when we have information from the DataLayout. This may happen late (LLVM dialect) depending on whether or not we want to keep it opaque to MLIR.

The usual complexity + YAGNI vs opacity tradeoffs would apply.

Additional Questions

What should we do at the sub-byte granularity (e.g. i6 because some quantization scheme may want that)?

What are implications on SPIR-V which has limitations on bitcasts and also goes through special op semantics to allow limited and controlled forms of bitcast?

What are the most important aspects when considering padding, packing and alignment and what should MLIR do?

Should we have a pointer type? Should we just always go through view memref<...i8>?

Should we consider padding at the level of a full memref row ? For instance, memref<?x?xf32, alignment : 128> generally does not result in 128 alignment except for the rows with some favorable gcd(r, dim#1) % 32.

Some hardware does not have byte addressability or an LLVM backend, what is relevant there?

Other aspects to consider?

Thanks for reading!

@ftynse, @mehdi_amini, @aartbik, @dcaballe, @bondhugula, @ThomasRaoux, @antiagainst, @MaheshRavishankar, @_sean_silva

7 Likes

Great post. A comment and suggestions below.

This may be the case of most MLIR dialects today, but it does not have to be the case in general. Lower level abstractions and alternative memory abstractions can be designed and even made compatible with memrefs if needed.

Now, what about Alternative 5, similar to Alternative 1, and resurrecting old discussions. What about a lower level buffer abstraction backing up any memref type. The alloc op would have an optional buffer attribute pointing to a buffer value. A buffer would have ?xi8 type and no implicit padding. We would need an additional op querying for a memref’s buffer and for its properties. When the buffer attribute is missing, a canonical one is provided when querying for its properties (canonical way to be defined, probably target-specific). Some buffer sizes may be restricted on targets where i8 addressing/sizing is not possible/desirable, and this may be implemented when lowering the buffer or at verification time via a target hook. A memref would always be associated with an underlying buffer; bitcasting the memref would not change the buffer; size constraints (buffer size >= implied from memref w/ padding) would have to be verified. This would help solve two problems we are struggling with from the beginning:

  • memref aliasing, by attaching alias set attributes to buffers rather than memrefs, which reduces bloat via one extra level of indirection;
  • provide complete support for layout maps, with verifyers and lowering code for all situations through those buffers.

This is certainly far from complete and missing significant aspects of the question, but I think it helps solve the “bitcast and getelementptr do not commute” thing while providing extra power to memrefs in the future.

Hey Nicolas!

Thanks for bringing this up! Great analysis! I’m afraid I cannot be very helpful in this topic but this problem is also blocking our vectorization efforts so, please, let me know if I can help somehow.

Alternative 3: Don’t allow casts that change the memref type.

vector.type_cast and std.view are very powerful but they also introduce strong constraints. The latter, for example, requires the memref to be 1D and i8. I think this would have important implications all over the place, as we discussed in [1]. Any function using std.view on input memrefs would require them to be 1D and i8, we would be losing the original shape information in the function, etc. That sounds concerning to me.

[1] Understanding the vector abstraction in MLIR - #2 by mehdi_amini

In this regard, I have a näive question that has been around for a while about what a vector is actually modeling when it is used in a memref. It was briefly discussed in [1] but I think it’s somehow also relevant here. Some questions that come to mind: Is %a = alloc() : memref<8xf32> actually different from %b = alloc() : memref<2xvector<4xf32>>? Does it mean that we are not allowed to read elements [2, 5] from %b with a single packed vector<4xf32> load or read the 8 elements in %b with single packed vector<8xf32>? Someone may answer that a vector memref is describing contiguous elements in memory but, don’t we have memref strides to model that?

I wonder if we could simplify this problem by separating memory representation aspects (memrefs) from how elements are read or written from/to memory (scalar/vector load/store ops, gather/scatter ops, …). This would imply that vector wouldn’t be allowed as memref element type and memory load/store operations would encode the scalar/vector information related to the memory operation. We actually have operations already that take in a “non-vector” memref and performs a vector load/store on them, where the vector information is encoded in the operation and not in the memref:

    %0 = alloc() : memref<8xf32>
    %1 = vector.transfer_read %0[%c0], %cst : memref<8xf32>, vector<4xf32>
    %2 = affine.vector_load %0[%c0] : memref<8xf32>, vector<4xf32>

There would probably be many other implications that I’m missing but maybe it’s another alternative to consider.

As far as I understand: it could very well be possible on a platform that the minimum alignment for vector is higher than 16B, for example if it were 32B then the second alloc would have 16B padding between the first and the second vector in memory.

1 Like

Thanks for bringing this up Nicolas! I recently ran into this with a miscompile due to a memref<3xi1>. This question is also quite timely, as @jurahul will have to deal with this when lowering std.global_memref to LLVM.

Probably upstream passes will have to legalize tensor to tensor at the tensor level if they want to run on hardware that doesn’t have native i8 support. E.g. std.addi %lhs, %rhs : tensor<i6> is pretty easy to legalize to tensor<i8>. But once we have expanded memrefs and the addi : i6 is lost somewhere inside loops, it’s probably too late. (especially because legalizing i6 to i8 across the whole program will create lots of opportunities to optimize the legalized code, and doing that at tensor level seems much easier)

Similarly, probably passes creating memrefs with vector element type probably need some awareness of what the code they are creating implies on the target (would love to hear more about what passes create memrefs with vector element type!).


For non-byte addressable hardware, perhaps I can just sketch what one (simplified) example of such hardware looks like and see if that stimulates any ideas.

One example that is fairly natural from a hardware implementor’s perspective (I know of multiple production chips that are like this) is that each “vector lane” is basically its own little processor (or processing element, “PE”) with its own scalar register file and scalar memory. When you do something like

%0 = load %localmem[%c3] : memref<5xvector<256xf32>, "my fastest local memory">
%1 = addf %0, %0 : vector<256xf32>

really what happens at the hardware level is that each PE does a scalar memory load (at scalar index %c3) from its local scalar memory into its local scalar registers. Then, each PE does a local scalar addf inside its local register file. As such, at the hardware level, there is really no notion of a “vector being in memory in a way that corresponds to a contiguous byte buffer”. Really what you have is 256 separate scalar memories that can only be indexed by the PE that they reside in, or by some external agent via an asynchronous memory-to-memory dma_start-like command.

Notice that nowhere is %c3 * %vector_size_in_bytes ever a useful quantity.

So ideally we can keep a layering here where we can talk about the abstract elementwise access to a mutable, indexable buffer with reference semantics that doesn’t correspond to a particular linearization to scalars (or worse, bits/bytes!). And then at some point we can make a concrete decision when lowering about what is possible to lower in a “well behaved” way.

Btw, LLVM doesn’t handle this type of architecture well. Basically what we have to do is to hack up the GEP lowering to somehow bypass the “index * element_size_in_bytes” calculation for such vector loads and just keep “index”. For N>1 backends that I know of which do this, the correct LLVM IR for MLIR to emit is still to emit a GEP “as if there was a contiguous byte buffer” :frowning: and make sure that you (or LLVM’s optimizations) don’t actually create any code that actually tries to operate as if there was a contiguous byte buffer.

Edit: Added Alternative #5 above for a packed attribute to memref which may end being up the least intrusive.

First of all, thanks for raising this subject!

I think this is a great opportunity to adapt memref to the evolution of MLIR that happened after it was created and also answer some recurrent feature requests about memref, namely what properties should a type satisfy to be suitable as a memref element (often formulated as a need for memref-of-memref, or memref-of-custom-type).

One comment is that allocation uses GEP because it’s a common LLVM equivalent of sizeof - see http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt - and that your example with -O1 actually doesn’t have a GEP because it was constant-folded.

I would really like if we considered Alternative 1 in more detail. This is something that we need anyway, let’s try to make most use of it. In one of the recent threads on memref-of-memref, I suggested to introduce a type interface MemRefElement that at the very least provides the allocation size of the type. We can further build on this to also define alignment and padding properties of the type, which seem to be the main source of problems in your examples. Arguably, MLIR currently uses LLVM data layouts wrong: the data layout could be added after some transformations that it should have affected.

Building further on this, memref itself can have alignment and padding rules, namely for alignment of n-dimensional memrefs in order to ensure that the requirements of the element type are respected. Incidentally, this will enable memref-of-memref that folks often request.

FWIW, I think that giving a clear, albeit complex, definition of the layout and how it affects the allocation size will make memref design cleaner. We already have a pattern for the addressing scheme, which isn’t trivial either, utilized in load/store lowering. Having another formula would just make the memref lowering more consistent, and the relation of the indexing scheme to the underlying memory storage is evident.

Can you please elaborate on what is the problem you are referring to here:

The type interface need not be limited to type size, but can contain any information necessary for constructing the representation of the type in memory.

I think that given a sufficiently clear definition of the layout, we can either update the indexing scheme to be less surprising, or at least provide a better explanation of why it works the way it does. E.g., I know about padding in C structures, so I’m not surprised to see garbage when I look at a structure byte-by-byte. There is no reason why this shouldn’t apply to MLIR. So having an equivalent of reinterpret_cast or bitcast that doesn’t guarantee contiguity is not all that surprising to me.

(Fun C++ trivia: it is possible to construct types such that (reinterpret_cast<inptr_t>(ptr1) != reinterpret_cast<intrptr_t>(ptr2) && ptr1 == ptr2) holds for two pointers.)

Another aspect to consider: in the current memref descriptor, the offset is represented as number of elements from the base. This may lead to weird situations where one wants to slice a big buffer into parts in absence of alignment requirements – a memref may end up needing a fractional offset. For example, a part with 3 i8 immediately followed by a part with f32s would need an offset of 3/4 from their common base pointer.

I agree with @ftynse in general here.

One challenge here is that this isn’t an static “absolute” property of the type, but a property of the type “in a particular environment” (or target, modeled by the DataLayout structure in LLVM).
No reason we can’t design something flexible enough here though.

Thanks Nicolas for starting this! This is really helpful. Some thoughts from Vulkan/SPIR-V specifically and GPU generally.

Regarding pointers, for Vulkan/SPIR-V, pointer usage is restricted and layered: different usage of pointers will require different extensions/capabilities. The most basic Vulkan-flavored SPIR-V adopts logical addressing mode. It has pointer types, but the operations that can work with the pointer type is very limited; namely, variables/function parameters for declaring memory objects, access chains (like LLVM GEP) for walking through the pointee type structure, load/store, atomic ops, memory copy ops. That’s basically it. One cannot store pointers inside variables (that will require the VariablePointers capability from the SPV_KHR_variable_pointers extension). Also cannot cast pointers to/from integers (which requires other capabilities, e.g., PhysicalStorageBufferAddresses from the SPV_KHR_physical_storage_buffer extension) or do pointer arithmetics. This layered support gated by extensions/capabilities is to satisfy the diverse landscape of GPUs out there; but it does mean additional complication. The logical addressing mode means at SPIR-V level, all “addressing” are considered as element offset/indexing; not concrete byte addressing in the memory buffer. The place where the concrete data layout kicks in is as decorations to the buffer type. Decorations in SPIR-V can be roughly thought of as LLVM metadata but semantic impacting. The decorations on the buffer type provide detailed information regarding how each field should be laid out and they should ABI match with the GPU API side providing the memory buffers. The layout decorations are consumed by GPU driver compilers that translate SPIR-V into hardware ISA. So essentially in SPIR-V land, the data layout consistency is scattered across multiple parties, across kernel and runtime, across SPIR-V generator and GPU driver compiler. There is a trend to open up more and more flexible pointer usage and the example extensions listed above are getting better and better adoption. So even in SPIR-V representation itself we’ll likely see more and more of the issues listed for LLVM here directly applicable to SPIR-V too.

Now given that we are rethinking about memref and data type and trying to address the issue in a more principled way, I’d also like to use the opportunity to clear my head regarding the path forward for certain exotic “buffers” that are related. On GPU, there is another interesting storage type to consider: textures/images. They have complete opaque underlying data layout (to be exact, they can be in the mode of having linear data layout like a normal buffer for transfer data in/out, but when they are really used for computation, they typically are in some sort of driver compressed/optimized format like z-order or so). This storage type is typically used for image sampling and can be of great importance to ML vision models because of convolutions, pooling, etc. They can also have great cache support so also possibly useful for storing normal buffers. The way one works with these types are using special intrinsics to read from / write to them. In SPIR-V, that’s like OpImageRead/OpImageWrite and others. On CUDA I think similar things exist. So speaking of textures/images, I think they work well with the restricted memref representation with special load/store/cast ops, if to model them with memeref.

I agree with @ftynse that this might be a great opportunity to define the boundaries of memref, which will make things clear like whether we can “overload” it more to support GPU textures/images (and cases like sparse tensors which also have exotic data layout). memref is quite load-bearing. It’s already vertically integrating a great portion of the whole stack (it’s basically the only existing buffer abstraction other than those in specific targets). To me, introducing data layout to it is making it more overloaded and integrating even greater, maybe also horizontally this time by covering all the different storage types including normal buffers and exotic “buffers” using one unified layout. The downside I can see is now every pattern/pass/etc. need to consider all the cases supported by memref, even for those they don’t care about. It can be fragile and error-prone. I’m wondering whether this is leaving the multi-level power untapped and whether it makes sense to break it down a bit like what @albertcohen suggests. Maybe leave memref as “logical” and have special ops for working with it and introduce other types that concern data layout? Then we can probably have different types for normal buffers, sparse tensors, GPU texture/images, and others. memref (or whatever it is called) can be a level above them. To me, that has a clear separation of concerns and supporting one kind of data-layout-aware buffers does not require all to action.

Just my two cents. I can certainly be way off here. :slight_smile:

I understand it may be the most practical definition of memref of vector type today, but I don’t see it as future-proof. If there are target-specific requirements that prevent storing memref<2xvector<4xf32>> as contiguous 32B, then this type should be rejected for those target-specific reasons. The vectorizer or whatever pass that crafted this type should not have done it in the first place, and this can be easily verified as soon as target information is available.

Preventing leakage of implementation details into higher level abstractions pushes for making memref of vector bitcasting size- and layout-invariant.

1 Like

Indeed. I was thinking along the lines of adding a layout attribute to the module, and having the related interface methods require an instance of this attribute as argument in order to compute the size/layout/padding. Another challenge here is to make sure this scales to the open type system. Having a of pairs “type → layout info” works only for specific instances of a type. We would also need some way of specifying generalization rules such as “integers take the layout properties of the first concrete integer type for which they are defined in the attribute”.

I personally wouldn’t try pushing memref onto, e.g., texture memory precisely because of a different addressing mode. (As a comment on the top part on @antiagainst’s post, CUDA indeed has texture memory as a special class that is accessed through special C++ class templates with overloaded operators and/or custom functions, which support floating point indices). Having load and store that support “normal” memrefs and texture_load that supports “texture” memref, without being able to “reinterpret” one into another doesn’t sound very enticing. I would suggest considering the tradeoff between the functionality one can get “for free” (or at least for cheap) by reusing the memref type vs. the additional complexity of the new cases existing code would need to handle. For the texture memory, I don’t think there is much that can or should be reused: loads/stores are different, allocation is different, analyses don’t necessarily apply because of the different addressing mode, which IMO justifies a distinct type. There are other ways of sharing the code, or common parts of the functionality. If we want to reuse the nD addressing scheme, we can factor it out into a type interface + op interface, for example.

The consideration of the tradeoff above also applies to introducing a layout. We will have to change the lowerings of allocation/load/store to account for that. Since we disallow internal aliasing in the spec, i.e. two distinct tuples of indices should not point to the same memref, analyses need not be modified. This sounds on-the-edge to me, but given the omnipresence of memrefs, I would be on the side of modifying the existing type as opposed to introducing a new one.

I am concerned that this will only push the problem one level above. With “logical” memrefs, we won’t be able to reason about, e.g., allocation or actual access patterns (stride-1/contiguous vs. small-stride vs. unknown stride, for example, is important for vectorization decisions). So we will have to lower them to a lower level abstraction, or lift components of this abstraction to “logical” memrefs, which will likely bring us back to where we are now.

I don’t think that we in general can avoid having this information at the high level. Normally, I would expect the frontend to provide the layout information that the passes should use. Injecting the layout information in the middle of the transformation process comes with the danger that some previous passes already made assumptions the layout, which could have been inconsistent with what is injected. In LLVM, it’s clang that sets up the data layout field when creating the module…

That being said, I see the current memrefs essentially as templates for address computation analyzable at compile time. Today, for strided memrefs, the pattern is *(alignedPtr + offset + \sum_i index_i * stride_i). This does not necessarily match how the data is stored in memory, so we are reconsidering the pattern to be something like *reinterpret_cast<Ty*>(reinterpret_cast<char *>(alignedPtr) + offset * sizeof(Ty) + \sum_i (index_i + align-corrected(stride_i)) * sizeof(Ty). As long as the template is well specified, I would be okay with keeping it inside the memref + load/store.

Good points, thanks! I see the problem now. Indeed, if a memref is non-vector, I don’t think the vectorizer can just automatically turn it into a vector memref, given the aforementioned implications. Actually, alignment implications also bring up another problem: regardless of the target’s alignment constraints, we should be able to represent unaligned vector loads/stores in MLIR. If we focus on the Standard dialect, the current memref model and the use of vector memrefs, I can’t think of an easy way to represent them. I’m probably missing something. Maybe you could educate me in this regard.

Perhaps this is trivial but let me elaborate a bit more on the alignment problem with an example using LLVM. Let’s say we have the following scalar code and we want to vectorize it:

void foo(float *restrict a, float *restrict b, int N) {
#pragma clang loop vectorize(enable) vectorize_width(8) interleave(disable)
  for (int i = 0; i < N; i++)
    a[i] = b[i] + b[i+7];
}

After vectorization, the memory accesses on b look like this:

  // b[i]
  %0 = getelementptr inbounds float, float* %b, i64 %index
  %1 = bitcast float* %0 to <8 x float>*
  %wide.load = load <8 x float>, <8 x float>* %1, align 4, !tbaa !2

  // b[i+7]  
  %2 = or i64 %index, 7
  %3 = getelementptr inbounds float, float* %b, i64 %2
  %4 = bitcast float* %3 to <8 x float>*
  %wide.load15 = load <8 x float>, <8 x float>* %4, align 4, !tbaa !2

The address computation happens on the original scalar type (float), then the address is bitcasted to a vector address (<8 x float>*) which is finally loaded. Note that this happens regardless of the alignment of the &b[i] and &b[i+7] and the alignment constraints of the target. The target will lower each load appropriately based on the ISA (single unaligned vector load, multiple aligned vector loads + combining instructions, scalar loads + combining instructions, etc.). Is there a simple way to represent this example with the current memref model and the Std dialect?

My impression is that whereas memory allocation, address computation and memory operations are well decoupled in LLVM, the higher level of abstraction in MLIR (e.g., Std dialect) keeps these three concepts too tight. This is great for other reasons but currently it doesn’t allow us to describe the address computation in terms of the original memref scalar element type and then perform the load/store using a vector address (bitcast). As I mentioned before, we can currently use affine.vector_load/store and vector.transfer_read/write ops to model exactly this “scalar address computation + bitcast to vector address” and overcome this limitation in the Affine and Vector dialects, respectively. Not sure if that’s the approach that we want to generalize but I think we should also address this problem as part of the new memref model.

I don’t quite get your point here: it seems to me that preventing memref<2xvector<4xf32>> from being created is also leaking implementation details into above layer somehow.

We would need to define the notion of scalar type on memrefs that can “see through” other types in order to support this in the address computation scheme. We may end up needing this anyway if we want memref-of-memref, but it introduces extra complexity so I am not ready to push one way or another.

That being said, I think we can represent this today if we avoid jumping from C to vectorized LLVM.

scf.for %i = ... {
  %b0 = load %b[%i] : memref<?xf32>
  %b1 = load %b[%i + 7] : memref<?xf32>
  ...
}

gets vectorized by adding casts and views with offsets

scf.for %i= ... step 8 {
  %b0view = subview %b[0][][] : memref<?xf32>
  %b0viewvector = vector.type_cast %b0view : memref<?xf32> to memref<?xvector<8xf32>>
  %b0vector = load %b0viewvector[%i / 8]
  %b1view = subview %b[7][][] : memref<?xf32, offset: 7, strides: [?]>
  %b1viewvector = vector.type_cast %b1view : memref<?xf32, offset: 7, strides: [?]> to memref<?xvector<8xf32>>
  %v1vector = load %b1viewvector[%i / 8]
}

note that the “+7” moved from the load to the view. This may or may not pass the verifier right now, but we should be able to relax this. The trick to work around vector alignment issues is to take the subview with offset every time.

Then we need a strong canonicalizer to simplify all this away. (I’m not sure if LICM considers subview, maybe it should not).

The thinking was about exposing the problem and getting a solution for Diego without longer delays. In the fullness of time exposing DataLayout to MLIR is indeed the best way forward. I happen to know you have started looking deeper at this so that we can have an MLIR DataLayout and generate the LLVM DataLayout, so big +1 from me :slight_smile:.

There should not be such blanket limitations, there is no conceptual problem in having memref<T>, the only problem is whether a bitcast is legal or not. As explained previously, a bitcast is fine IFF the layout and addressing scheme does not introduce padding. If the layout would introduce padding, it is still possible make bitcast + addressing work by going back to byte level addressing.

The cases I see are:

  1. if the natural representation of T on the HW does not require padding, all is good without the existing addressing scheme.
  2. if the natural representation of T on the HW requires padding, and memref is marked as packed, and the HW allows loading at the needed granularity / alignment then all is good with a packed addressing scheme.
  3. if the natural representation of T on the HW requires padding, and memref is not marked as packed, and the HW allows loading at the needed granularity / alignment then bitcast scalar → vector is invalid but we can still reason about bitcast vector → scalar. This is what we currently have in MLIR.
  4. if the HW has more difficult loading granularity / alignment then I expect this will be captured by the padding needs of T and reduce to one of the previous cases

All these will be available form exposing DataLayout to MLIR, for now we do not even have the information available to determine whether we need a the byte-level indexing scheme and whether bitcast is legal. Case 3. is a conservatively correct approach that can be inefficient, same for case 2 (with an extra “packed” attribute).

MLIR does not currently have enough information to handle Option 1. or Option 4. in a generic fashion.

Thanks for the example, Alex! I’m glad we have a way to represent unaligned vector memory ops. I’m concerned, thought, that we have to go into subviews to model something as simple as an arbitrary vector load or store. As you mentioned, it could make reasoning about memory accesses more difficult since the address computations spans across multiple ops.

Regarding the vector.type_cast op, according to the doc, “Performs a conversion from a memref with scalar element to a memref with a single vector element, copying the shape of the memref to the vector”, so I think we cannot use it to convert a scalar memref to a vector memref with the arbitrary vector shape that we want. Also, @nicolasvasilache and @bondhugula discussed [1] other issues related to converting scalar memrefs to vector memrefs so it doesn’t seem that simple.

Thanks, Nicolas! I think there are 3 problems we have to address in the Standard dialect related to the Vectorizer and the Affine/Vector front:

  1. Padding/alignment issues related to the allocation of vector memrefs (described in your first post).
  2. Representation of vector memory ops on memrefs allocated using scalar element types (unaligned vector load example).
  3. Representation of unaligned vector memory ops (unaligned vector load example).

IIUC, having an MLIR DataLayout would solve #1 but not #2 and #3. Also, if we had #2 and #3, we would minimize #1 scenarios since we wouldn’t have to necessarily use vector memrefs to perform vector memory ops, as we have to do in the Standard dialect today.

[Sorry if I’m repeating myself below :slight_smile: ]

Current semantics of affine.vector_load/store ops [2] (also vector transfer ops) nicely address #2 and #3 by:

%1 = affine.vector_load %0[%i0 + 3, %i1 + 7] : memref<100x100xf32>, vector<8xf32>
  • Allowing scalar memrefs as inputs (%0 is memref<100x100xf32>).
  • Performing address computation based on the element type of the input memref (f32).
  • Bitcasting the computed address to the second type provided (vector<8xf32>) before performing the load/store.

IMO, this approach is the high-level equivalent of what LLVM does and it provides the level of representation that we need to solve #2 and #3. It would be great to know about cons/problems that you may see if we extended this approach to loads/stores in the Standard dialect. Some examples:

// Aligned/unaligned vector loads/stores can be performed on scalar memrefs.
// They both have a simple representation. Vector loads/stores from the vectorizer
// can be lowered to this representation.
scf.for %i= ... step 8 {
  %idx = addi %i, %c7 : index
  %v0 = load %b[%i] : memref<8000xf32>, vector<8xf32>
  %v1 = load %b[%idx] : memref<8000xf32>, vector<8xf32>
}
// Vector loads/stores on vector memref can also be represented.
// Address computation is based on the memref vector type (`vector<8xf32>`),
// taking into account HW constraints as it happens today.
scf.for %i= ... step 1 {
  %v0 = load %b[%i] : memref<?xvector<8xf32>>, vector<8xf32>
  ...
}
// No major changes for scalar loads/stores. We could avoid printing the second
// op type when it's the same as the memref element type.
scf.for %i= ... step 1{
  %b = load %b[%i] : memref<?xf32>, f32
  ...
}

Thoughts?
Thanks,

Diego

[1] Steps towards generalizing vectorization in Affine - #10 by bondhugula
[2] 'affine' Dialect - MLIR

I don’t see an issue with allowing vector loads from scalar memref. I assume that these load would be aligned on the alignment of the scalar type without more information right?

I would restrict these to the fastest varying dimension though and forbid something like: load %buf[0][0] : memref<4x3xf32>, vector<12xf32>

IIUC this would not be adding any other assumption that a sequence of elements in memory has the same layout as a vector of the same number of elements.

Is this enough for what you need right now?

Sorry for the delay, EoY crunch + other priorities compete on cycles :stuck_out_tongue:

Let’s go with it then, this is what already happens under the hood in vector.transfer → llvm.load conversion so lifting it one dialect up and making available to others is fine.

The fastest varying dimension restriction that @mehdi_amini proposes is also what the vector.transfer lowering does today so let’s just keep that for now. The elemental type of buffer and vector should be the same.

You just need to decide what the semantics at the boundary are: load %buf[%c0] : memref<?xf32>, vector<4xf32>. At this level of abstraction I wouldn’t worry about reproducing something like the masked attribute in vector.transfer, just let it read/write out of bounds and put the burden on the higher levels to use it properly.

Thanks @dcaballe !

My concerns with vector loads from scalar memrefs are with the semantics of this operation when we go beyond the trivial case. What is the semantics of partially out-of-bounds read, for example?

// Statically analyzable case, can see the problem.
std.load %memref[%c0] : memref<7xf32>, vector<8xf32>
// Can be out-of-bounds in a different direction with %i < 0,
// still partially analyzable at compile time,
std.load %memref[%i] : memref<7xf32>, vector<8xf32>
// Can't really tell, but we may be reading out-of-bounds.
std.load %memref[%i] : memref<?xf32>, vector<8xf32>

What happens when the read is not technically out of bounds, but the data layout makes it weird:

// Is the last element of vector guaranteed to be %memref[%c1, %c0]?
// Does it depend on data layout?
// Is it undefined/disallowed?
std.load %memref[%c0, %c0] : memref<2x7xf32>, vector<8xf32>

Can we read nD vectors, how does it interact with data layout?

// What do we actually load if the rows are padded to 4 elements
// to ensure each row is aligned?
std.load %memref[%c0, %c0] : memref<3x3xf32>, vector<3x3xf32>

How do we deal with non-trivial memref layouts?

// Does the result contain %memref[0], %memref[1], %memref[2], %memref[3]
// or does it contain garbage at positions 1 and 3 (zero based)?
// If the former, this isn't truly a vector load anymore.
// Is it even legal?
std.load %memref[%c0] : memref<4xf32, offset: 0, strides: [2]>
                        vector<4xf32>
// Which dimension do we read along?
std.load %memref[%c0, %c0] : memref<?x?xf32, (d0,d1)->(d1,d0)>
                             vector<4xf32>
// I don't even have an idea what to do here.
std.load %memref[%c0] : memref<?xf32, (d0) -> (d0 floordiv 32, d0 mod 32)>,
                        vector<4xf32>

I would be favorable to finding the minimal case that unblocks progress for you, e.g., only supporting memrefs with default layout and only 1D vectors with even a single element read out-of-bounds leading to UB (assuming out-of-bounds scalar read is also UB), and extending it to more cases later if necessary.

I would also caution against defining the semantics of memref in terms of underlying LLVM instructions. LLVM IR is not the only target (SPIR-V is my go-to example, but there are others) we have.

Yes, this is what vector.transfer lowering to LLVM does and it be enough to break the weird layering of affine.vector_load lowering to vector.transfer_read. All the rest is undefined / does not lower for now.