[RFC] MemRefType affine maps - list vs single item

The topic was born during ⚙ D106877 [mlir] Allow composed affine maps in memref.subview discussion.

MemRefType API allows to use lists of affine maps. But various places use an assumption, that MemRefType contains single affine map (or none for default case). And sometimes those assumption is not even checked and just the first/last item in the list is used without any check/assertion. This makes the code base fragile.

The original patch proposed to add getComposedAffineMap method to MemRefType, which will compose the list of affine map into single map. The method can be used in places, where single affine map is expected. This will allow to preserve the current MemRefType behavior.

Another proposal was to remove the list of affine maps and store only single affine map in the MemRefType, since the current code base actually never used the list capability.

Personally, I’m not very happy with the second proposal, since in our internal project we use the list capability. And the proposed patch just fixes some places in upstream to process such MemRefType.

Thus, I’d like to hear community and architectures opinions on that question.

Commenting here based on experience in IREE. I think having a list of affine maps that compose to arbitrary striding is very hard to support and reason about in a compilation stack. IMO, allowing the composition map to allow accesses where the least significant dimension is not the fastest varying dimension just adds complexity to analysis passes that are trying to optimize the data accesses in the generated code. You not only need to look at the shape, you also need to look at the stride. It also makes some trivial ops that should be no-ops, not a no-op. For example, the following reshape is a no-op (effectively just changes metadata)

%1 = reshape %0 : memref<?x?x?xf32> to memref<?x?xf32>

where as this is not a no-op

%1 = reshape %0 :
    memref<?x?x?xf32, affine_map<(d0, d1, d2)[s0, s1, s2, s3] -> (s0 + d0*s1 + d1*s2 + d2 *s3)> to
    memref<?x?x?xf32>, affine_map<(d0, d1)[s0, s1, s2] -> (s0 + d0*s1 + d2*s2)>

It depends on the values of the strides dynamically. IMO this is conflating data layout transformations and metadata change to just change the shape (without having to move the underlying data). A more canonical representation is to have the data movement ops represented explicitly.

The only valid use case of the affine maps in MemRefType used in IREE is where you are accessing an N-D slice of a memref. So you only need the offsets and extents of each dimensions, and there is a deterministic ordering between shape dimensions and order in which the data is laid out, i.e. the inner most dimension is also the fastest varying dimension (and so on).

A side-effect of having affine composition map is that an operation does not have all the information needed to be lowered to say LLVM/SPIR-V dialects since the values of the strides (s1, s2, etc.) are available at the time the value is created and not at the op use site. So either

  • you need to carry this information during lowering using an additional data structure like the Memref descriptor, as is done in the conversion to LLVM dialect, or
  • you need to walk the use-def chains the get the values of the strides.

For the LLVM backend, using a memref descriptor might be OK for most cases since the LLVM (proper) passes like SROA are able to break up the memref descriptor before actual code-generation. For backends like SPIR-V it is unreasonable to expect all driver compilers to be able to handle such complex transformations (most driver compilers are JIT compilers, and they need to be cognizant of compilation time).

The counter argument to the above is that if you dont want to support arbitrary striding, then the specific compilation stack errors out. Thats fair, and thats what is done in IREE. The question for me is that this feature exists in MLIR and is not used in core, is it paying for itself? There might be some simplifying assumptions that can be made to reduce the complexity of, say LLVM lowering, if such a feature did not need to be supported (and my view here is that this is not a feature, but a problematic representation)

I’m not sure what this op actually means: without more info I can’t tell what dimension is collapsed here?
Even if you provide more info about how to reshape the domain, where is the no-op guarantee coming from here?

I think to drive the tradeoff, we need to have more inputs on the use-cases I think.
Linalg was built with other solutions (separate ops in general) instead of adding index-space transformations in the type itself, and it somehow makes sense to me intuitively: I rather compose ops than increasing the type system complexity, in my experience it leads to more scalable design.
That said you mention that your out-of-tree project relies on these list, can you provide more details in how would it look like for you to move to ops to model what you’re using these list for right now?

(also ping @bondhugula on this topic)

I think you are mixing up two things here: the issue of “single affine map vs list of affine maps” and “type system vs explicit ops”. On the first one, one can always compose multiple affine maps into a single affine map. So a single affine map should be sufficient - to be in a canonical form. I might be forgetting something here, but the decision to use multiple affine maps was motivated by the thinking that it was beneficial to keep these maps separate for “some time” in the pipeline if needed and as needed – since composition would perhaps lead to a loss of information. @vinograd47 for your use case, wouldn’t it work if you just always kept a single composed affine map? The code base already has the necessary utilities to do that. I think this is the key part to understand since the upstream really doesn’t have a use case yet for a list of affine maps.

Reg. having logical reshapes and data layout transformations reflected in the type system vs having separate ops, this is a typical trade-off: having it in the type system helps you with looking things up locally/readily instead of having to track provenance + it allows information to pass function boundaries (in this case on higher/mid-level dialects) when you have memrefs being passed without having to track the origin. As affine maps were already available in the type system, I think having ops carrying side data weren’t needed in the first place (you still need explicit ops to change the type for index space transformation) — I’m not sure whether layout maps and layout map composition was even explored before adding meta-data to ops representing such index space transformations; I feel the maps were overlooked. For eg. I pointed out a scenario where additional attributes were unnecessarily being stored in the op for information that the type can already carry:

This only leads to redundancy, inconsistency, and the need to look up the op definition for information that is available locally in the Type of a Value.

You can always recover the strides/offset stuff if your index space transformations are restricted to the forms of interest that these new ops model in the first place.

I don’t think the representation is conflating anything here but I think you are conflating the two! :slight_smile: If the semantics of an op are to just create a “view” or an alias, there is no data movement or data layout transformation happening at all regardless of what affine map is being chosen on the result memref (you can change the strides, offset, sizes for eg.). However, if a transformation changes the Type of a memref being used (by changing its layout map), you have effectively picked a new data layout for that memref. The index subscripts for any memref dereferencing remain unaffected but its layout in memory has changed. So, you don’t need a new op to pick a new layout for a memref.

We decompose virtual index permutation and strided form:

Instead of single

#map = affine_map<(n, c, h, w) -> (n * 64 + c + h * 16 + w * 2)>

we store composition of 2 maps:

#map1 = affine_map<(n, c, h, w) -> (n, h, w, c)>
#map2 = affine_map<(n, h, w, c) -> (n * 64 + h * 16 + w * 2 + c)>

This might lead to blow up of opset size.

linalg.conv_2d_nchw
linalg.conv_2d_input_nhwc_filter_ohwi
linalg.conv_2d_input_nchw_filter_hwcf
linalg.conv_2d_input_nhwc_filter_hwcf
// layout is determined from operation type

// instead of
#NHWC = affine_map<...>
#OHWI= affine_map<...>
some.conv_2d input(%0 : memref<..., #NHWC>, %1 : memref<..., #OHWI>)
// layout is determined from operands type

Also type system allows to implement extra validation of IR:

%0 = linalg.conv_2d_nhwc
%1 = linalg.pool_nchw(%0) // is this a bug or it it intended implicit meta-data cast from NHWC to NCHW?

// vs

%0 = some.conv_2d : memref<#NHWC>
%1 = some.pool inputs(%0 : memref<#NCHW>) // ERROR: type mismatch for %0 operand

Using separate ops for meta-data description (like memref.transpose) makes such checks non local, as @bondhugula mentioned.

1 Like

There seems to be a misconception here: op semantics is orthogonal to and composes with layout changes.
Let’s simplify the discussion and just think of some theoretical 1d batched conv. on tensors: some named conv1d_nh_filter_h which implements the following comprehension:

O(n, h) += I(n, h + kh) * F(kh)

In this case, if I : tensor<8x16xT> and F : tensor<5xT> then verification requires O : tensor<8x12xf32>. You can chose to logically swap indexing expressions in any of the 4 possible combinations and give them different names, you will get 4 different ops.

And yet we have not talked about layout in memory or data type representation: tensor are just bags of values (at least for now). When writing in buffer form with layout, this is perfectly valid:

conv1d_nh_filter_h ins(%A: memref<8x16xT, affine_map<(d0, d1)[] -> (10^100 * d0 + 54312 * d1 + 42)>>, ...)

I am not sure what this wants to convey: if you return SSA values, you are in tensor land and there is no layout (at least for now). In any case, linalg ops do not prescribe layout, they compose with it.

Every check is local, the union of all your checks is non local (and you are missing a transpose here).
memref.transpose is not a “meta-data description”, it is an op that produces a new memref with a different layout on the same backing buffer.

I’m strongly +1 on this. (Forgot to comment on this earlier.) This was also the original goal and this preserves the power while not penalizing common cases. getComposedLayoutMap should be used by API users who just want the “sum total” layout map.

Yeah, I should have just used linalg.collapse_shape like this

%1 = linalg.collapse_shape %0  [[0, 1]]: memref<?x?x?xf32> to memref<?x?xf32>

This is dynamically gauranteed to be a no-op, and to do so it (or at least should) disallows using affine maps for the memref type.

I dont follow. A reshape changes the type of the memref. It can change the type without changing the layout (i.e. physically the values remain in the same location before and after reshape). If I am conflating the two things, it is because the type is conflating two things. IMHO, the use of strides to decouple physical layout and “logical layout” of a buffer leads to some non-trivial coupling between the two, especially in the dynamic case. If the layout description are done using dynamic values, all bets are off in terms of analysis (and hence potentially) optimizing memory access patterns. At the very least it will need some level of cross-op analysis instead of just having that information available in the type.

Great, and to restore balance, I will add a strong -1 to this: I have not yet seen any legitimate use case of lists of affine map as of today.

Quite the contrary, I have seen it result in confusions (as I explain above) as well as incorrect code (as pointed out in ⚙ D106877 [mlir] Allow composed affine maps in memref.subview) and normalization code that may or may not be correct.

In the latter, I can’t tell offhand whether there is an iterative procedure that strips the front layout map here or whether this is just dropping the rest. Whichever the answer, I don’t see anything testing compositions of layout maps in the revision that introduced the normalization functionality. As a reminder, the only layouts that lower to LLVM today without normalization are strided layouts in the form of a single affine map.

I have also seen this premature complexity in the context of multi-result affine.apply ops and it was quite a pain to clean up.

Since (a) entropy only increases, (b) I have yet to see any meaningful usage of compositions of affine map layouts that is better than SSA use-def chains, (c) it results in confusion, incorrect code and defensive asserts/failures in 100% of the core cases, I need some solid evidence to buy this abstraction.

@mehdi_amini Let me add more here to make the discussion more concrete while connecting to D106877 from the original post where this originated:

While I was pointing to reinterpret_cast, I now realize that memref.subview itself is another op (and perhaps the one that introduced a pattern) storing data that’s already carried or can be carried in the type (even with a single affine map): the three trailing attributes below (from the op def of memref.subview) don’t need to be attributes - you just need three accessors on the op if they are stored in the type! (part of the layout map) - you even get an accurate dump/print of the memref IR Type when you use dump()! (including the strided form if desired for all Value uses)

let arguments = (ins
    AnyMemRef:$source,
    Variadic<Index>:$offsets,
    Variadic<Index>:$sizes,
    Variadic<Index>:$strides,
    I64ArrayAttr:$static_offsets,
    I64ArrayAttr:$static_sizes,
    I64ArrayAttr:$static_strides
  );

Storing them in the op would just be:

  1. redundancy,
  2. potential inconsistency risk (between the type and the op metadata),
  3. extra memory footprint throughout the lifetime of the op (FWIW), and
  4. more importantly creating the need to look at the op definition or track the origin of the op to recover this information which would be complex if it’s across functions or even intra-function with “isolated-from-above” ops or ops that take explicit arguments for their regions like gpu.launch.
  5. print/dump of ops not showing full information on the memref uses locally.

I think both reinterpret_cast and subview should be ridden of these attributes and this information should be stored in the type, which btw doesn’t need any extensions in order to do that. Only the truly dynamic information would then be stored as operands just like for the shape symbols.

We’re off topic, but I think you mean memref.collapse_shape, I just read the doc for this op because I was a bit puzzled by your claim of “guaranteed no-op”: actually this isn’t how it is specified exactly even though it mentions that the current lowering can fail if it isn’t (the input memref can be the result of a view and so have arbitrary strides for example which makes no dimension contiguous and requires alloc/copy: I don’t think we guarantee that memref<?x?x?xf32> means that the data is contiguous across dims right now, are we?).

It’s only redundant if it is stored in the type as well, and I think that may where the crux of the disconnect lies.

I believe that we’ve been using func @f(%in : memref<?x?x?xf32>) to imply that the absence of a layout means “unknown” layout, so that we can called functions with a subview. It all works as long as the runtime representation is limited to strided layouts right now.
But it mean that in the IR, the use of layout does not need to be explicit in the type since it can be erased like that: so you can easily consider that the operation stores the information and it stay erased from the type.

This is a nit, but Attributes are owned by the context, it is zero cost for the op.

These gets into my point earlier about the expectations around something like memref<?x?x?xf32> I think?

@nicolasvasilache @bondhugula @mehdi_amini

Sorry for interrupting your discussion, but original question was about MemRefType, not about upstream dialects.

By the spec, MemRefType allows to store the list of affine maps as decomposed form of some complex affine map. Please correct me, if I misinterpret the purpose of the list.

But the upstream code in many places just ignores the fact that the list can hold more than 1 element and just take first/last element without any checks. This introduces issues in 3rd party projects, if they try to utilize the list.

My original proposal was to introduce getComposedAffineMap method to allow compose the list of affine maps into single one and use this method instead of getAffineMaps().front()/getAffineMaps().back(). This shouldn’t affect the upstream code, since it uses single affine map, while should correct handle the cases with list of affine maps.

Other proposal was to replace the list with single affine map to avoid such issues and confusion.

We’re also off veering deeper off-topic :slight_smile:

Still let me unpack a bit here: memref<?x?x?xf32> is interpreted as the “canonical contiguous layout”.
I.e. assuming your memref is called %A and your ? are “named” memref<mxnxpxf32>, then you have:

  • &(%A[i][j][k + 1]) == &(%A[i][j][k]) + 1
  • &(%A[i][j + 1][k]) == &(%A[i][j][k]) + p
  • &(%A[i + 1][j][k]) == &(%A[i][j][k]) + n * p

This interpretation is the only one that can encode this contiguity because there is no good way to identify sizes and strides in the type only; we’d need something that resembles dependent types:

memref<mxnxpxf32, offset : 0 strides : [n * p, p, 1]>

and whatever the equivalent linearized affine map would be. This would be particularly useful for reshape ops and determining unambiguously when a copy is needed or not in the dynamic case (today it can only be done in the static case).

This “empty” layout is also interpreted as equivalent to the identity layout:

memref<?x?x?xf32, affine_map<(i, j, k) -> (i, j, k)>>

which can be elided to “empty”.

The empty + identity AffineMap interpretation is consistent across “affine dialect” uses and “all other dialect” uses. It also happens to be the only thing we have for contiguous views.

I’ll stop here to avoid feeding the off-topic instincts, please start another thread if you want deeper discussions: there are deeper tradeoffs and a fine line between hiding/exposing details to a particular level of codegen and multidimensional indexing/degrading to flattened 1-D codegen.

Agreed, thanks for steering us back on the right path :slight_smile:

For the particular case of the subview op, the semantics supported is that of strided memref. The SubViewOp doc currently says:

The “subview” operation converts a memref type to another memref type which represents a reduced-size view of the original memref as specified by the operation’s offsets, sizes and strides arguments.

We should improve the docs to avoid such confusions in the future.
A verification failure in the case you are contemplating is what makes the most sense to me.

Back to the original question, the underlying problem I see is that 100% of the code I know of either asserts, fails or ignores anything beyond the front affine map. Since (a) this does not seem to be supported anywhere, (b) is a source of confusion to people who have not been accustomed to just ignoring it and (c) in the absence of strong evidence; I’d love to take a rare chance at reducing the entropy of the system rather than continue increasing it in that direction.

Now I tend to agree with you. Actually I’ve just found from affine-map-composition that I’ve misinterpreted the order of items in the affine map list :slight_smile:

Using single affine map should make code simpler and less confusing.

Small off-topic question: why strided syntax sugar is supported in ASM parser, but not in printer? Does it make sense to print strided MemRefType with memref<42x16xf32, offset: 33, strides: [1, 64]> instead of generic affine map form?

Because of the back and forth and the pushback in https://groups.google.com/a/tensorflow.org/g/mlir/c/MaL8m2nXuio/m/a_v07o9yBwAJ.

But @ftynse is writing about a nicer way, I’ll let him finish :slight_smile:

tl;dr: let’s make the layout an arbitrary Attribute, with an optional interface to convert to AffineMap.

I would advocate for going in the opposite direction and making the layout more extensible than it currently is. The current memref design dates back to the very early days and, as far as I know, is elaborated more on paper than it is in the code. It was a clever design that bridged the ML (affine) and CFG flavors of MLIR, back when they existed, and it was also intended to address some of the known scalability issues of affine modeling. However, affine modeling does not play as important role in MLIR as originally anticipated. MLIR has long outgrown its initial vision of a polyhedral-first compiler infrastructure.

Memref design predates a lot of modern MLIR extensibility. We’ve relaxed CFG to support regions. We’ve relaxed SSA to support use-before-def to express graphs. We’ve added support for custom attributes and types, and added polymorphism through interfaces. But we have not revised memref.

I would propose to make the memref layout an arbitrary Attribute, similarly to the memory space. This attribute can be an AffineMapAttr, an array thereof if a composition is desired, a DenseIntElementsAttr or a custom class for the strided form, an “NCHW” string, etc. Out-of-tree projects can use custom attributes if necessary. Individual analyses or passes can rely on the usual casting to select the kinds of layouts they support. Op verifiers can reject layouts they don’t know about. Regular dialect/type conversion can be used to convert between memrefs with different layouts.

We also have sufficient mechanisms to “view” some layouts kinds as other kinds. For example, we can have an AffineLikeLayoutAttrInterface that views a layout attribute as a single affine map when possible. Passes that need specifically the affine form can be expressed in terms of this interface. A hypothetical StridedLayoutAttr can implement it.

This will likely help with several moot points in the very core of the MLIR ecosystem: the weird duality between AffineMap and AffineMapAttr, the fact of AffineMapAttr being built-in rather than Affine, the complexity of inferring back the strided form of a memref given affine maps, which subsets of affine map layout can be lowered to which other dialect (LLVM and SPIR-V differ in this). It might also help us better connect to tensors with layout.

Certainly, there was an argument that most, potentially all, reasonable layouts can be represented as an affine map. This, in my opinion, does not preclude us from having higher-level or more restricted forms. Similarly, all MLIR SCF constructs can be represented as a block-based CFG with something like llvm::LoopInfo slapped on top, but it is significantly easier to reason about and manipulate first-class loop operations than it is to do the same for CFG with extra metadata. This is one of the main reasons of MLIR’s success, let’s not discard it!

2 Likes

Yes, but the memref type itself has sizes, strides, and offset reflected by the shape and the layout map. If the op that defines also carries those as attributes (in the case of static shapes), you’ll have to reconcile the two or imply that the latter overrides (overwrites) the former.

The pointers here do consume storage. That’s not zero cost when compared to the allocation size of the Operation - in fact, the fact that the Attributes are uniqued and owned by the context is of little importance here (these attributes are all int64 attributes and the sharing across ops is really irrelevant as the pointer size is the same as i64 size typically). However, the additional size on the op due to the three I64ArrayAttr isn’t zero or anywhere close to that (the alloc op has 0 operands for statically shaped memrefs otherwise) - more importantly, it’s unnecessary since it can already be stored on the type for all statically shaped memrefs.

I didn’t understand which scheme you are referring to. You have a dim op to get hold of the SSA values bound to the ? as you know and there is an extract_symbol (or perhaps missing) to get the remaining SSA symbols that would correspond to the other layout map symbolic identifiers: these would correspond to strides and offset if the form of the affine map would correspond to stride + offset form or just other arbitrary symbols in the layout map if not in that form. (I know the LLVM lowering doesn’t support the arbitrary case.) In fact, you don’t need to maintain an order among those symbolic operands that trail the size operands corresponding to ? marks in the shape. You can pull all of those symbols from the memref descriptor without looking at the op def if you carry it in the type.

I’ve just realized one more issue (at least from my point of view) with current representation of strided memref (even with single affine map). The strided form stores both strides and offset. While the strides can be considered as a part of type definition, the offset looks like “runtime” only part.

Let’s take simple example with 1D case:

%0 = memref.alloc : memref<10xf32>
%1 = memref.subview %0 [5][5][1] : memref<10xf32> -> memref<5xf32, affine_map<(d0) -> (d0 + 5)>>
%2 = memref.alloc : memref<5xf32>
memref.copy(%1, %2)

If we take some simple C++ implementation for this IR, the 1D memref can be represented with ArrayRef<float>. From that point of view, the %1 and %2 values have the same type (ArrayRef<float>) and memref.subview can be represented as ArrayRef<float>::slice. But from MLIR point of view, they have different types due to offset part in %1 type.

Why do we need to attach an offset to the type itself, why not just store it in the producing operation? When the memref.subview is lowered to C/LLVM this operation can be replaced by creating a new pointer using address arithmetic.