Changing operation attribute storage

Various patches are going in to improve IR efficiency, which is super exciting. I just saw this one which is another nice step forward. At this point, my favorite workload is bottlenecked on three things (listed in order of “hotness”):

  1. Non-empty StringAttr::get - there are some minor optimizations that are probably possible here, but this really needs a concurrent hash table to make major gains with. These things are for string attributes on operations, so these inherently need to be StringAttr’s.

  2. Non-empty DictionaryAttr::get - This is predominantly for the attribute dictionary at the top level of an operation. For example, the std.constant op has a “value” attribute that stores the value of the constant. This is represented with an extra level of DictionaryAttr that holds a single entry {"value"=<whatever>}, and that DictionaryAttr needs to be allocated and uniqued.

  3. OperandRange is very hot as are getOperand calls because of how Operation is laid out. There has been some discussion about rearranging this so that Use records are at a constant offset from “Operation::this” and successors/regions are at variable offsets.

I’d like to talk about the second one here: it is really wasteful in both time and memory to create and store the extra level of DictionaryAttr, and it is also inefficient to query this thing “by string”, and there is a fixed arity of these attributes. Has anyone thought about ways of flattening this into an Operation and making it indexed by integer instead of looked up by string? We could keep the DictionaryAttr around for non-defined attributes like dialect attributes.

-Chris

I like this optimization! We should look into this indeed :slight_smile:

Some challenges I foresee will be around the Operation::getAttrs() and Operation::getAttrDictionary(), these offer a uniform access to the attributes, and to preserve them as they are we’d need to make them “expensive” (materializing a dictionary on the fly potentially).
These API can be useful: I’ve been relying on this for propagating attribute from one operation to another in a cheap way when doing transformations.

I agree we need something that provides uniform access to the declared and non-declared attributes, but I don’t think we need DictionaryAttr to do it. We can provide “a new thing” that handles the out of line case. It doesn’t need to be part of the Attribute’s hierarchy.

In fact, with that, the “out of line” attributes wouldn’t need to be a DictionaryAttr either.

I think there are two main open questions here:

  • How does the generic API work?
    Not just getAttrs/getAttrDictionary, but also getAttr.

  • How do we handle optional attributes?
    Numeric indexing get complicated in the presence of optional, how do we handle that here?

I think one step could be just inlining a NamedAttrList in Operation instead of uniquing, and then see how much of a benefit we get by separating the attribute types. After that we could evaluate how much benefit we get from splitting (vs the cost).

– River

Mehdi and I discussed this topic briefly, I still think it would be a great thing to do, for both performance and memory usage reasons. Here’s a braindump of some thoughts in case someone is interested in picking this up:

a) the win should be a big compile time win: the YourOp::getFooAttr() accessors will be much faster because you can use an integer index into an array stored on the Operation, rather than doing Identifier matches on a thing in an attr dictionary. This saves a lot of indirections. It should also be a memory win, because we’re not creating/uniquing/immortalizing the attr dicts needlessly: for example, DimOp stores an integer index attribute, but we currently also have a AttrDict with one element for each of these.

b) Given that AbstractOperation already has attrNames with a list of known attributes, I think the approach here is to allocate attrNames.size() Attribute values inline in the operation, using the existing TrailingObjects structure. Optional attributes would be null Attribute’s.

c) Given this, we’d need an OperationAttrSet Operation::getAttrs() method, replacing the one that returns ArrayRef<NamedAttribute>. OperationAttrSet would just wrap an Operation* and provide iterators that skip over missing-optional attribute and synthesize the NamedAttribute by combining the inline attributes with their names in the AbstractOperation.

d) OperationAttrSet can have an explicit “toVector()” method that returns a SmallVector<Attribute> for cases that really really ArrayRef<NamedAttribute> explicitly. This will help phase in the new design in, and we can wean the compiler off it over time.

e) We also need the ability to start “random extra attributes” on the attribute. This can be done by either maintaining the existing attrs DictionaryAttr on the operation (but rename it to “extraAttrs” or something) or change that too to be a bespoke data structure. The former approach is easier (so I’d start by staging that in) but the later is more efficient in the normal case, because these don’t need to be uniqued.

f) The Operation::getAttrDictionary method needs to stick around for awhile due to source compatibility and to allow phasing things in, but it should lazily create the dict on demand. This will make it more inefficient than it currently is, and we want to move the codebase off it and onto OperationAttrSet directly. It isn’t widely used right now - only 28 uses in the MLIR tree. Some of them can probably be moved to using getAttrs() ahead of the bigger push.

Anyway, I think such a change would be a great step forward for the compiler.

-Chris

1 Like

So I’ve messed around with this and here are my findings so far:

These cases are quite rare, and when they do arise, they are most often trivially refactored to use some iterator range (within and outside the MLIR codebase). They are also very important to remove because materializing an op’s attributes as a dictionary will wipe out any memory gains by getting rid of it in the first place.

The most egregious example was in IR/SymbolTable.cpp with the function that found and replaced all nested symbol uses. It was a little tricky to refactor, and it’s the most significant: a DictionaryAttr might end up getting materialized by every op in the IR.

I think the easiest start would be to ditch DictionaryAttr and just stick some list structure inside Operation. We get a quick performance boost and generally don’t use too much more memory (except in degenerate cases: e.g. 1000 ops with the same attributes). In some cases, we use less.

The challenge with splitting the attributes is: how do you determine whether an identifier belongs to an intrinsic attribute or an extra attribute? Inline attributes can be stored without their identifier (Attribute instead of NamedAttribute) to save memory, so will the lookup have to traverse the intrinsic attribute names stored inside AbstractOperation at the same time? So much for avoiding indirection. Do we deprecate getAttr(StringRef/Identifier) and replace it with fine-grained methods like getIntrinsicAttr? Do we try to store the inline attributes in some sorted order?

Do we sort by identifier (pointer value) or string value? String lookups are slower but they occur much less commonly than identifier lookups (at least maybe until most of the identifier lookups are replaced with indices).

Yeah I agree with you. I think that refactoring SymbolTable would be good for other reasons as well. I hope it could be split out as a incremental precursor patch.

Right, the generic getAttr(StringRef) method is going to be more complicated.

I haven’t dug into the tradeoffs here, but one approach would be to add a function pointer to AbstractOperation, something like: ssize_t (*getAttributeIndex)(StringRef) which returns a non-negative index for inline attributes, or -1 for unknown/out of line ones. This function could be synthesized by ODS to use a simple comparison in the case where there are a small number of attributes (e.g. ConstantOp) and could go all the way up to perfect hashing + StringRef compare (with the corresponding entry in AbstractOperation::attributeNames) for a theoretical op with thousands of attributes. StringRef compares are reasonably fast because they carry the length which helps filter out many of the strcmps in the negative comparison case.

Completely unrelatedly to this topic, I really wish we didn’t use Identifier for keys of DictionaryAttr or in general in MLIR. It is redundant with StringAttr, various things need to be converted back and forth between the two of them, and doing so causes the same string data to be uniqued twice. Introduction of Identifier predated building out the attribute system.

-Chris

2 Likes

So I have concrete results now: Inlining intrinsic attributes with the Operation makes ODS attribute getters/setters really fast. However, everything gets slower. StringRef and Identifier attributes are slower because they have to check two lists, and iterations over ranges of attributes are slower because the range is now AttributeRange, which hides the complexity of dealing with two separate lists, one of which may contain null values, instead of ArrayRef<NamedAttribute>.

@clattner I’m not sure what your particular case is. If 99% of your attribute lookups are through ODS getters, then this change is a sure win (assuming the performance increase is worth the implementation complexity). However, the cases I have been using as benchmarks are very heavy on generic attribute lookups (lots of unregistered ops, etc. lookupSymbolIn is quite hot). I see a performance decrease of 20-30%.

Not that it was a waste of time: I found a handful of things along the way that will improve MLIR’s performance (as well as bugs…).

@River707 Had a “compromise” idea, where an ODS attribute that knows it has n non-optional/default-valued attributes less than it and m greater can search a subrange of attributes attrs[n:#attrs-m).

Would it possibly help to avoid fracturing-out the intrinsic attributes? I.e., either we store all attributes in the inlined/vector storage, or we store all attributes in a dictionary. Only registered ops would have the inlined storage allocated, and it would gracefully overflow (SmallVector-style) to a dictionary if you add too many additional attributes.

We would maintain the invariant that if the inlined storage is being used, then the intrinsic attributes always come first and are in the right order, allowing the getter optimization with a relatively cheap boolean check.

I kinda got tunnel-visioned and forgot this approach.

I didn’t mention this in my response (which was directed specifically at inlining intrinsic attributes) but just putting a vector of attributes inside the op directly does lead to a performance improvement (except in one benchmark, it reduces performance, which I am currently investigating). Consequently, having the elements of the list inline might improve performance.

However, two problems remain.

First, if the intrinsic attributes come first, then there is still the question of how to determine whether a string lookup is looking for an intrinsic attribute. In my testing, I have found that most of the time spent in getAttr, especially for large programs, is looking for an attribute that doesn’t exist (e.g. lookupSymbolIn). Intrinsic attributes can be stored in sorted order (guaranteed by ODS) which reduces the burden to two binary lookups. When I inlined the intrinsic attributes, I stored only the Attribute value, which still resulted in an overall memory usage increase because of the number of optional/default-valued attributes just being stored as null. If the whole NamedAttribute were stored, then I would say that optional/default-valued attributes should be excluded, because not only would this double the amount of “wasted” space, but it would also complicate iteration over the attributes (which was another slowdown cause).

Second problem: how much “extra” space do we allocate inline for extrinsic attributes? Clearly enough to store optional/default-valued ones that aren’t stored with the mandatory attributes, at least. Any extra space that isn’t used is a significant amount of wasted memory. Operands, for example, are allocated exactly the amount of space needed for the number given at the time an operation is created. If the number increases, then all the operands are pushed into dynamic storage and the original space is unused. However, the number of operands does not increase as frequently as the number of attributes. Configuring the wrong amount of extra space could result in significant extra memory usage. Do we make this a user-configurable option? Per op or for all of MLIR?

I can explore inlining the whole list (no separation of intrinsic/extrinsic attributes) and see if potentially wasting memory is worth the performance gain.

Also, attributes must still be printed in sorted order, and default-valued attributes need to be elided if their value is the same as the default value (most likely, the default values will be stored in AbstractOperation). Having the intrinsic attributes provided in sorted order also reduces the need to sort them at printing time. I don’t think this is too much of a burden on whoever still writes Ops in pure C++, especially if a constexpr sort function were provided.

There is still one aspect of this that makes me a tad bit quezy, and that is the fact that iteration order of attributes on registered vs. unregistered operations is going to be different. That I could see resulting in… interesting behavioral differences (not that it’s a deal breaker for me if the performance wins are worth it).

This is not just during attribute lookup, but also operation creation. You’ll have to compute which attributes are intrinsic, vs not and what order they need to be in. There is likely some things we could do here to help, which is that Identifier can know if it has a dialect prefix or not. If the identifier has a prefix, we know it would have to be an intrinsic attribute, otherwise it would have to be external. That would remove the double lookup for identifiers at least, for strings we could also check for a prefix (i.e. “do you have a .?” but it isn’t immediately clear to me if that is faster than checking both lists in all cases.

I don’t think we should store an empty entry for optional attributes (regardless of the approach), given that it just wastes space. Or if we do, it should be driven by data that proves the access performance increase is worth all of the wasted space.

I would always allocate (at a maximum, we could always do less if there are a huge number of attributes) just enough space for the attributes provided. Overallocating inline storage is a waste of space that won’t go away until the operation is destructed.

IMO we can order the attributes internally however we want, users shouldn’t be indexing directly by an index themselves anyways. The only important aspect is having some deterministic behavior.

Avoiding the fracture could be interesting (though we don’t really save as much memory in that case), but it still remains to be seen if the complexity is worth it for this in general (given the other issues involved). One aspect of this that I’m a big -1 on though is having different storage behavior for “inline” vs not (unless I’m misunderstanding your point, which is certainly possible!). That would create different iteration behavior depending on when the operation was created, i.e. getAttrs on a fresh operation will return a different result from a pre-existing operation?

– River

Just a quick update on my side - I care about this a lot, and plan to poke at your patch this weekend. I’m also looking at River’s Operand layout change as well.

This might just need to be something that we solve with perfect hashing or some other precomputed acceleration structure.

Good point. Do we have strong guarantees already about the iteration order? Maybe we could have a helper “getSortedAttributes()” or somthing that always materializes a dictionaryattr. It seems excessive to always guarantee an iteration order for all types of iterations. I put this in the same category as code that iterates over blocks in a region, expecting them to be in a topological order or something.

IME perfect hashing isn’t going to help here won’t entirely fix this, given that a majority of the time (in these situations) is spent computing the hash itself. When interfaces used a hash map (they don’t anymore), a majority of the lookup time was spent hashing the pointer TypeID, not even in the lookup.

The iteration order right now is guaranteed. We don’t claim that the attribute dictionary of an operation is like a DenseMap, or some other un-ordered container. There is a very established sorted order.

– River

That isn’t to say that every case needs this (many usages are just building a new operation, or looking for certain attributes), but there are several usages that do rely on a deterministic ordering. My main requirement here is that we can determine that it has a verifiable benefit that warrants all of the complexity being discussed here. If the benefit is marginal, and we end up bloating AbtractOperation/Operation/etc. and breaking established invariants, then it’s just not worth it IMO.

– River

1 Like

In my implementation of separate inline/out-of-line attributes, this wasn’t too much of a problem since both lists were stored sorted:

  • if the incoming attributes aren’t sorted, sort them (they need to stored as sorted anyways)
  • walk the list and the (sorted) name list at the same time

Not exactly an O(1) check but it works out since the whole list is getting scanned at once. But it gets worse when performing normal lookups.

My motivation for doing so was so that default-valued attributes constructed inside ODS getters could be stored into the op (this wasn’t a noticeable performance gain FYI). Storing them in AbstractOperation is a solid alternative.

I’m fairly certain the only thing that depends on iteration order is printing operations’ attributes in sorted order.

I don’t have any patches up right now, but I can throw them up today. They start to get messy FYI :')

-1 on materializing a DictionaryAttr, unless this happened very rarely. Materializing a DictionaryAttr duplicates every attribute into an immortal list. If both inline and out-of-line attributes are stored sorted (although separately), you’d just need a sorted_iterator that does a sorted walk over both lists.

Any time your processing has a user visible effect (e.g. creating IR, emitting diagnostics, etc.) the iteration order has to be deterministic. This most certainly happens outside of the printer (e.g. during verification).

– River

2 Likes

Just for context, this is still really important to me, but I can’t see past the operand and other Operation layout changes, I’ll deep dive in this as soon as those settle out.

Yeah np. I’m exploring other options right now and I have made the following optimizations that do result in a measurable speedup:

  • std::vector<NamedAttribute> (specifically std::vector and not SmallVector or any other vector-like) inside Operation
  • Identifier lookups on attribute lists large than 16 should do a string binary search (there are operations with 300k+ attributes)
  • Op::verify is calling OpAdaptor::verify which uses string lookups - I changed this to regenerate the verifier inside the op class with identifier lookups
  • Intrinsic attribute lookups can search a known subrange of the sorted attribute list (with help from ODS)

Plus some other minor changes. I’ll throw this patch up as well.