Improving the efficiency of Operation::getOperand

Diving into the third item in this list of things that can be done to improve compiler performance, I’d like to discuss improving the efficiency of the getOperand calls.

For background, Operation is currently defined like this:

class alignas(8) Operation final
    : public llvm::ilist_node_with_parent<Operation, Block>,
      private llvm::TrailingObjects<Operation, BlockOperand, Region,
                                    detail::OperandStorage>

The operands of an operation are stored as a consecutive array of OpOperand values, which are allocated as part of the aptly named OperandStorage object. The consequence of this is that very common calls like op->getOperand(0) or op.lhs() has to load the # block operands, # regions, and do a bunch of math to get the OpOperand and the Value’s in question. It may also surprise you that things like: Value lhs = thing.lhs(), rhs = thing.rhs(); does all this arithmetic twice: it can’t be CSEd because it is large and out-of-line.

It is inherent to the design of Operation that some thing will need this arithmetic, but BlockOperand and Regions are much less common and commonly accessed than operands, so I’d like to see operands come first.

The challenge with this is that detail::OperandStorage itself has trailing objects:

/// This class handles the management of operation operands. Operands are
/// stored either in a trailing array, or a dynamically resizable vector.
class OperandStorage final
    : private llvm::TrailingObjects<OperandStorage, OpOperand> {
...
  /// The current representation of the storage. This is either a
  /// InlineOperandStorage, or a pointer to a InlineOperandStorage.
  union {
    TrailingOperandStorage inlineStorage;
    llvm::PointerIntPair<TrailingOperandStorage *, 1, bool,
                         StoragePointerLikeTypeTraits>
        dynamicStorage;
  };

I think that can be addressed by refactoring this to flatten the operand list into Operation itself, something akin to:

class alignas(8) Operation final
    : public llvm::ilist_node_with_parent<Operation, Block>,
      private llvm::TrailingObjects<Operation, OpOperand, BlockOperand, Region>

And then give Operation a pointer to OpOperand for the out-of-line case, and have separate “number of inline operands” vs “number of operands” integers. Does this seem like a sensible approach to this?

-Chris

2 Likes

+1 Chris!

This seems like a big efficiency win.

Every time I’ve attempted this, the result has been negligible change to performance in all of the benchmarks I have, but an increase in the size of every operation (not ideal). I can upload my patch for others to benchmark with, and if it has a worthwhile performance improvement I’d be on board. If the performance isn’t a decent improvement though, I’d rather keep Operation smaller.

– River

1 Like

Why would it increase the size of Operation?

Right now we don’t track the original number of operands, because we don’t have to (the operand storage is at the end of the trailing objects, so need for additional book keeping). If we move the operands to earlier in the trailing objects list there will need to be some kind of additional field:

^^ e.g. adding those will increase the size of Operation, as right now the only cost of the operand storage on the actual Operation class is 1-bit (“do you have an operand storage?”).

⚙ D111695 [mlir] Move the Operation OperandStorage to the first trailing object is the revision that I’ve been playing around with.

– River

Thank you so much for posting the patch River - I’m sorry for the delay getting back this, I got sucked into a rabbit hole of speeding up other unrelated things. :slight_smile:

I experimented with the patch and it does show a noticable speedup, particularly with two changes that I mentioned in your phabricator:

  1. Changing the bitstealing from numOperands to capacity, since numOperands is very hot and this eliminates some masking.

  2. Changing Operation::getOperands() to be inline now that it is trivial.

With both of these tiny tweaks, your patch shows a 9% speedup on a complex macro benchmark (details in the phabricator). I think this would be a great thing to do.

-Chris

Thanks for benchmarking Chris! 9% seems rather significant, so I think that provides some of the necessary justification for increasing the size of Operation a tiny bit. I’ll look at rebasing the patch based on your suggestions early next week.

– River

Thank you for working on this River, I appreciate it!

FWIW, there is still opportunity to squeeze things a bit. Operation is storing each of numResults, numSuccs, and numRegions as 32/31 bit values. I think it would be reasonable to limit some of those to 16/15 bits. For example, if you limit numRegions to 15 bits, then you’d free up an unsigned short worth of space next to it.

Given that space, you could put either numOperands in that 16-bit hole (if you’re willing to limit operations to 65536 operands, which I’m not sure we are) or you could put capacity/isStorageDynamic from OperandStorage into that hole (if you have more than 32768 operands, it seem completely fine to allocate them out of line rather than inline with the operation). The advantage of moving numOperands is that it could then be accessed without having to check hasOperandStorage which would shave some more cycles. This would shrink things by 32-bits, which would save a word on 32-bit systems.

To save a word on 64-bit systems, we’d have to shrink something else, e.g. #results/#operands.

-Chris

We have some operations with >700k operands and results, so some out of line storage would be needed.

Uhm, whoa. Can you share what going on there? Are there ways to factor them into multiple ops?

-Chris

It seems if we go down this path, that packing storage even more (say, into 7 bits for each) would be worthwhile. The extra cycles for the checks are going to be irrelevant on any modern architecture, compare to the storage savings if we can fit larger designs into cache. Maybe you guys are considering other things, but I was thinking about something like this: SQLite4: Variable-Length Integers

True variable length encoding is extremely expensive to decode, and makes the normal path more bloated (in terms of code size etc).

That said, your point is good. A different way to really more extreme packing is to shrink these fields to 8 bits (which will handle the vastly most common case) and have a single “isLargeOperation” bit that is set when any of the fields is too large. This bit would be calculated when allocation the operation, and if isLargeOperation is true, it would add an optional LargeOperationInfo chunk to the operation, similar to how OperandStorage is currently conditional on a bit. That LargeOperationInfo struct would hold all the fields as unsigned’s.

This would then cause all the accessors to be:

  unsigned getNumRegions() {
    if (!isLargeOperation)
      return smallNumRegions;  // 8 bit field with a cheap zext load.
    return getTrailingObjects<detail::LargeOperationInfo>()->numRegions;  // 32-bit load
  }

The only one that can’t be shrunk like this is #operands, since it is actually dynamic, and you don’t know at allocation time if it will stay small.

-Chris

This can happen in large scale TensorFlow models. You end with a function with that many parameters, and so at call site that many operands…

I quite like this idea, especially given that it would negate the increased size from optimizing operand lookups (and be a net win for zero-operand operations). I’ll experiment with this as a follow-on.

– River

I’ve never quite seen one that big, but I have seen some hefty ones (where even basic transformations decompose them into more usual sizes). Not to go down a rabbit hole, but have we considered that we maybe got something wrong about the programming model if this is the effect? :slight_smile: (not that it makes a difference to there needing to be a way to handle unusual numbers of operands vs arbitrary, small limits)

Frequently this does not really show up at the programming level I think: the user write a SPMD computation initially. It is the lowering process that will map this to a cluster of O(1000s) devices and express it in a way that is just convenient for the runtime.
We may be able avoid such expansions by pushing the “SPMD” view down to the runtime (i.e. if the runtime was “native SPMD”).

Correct this is when we have a model distributed over a large cluster of devices. So the parameters could be small while operand counts large. SPMD is also a more restrictive form of parallelism than what is supported and used by some of the models, so while useful pushing it down (and some WIP in that regards) it won’t cover all.

But proposal being discussed with isLargeOperation sounds like it handles this.

Cool, thanks for working on this River! Please let me know if you’d like me to measure anything, I have a pretty good setup for performance, I don’t have a great way of measuring memory use though.

-Chris

Thanks. I’ve got a few good memory size benchmarks, but the number of performance ones I have access to are quite lacking.

– River

Yes, this was along the lines of what I was thinking. I don’t think we need to do exactly what SQLLite proposes, but just to make use of the statistics of real programs we have. Probably 99.99% of operations are < 127 operands and statically allocated. For moderately sized data sets, performance is almost always hugely influenced by fitting more data in cache.