Improving the efficiency of Operation::getOperand

clattner · October 12, 2021, 6:33pm

Diving into the third item in this list of things that can be done to improve compiler performance, I’d like to discuss improving the efficiency of the getOperand calls.

For background, Operation is currently defined like this:

class alignas(8) Operation final
    : public llvm::ilist_node_with_parent<Operation, Block>,
      private llvm::TrailingObjects<Operation, BlockOperand, Region,
                                    detail::OperandStorage>

The operands of an operation are stored as a consecutive array of OpOperand values, which are allocated as part of the aptly named OperandStorage object. The consequence of this is that very common calls like op->getOperand(0) or op.lhs() has to load the # block operands, # regions, and do a bunch of math to get the OpOperand and the Value’s in question. It may also surprise you that things like: Value lhs = thing.lhs(), rhs = thing.rhs(); does all this arithmetic twice: it can’t be CSEd because it is large and out-of-line.

It is inherent to the design of Operation that some thing will need this arithmetic, but BlockOperand and Regions are much less common and commonly accessed than operands, so I’d like to see operands come first.

The challenge with this is that detail::OperandStorage itself has trailing objects:

/// This class handles the management of operation operands. Operands are
/// stored either in a trailing array, or a dynamically resizable vector.
class OperandStorage final
    : private llvm::TrailingObjects<OperandStorage, OpOperand> {
...
  /// The current representation of the storage. This is either a
  /// InlineOperandStorage, or a pointer to a InlineOperandStorage.
  union {
    TrailingOperandStorage inlineStorage;
    llvm::PointerIntPair<TrailingOperandStorage *, 1, bool,
                         StoragePointerLikeTypeTraits>
        dynamicStorage;
  };

I think that can be addressed by refactoring this to flatten the operand list into Operation itself, something akin to:

class alignas(8) Operation final
    : public llvm::ilist_node_with_parent<Operation, Block>,
      private llvm::TrailingObjects<Operation, OpOperand, BlockOperand, Region>

And then give Operation a pointer to OpOperand for the out-of-line case, and have separate “number of inline operands” vs “number of operands” integers. Does this seem like a sensible approach to this?

-Chris

_sean_silva · October 12, 2021, 6:43pm

+1 Chris!

This seems like a big efficiency win.

River707 · October 12, 2021, 6:58pm

Every time I’ve attempted this, the result has been negligible change to performance in all of the benchmarks I have, but an increase in the size of every operation (not ideal). I can upload my patch for others to benchmark with, and if it has a worthwhile performance improvement I’d be on board. If the performance isn’t a decent improvement though, I’d rather keep Operation smaller.

– River

clattner · October 12, 2021, 8:37pm

Why would it increase the size of Operation?

River707 · October 13, 2021, 7:22am

Right now we don’t track the original number of operands, because we don’t have to (the operand storage is at the end of the trailing objects, so need for additional book keeping). If we move the operands to earlier in the trailing objects list there will need to be some kind of additional field:

clattner:

I think that can be addressed by refactoring this to flatten the operand list into Operation itself, something akin to:
class alignas(8) Operation final
    : public llvm::ilist_node_with_parent<Operation, Block>,
      private llvm::TrailingObjects<Operation, OpOperand, BlockOperand, Region>
And then give Operation a pointer to OpOperand for the out-of-line case, and have separate “number of inline operands” vs “number of operands” integers.

^^ e.g. adding those will increase the size of Operation, as right now the only cost of the operand storage on the actual Operation class is 1-bit (“do you have an operand storage?”).

⚙ D111695 [mlir] Move the Operation OperandStorage to the first trailing object is the revision that I’ve been playing around with.

– River

clattner · October 31, 2021, 5:51am

Thank you so much for posting the patch River - I’m sorry for the delay getting back this, I got sucked into a rabbit hole of speeding up other unrelated things.

I experimented with the patch and it does show a noticable speedup, particularly with two changes that I mentioned in your phabricator:

Changing the bitstealing from numOperands to capacity, since numOperands is very hot and this eliminates some masking.
Changing Operation::getOperands() to be inline now that it is trivial.

With both of these tiny tweaks, your patch shows a 9% speedup on a complex macro benchmark (details in the phabricator). I think this would be a great thing to do.

-Chris

River707 · October 31, 2021, 6:47am

Thanks for benchmarking Chris! 9% seems rather significant, so I think that provides some of the necessary justification for increasing the size of Operation a tiny bit. I’ll look at rebasing the patch based on your suggestions early next week.

– River

clattner · October 31, 2021, 4:31pm

Thank you for working on this River, I appreciate it!

FWIW, there is still opportunity to squeeze things a bit. Operation is storing each of numResults, numSuccs, and numRegions as 32/31 bit values. I think it would be reasonable to limit some of those to 16/15 bits. For example, if you limit numRegions to 15 bits, then you’d free up an unsigned short worth of space next to it.

Given that space, you could put either numOperands in that 16-bit hole (if you’re willing to limit operations to 65536 operands, which I’m not sure we are) or you could put capacity/isStorageDynamic from OperandStorage into that hole (if you have more than 32768 operands, it seem completely fine to allocate them out of line rather than inline with the operation). The advantage of moving numOperands is that it could then be accessed without having to check hasOperandStorage which would shave some more cycles. This would shrink things by 32-bits, which would save a word on 32-bit systems.

To save a word on 64-bit systems, we’d have to shrink something else, e.g. #results/#operands.

-Chris

jpienaar · October 31, 2021, 5:24pm

We have some operations with >700k operands and results, so some out of line storage would be needed.

clattner · October 31, 2021, 5:59pm

Uhm, whoa. Can you share what going on there? Are there ways to factor them into multiple ops?

-Chris

stephenneuendorffer · October 31, 2021, 6:07pm

It seems if we go down this path, that packing storage even more (say, into 7 bits for each) would be worthwhile. The extra cycles for the checks are going to be irrelevant on any modern architecture, compare to the storage savings if we can fit larger designs into cache. Maybe you guys are considering other things, but I was thinking about something like this: SQLite4: Variable-Length Integers

clattner · October 31, 2021, 8:16pm

True variable length encoding is extremely expensive to decode, and makes the normal path more bloated (in terms of code size etc).

That said, your point is good. A different way to really more extreme packing is to shrink these fields to 8 bits (which will handle the vastly most common case) and have a single “isLargeOperation” bit that is set when any of the fields is too large. This bit would be calculated when allocation the operation, and if isLargeOperation is true, it would add an optional LargeOperationInfo chunk to the operation, similar to how OperandStorage is currently conditional on a bit. That LargeOperationInfo struct would hold all the fields as unsigned’s.

This would then cause all the accessors to be:

  unsigned getNumRegions() {
    if (!isLargeOperation)
      return smallNumRegions;  // 8 bit field with a cheap zext load.
    return getTrailingObjects<detail::LargeOperationInfo>()->numRegions;  // 32-bit load
  }

The only one that can’t be shrunk like this is #operands, since it is actually dynamic, and you don’t know at allocation time if it will stay small.

-Chris

mehdi_amini · October 31, 2021, 8:38pm

This can happen in large scale TensorFlow models. You end with a function with that many parameters, and so at call site that many operands…

River707 · October 31, 2021, 8:52pm

I quite like this idea, especially given that it would negate the increased size from optimizing operand lookups (and be a net win for zero-operand operations). I’ll experiment with this as a follow-on.

– River

stellaraccident · October 31, 2021, 9:25pm

I’ve never quite seen one that big, but I have seen some hefty ones (where even basic transformations decompose them into more usual sizes). Not to go down a rabbit hole, but have we considered that we maybe got something wrong about the programming model if this is the effect? (not that it makes a difference to there needing to be a way to handle unusual numbers of operands vs arbitrary, small limits)

mehdi_amini · October 31, 2021, 9:28pm

Frequently this does not really show up at the programming level I think: the user write a SPMD computation initially. It is the lowering process that will map this to a cluster of O(1000s) devices and express it in a way that is just convenient for the runtime.
We may be able avoid such expansions by pushing the “SPMD” view down to the runtime (i.e. if the runtime was “native SPMD”).

jpienaar · November 1, 2021, 12:01am

Correct this is when we have a model distributed over a large cluster of devices. So the parameters could be small while operand counts large. SPMD is also a more restrictive form of parallelism than what is supported and used by some of the models, so while useful pushing it down (and some WIP in that regards) it won’t cover all.

But proposal being discussed with isLargeOperation sounds like it handles this.

clattner · November 1, 2021, 12:14am

Cool, thanks for working on this River! Please let me know if you’d like me to measure anything, I have a pretty good setup for performance, I don’t have a great way of measuring memory use though.

-Chris

River707 · November 1, 2021, 12:27am

Thanks. I’ve got a few good memory size benchmarks, but the number of performance ones I have access to are quite lacking.

– River

stephenneuendorffer · November 1, 2021, 12:59am

Yes, this was along the lines of what I was thinking. I don’t think we need to do exactly what SQLLite proposes, but just to make use of the statistics of real programs we have. Probably 99.99% of operations are < 127 operands and statically allocated. For moderately sized data sets, performance is almost always hugely influenced by fitting more data in cache.

Topic		Replies	Views
Questions regarding the LLVM IR and the Instruction class LLVM Dev List Archives	2	71	May 1, 2015
How do I get the result of an instruction? LLVM Dev List Archives	6	100	November 16, 2008
Instruction Operands LLVM Dev List Archives	1	82	September 6, 2011
Operand, instruction LLVM Dev List Archives	24	239	April 22, 2010
Operating on contents of virtual registers LLVM Dev List Archives	5	94	March 17, 2011

Improving the efficiency of Operation::getOperand

Related Topics