[RFC] Introduce alloca_scope op

Introduction

This proposal describes the new op to be added to the memref dialect called alloca_scope.

Motivation

Alloca operations are easy to misuse, especially if one relies on it while doing rewriting/conversion passes. For example let’s consider a simple example of two independent dialects, one defines an op that wants to allocate on-stack and another defines a construct that corresponds to some form of looping:

dialect1.looping_op {
  %x = dialect2.stack_allocating_op
}

Since the dialects might not know about each other they are going to define a lowering to std/scf/etc independently:

scf.for … {
   %x_temp = std.alloca …
   … // do some domain-specific work using %x_temp buffer 
   … // and store the result into %result
   %x = %result
}

Later on the scf and std.alloca is going to be lowered to llvm using a combination of llvm.alloca and unstructured control flow.

At this point the use of %x_temp is bound to either be either optimized by llvm (for example using mem2reg) or in the worst case: perform an independent stack allocation on each iteration of the loop. While the llvm optimizations are likely to succeed they are not guaranteed to do so, and they provide opportunities for surprising issues with unexpected use of stack size.

Proposal

We propose a new operation that defines a finer-grain allocation scope for the alloca-allocated memory called alloca_scope:

alloca_scope {
   %x_temp = alloca …
   ...
}


Here the lifetime of %x_temp is going to be bound to the narrow region within alloca_scope. Moreover, one can also return values out of the alloca_scope with an accompanying alloca_scope.return op (that behaves similarly to scf.yield):

%result = alloca_scope {
   %x_temp = alloca …
   …
   alloca_scope.return %myvalue
}


Under the hood the alloca_scope is going to lowered to a combination of llvm.intr.stacksave and llvm.intr.strackrestore that are going to be invoked automatically as control-flow enters and leaves the body of the alloca_scope.

The key value of the new op is to allow deterministic guaranteed stack use through an explicit region in the code which is finer-grain than the function-level scope of AutomaticAllocationScope interface. alloca_scope can be inserted at arbitrary locations and doesn’t require non-trivial transformations such as outlining.

Which dialect

Before memref dialect is split, alloca_scope can temporarily reside in std dialect, and later on be moved to memref together with the rest of memory-related operations.

Implementation

An implementation of the op is available here.

2 Likes

Thanks for the proposal and the good write up!

It seems that this is a no-op if there is no allocation in the region right? If so then is seems that we’re missing opportunities on the structure here: we could just fold the “alloca” aspect into the op itself:

%result = alloca_scope %size as %x_temp {
   ... // %x_temp is %size and is available within this region
}

Have you considered such alternative and the tradeoffs?

@mehdi_amini, the issue with not using an explicit alloca and folding it into another op this way is that everything (passes, utilities) will now have to worry about a block argument Value that is being allocated via this magic op instead of a simple getDefiningOp<AllocLikeOp>(..) or getting defined by an op with Allocate memory effect on its result. I’m a -1 on loading the semantics this way instead of keeping it factored into existing patterns. I know that there are other ops outside of AllocLikeOps that allocate as part of their semantics (for eg. a version of the copy op recently proposed I think), but those can be handled via the Allocate memory effect and they are still defining ops. This one is a new pattern and is different because things using it see it as a block argument. Separately, I think the Allocate memory effect can’t be attached to block arguments but only results. This part can be addressed though.

@shabalin This looks good to me, thanks. Please make sure to attach the AutomaticAllocationTrait to this op and your lowering to LLVM using llvm.intr.stacksave etc. should look at this trait as opposed to this magic op name. That way, other ops that attach this trait will get the same lowering treatment.
https://mlir.llvm.org/docs/Traits/#automaticallocationscope

Right there is a trade off.
My reservation with the current proposition is that the canonicalization isn’t obvious to me in the various form of IR we can get.
Think about nesting an ˋalloca_scope` within another one and the rules for merging them.

Also are the scope created by this op intended to exclusively map to dynamically allocated memory? I’d expect any static allocation to be hoisted in the entry block of the function for instance.

Another difficult aspect: if there aren’t any alloca in the region, can we fold this op away?
What if there is a call op in this region and after inlining later we end up with an alloca?
Does it mean that the inliner should always add this op after inlining?
What about the way this op gets in the way of code motion?

As you see I have many questions :slight_smile:

1 Like

That wouldn’t work with the dialect conversion. It can match either specific ops or any Operation * with a programmable condition (e.g., having a trait). Once an op matched, further patterns are not considered. So if this conversion matched any op with the trait, all such ops would be converted only to stacksave/stackrestore.

I think we might opt for a different approach: add a pattern (or a separate pass because this is not specific to the lowering to LLVM) that matches ops with the trait and adds an explicit scope around them.

Hoisting allocas at the function entry block is something LLVM does and MLIR does not, at least not systematically. I am not sure it is feasible, let alone desirable, in MLIR because it has a potentially infinite amount of function-like or stack-scoping operations; even if we write the hoisting for std.func, llvm.func, and gpu.func via an interface, somebody will ignore it for their function op. I have also found it quite annoying to have to keep track of a separate “alloca insertion point” when emitting LLVM IR so I welcome a different mechanism.

On the flip side, the proposed op sounds like a good place where the hoisting can happen because, unlike custom functions, it has an explicit stack scope semantics.

These are good questions that, I suppose, can be answered by clarifying how this op relates to other ops that are also allocation scopes (potentially implicitly, like func). The inliner is implemented on interfaces, it shouldn’t need to know about this op and AFAIK it doesn’t do anything special to allocations.

Hey Denys,

Instead of:

Have you considered something like this:

dynamic_alloca { %x_temp in
   ...
   store %0 -> %x_temp
   ...
}

Basically, instead of making the “op with the region” be a scope that hold unstructured dynamic allocas, you could make the “op with the region” itself be the allocation, and the SSA value is a BB argument in its body (like %i is in an affine.for).

This seems to more directly model what happens in languages with (e.g.) dynamically sized values on the stack, since those values have stack discipline. It also seems like this would be easier to analyze and transform than having to find the allocas in the alloca_scope.

-Chris

1 Like

Hi @mehdi_amini, @clattner,

Thanks a lot for the comments!

This is indeed a valid alternative and it would address the main example I’ve listed earlier. Many of the uses of the alloca today would work just as well with dynamic_alloca solution.

Another aspect to think about is that such design would restrict use of temporary stack allocation purely for intermediate results within the op, and would not allow you to return stack allocated memory (since it would be equivalent to use after free), even for an extremely brief duration in-between ops.

With the current alloca_scope sketch, ops can use alloca-s naively and then require the user of the dialect to insert alloca_scope appropriately to manage stack memory. For example:

scf.for ... {
   %source = mydialect.op_that_returns_stack_allocated_memory
   %sink = mydialect.op_that_consumes_stack_allocated_memory %source
}

Here with the dynamic_alloca-like design you could not lower first op safely. With alloca_scope you can lower it naively and then wrap the whole body to prevent from accidental leakage of stack memory.

My understanding that some of the existing lowerings of the memref family of ops already follow such pattern (since many of the memref ops do use stack allocated memory under the hood). In fact, the original reason for looking into this was internal users complaining that some memref ops can use surprising amount of stack space today in MLIR.

Those are good points. To eliminate alloca_scope one has to make sure that neither of the ops within it are using alloca (even indirectly). One could add an interface that’s used to mark ops that do in fact use stack allocated memory.

Another thing that comes to mind is that in case of nested alloca_scope-s you can fold them into a single one:

// stacksave
alloca_scope {
  %v1 = alloca ...
  // stacksave
  alloca_scope {
     %v2 = alloca ...
     ...
  }
  // stackrestore
}
// stackrestore

Here the last stackrestore always wins so there is no reason to do it twice.

The main benefit of alloca_scope is that it works even for dynamically sized allocations. If the size of the stack allocation is not statically known it’s not easy to hoist them out of the loop (for example if the size of allocation is dependent on the results derived from the loop iteration index).

It would be interesting to discuss more what would be the rules for hoisting, since AFAIK right now we rely on LLVM to do alloca-related optimizations. My intuition is that cases that are hoistable now should still be hoistable with alloca_scope and the rest are likely to be leaking the memory in the absence of alloca_scope.

This could be an interesting thing to investigate as well. Defensively wrapping inlined functions could be a nice experiment to battle-test the op and see how well it scales to more broad usage. It would also force us to figure out good folding/canonicalization rules to make it work well.

As briefly mentioned earlier, my hunch is that it only changes things for dynamically-sized allocations which are not working well at the moment anyway. But the interaction definitely needs to be discussed further.

1 Like

I haven’t really thought about this, but I wonder if static allocas should be rethought. instead of being operations floating around in the entryblock like LLVM, maybe they should be BB args in the entry block? This would be closer to modeling how they actually work and would make it easier to do things with them than scanning the entryblock.

1 Like

This sounds fairly dangerous, as it could overflow the stack (even if we stackrestore/stacksave; sizes derived from loop induction variables can get pretty big). Do those users experience significant problems if they just use alloc for such allocations? In a similar vein, in you example of mydialect.op_that_returns_stack_allocated_memory, why does op_that_returns_stack_allocated_memory have such a strong opinion about returning stack allocated memory? (typically in upstream we always lower to alloc and then optimize them to alloca’s in some circumstances)

@mehdi_amini Before we get to the questions, note that what you propose has a different representational power than the original post IIUC - it forces a longer lifetime than desired in several cases than with the proposal in the OP and is thus less powerful. Just making sure we are aware we are comparing different things.

Can you give an example? It isn’t obvious to me right now: the lifetime dies with the scope you create, why would it get longer?

It’s longer than what you can achieve with the separate scope + alloca’s inside (OP’s proposal) in several scenarios.

Can you provide an example of such scenarios? Just some pseudo-IR or alternatively elaborate a bit on why would the lifetime be extended in one case and not the other?
As I wrote above it isn’t obvious to me right now unfortunately.

alloca_scope {
   ...
   affine.for ... {
     ... = alloca()    // Assume that you don't want these to be thrown away until you hit the end of the alloca_scope that exists above.
     affine.for ...  {
        ...
     }
     ... = alloca
     affine.for ... {
         ...
     }
     ...
  }
}

Thanks!

// Assume that you don’t want these to be thrown away until you hit the end of the alloca_scope that exists above.

Isn’t it the exact opposite of what you were saying before though? The lifetime in your example is longer than strictly needed for these allocas, compared to forcing to tie an alloca to a scope:

   ...
   affine.for ... {
     alloca_scope %... {
       affine.for ...  {
          ...
       }
     }
     alloca_scope %... { # note how it forces to resolve the ambiguity in your example: the lifetime of the first alloca stops before this one start. But you could nest this one in the previous alloca scope as well.
       affine.for ... {
           ...
       }
     }
     ...
  }
}

I think you’ve read it the other way. If you were to use the semantics you proposed, you’d have all those things allocated at the outer alloca_scope - much earlier than needed and thus forced a longer lifetime. Please don’t change the example and move the alloca_scope inside - that’s not what the user desired here. (On a separate note, in fact, you can’t even allocate a variable/compile time unknown number of buffers with the “combined” scope + allocation approach – you could imagine for now if those fors were ifs if you wanted to get rid of the unknown number of buffers part, and the loss of additional flexibility is still apparent with the “combined” approach of scope op itself providing the allocation). I’m not saying one approach being better than the other for IR transformation but they are different in what they can represent.

OK I see what you mean now: the lifetime extension is that you can alloca inside a region and “escape” the allocation to the parent somehow?

alloca_scope {
  %alloc = scf.if ... {
     %true_alloc = alloc(%size1) // lifetime is the alloca_scope
     yield %true_alloc  // escape outside of the regions in scf.if
  } else {
     %false_alloc = alloc(%size2) // lifetime is the alloca_scope
     yield %false_alloc // escape outside of the regions in scf.if
  }
  ... // use %alloc
}

That’s right - the lifetime start (as well as size) of allocations happening inside nested ops with regions (pretty common with if/else and for/while yielding from the last iteration to the op’s result) can’t be captured in the same way with the combined scope + allocate approach.

Loops are one of the reasons we need scoping, regardless of users having opinions about using stack. Something like

scf.for %i ... {
  // arguably, the allocation here has static size
  memref_cast ... : memref<?x?xf32> to memref<*xf32>
}

or

scf.for %i ... {
  call @foo() : () -> memref<*xf32>
}

can already overflow the stack for loops that iterate long enough because unranked memrefs are lowered to stack allocations, no need to depend on the induction variable. And at the LLVM dialect level there’s no alloc that we could use and hope to be optimized to alloca

1 Like

Hi all,

Thanks a lot for the discussion. To summarise here are some key points that were brought up here:

  • @mehdi_amini and @clattner suggested an alternative dynamic_alloca op.

  • @mehdi_amini asked a number of clarifying questions about how things are going fit into existing infra such as canonicalization, inlining and code motion.

  • @clattner wondered whether we should rethink static stack allocations by putting them into the first basic block.

  • @_sean_silva highlighted the dangers of dynamic stack allocations relative to the stack space exhaustion.

  • @bondhugula and @mehdi_amini discussed the interaction of allocation lifetimes and nested scoping between the original proposal and dynamic_alloca variant.

  • @ftynse highlighted that many of the concerns raised earlier for stack exhaustion don’t even require dynamic allocation and can easily happen today while using just the memref dialect with looping constructs.

Since there hasn’t been any alternative proposals that address stack space exhaustion while using memref ops, I suggest we proceed with merging the new op from the original proposal and addressing the rest of the concerns (such as interaction with canonicalization, inlining, code motion etc) as smaller follow-up contributions.

I’d like the semantics of the new op to be defined before the op gets added. In particular all the interactions you mentioned are critical to me to be defined. The implementation can be done as follow-up but there shouldn’t be a need to revisit unclear aspect of the op semantics for every new patch adding a canonicalization or updating a transformation.

@mehdi_amini Here is a brief summary of all the feature interactions we’ve discussed earlier:

  • Canonicalization:

    • alloca_scope without any stack allocating op within it can be eliminated.
    • ops that expect to use stack allocation in their lowering have a marker interface to express that (for example memref casts).
    • whenever alloca_scope is nested directly one inside the other, the inner one can be eliminated (as shown earlier).
  • Inlining:

    • inlining wraps the body of the inlined method into alloca_scope.
    • inlining works together with canonicalization to eliminate redundant alloca_scope-s when they are not necessary.
  • Code motion:

    • it’s legal to move alloca operations from inner alloca_scope to to the beginning of any of the outer ones or to the beginning of the function (which can be seen as having an implicit alloca_scope wrapped around the body), provided this does not break domination in case of dynamically sized stack allocations.

Do you think anything else is missing?

Now this gets tricky, because it seems like it makes the behavior of operations dependent on some lowering instead of the operation definition itself.
For example memref.cast is a good example, will we create a new traits IsMemAlloca? Would we tag memref.cast with this trait?

Which makes me think: the way we define it today I believe is with the AutomaticAllocationScope trait. Any operation can be defined with this scope already.
From this perspective, isn’t alloca_scope just a “passthrough” operation with the AutomaticAllocationScope trait?

If we define it this way, it seems to imply that we can rephase

whenever alloca_scope is nested directly one inside the other, the inner one can be eliminated

into:

whenever alloca_scope is nested directly inside an operation with AutomaticAllocationScope trait, it can be eliminated.

What’s the status on this?

I have run into the stack overflow again when working with loopy code in an out-of-tree project. Even the basic version would work for me.

To try being more constructive, I don’t think we need a separate “memalloc” trait, but can rely on side effect modeling. It should be okay to remove an alloca scope that only contains side effect-free operations; further extensions can look into using the “alloc” effect and have a “stack” resource, but I don’t really need those.

Thanks for the suggestion @ftynse, I think side-effect based analysis could be a decent solution here. alloca_scope regions that contain no side effectful operations within them should be safe to safe unwrap. Of course with finer-grain effect tracking we could be even more precise to model operations with some finer grain effects (such as alloc/stack).

I see that this was committed, a bit quickly, but some of my questions were not answered I believe.

For example:

You didn’t provide the AutomaticAllocationScope on the op you committed, why?

I didn’t get an answer here either: is this operation just a “passthrough + AutomaticAllocationScope”? If so please document it as such in the first place!

I think the op should have indeed have the AutomaticAllocationScope trait, but should not be defined in terms of the trait. It provides exactly the feature that the trait does not: the possibility to match and rewrite the scope operation.

@mehdi_amini Thanks for bringing up the omission, indeed AutomaticAllocationScope should have been added to the op. I’ve amended it in D104227.

I mean that the operation is literally doing nothing but providing a region with this trait, hence why I asked above to document it this way.
(thanks for the update)