[RFC] Representing pipelined loops

Introduction

Following today’s discussion, it seems like we need to commit to a representation of pipelined loops, which can be targetted from higher-level dialects, and be used to generate a FSMD representation.

I propose we evolve staticlogic.pipeline to fit our needs. If that ultimately aligns with the needs of software pipelining as outlined in Better modelling of pipeline loops, we can talk about upstreaming this. For now, it seems prudent to incubate the representation in CIRCT.

So, what are our needs? To drill in a bit, I think there are a few things:

  • Represent the induction variable of an scf.for or the condition variable of scf.while and how it updates
  • Represent information computed by scheduling in the IR, like initiation interval and start times of operations
  • Enable the generation of a FSMD

What else?

Calyx representation

For me, it helps to think bottom up, starting from what is required to enable the generation of a FSMD. I’m going to assume we are talking in terms of Calyx, which represents the datapath with wires and groups, and the FSM with a control language that enables different groups.

It seems like there are two major needs: FSMD for the looping constructs and FSMD for the stages of the pipeline, which access the looping constructs. I’m trying to focus on what the control language should look like. In Calyx psuedo-IR, I’m picturing something like:

while %loop_construct.condition with @LoopGroup
  parallel
    enable @LoopGroup
    if %input.ready with @InputGroup
      enable @Stage1Group
    if %stage1.done with @Stage1Group
      enable @Stage2Group
    if %stage2.done with @Stage2Group
      enable @OutputGroup

Does that make sense? I’m curious if there is a good idiom for representing a pipeline in Calyx.

To support such an FSM, we’d need datapath components for:

  • Computing the next values of looping constructs, done conditions, etc.
  • Computing each stage of the pipeline

Pipeline stages

From scheduling, we can get an initiation interval, and the start times of different operations. Operations with the same start time are in the same stage.

We already have discussed a couple options for capturing the computation stages: either with a sequence of blocks (staticlogic.pipeline), or with a sequence of region-carrying ops (scf.pipeline_for). I think either option can make sense.

One thing I’m considering is how to pass values between stages (i.e. where to put pipeline registers). In the current staticlogic.pipeline, this is implicit in the values flowing between each block. In scf.pipeline_for, this is made explicit in the values yielded and returned by each stage. An in-between approach could be to define a new operation that is a block terminator, which passes values to the next block:

staticlogic.pipeline
bb0:
  %0 = std.addi %1, %2: i32
  staticlogic.yield %1 : i32
bb1(%arg0: i32) // %arg0 is passed from %1
  ...

Would that help analysis and lowering to FSMD?

Loop conditions and variables

Another thing I’m interested in is the interaction between looping constructs like induction variables and the stages. I’m imagining a component that computes the loop variable updates and done conditions, and feeds the loop variables into the first stage. The value of the loop variable is carried down through the stages, so each stage can access the appropriate value of the induction variable at the right time.

This is something I liked about the scf.pipeline_for representation: it passes the induction variable to each stage as block arguments, before the input values:

%0 = scf.pipeline_stage {
  ^bb0(%iv : index, %arg1 : f32):

Whether or not we use a wrapping op like scf.pipeline_stage, or use a flat list of blocks inside staticlogic.pipeline directly (possibly with the proposed staticlogic.yield), it seems like having the current induction variable passed as a basic block argument makes sense.

Besides how to pass them around, I’m also thinking about how to represent the computation of loop variables. The scf.pipeline_for proposal was specific to for-loops, and captures the lower bound, upper bound, and step in op itself. I’m also thinking about scf.while (which scf.for can be converted to), and may be needed for certain frontend use-cases. I think capturing the loop variable computation in the op, like scf.pipeline_for, is a good thing, but I’m looking towards a little more generality, like scf.while.

Perhaps we can add this onto staticlogic.pipeline, similar to how scf.while does it. We could have a region to compute the next value of the condition, and a region to compute any other “looping variables”, like the induction variable in a for-loop, and present them to the pipeline stages. I’m hoping this would capture the information needed to generate the appropriate Calyx component(s).

Draft proposal

Given the above, I’m thinking of how we can update staticlogic.pipeline.

Here’s a draft. Let’s say we have this loop, with scheduling information in parentheses:

scf.for %i = %c0 to %c10 step %c1 { (II = 1)
  %0 = memref.load %memory1[%i]     (startTime = 0)
  %1 = std.addi %0, %c1             (startTime = 1)
  memref.store %1, %memory2[%i]     (startTime = 2)
}

Here’s one point in the design space:

staticlogic.pipeline {II = 1} {
  condition(%done = %c0) {            // %c0 is an operand, the initial value. %done is a block argument.
    %0 = std.icmp lt %done, %c10
    staticlogic.yield %0
  }
  update(%i = %c0) {                  // %c0 is an operand, the initial value. %i is a block argument.
    %0 = std.addi %i, %c1
    staticlogic.yield %0              // each value yielded here corresponds to a block argument in the stages
  }
  stages() {                          // any inputs could be accepted as operands to stages
  ^bb0(%i0):                          // %i0 is yielded by the update block, and any other block args come from the operands to stages (to keep staticlogic.pipeline IsolatedFromAbove)
    %0 = memref.load %memory1[%i]
    staticlogic.yield %i0, %0
  ^bb1(%i1, %arg0):                   // the arguments are now fed through the previous yield
    %1 = std.addi %arg0, %c1
    staticlogic.yield %i1, %1
  ^bb2(%i2, %arg1):
    memref.store %arg1, %memory2[%i2]
    staticlogic.return                // pipeline outputs could be returned here
  }
}

Does this representation capture the ingredients for a FSMD? What’s missing?

Other thoughts

Some random other thoughts I’ve had that aren’t addressed by the above draft:

  • How to pass around loop-carried values? I think this can be captured similarly to scf.for, etc., using iter_args on staticlogic.pipeline, alongside condition, update, and stages. The values from the staticlogic.return could be made available as iter_args.
  • How to represent nested loops? I think it should be possible to cram more complex logic into the condition and update regions, to update multiple loop induction variables. Control flow could be represented at this level using std.select to implement mux-like behavior, or we could rely on CFG operations like cond_br.
  • Does it make sense to explicitly yield induction variables from one stage to the next? Is there a meaningful computation we’d want to do on them, in order to yield something to the next stage other than the value passed in? That could enable a degree of flexibility. If not, we could make the passing of the first block argument implicit, as is done in the scf.pipeline_for proposal.

Thank you for the write-up; i think it summarizes the discussions and different proposals up to now nicely.

I voiced my supported for a region-based approach during the meeting, but I’ll repeat here for reference:
While the block-based approach of staticlogic.pipeline seems possibly simpler to implement, analyse and optimize across stages, it puts quite a severe restriction on the input program, being that it must be fully lowered to comb - this will eliminate any block-based control flow in favor of multiplexers, which is what allows staticlogic.pipeline to “hijack” block semantics as being pipeline stages.
As i see it, this representation is too specific and wouldn’t work as a (potentially upstreamable) general pipeline representation, seeing as we disallow block-based control-flow within the different stages of the loop. With a region based approach we may place control flow within the regions or we can have region bodies be fully lowered to comb. This makes it applicable to pipelining lowered comb logic, as well as pipelining before comb lowering. In this way we are not painting ourselves into a corner where we are forced to perform pipelining post comb lowering.

For managing the iterations of the pipeline itself, I don’t see a good argument for specializing this to for-loops; for loops are syntactically sugar’ed while loops, so a controller FSM for managing the for-loop will essentially be identical to managing a while-loop.
After lowering a for-loop to a while-loop, the incrementation of the induction variable is no longer anything special - it is just a part of the datapath. In your example, I’d say this would equate to inlining the update region into the 1st stage of the pipeline. Therefore, if going with a while-based approach, there is no need for an update op.

If basing the pipeline representation on while-loops, the induction variable itself is also no longer anything special - it is just a loop-carried value similar to any other iter_args. This means that the loop variable will always be made a stage argument for lowered for-loops, but never for lowered while-loops (since it doesn’t have any).

Here is what I was thinking about a while-loop, region-based approach, using your provided example. There are only iterargs and no “special” induction variables, the incrementation of the for-loop induction variable is inlined into the first stage, and the staticlogic.condition op simply references the loop-carried values. Furthermore, a specific terminator-operation of staticlogic.stages is required to forward loop-carried values to the next iteration.


staticlogic.pipeline(/*iterargs=*/%done = %c0) {II = 1} {
  staticlogic.condition {
    %0 = std.icmp lt %done, %c10
    staticlogic.yield %0
  }
  staticlogic.stages {
    %0:2 = staticlogic.stage {
      %l = memref.load %memory1[%i]
      %i = std.addi %done, %c1
      staticlogic.yield %i, %l
    }
    %1:2 = staticlogic.stage {
      %1 = std.addi %0#1, %c1
      staticlogic.yield %0#0, %1
    }
    staticlogic.stage {
      memref.store %1#1, %memory2[%1#0]
    }
    staticlogic.return %0#0 // Terminator op matching iterargs type
  }
}

In this case, the return value depends on a variable yielded from stage 1; meaning that an II of 1 is feasible. We could imagine that if it dependend on values from later stages, this would then increase the achievable II.

As relevant reference and since it came up in the August 25th meeting, here is how Calyx (and by extension the Calyx dialect) can encode pipelines. We’ve also internally discussed how to represent the pipeline stages at a higher-level using a new control operator but at the end of the day, it will compile down to a similar encoding.

As mentioned in the link above, Calyx does not place any restriction on the complexity of the stage—it can be an arbitrary control graph and the Calyx lowering passes will handle it just fine. I think this more in line with @mortbopet explanation/reasoning on why the iter_args stuff in not represented in the pipeline representation.

Regarding the use of basic blocks, I have two points.

If we need it, there is a std.select operation we can use to implement mux-like behavior without having to lower all the way to Comb.

More importantly, this is about structured control flow. The SCF or Affine dialect enforces each op to have a single body block, terminated by scf.yield or affine.yield. At this level of abstraction, there is no CFG of basic blocks, and branching is represented by scf.if or affine.if. In fact, for software compilation, SCF or Affine will lower into a CFG, before lowering to LLVM IR. To me, the discussion about basic blocks within stages is kind of moot: we are choosing to start at a higher level of abstraction (SCF or Affine) and generate our pipeline from there. This is really the beauty of MLIR for HLS: we get to define how “high level” the input IR really is.

All that being said, I do agree that this use of basic blocks is somewhat out of the norm, and might not be acceptable for upstream MLIR. Especially if we agree stages should yield outputs through a terminator, wrapping each stage in an op with a single block makes sense. It also seems more natural if we have any transformations we’d like to do at this level of abstraction with RewritePatterns. And, we could support a CFG within each stage, if there was a need. I propose we add an op like staticlogic.pipeline.stage for this.

This is a great insight, thanks for helping me see this. I think replacing what I called the update region above with iter_args and a terminator that can yield results from the stages makes sense.

Another alternative I was just looking into is how scf.while handles this. It combines what I called condition and update above into one region. That region’s terminator has one operand to indicate if the loop should continue, and forwards the rest of its operands to the body.

We could do something similar to scf.while, but that is only really useful for things like induction variables that we would always compute in the first stage. I think we will want general iter_args in order to support reduction loops coming from the frontend, which may not yield until the final stage. If we have general iter_args, then it seems to me like we should use them for inductions variables, etc. as well, rather than doing something like scf.while. I’ll adopt what you proposed above.

One question is whether we want to separate loop carried values from the outputs of the operation. For SCF iter_args, they are one and the same, and the values yielded by the final iteration become the results of the operation. I think it our case it may make sense to make a distinction between internal loop carried values and outputs. Perhaps staticlogic.return could have two variadic lists of operands: one for outputs and one for iter_args.

Thanks for the pointer, that’s just what I was looking for. I’ll check it out.

I’m away from keyboard for a bit, but I’ll post again with an updated proposal (i.e. hypothetical IR) taking the above into account.

The scf.execute_region operation allows for control-flow in SCF-level MLIR. i.e.:

func @for_ex(%arg0 : index, %arg1: index, %arg2: index) -> f32 {
  %s0 = constant 0.0 : f32
  %result:2 = scf.for %i0 = %arg0 to %arg1 step %arg2 iter_args(%iarg0 = %s0, %iarg1 = %s0) -> (f32, f32) {
    %sn = scf.execute_region -> f32 {
      %0 = std.cmpf oeq, %iarg0, %iarg1 : f32
      cond_br %0, ^bb0, ^bb1
    ^bb0:
      scf.yield %iarg0 : f32
    ^bb1:
      scf.yield %iarg1 : f32
    }
    scf.yield %sn, %sn : f32, f32
  }
  return %result#1 : f32
}

or if lowered to a while loop:

module  {
  builtin.func @for_iter_args(%arg0: index, %arg1: index, %arg2: index) -> f32 {
    %cst = constant 0.000000e+00 : f32
    %0:3 = scf.while (%arg3 = %arg0, %arg4 = %cst, %arg5 = %cst) : (index, f32, f32) -> (index, f32, f32) {
      %1 = cmpi slt, %arg3, %arg1 : index
      scf.condition(%1) %arg3, %arg4, %arg5 : index, f32, f32
    } do {
    ^bb0(%arg3: index, %arg4: f32, %arg5: f32):  // no predecessors
      %1:2 = scf.execute_region -> (f32, f32) {
        %3 = scf.execute_region -> f32 {
          %4 = cmpf oeq, %arg4, %arg5 : f32
          cond_br %4, ^bb1(%arg4 : f32), ^bb1(%arg5 : f32)
        ^bb1(%5: f32):  // 2 preds: ^bb0, ^bb0
          scf.yield %5 : f32
        }
        scf.yield %3, %3 : f32, f32
      }
      %2 = addi %arg3, %arg2 : index
      scf.yield %2, %1#0, %1#1 : index, f32, f32
    }
    return %0#2 : f32
  }
}

I’m not aware of pipelining transformations that handle loops with control-flow. At least for modulo schedulers, the usual approach is to if-convert the loop’s body to a CDFG, and schedule from there. So it might not be necessary to allow control-flow in the pipeline stages, as it would be lowered to predicated data-flow before that stages are formed.

One aspect currently missing in the proposal is how to handle multi-cycle operations, as we touched upon in the ODM. It’s obvious that an operation being inside a pipeline stage means that it is started in that stage, but we need some notion of when a result is available, e.g. for transformations or verifiers. This could be done by encoding the op’s latency, or using a future-like operation in the stage in which the result is available. Also, would be allow empty stages, e.g. if no op starts in certain time step?

In general, if such a representation of a pipelined datapath should be transformed/verified after scheduling/formation of the stages, I believe we need to encode the same information as currently modeled in the circt::scheduling::Problems, such as the latency (cf. previous paragraph), operator blocking times, resource constraints, etc.

I’m not aware of pipelining transformations that handle loops with control-flow. At least for modulo schedulers, the usual approach is to if-convert the loop’s body to a CDFG, and schedule from there.

Just trying to state the obvious here—this restriction only applies in the static scheduling case right? One reasonable question to ask here is if we want to commit to tying the pipeline representation to only be implementable using a statically-timed pipeline. The other option is to have a generic representation of pipelines that can be either scheduled statically or dynamically (latency-insensitive).

The Calyx philosophy from day one has been that static scheduling is just an optimization on dynamic scheduling and everything should always have a compilation path through dynamic compilation. The example in Calyx I linked above already works with both dynamic and static scheduling (albeit doesn’t take full advantage of all static timing information yet).

In the same way, the utility of a pipeline operator that can encode arbitrary control flow is that it’s more homogeneous with the rest of the language—there is no restrictions on how control operators (in Calyx parlance) may be composed with the pipeline operator. As long as they can be lowered, the pipeline can be constructed. Arbitrary restrictions on language constructs is a famous problem in HLS tools today so I would recommend considering the effect on upstream dialects when thinking about a restricted pipeline operator.

This is an important point. If we want to do any transformations on this IR, we will need to know all the same information that is used by scheduling, in order to write robust verifiers that ensure the transformations are valid. Another way of putting this is we are talking about capturing the relevant data structures from circt::scheduling::Problem in the IR itself.

To that end, I’ve been thinking about the issues like multi-cycle operations and stages with no operations. If we stick to the representation that a sequence of “stage” operations mark the start times, I think there are decent answers. To encode the latency of operations, we could specify per-result attributes on the stage terminator operation, which label the stages at which each result will be available. The verifier for the pipeline operation could check that values are accessed in the appropriate stage. For stages that don’t start any operations, we could have a “stage” operation with an empty body block.

The idea around future-like operations is interesting. I’m picturing something like the HIR delay operation, which takes an SSA value and encodes a latency after which it may be accessed. This is an interesting alternative, but I wonder if it will be amenable to the kind of transformations we want to do. It seems like answering questions like “is this access valid?” or “at which stage is this value available?” would require more IR traversal than the representation described in the previous paragraph.

My gut feeling is to start with the “sequence of stage” representation, with attributes to encode latency. Does that sound reasonable? Is there a use case that would make the future-like representation preferable?

Taking the above discussions into account, here is my revised proposal. Note that the staticlogic.pipeline.stage operation could support arbitrary control-flow a-la scf.execute_region, if so desired:

staticlogic.pipeline {II = 1} {
  iter_args(%i = %c0)
  condition(%done = %c0) {
    %0 = std.icmp lt %done, %c10
    staticlogic.yield %0
  }
  stages() { // any inputs to the pipeline could be passed to stages()
    %s0:3 = staticlogic.pipeline.stage {
      %0 = memref.load %memory1[%i]
      %1 = std.addi %i, %c1
      staticlogic.yield %i {latency=1}, %0 {latency=1}, %1 {latency=1}
    }
    %s1:2 = staticlogic.pipeline.stage {
      %1 = std.addi %s0#1, %c1
      staticlogic.yield %s0#0 {latency=1}, %1 {latency=1}
    }
    staticlogic.pipeline.stage {
      memref.store %s1#1, %memory2[%s1#2]
      staticlogic.yield
    }
    staticlogic.return results(), iter_args(%s0#2) // any outputs from the pipeline could be passed to results()
  }
}

It’s unclear to me how this is fundamentally different from using blocks for feedforward designs. We could represent this as something like (simplified slightly):

^bb0:
      %0 = memref.load %memory1[%i]
      %1 = std.addi %i, %c1
      staticlogic.reg %i, %0, %1
^bb0(%a, %b, %c):
      %1 = std.addi %b, %c1
      staticlogic.reg %a, %1

Which would avoid replicating alot of machinery that already exists for basic blocks.
It’s also unclear to me what you’re trying to represent with “latency”. Perhaps you could provide some more examples? How do values get passed forward from one pipeline stage to another? How are recurrences described?

Here are a few C-like examples that might be interesting to discuss concretely:

// value in x needs to get passed back to a previous stage.  What happens if the add requires multiple pipeline stages?
x = 0; for(i) {  x += a[i] }   

// reuse values read from a.  store for multiple cycles.
for(i) {  x = a[i]; if(i >=3) {b[i-3] = x + a[i-3];} }

// a is 3 different registers implementing as rotating register file..  equivalent to a[i/3] += b[i];
for(i) { a0=a1; a1= a2; a2 = a3; a3 = a0+b[i] }

// histogram.  raw and war dependencies through memory a[i].
for(i) { a[b[i]] ++;}

The only difference is a region-holding op could support control flow with branches within a single stage. I’m still doubting if anything will actually want to target that at this level of abstraction, but it sounded like there was some interest in at least keeping the door open for such a thing. Wrapping each pipeline stage in an operation would at least support that kind of generality without too much overhead. Hopefully the useful block machinery in MLIR can still be made to work with some helper methods (splicing operations, etc.)

The intent was the staticlogic.yield would represent pipeline registers to pass values forward to the next stage.

This was what the iter_args are intended to capture. The staticlogic.return can indicate that some values flow back and are accessable through iter_args.

In general, I need to add a better prose exposition (which should ultimately go into a rationale doc). But I also think fleshing out more examples is the best way to iterate on the design at this point. Let me compile those actual examples to SCF and I’ll then propose how I’d convert the SCF to a pipeline.

After this morning’s meeting, I think it’s worth restating the goal of this proposal, since what I’ve proposed so far is sort of missing the mark. We all seemed to agree that a compilation path from loops could use a “high level pipeline” representation, which captures data dependence but abstracts away the details of RTL control signals. After this level, we could do scheduling and lower into a representation that makes cycle-by-cycle control explicit. This proposal is really about the former, and we are hoping to target Calyx for the latter.

First off, is that a fair summary?

If so, I’ll take a look at what could go into this representation, and how we can get useful information from MLIR tools. I’ve been looking into what MLIR can support, and in the Affine dialect we already have tools for analyzing memory access dependences. I can try running the above examples through those tools, see what information we can get, and start proposing how to capture that in the IR with a “high level pipeline” operation.

2 Likes

I tested the above examples using Polygeist to compile C into Affine dialect ops. Then, I wrote a little debug analysis using checkMemrefAccessDependence. Below are the actual C code, Affine dialect IR, results from the check, and a little commentary on each example.

This is just one of many tools available upstream, but I think this is getting at what we want. Feedback welcome!

Test 1

C

int test1(int a[10]) {
  int x = 0;
  for (int i = 0; i < 10; ++i) {
    x += a[i];
  }
  return x;
}

Affine

builtin.func @test1(%arg0: memref<?xi32>) -> i32 {
  %c0_i32 = constant 0 : i32
  %0:2 = affine.for %arg1 = 0 to 10 iter_args(%arg2 = %c0_i32, %arg3 = %c0_i32) -> (i32, i32) {
    %1 = affine.load %arg0[%arg1] : memref<?xi32>
    %2 = addi %arg2, %1 : i32
    affine.yield %2, %2 : i32, i32
  }
  return %0#1 : i32
}

checkMemrefAccessDependence Result

No memref access dependence.

Commentary

Rather than creating a single element memref for the x variable, the loop carried dependence is expressed with iter_args. Note that Polygeist appears to create two iter_args, one for the return value and one for the loop carried dependence, even though they are actually the same in this case.

Test 2

C

void test2(int a[10], int b[10]) {
  for (int i = 0; i < 10; ++i) {
    int x = a[i];
    if (i >= 3) {
      b[i - 3] = x + a[i - 3];
    }
  }
}

Affine

builtin.func @test2(%arg0: memref<?xi32>, %arg1: memref<?xi32>) {
  affine.for %arg2 = 0 to 10 {
    %0 = affine.load %arg0[%arg2] : memref<?xi32>
    affine.if #set(%arg2) {
      %1 = affine.load %arg0[%arg2 - 3] : memref<?xi32>
      %2 = addi %0, %1 : i32
      affine.store %2, %arg1[%arg2 - 3] : memref<?xi32>
    }
  }
  return
}

checkMemrefAccessDependence Result

RAR dependence at loop depth 1:
 loop: 0
 dependence: [3,3]
 src: %0 = affine.load %arg0[%arg2] : memref<?xi32>
 dst: %1 = affine.load %arg0[%arg2 - 3] : memref<?xi32>

Commentary

The tool detects a read-after-read dependence on the a array with a distance of 3.

Test 3

C

void test3(int b[10]) {
  int a0, a1, a2, a3;
  for (int i = 0; i < 10; ++i) {
    a0 = a1;
    a1 = a2;
    a2 = a3;
    a3 = a0 + b[i];
  }
}

Affine

builtin.func @test3(%arg0: memref<?xi32>) {
  %0 = memref.alloca() : memref<1xi32>
  %1 = memref.alloca() : memref<1xi32>
  %2 = memref.alloca() : memref<1xi32>
  affine.for %arg1 = 0 to 10 {
    %3 = affine.load %2[0] : memref<1xi32>
    %4 = affine.load %1[0] : memref<1xi32>
    affine.store %4, %2[0] : memref<1xi32>
    %5 = affine.load %0[0] : memref<1xi32>
    affine.store %5, %1[0] : memref<1xi32>
    %6 = affine.load %arg0[%arg1] : memref<?xi32>
    %7 = addi %3, %6 : i32
    affine.store %7, %0[0] : memref<1xi32>
  }
  return
}

checkMemrefAccessDependence Result

RAW dependence at loop depth 1:
 loop : 0
 dependence: [1,9]
 src: %3 = affine.load %2[0] : memref<1xi32>
 dst: affine.store %4, %2[0] : memref<1xi32>
RAW dependence at loop depth 1:
 loop : 0
 dependence: [1,9]
 src: %4 = affine.load %1[0] : memref<1xi32>
 dst: affine.store %5, %1[0] : memref<1xi32>
RAW dependence at loop depth 1:
 loop : 0
 dependence: [1,9]
 src: %5 = affine.load %0[0] : memref<1xi32>
 dst: affine.store %7, %0[0] : memref<1xi32>

Commentary

The tool detects read-after-write dependences on the single element memrefs representing the rotating register file. The dependence distance ranges from 1 to 9.

Test 4

C

void test4(int a[10], int b[10]) {
  for (int i = 0; i < 10; ++i) {
    a[b[i]]++;
  }
}

Affine

builtin.func @test4(%arg0: memref<?xi32>, %arg1: memref<?xi32>) {
  %c1_i32 = constant 1 : i32
  affine.for %arg2 = 0 to 10 {
    %0 = affine.load %arg1[%arg2] : memref<?xi32>
    %1 = index_cast %0 : i32 to index
    %2 = memref.load %arg0[%1] : memref<?xi32>
    %3 = addi %2, %c1_i32 : i32
    memref.store %3, %arg0[%1] : memref<?xi32>
  }
  return
}

checkMemrefAccessDependence Result

No memref access dependence.

Commentary

It looks like reading the index out of the b array causes Polygeist to insert an index_cast operation, which leads to memref.load and memref.store being used, rather than affine.load and affine.store. This stumps the tool, since it only works with Affine loads and stores. I couldn’t figure out how to convince Polygeist that this C code wanted to store index types in the b array, but I will try handwriting some Affine dialect code to see if this example can be understood by the tool.

1 Like

Is this last case a lack of being conservative in checkMemrefAccessDependence? (@bondhugula)

Cool! So if we scheduled these into our various pipeline for proposals, how would they look?

This seems like an interesting canonicalization I hadn’t thought about… maybe a pass could generically optimize any ReturnLikeOp with redundant args?

I’m not sure you can do anything generically in a simple way: you’d need to know about the enclosing op as well. For example for a func operation, you’d need to update the signature and find all the call-site that needs to be updated as well.

That’s the question! I wanted to share the results first, but I’ll come back with some ideas.

Indeed. This code is actually irregular (non-affine). The subscript for %arg0 is being loaded from (effectively %arg0[%arg1[%arg2]). The affine dependence analysis pass won’t define/represent/report any dependences between affine ops and non-affine ones that might be reading/writing memref operands.

For this “high level” pipeline operation, I’ve been wondering if we want to capture the exact dependence distances shown above, or if it is enough to whether or not there is a dependence. If there is a dependence, the operations can’t be scheduled into the same coarse grained stage, but I’m not sure if we need the exact distances at this level of abstraction (if it is useful, it could be captured in an attribute).

During our meeting @jopperm mentioned perhaps we could give every operation a dummy latency of 1 and run the scheduling. I have been toying with this, and I think it makes sense at this level. This would allow the dataflow through SSA values to capture which coarse grained stage operations should be scheduled into. We could insert a dependence into the problem for a couple scenarios:

  • memory access dependences, to ensure RAW or WAW happens in an order such that writes are visible
  • dependences from every operation enclosed within an if to the condition of the if, so the condition is computed before the operations that are predicated by it

I think this makes sense, but I’m curious what others think. I’m trying this approach, and I’ll post what the four running examples might look like using that coarse grained scheduling approach and the pipeline operation I previously proposed.