[RFC] Representing pipelined loops

The only difference is a region-holding op could support control flow with branches within a single stage. I’m still doubting if anything will actually want to target that at this level of abstraction, but it sounded like there was some interest in at least keeping the door open for such a thing. Wrapping each pipeline stage in an operation would at least support that kind of generality without too much overhead. Hopefully the useful block machinery in MLIR can still be made to work with some helper methods (splicing operations, etc.)

The intent was the staticlogic.yield would represent pipeline registers to pass values forward to the next stage.

This was what the iter_args are intended to capture. The staticlogic.return can indicate that some values flow back and are accessable through iter_args.

In general, I need to add a better prose exposition (which should ultimately go into a rationale doc). But I also think fleshing out more examples is the best way to iterate on the design at this point. Let me compile those actual examples to SCF and I’ll then propose how I’d convert the SCF to a pipeline.

After this morning’s meeting, I think it’s worth restating the goal of this proposal, since what I’ve proposed so far is sort of missing the mark. We all seemed to agree that a compilation path from loops could use a “high level pipeline” representation, which captures data dependence but abstracts away the details of RTL control signals. After this level, we could do scheduling and lower into a representation that makes cycle-by-cycle control explicit. This proposal is really about the former, and we are hoping to target Calyx for the latter.

First off, is that a fair summary?

If so, I’ll take a look at what could go into this representation, and how we can get useful information from MLIR tools. I’ve been looking into what MLIR can support, and in the Affine dialect we already have tools for analyzing memory access dependences. I can try running the above examples through those tools, see what information we can get, and start proposing how to capture that in the IR with a “high level pipeline” operation.

2 Likes

I tested the above examples using Polygeist to compile C into Affine dialect ops. Then, I wrote a little debug analysis using checkMemrefAccessDependence. Below are the actual C code, Affine dialect IR, results from the check, and a little commentary on each example.

This is just one of many tools available upstream, but I think this is getting at what we want. Feedback welcome!

Test 1

C

int test1(int a[10]) {
  int x = 0;
  for (int i = 0; i < 10; ++i) {
    x += a[i];
  }
  return x;
}

Affine

builtin.func @test1(%arg0: memref<?xi32>) -> i32 {
  %c0_i32 = constant 0 : i32
  %0:2 = affine.for %arg1 = 0 to 10 iter_args(%arg2 = %c0_i32, %arg3 = %c0_i32) -> (i32, i32) {
    %1 = affine.load %arg0[%arg1] : memref<?xi32>
    %2 = addi %arg2, %1 : i32
    affine.yield %2, %2 : i32, i32
  }
  return %0#1 : i32
}

checkMemrefAccessDependence Result

No memref access dependence.

Commentary

Rather than creating a single element memref for the x variable, the loop carried dependence is expressed with iter_args. Note that Polygeist appears to create two iter_args, one for the return value and one for the loop carried dependence, even though they are actually the same in this case.

Test 2

C

void test2(int a[10], int b[10]) {
  for (int i = 0; i < 10; ++i) {
    int x = a[i];
    if (i >= 3) {
      b[i - 3] = x + a[i - 3];
    }
  }
}

Affine

builtin.func @test2(%arg0: memref<?xi32>, %arg1: memref<?xi32>) {
  affine.for %arg2 = 0 to 10 {
    %0 = affine.load %arg0[%arg2] : memref<?xi32>
    affine.if #set(%arg2) {
      %1 = affine.load %arg0[%arg2 - 3] : memref<?xi32>
      %2 = addi %0, %1 : i32
      affine.store %2, %arg1[%arg2 - 3] : memref<?xi32>
    }
  }
  return
}

checkMemrefAccessDependence Result

RAR dependence at loop depth 1:
 loop: 0
 dependence: [3,3]
 src: %0 = affine.load %arg0[%arg2] : memref<?xi32>
 dst: %1 = affine.load %arg0[%arg2 - 3] : memref<?xi32>

Commentary

The tool detects a read-after-read dependence on the a array with a distance of 3.

Test 3

C

void test3(int b[10]) {
  int a0, a1, a2, a3;
  for (int i = 0; i < 10; ++i) {
    a0 = a1;
    a1 = a2;
    a2 = a3;
    a3 = a0 + b[i];
  }
}

Affine

builtin.func @test3(%arg0: memref<?xi32>) {
  %0 = memref.alloca() : memref<1xi32>
  %1 = memref.alloca() : memref<1xi32>
  %2 = memref.alloca() : memref<1xi32>
  affine.for %arg1 = 0 to 10 {
    %3 = affine.load %2[0] : memref<1xi32>
    %4 = affine.load %1[0] : memref<1xi32>
    affine.store %4, %2[0] : memref<1xi32>
    %5 = affine.load %0[0] : memref<1xi32>
    affine.store %5, %1[0] : memref<1xi32>
    %6 = affine.load %arg0[%arg1] : memref<?xi32>
    %7 = addi %3, %6 : i32
    affine.store %7, %0[0] : memref<1xi32>
  }
  return
}

checkMemrefAccessDependence Result

RAW dependence at loop depth 1:
 loop : 0
 dependence: [1,9]
 src: %3 = affine.load %2[0] : memref<1xi32>
 dst: affine.store %4, %2[0] : memref<1xi32>
RAW dependence at loop depth 1:
 loop : 0
 dependence: [1,9]
 src: %4 = affine.load %1[0] : memref<1xi32>
 dst: affine.store %5, %1[0] : memref<1xi32>
RAW dependence at loop depth 1:
 loop : 0
 dependence: [1,9]
 src: %5 = affine.load %0[0] : memref<1xi32>
 dst: affine.store %7, %0[0] : memref<1xi32>

Commentary

The tool detects read-after-write dependences on the single element memrefs representing the rotating register file. The dependence distance ranges from 1 to 9.

Test 4

C

void test4(int a[10], int b[10]) {
  for (int i = 0; i < 10; ++i) {
    a[b[i]]++;
  }
}

Affine

builtin.func @test4(%arg0: memref<?xi32>, %arg1: memref<?xi32>) {
  %c1_i32 = constant 1 : i32
  affine.for %arg2 = 0 to 10 {
    %0 = affine.load %arg1[%arg2] : memref<?xi32>
    %1 = index_cast %0 : i32 to index
    %2 = memref.load %arg0[%1] : memref<?xi32>
    %3 = addi %2, %c1_i32 : i32
    memref.store %3, %arg0[%1] : memref<?xi32>
  }
  return
}

checkMemrefAccessDependence Result

No memref access dependence.

Commentary

It looks like reading the index out of the b array causes Polygeist to insert an index_cast operation, which leads to memref.load and memref.store being used, rather than affine.load and affine.store. This stumps the tool, since it only works with Affine loads and stores. I couldn’t figure out how to convince Polygeist that this C code wanted to store index types in the b array, but I will try handwriting some Affine dialect code to see if this example can be understood by the tool.

1 Like

Is this last case a lack of being conservative in checkMemrefAccessDependence? (@bondhugula)

Cool! So if we scheduled these into our various pipeline for proposals, how would they look?

This seems like an interesting canonicalization I hadn’t thought about… maybe a pass could generically optimize any ReturnLikeOp with redundant args?

I’m not sure you can do anything generically in a simple way: you’d need to know about the enclosing op as well. For example for a func operation, you’d need to update the signature and find all the call-site that needs to be updated as well.

That’s the question! I wanted to share the results first, but I’ll come back with some ideas.

Indeed. This code is actually irregular (non-affine). The subscript for %arg0 is being loaded from (effectively %arg0[%arg1[%arg2]). The affine dependence analysis pass won’t define/represent/report any dependences between affine ops and non-affine ones that might be reading/writing memref operands.

For this “high level” pipeline operation, I’ve been wondering if we want to capture the exact dependence distances shown above, or if it is enough to whether or not there is a dependence. If there is a dependence, the operations can’t be scheduled into the same coarse grained stage, but I’m not sure if we need the exact distances at this level of abstraction (if it is useful, it could be captured in an attribute).

During our meeting @jopperm mentioned perhaps we could give every operation a dummy latency of 1 and run the scheduling. I have been toying with this, and I think it makes sense at this level. This would allow the dataflow through SSA values to capture which coarse grained stage operations should be scheduled into. We could insert a dependence into the problem for a couple scenarios:

  • memory access dependences, to ensure RAW or WAW happens in an order such that writes are visible
  • dependences from every operation enclosed within an if to the condition of the if, so the condition is computed before the operations that are predicated by it

I think this makes sense, but I’m curious what others think. I’m trying this approach, and I’ll post what the four running examples might look like using that coarse grained scheduling approach and the pipeline operation I previously proposed.

Here’s my revised proposal, with examples below. I’ll try to give a prose specification, followed by some questions I’ve had, and finally the examples we’ve been looking at. I appreciate any feedback.

Specification

staticlogic.pipeline

  • This is the top-level operation representing a high-level pipeline.
  • It is not longer isolated from above, but could be if this is helpful.
  • It has two regions: condition and stages.
  • It has an iter_args similar to the SCF dialect for representing loop carried values. When the pipeline starts execution, the registers indicated as iter_args by staticlogic.return are initialized to the initial values specified in the iter_args section here.

condition

  • This single-block region dictates when the pipeline terminates.
  • It has a staticlogic.yield terminator, and the pipeline continues while the value is true.
  • It may access SSA values dominating the pipeline, as well as the iter_args.

stages

  • This region wraps staticlogic.pipeline_stage operations.
  • It has a staticlogic.return terminator, which can both return results from the pipeline and yield iter_args .
    • The results section accepts a variadic list of values which become the pipeline’s return values. These must be results of the last stage, and their types must match the pipeline return types.
    • The iter_args section accepts a variadic list of values which become the next iteration’s iter_args . These may be the results of any stage, and their types must match the pipeline iter_args types.

staticlogic.pipeline_stage

  • This operation has a single-block region which dictates the operations that may occur concurrently.
  • It may have an optional when predicate, executes the body when the predicate is true, and pushes a bubble through the pipeline otherwise.
  • It has a staticlogic.yield terminator, which passes the concurrently computed values forward to the next stage.
  • Any stage may access iter_args
  • Other than iter_args, stages may only access SSA values dominating the pipeline or SSA values computed by the previous stage. This ensures the stages capture the coarse-grained schedule of the pipeline.

Discussion

The iter_args and condition region were previously discussed, but the intent is to capture both for and while loops. All of the examples are simple for loops, so they have the same condition and induction variable update, but more complex loops should be possible.

The second example (with an if) brings up how to encode predicated execution. The body of an if will become one or more stages. I’ve chosen to add an optional predicate to “enable” any stages that arise from the body of an if, and populate the predicate with the condition of the if. The intent is when a predicate is specified, and is true, the stage executes as normal. But when a predicate is false, the stage executes, but outputs meaningless results that bubble through the pipeline. Other options could be to embed multiple stages within an scf.if, or to inline scf.if into multiple stages. In general, does this kind of predication make sense at this level?

Another point we have discussed is whether the stages should be represented as simple blocks, or an op holding simple blocks. In this case, I’m choosing to represent this with a staticlogic.pipeline_stage op. In the current proposal, we are using simple blocks, so this creates a level of indirection, where SSA values must be yielded up. However, this opens the door to a few things:

  • predicated execution as described in the previous paragraph
  • the potential to store detailed dependence information or other attributes on each stage
  • the potential to support embedding a CFG a-la scf.execute_region

Finally, I want to make sure I’m capturing the right thing with the stages. We discussed how we want this “high-level pipeline” to capture data dependences while leaving the details of cycle-by-cycle control flexible. By using the scheduling infrastructure as described in the previous post, I think we can get just that. We end up with a flat sequence of stages, where the operations in a given stage can execute concurrently, but each stage executes sequentially one after another. If values computed in an earlier stage get used by later stages, I’m currently having them forwarded along through each intermediate stage, even if they aren’t used. My intent is to represent the pipelined nature of execution: each stage can only use values computed by the previous stage. Is this capturing what we want at this level of abstraction?

Examples

Below are the same four examples, with a little commentary about their peculiarities mixed in.

Test1

This example sum-reduces an array. Hence, it makes use of iter_args not just for the loop induction variable, but for the actual result being computed. The first stage loads the next value, the second stage sums it, and the return specifies the sum is both the result and the iter_arg value for the next iteration.

func @test1(%arg0: memref<?xi32>) -> i32 {
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %c10 = constant 10 : index
  %c0_i32 = constant 0 : i32
  %0 = staticlogic.pipeline iter_args(%i = %c0, %arg_iter = %c0_i32) -> (i32) {
    condition {
      %done = cmpi lt, %i, %c10 : index
      staticlogic.yield %done
    }
    stages {
      %2:3 = staticlogic.pipeline_stage {
        %i_next = addi %i, %c1 : index
        %1 = memref.load %arg0[%i] : memref<?xi32>
        staticlogic.yield %i_next, %1
      }
      %4 = staticlogic.pipeline_stage {
        %3 = addi %2#1, %arg_iter : i32
        staticlogic.yield %3
      }
      staticlogic.return results(%4), iter_args(%2#0, %4)
    }
  }
  return %0 : i32
}

Test 2

This example demonstrates predicated execution. The first stage is eagerly loading values and computing the predicate. The following stages push a bubble through the pipeline the first 3 times they execute, after which the predicate becomes true and actual results are passed through.

func @test2(%arg0: memref<?xi32>, %arg1: memref<?xi32>) {
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %c3 = constant 3 : index
  %c10 = constant 10 : index
  staticlogic.pipeline iter_args(%i = %c0) {
    condition {
      %done = cmpi lt, %i, %c10 : index
      staticlogic.yield %done
    }
    stages {
      %2:4 = staticlogic.pipeline_stage {
        %i_next = addi %i, %c1 : index
        %0 = memref.load %arg0[%i] : memref<?xi32>
        %1 = cmpi sge, %i, %c3 : index
        staticlogic.yield %i, %i_next, %0, %1
      }
      %4:3 = staticlogic.pipeline_stage when %2#3 {
        %3 = subi %2#0, %c3 : index
        staticlogic.yield %2#2, %2#3, %3
      }
      %6:4 = staticlogic.pipeline_stage when %4#1 {
        %5 = memref.load %arg0[%4#2] : memref<?xi32>
        staticlogic.yield %4#0, %4#1, %4#2, %5
      }
      %8:3 = staticlogic.pipeline_stage when %6#1 {
        %7 = addi %6#0, %6#3 : i32
        staticlogic.yield %6#1, %6#2, %7
      }
      staticlogic.pipeline_stage when %8#0 {
        memref.store %8#2, %arg1[%8#1] : memref<?xi32>
      }
      staticlogic.return results(), iter_args(%2#0)
    }
  }
  return
}

Test 3

This example represents the rotating register file using three memrefs. The first stage performs all loads, the second stage rotates the registers and computes the addition, and the last stage stores the result of the computation to the final register.

func @test3(%arg0: memref<?xi32>) {
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %c10 = constant 10 : index
  %0 = memref.alloca() : memref<1xi32>
  %1 = memref.alloca() : memref<1xi32>
  %2 = memref.alloca() : memref<1xi32>
  staticlogic.pipeline iter_args(%i = %c0) {
    condition {
      %done = cmpi lt, %i, %c10 : index
      staticlogic.yield %done
    }
    stages {
      %7:5 = staticlogic.pipeline_stage {
        %i_next = addi %i, %c1 : index
        %3 = memref.load %2[%c0] : memref<1xi32>
        %4 = memref.load %1[%c0] : memref<1xi32>
        %5 = memref.load %0[%c0] : memref<1xi32>
        %6 = memref.load %arg0[%i] : memref<?xi32>
        staticlogic.yield %i_next, %3, %4, %5, %6
      }
      %9:1 = staticlogic.pipeline_stage {
        memref.store %7#2, %2[%c0] : memref<1xi32>
        memref.store %7#3, %1[%c0] : memref<1xi32>
        %8 = addi %7#1, %7#4 : i32
        staticlogic.yield %8
      }
      staticlogic.pipeline_stage {
        memref.store %9#0, %0[%c0] : memref<1xi32>
        staticlogic.yied
      }
      staticlogic.return results(), iter_args(%7#0)
    }
  }
  return
}

Test 4

This example demonstrates indirection. The first stage loads the index, the second stage uses the index to load the value to increment, the third stage performs the increment, and the final stage stores the value back. Note how the index loaded in the first stage has to be passed through the third stage, even though it isn’t used, in order for the fourth stage to use it.

func @test4(%arg0: memref<?xi32>, %arg1: memref<?xi32>) {
  %c0 = constant 0 : index
  %c1 = constant 1 : index
  %c10 = constant 10 : index
  staticlogic.pipeline iter_args(%i = %c0) {
    condition {
      %done = cmpi lt, %i, %c10 : index
      staticlogic.yield %done
    }
    stages {
      %1:2 = staticlogic.pipeline_stage {
        %i_next = addi %i, %c1 : index
        %0 = memref.load %arg1[%i] : memref<?xi32>
        staticlogic.yield %i_next, %0
      }
      %3:2 = staticlogic.pipeline_stage {
        %2 = memref.load %arg0[%1#1] : memref<?xi32>
        staticlogic.yield %1#1, %2
      }
      %5:2 = staticlogic.pipeline_stage {
        %4 = addi %3#1, %c1_i32 : i32
        staticlogic.yield %3#0, %4
      }
      staticlogic.pipeline_stage {
        memref.store %5#1, %arg0[%5#0] : memref<?xi32>
        staticlogic.yield
      }
      staticlogic.return results(), iter_args(%1#0)
    }
  }
}
1 Like

I think the proposed representation captures the nature of a pipeline really well!

How would you proceed with the lowering once the stages are formed? I‘m a bit concerned that scheduling with 1-latencies to form the stages is prone to bundling unrelated operations with potentially very different runtime characteristics together, such as the simple addi and memory loads in your examples. Similarly problematic would be e.g. a chain of simple combinatorial ops, all in their own stages. A naive 1:1 mapping of “high-level” to Calyx stages could result in inefficient/imbalanced pipeline implementations, I think.

My second concern is that encoding the dependences based on the order of the stages makes it inconvenient to (re-)construct a cycle-by-cycle scheduling problem later in the flow. For that, we’d ideally still have the original SSA graph and the additional memory and control dependencies in the IR, i.e. basically an old-school CDFG. However, if we’d add something like this, we could also reconsider to schedule using the actual latencies (in conjunction with the operator library proposal) to form the stages in the first place… Sorry for meandering between ideas here!

Good question, and I do agree these are valid concerns… let me try to address them.

I think at this level, we may be able to do some transformations to avoid these pitfalls while still maintaining the data dependences we want to capture. For example, we could split off memory operations to their own stage, or condense stages containing only combinational operations into one. The goal of any IR should be to enable transformations, and I think the proposed IR can enable such transformations.

But this seems to me to beg the question: how would we decide when to perform these transformations? Will we use heuristics? Should we just solve the full cycle-by-cycle scheduling problem at this point? I think the answer relates to your other question…

I agree it is inconvenient, but I don’t think it’s really too bad. We can still trace the SSA def-use chains, we just need to known that the terminator of a stage implicitly represents a set of pipeline registers, and we will need to trace through each stage’s return values to find where those registers get used. I haven’t prototyped it, but I’m fairly confident the proposed IR is still relatively straightforward to traverse and construct the original def-use chains for scheduling. It didn’t show up in my proposal above, but we can also capture memory depencies as attributes in the IR to avoid recomputing them.

The big question I’m wondering at this point is whether we find value in doing this “coarse grained” scheduling first, forming a “high-level pipeline”, applying some transformations, and then proceeding with lowering using the full cycle-by-cycle scheduling. In other words, a flow like this:

Affine/SCF loops → “Coarse Grained” Scheduling → “High Level Pipeline” Creation and Transformation → Cycle-by-Cycle Scheduling → Calyx program

Or, as you mentioned:

This might look more like:

Affine/SCF loops → CDFG with Standard branches and blocks → Cycle-by-Cycle Scheduling → “High Level Pipeline” Creation and Transformation → Calyx Program

Finally, if we are doing that, do we even need to materialize a “high-level pipeline” in the IR? To rephrase that question as @stephenneuendorffer puts it: what transformations would we want to do at this point before creating the Calyx program? If there aren’t any, perhaps we don’t need it. One thing having an intermediate representation would make explicit is where to put the pipeline registers, which was one of the TODOs on @mortbopet’s draft PR for going to Calyx. Are there other benefits?

Personally, I am starting to lean towards the second approach, i.e. scheduling once, with actual latencies. I do see value in capturing the scheduled operations in the IR as a “high-level pipeline”, as proposed above. This will hopefully simplify any transformations on the pipeline after scheduling, and make the lowering to Calyx straightforward. What do others think?

EDIT: to make the above paragraph more concrete, here’s how I’d propose implementing it:

  • Update the existing StandardToStaticLogic pass to incorporate cycle-by-cycle scheduling information
  • Enhance the staticlogic.pipeline operation to capture some things it is missing from the proposal above (iter_args, condition block, terminators to represent pipeline registers, etc.)
  • Consume the new staticlogic.pipeline operation in the (draft) pass that lowers to Calyx

A small question wrt. this:

We’d obviously want an II of 1 here. However, I’m wondering about the semantics of staticlogic.returning results from different pipeline stages. Do we intend that this:

  1. Infers a register for %2#0 through the 2nd stage of the pipeline or
  2. stage 1 stalls while stage 2 computes (implies II = 2 => bad)

If not case 1; should it even be allowed to staticlogic.return values from anywhere but the last stage? And if so, does it make sense to have the staticlogic.return operator? If we require the programmer to be explicit about which values are forwarded through stages (i.e. the opposite of case 1), we could require that the scf.yield of the final stage follows the type signature of results and iter_args.

My intent was for each yield to infer a register. For feedforward values, this is how they are passed from one stage to the next.

Regarding the return, I was hoping to capture how the registers get “wired up”. For one, how the registers from the final pipeline stage connect to the operation’s results. Also, how the loop-carried values are passed to iter_args.

I can see how what I’ve proposed here makes this confusing. I guess in my head, I was intending for an II of 1, but the proposal doesn’t capture the need to forward %2#0 through the second stage of the pipeline. Like some of the other examples that forward along values even if they aren’t used, I think I should update this to explicitly yield that value from the second stage:

stages {
  %2:3 = staticlogic.pipeline_stage {
    %i_next = addi %i, %c1 : index
    %1 = memref.load %arg0[%arg1] : memref<?xi32>
    staticlogic.yield %i_next, %1, %arg0
  }
  %4 = staticlogic.pipeline_stage {
    %3 = addi %2#1, %2#2 : i32
    scf.yield %2#0, %3
  }
  staticlogic.return results(%4#1), iter_args(%4#0, %4)
}

I think this is more explicit, represents II=1, and is generally in line with the other examples. (I would of course need to update all of the other example’s inductions variables, since they follow the same pattern.) Does this make more sense?

I think that makes sense. It would be easy enough to verify, and if we do the above, the extra return operation isn’t really necessary. We just need to define that e.g. the results are yielded first, followed by the iter_args.

Thinking back to this, I’m also wondering if we need to move down to a CDFG using the Standard dialect, or if it would be possible to schedule at the SCF level directly. Looking at how SCF lowers to Standard, the only real difference is whether we want to capture the control flow of the loops/ifs/etc with branches and blocks, or in the semantics of a higher level operation. The latter seems preferable to me, since we know exactly what values are induction variables, loop-carried values, etc. and we can create auxiliary dependences in the scheduling problem as needed. @jopperm do you think it could be possible/valuable to schedule at this higher-level of abstraction?

Actually, in my mental model, this represents II=2. I remember why I didn’t do this initially. To support iter_args in this pipeline operation, I think we should allow any stage to yield/register the value to pass to the next “iteration”. But we should also enforce the results are only computed by the final stage. The way I chose to represent that was with the return operation. In addition to terminating the block and accepting the pipeline’s result values, it could point to any intermediate results and pass them to the next “iteration”.

This is what my previous comment was trying to get at with registers being explicit, and return being about “wiring-up”. Any stage’s pipeline register(s) could be “connected” to the iter_args, and the final stage’s pipeline register(s) become the outputs of the pipeline.

I think the original description didn’t go into enough detail regarding the semantics. If I update it as below, would the representation make sense for II=1 in this case?

  • It has a staticlogic.return terminator, which can both return results from the pipeline and yield iter_args.
    • The results section accepts a variadic list of values which become the pipeline’s return values. These must be results of the last stage, and their types must match the pipeline return types.
    • The iter_args section accepts a variadic list of values which become the next iteration’s iter_args. These may be the results of any stage, and their types must match the pipeline iter_args types.

In my mind, the initiation interval is equal to the number of stages it takes to compute the first available iter_arg. In all of these examples, we can compute the for-loop index update in the first stage, so we can achieve II=1.

To handle iter_args that take more stages to compute, like the example here, we may need to forward iter_args through stages to where they are used. This is enforced by “the first stage may access iter_args , but later stages may only access SSA values dominating the pipeline or SSA values computed by the previous stage”.

I had a copy-paste typo in the original proposal, but I updated it and added the above text. To get back to your original question:

I think there is a third option, which is what I was going for in this representation: Explicitly add a register for %arg_iter through the 1st stage of the pipeline. This was obscured by my copy-pasta but has been updated. Does this make sense as an alternative?

@mortbopet and I discussed this offline, and I said I’d ponder this some more and come back to it. Here’s what I’ve found: my previous notion of passing the iter_args through registers still doesn’t achieve II=1 for the Test1 example (at least in my mental model of execution). After some whiteboarding, I’ve revised my stance.

In order for a pipeline to have II=1, a stage using an iter_arg will need to access the previously computed value every “iteration”. This may not be the first stage, as this example shows. The second stage accesses the running sum, and forcing it to access that through a register at the end of the first stage will make the II=2. In short, a stage should be able to read from its own pipeline registers to receive iter_args. I’ve updated the description above to note that any stage may access iter_args.

EDIT: And I’ve updated the Test1 example to show the second stage directly accessing %arg_iter.

In situations where II=1, I think we can also guarantee that every stage accessing an iter_arg is also yielding that arg, but for II>1, this may not hold. For II>1, we may need to use the when clause to push bubbles through.

After talking to Morten, we are getting to the point that we need to just define the operation and start playing with it. Both to resolve the questions about how and when to do scheduling, as well as to better understand what the lowering to Calyx actually wants. With that in mind, I’m going to take this to GitHub now, but please feel free to continue the discussion here. I’ll post the draft PR when it goes up.

Yes, I think it’s possible and sound to schedule at the SCF level, by adding the auxiliary control and memory dependences as you describe. So no need to materialise a CDFG just for constructing the scheduling problem.

:+1: Thanks, looking forward to that! I’m out of town this week, but should be able to contribute more actively next week.

I opened up a draft PR here for review: [StaticLogic] Add new PipelineWhile operation. by mikeurbach · Pull Request #1759 · llvm/circt · GitHub