Should `rtl.add i1, i1` return i1 or i2?

Hardware has the capability to do non-modulo arithmetic – to return the number of bits necessary to contain the full result. This is very different from most software which just uses machine primitives to avoid the overhead of arbitrary precision arithmetic.

It appears that the RTL dialect’s arithmetic operators are modulo operations. I think they should do the full value then we can truncate the result if needed.

What say y’all?

1 Like

Hello,
I am new here, so I apologize if my response falls completely out of scope.

I can see several use cases:

  • mixed-precision results => need the full width (i1 + i1 -> i2)
  • standard-precision results without carry i1 + i1 -> i1 (modulo)
  • standard-precision results with carry i1 + i1 -> {i1, i1} OR i2

Having an rtl.add* operations returning the full results would be very useful.
I think that returning i2 allow to implement those 3 cases easily (with a truncate for the second one).

When extending the results, a case to consider is signed vs unsigned semantic.
it should be assumed that each operand is sign extended for signed operations and zero extended for unsigned operations. This sign extension has an impact when considering widended results.

i4 + i4 -> i5
unsigned 0b(0)1111 + 0b(0)0001 -> 0b10000
signed 0b(1)1111 + 0b(0)0001 -> 0b00000 (virtual carry out)

Another solution could be to force using a construction like “zext … to” and “sext … to” and have explicit widening and keep rtl add with uniform I/O widths.

Regards,

Talking to process a bit:

This is what the FIRRTL dialect does for add, but it also allows operand extensions to happen as well. In addition to add, it does this for its other operations as well, which isn’t always very predictable.

The problem with the FIRRTL design is that there is /too much/ flexibility in the cross product between operands and result types, and this makes writing peephole optimizations and other transformations more complicated. Instead of handling sign and zero extension once in logic for a zext/sext node, you have to think about it everywhere. This empirically has introduced bugs that only got caught by fuzzing. Because of that, we’re currently restricting inputs to be the same type. This forces the operands to be manually sign/zero extended, which helps define away a lot of bugs and also make analyses and transformations easier to write.

As for result types, I can see the argument both ways: these operations only have a single result type, and if we keep the restriction that the operands are all homogenous, then the zext/sext issue remains fixed. The way the RTL dialect currently handles widening additions and multiplications is like LLVM (which also has arbitrary bitwidth): you implement a widening 64x64=128 multiplication by expressing it as a 128x128=128 multiplication. The code generator then matches the extensions + mul back into the widening multiplication.

Looking at this fresh, assuming we don’t change the operands to allow implicit extensions, then I can see it both ways:

  1. On the one hand, the worst of the “bad thing” is avoided in terms of bugs and annoyance working with the IR. Reassociation would still be a pain, but it is addressible. I agree with you that this would make the cost model for hardware easier to reason about in some cases.

  2. On the other hand, I don’t see the problem with the LLVM style approach: general hardware passes will be doing width minimization anyway, which will “pull truncations up” and “push extensions down” the IR to reduce widths. Keeping symmetric types also works better with MLIR infra in the absence of computed result types for ops.

What do you see as the major benefit here? Is it just “this is what HW and verilog do” or is there a more fundamental problem with our current approach?

-Chris

That’s part of it – Verilog actually has a reasonable approach here (IMO) and this feels rather intuitive for hardware. My other concern is that if we end up outputting a 128x128=128 operation to verilog, synthesizers won’t see the potential optimization opportunity and will end up generating a bunch of unneeded gates which won’t be optimized away for one reason or another. I make a habit of never overestimating synthesizers or EDA tools in general.

If, however, the “general hardware passes” you’re referring to are CIRCT code, I’m ok with the LLVM style but we still have a similar issue: if you want to output a 64x64=128 operation in Verilog, you have to either (1) implement the “general hardware passes” in the verilog code generator (ugly solution IMO), (2) have an optimization pass which creates an operation which represents a 64x64=128 that can be used by multiple code generators, or (3) implement and optimize the multiplier itself at the bit level, which I don’t think is a good idea in the next few years. Unless I’m missing something…?

And I hadn’t thought about the operands, but all your arguments make sense and sound like they have been learned the hard way.

Right, I agree that we should not over estimate downstream EDA tools :-).

That said, your arguments all apply to the generated Verilog – they aren’t arguments about a design point about the IR we implement transformations against.

You’re right that this puts additional complexity into the verilog emitter, which carries the burden of generating reasonable verilog that the tools will eat. I’m optimistic that I can solve this with a combination “generally useful” prepasses that clean up the RTL level representation (e.g. minimizing widths, merging if/#ifdef blocks, etc), and a some targeted heuristics in the verilog emitter itself. I’d like to explore this before making these sorts of changes.

That said, if that fails, then there is still another path: we can introduce “SV” versions of all of the operators, and lower from rtl.add to sv.add, and do these sorts of things on the sv dialect instead of rtl.

This gets back to my main question at the top which is “what is best for analysis and transformation”?

-Chris

Fair points all.

I can see that, but it does speak to the abstraction level we need to design the IR for. We need to figure out how far down that particular rabbit hole we go and set up the dialects appropriately. Should we have a dialect to describe exactly the (digital-level) hardware we want? This would imply that it is a terminal dialect and the SV dialect is purely for ease of codegen and should do its best to not pervert the design – like a human readable netlist? That’d be a pretty extreme amount of control and distrust, but we may eventually want that level depending on how much we end up trusting synthesizers. We can theoretically dig all the way to Taiwan! (Just not Portland – that wouldn’t be a hole so much as a tunnel.)

How much control we want/need and how soon – it seems to me – will depend partially on how good SystemVerilog synthesis engines end up being, but we should certainly explore this via measurement.

This, however, is nearly completely off topic.

It seems to me that this boils down to an opt-in vs opt-out policy. If we keep the modulo arithmetic, we require designers to “opt-in” by extending the operands appropriately. If we get rid of modulo, we require that designers who don’t want the extra result to “opt-out” via truncation. One argument in favor of opt-out is that designers are more likely to reason about the size of the results they actually need and truncate appropriately instead of wasting area. One argument against is that it requires they do so or else cascading arithmetic operations could end up exploding the bitwidths.

The reason I’m pushing on this is that being able to customize bitwidths is an important advantage of accelerators. If we expose this up to higher levels as the entry point, the default, to hardware instead of an optimization, it’ll encourage the upstream languages and dialects to reason about bitwidths – something which isn’t as important in the software world. If we just accept modulo arithmetic as the default, we cede that advantage trying to look like software.

That said, I realize that adapting modulo arithmetic to non-modulo via truncation operators is a PITA (from personal experience having done it) and would add friction in lowering to RTL (and possibly prevent us from sharing optimization code with modulo arithmetic dialects).

As described in the rationale doc, the SV dialect is really about optimizing for humans, not optimization and analysis, so I’d set that aside.

The purpose of the RTL dialect is to make it easy to transform. This is why regularity and predictability are valuable. One aspect of this (in support of your point) is that you want a simple cost model so you can reason about “what is better and worse” in the model, and things that diverge from how they are implemented undermines that. However, everything is an approximation of truth, you have to decide whether to draw the line and balance. This is the art/design aspect that is best informed with experience.

I would rather we move deliberately and only add complexity when we are convinced it is warranted, and can’t be solved in other ways.

This has nothing to do with designers. Designers are not writing the RTL dialect, it is an intermediate representation used for analysis and transformation. Frontends should have their own domain abstractions (e.g. firrtl is just one example, there can and should be others) just like imperative languages have their own high level ASTs/IRs and lower to LLVM IR.

I also care a lot about accelerators. However, you’re confuse the use-case with the purpose of the IR.

Unrelated technical question about the concept. Given that rtl.add is variadic (not binary), what is the result width of the operation?

-Chris

Sorry, I meant IR designers. (Maybe IR developers is a better term?) But I’m assuming/hoping that what we expose in the IR will bubble up and affect future language design. Even if it doesn’t, it could influence lowerings to do some sort of bitwidth analysis (if possible) to constrain the bitwidths to what they need. I understand that this is a very optimistic hope, maybe to the point of being unreasonable.

Absolutely. Given that you have far more experience in IR develop, I’ll defer to you on this question. To be clear, I’m not strongly in favor either way – I recognize there are trade-offs but I tend to take hyperbolic positions to flush out the reasoning behind a decision.

Since we agree that all the operands should be the same width (N), I’m pretty sure that the output width would be N + <numOperands> - 1, right? Just like cascading binary adds? Or is there some subtlety I’m missing?

Actually, I’m pretty sure one can save bits by allowing varying operand widths and computing the output width based on those widths varying widths – e.g. i32 + i2 + i2 could be i33 rather than i34.

Sure. Just to explain where I’m coming form, I expect the RTL dialect to be the host to a wide range of interesting analysis and transformations over time, so optimizing it to be easier to analyze and transform is the priority. Making it easy to generate is less important, frontends are going to have to do all sorts of lowering in any case.

Yeah, this was my point. I’m mostly saying that this opens a bit of a pandoras box - precise specification is the ideal thing if we’re trying to achieve our goal of matching closely to hardware, but that gets complicated quickly.

Particularly given the lack of implicit extension on the inputs, it isn’t clear to me that the result type will be a tight bound anyway.

-Chris