Comb MuxOp 2:1 vs N:1

Is there any particular reason why comb.mux is a 2:1 mux rather than an N:1 mux? I get that I can build an N:1 mux out of 2:1s, however; it must be easier to reason about (analyze) an N:1 mux than a tree of 2:1 muxes, right?

No particular reason that I’m aware of. Generalizing it makes sense to me. The only thing we’d have to decide is whether a 2-bit selector can have three operands to choose from (similarly, whether a 3 bit selector can have less than 8), or whether it requires 4. I’d lean towards requiring four: if a bit pattern is impossible, then you can feed in x or z.

I’m leaning towards not requiring 2^(selection bit) inputs for two consistency: the ArrayGetOp doesn’t require the array to be a power of two, so we are throwing away a fraction of a selection bit. (I get why that’s radically different.) It also makes the IR slightly more difficult to read. I don’t feel strongly about it though.

How about we start with it as a required power of two (verifier check) and then relax it if it is a problem in practice?

I neglected another reason: on FPGAs one can pack in an N:1 mux into an ALM wherein N is generation dependent (Stratix 10 FPGAs can do 5:1). It’s not particularly difficult to detect that sort of thing, but I am concerned that if we naively output the case statement with the x/zs, that the braindead synthesis tools will not detect and pack the muxes properly. (To the extent where I don’t think we should rely on them to do that.) So we’d have to detect when x or z drives the inputs and just not print them. I’d say that could be an optimization pass, but if there’s not way to represent that in the IR it’d have to be logic in ExportVerilog.

Actually, there shouldn’t be any difference between an ArrayCreateOp followed by an ArrayGetOp and an N:1 MuxOp begging the question, “do we need a mux op?”

How does a 2+ bit mux get verilog printed?

There are any number ways. Two of the most popular are:

module Mux #(
    parameter  int unsigned	DATA_WIDTH      = 32
) (
    input           [4:0][DATA_WIDTH-1:0]    data_in     ,
    input           [4:0]                  select_in   ,
    output logic    [DATA_WIDTH-1:0]                    data_out_mux  
)

Option 1:

    always_comb begin : flat_mux
        unique case ( select_in )
            default  : data_out_mux = data_in[0];
            3'b001   : data_out_mux = data_in[1];
            3'b010   : data_out_mux = data_in[2];
            3'b011   : data_out_mux = data_in[3];
            3'b100   : data_out_mux = data_in[4];
        endcase
    end

Option 2:

data_out_mux = data_in[select_in];

These have slightly different semantics vis-a-vis the default.

But wait, comb.mux is an expression not an array lookup. It is an operation like %x = comb.mux %selector, %a, %b, %c, %d.

I agree we can lower this to always_comb + case (your option #1). That seems like a general way to handle this.

-Chris

ArrayGetOp is an expression as well: %3 = hw.array_get %arr[%selector] : !hw.array<4xi1>. so your example mux op is functionally equivalent to:

%arr = hw.array_create %a, %b, %c, %d : !hw.array<4xi8>
%x = hw.array_get %arr[%selector] : !hw.array<4xi8>

Sure, I agree it can be equivalent in functionality, but that won’t synthesize into the right thing. You want a case as far as I know.

Here is my suggestion, more fleshed out:

  1. generalize to any power of two. Start by requiring power of 2 operands, using “x” for unknowns. We’ll have to handle this anyway, because you may have an x for intermediate values.
  2. Add a helper function that lowers comb.mux with more than one bit into a “case” statement, it can pattern match on x operands in various places to make nice patterns, including trailing ones.
  3. Lower in both PrettyVerilog and in the ExportVerilog routines, using that shared helper function. We need this because prettyify verilog is optional, but will almost always be used. PrettifyVerilog is the places that should sink other expressions into the case.

Does that seem like a reasonable approach?

-Chris

Alternatively, we could have ExportVerilog lower into the array_create+array_get thing, since that is simpler, and avoids having to have the shared helper function.

Actually, I don’t have a preference. array_get actually works better since the port type is an array.

Define “the right thing”. A case block? I have no reason to believe a case block would result in different hardware than an array subscript, modulo the default behavior. In fact, something like #2 is likely preferable since (I assume) indexing past the end of the array is undefined behavior and allows the synthesizer freedom to optimize that condition.

I’m thinking we should either get rid of MuxOp (since it’s technically redundant) or lower it to an array_create+array_get. Then let PrettyVerilog optionally lower the array_create+array_get pattern to a SV trinary expression or a case block. (Both of which I think would have to be added to the SV dialect.) I think this approach is more consistent with the idea of having a minimal set up operations then worry about prettying them up for verilog output later on.

Getting rid of MuxOp also avoids having to decide on the number of inputs. OTOH, a mux is a pretty basic hardware structure which just happens to be what is used (in practice) by verilog synthesizers. If we generalize MuxOps, there’s an argument to get rid of array_get and replace it with an array_explode followed by a mux. I may have fallen down the minimalist IR rabbit hole and we should keep both for practicality.

Since that functionality already exists via array_get, I’ve advised my user to use that. Neither of us thought to use array_get until this discussion today. So generalization of MuxOp is no longer as important (or useful) as it was last week.

But it isn’t just the end of the array, the array formulation is strictly less powerful than a multi-bit mux with x’s. For example, if you have a 4-bit mux, it is entirely possible to have x’s on element 2, 3 and 13. It isn’t just dense 0…12 or something. Synthesis tools handle don’t cares with switch statements, not array indexes.

I’ll expand on my interest in this. I have a long standing axe to grind with casex/z formation. I’ve deferred doing this because mapping from a “pile of combinational logic” into a case statement is too big of a logical jump to fit into the pipeline in a natural place. The combinational logic (usually involving muxes) makes sense to turn into a combinational logic element.

Your multi-bit mux is a very nice solution to this problem. We could pattern match various idioms into a multibit mux, then lower them to a casex at verilog emission time. This allows synthesis tools to generate efficient logic based on various x’s in the patterns (which, again, don’t have to be at the end) while allowing SSA-based optimizations at the comb level to be powerful.

It seems like a nice sweet spot.

I’m thinking we should either get rid of MuxOp (since it’s technically redundant)

What single op is it redundant with? Mux is a pretty fundamental operation.

-Chris

You mean on the inputs or the select? I assume the select since that’s the interesting bit (or x as the case may be).

I think this is a bit much for a multiplexer. In terms of actual hardware, there’s no such thing as don’t cares, cases, pattern matching, etc. – there are only “simple” muxes like the one array_get supports. casex systemverilog blocks allow synthesizers to optimize and generate hardware-level muxes. I agree that the sort of pattern matching op you’re proposing would be a powerful op to have modeled. We could lower it to the verilog dialect’s casez or do the optimization and lowering to simple muxes ourselves. Either way, I’m not comfortable with adding pattern matching to an operation called MuxOp in the hw dialect.

No single op – an array_create + array_get. I’m pretty sure the mux op as-is is equivalent. I tend to agree that we should keep mux as you’re correct that it is pretty fundamental.

I looked up how we (Intel microprocessor design) write verilog to produce good mux implementations for Synopsys Design Compiler. The study done suggests that it is perfectly fine to use an array lookup instead of a “unique casez” or two other more complex looping constructs.

Fantastic, thanks @stevenmburns!

We discussed this today in the design meeting, here are notes / video from the discussion.

I added some rationale to the docs describing our design choice.

I also filed an issue for improving the verilog emission in the complicated case.

Thanks!

-Chris

Another public confirmation of this is the RTL used in University of Washington’s BaseJump_STL (which you should look at if you haven’t.)
Here is a link to the mux RTL:
https://github.com/bespoke-silicon-group/basejump_stl/blob/master/bsg_misc/bsg_mux.v