Representing device data

Hey all-

This is – for now – a purely academic question: does it make sense and how would one represent device data in MLIR? I’m defining ‘device data’ as information about the physical structure of a device: LUTs, memories, DSPs, PCIe MACs, DRAM controllers, their location(s) on the chip, routing resources, timing information, etc. The stuff contained in the vendors’ device databases.

We could use this information to floorplan. For instance with ESI: to communicate from the PCIe interface to the HBM controller on the opposite side of the chip, you’ll probably need a pretty deep pipeline. Knowing the physical locations is the first step to estimating the number of stages necessary.

I don’t think it makes sense to encode this information in an MLIR IR, but I may just not be creative enough. I also don’t see any specific advantages of doing this, though I’m not familiar enough with the graph algorithms present in MLIR to say. Maybe there’s some advantage of having a way to encode physical locations in MLIR IR for placement optimizations, but encoding the device data itself…?

Personally, I think this makes alot of sense and we’ve been doing exactly this for some of our next-gen devices with hundreds of Vector/VLIW cores. MLIR provides a nice structure for mixing the programming of processors, DMA engines, Other IP cores and programmable logic. At some level the ‘program’ ends up describing aspects of a device (like how many tiles there) and aspects of the program (like how those tiles are programmed). I don’t know if it’s the easiest way to handle your particular example (using floorplanning information to pipeline long paths), but I think it’s certainly possible. You could for instance, start with a design without floorplanning information and build up a hierarchical structure to represent different regions on a device. One thing you run into quickly is that values don’t escape regions, so accessing components in those regions becomes tricky without explicitly pushing all the signals through the boundary of each region, which is cumbersome.

Yep. I suspect you could use MLIR could be used to model anything hierarchical or linear in nature. (Since the basic data structure is a hierarchy with lists at the leaves.) That’s just not true of physical spaces, so you’d be fighting the basic data structure – you mention one of the symptoms.

There’s actually two separate but related issues here as I see it:

  • Hierarchy: it’s not clear by what property (location, type of widget, routing of some sort) one should construct a hierarchy upon. This will affect the “value escape” issue you mention. It’s almost certainly application-dependent.
  • Multidimensionality: in forcing a 2D structure into a hierarchy one makes it more difficult to reason about locality – two adjacent structures may be very distant in terms of hierarchy edges. (This is actually a problem for any tree-based data structure leading to the – now outdated – notion of data structure threads, links between the leaves in the tree.)

It does absolutely make sense to me to link your design hierarchy into some sort of more appropriate device data structure at map/place/route time. It intuitively makes a ton of sense to me to store routing information in the design hierarchy edges, though I’m not sure how much that would buy you in terms of timing estimation.

You doubtlessly have much more experience with device databases than I so I’m largely speaking from a place of intuition than experience.

I’m starting an engagement with an internal customer (at Microsoft) who are focused very much on performance. In order to meet their performance requirements, the tooling (ESI and/or some other dialect) needs some relatively detailed information about the physical structure of the FPGA they are targeting. Essentially, they know more-or-less where they want to place and (more importantly) route their widgets but are having trouble getting Quartus to do what they want because, frankly, Quartus isn’t very good at its job. So they end up having to tell Quartus exactly what to do. Given the size of their design (number of widgets they want to place) and the irregularity of FPGA fabrics, this approach obviously doesn’t scale so they’re looking for tooling assistance. This problem is more tractable than the usual PnR problem since their design is very regular – Quartus just can’t see the regularity since it’s just a flattened netlist by the time it gets to the placement and routing steps. Plus, this is a case where you really want to choose where to put the pipeline registers after you do placement. (As I’ve argued for several years, the basic design flow – RTL → mapped netlist → placement → routing – is fundamentally broken these days.)

So I’m starting to think seriously about this topic. @stephenneuendorffer Can you share any more details about Xilinx’s approach? I’m looking to store location data about the M20Ks, DSPs, I/O columns, and some routing information (mostly about fast vs slow wires so we can conserve the fast wires).

My initial (totally uninformed) instinct is that MLIR would be difficult and unnatural to use to store the device data, but that one could attach placement and routing information to MLIR ops via attributes. I think a placement/routing/device dialect could be useful for defining custom attributes and modeling device-specific hardware (e.g. DSPs) as ops. The logic could then be lowered to RTL in the usual way (being very careful about names) and a separate TCL script could be generated instructing Quartus where to place/route the entities in the RTL (as identified by the careful names).

Does this approach seem reasonable? Thoughts?

I fear that any advice I have will be woefully incomplete, but I’ll try.

In our case we are focusing on device information, but not on the programmable logic part. I think if you understand the regularity o the design, then you can probably generate blocks of the design and do a coarse-grained placement of them. The thing I’m not sure about the quartus tools is how they represent placement constraints. In the Xilinx tools at a coarse level you can floorplan and have regions of logic assigned to a particular block of the device. I could see doing an automatic assignment like this and then letting the tools figure things out from there. Alternatively, you can preplace specific instances to particular locations and then let the tools run from there. Usually anchoring DSPs and BRAMs gives most of the benefit and the place and route tools can work effectively from there. In this case, I agree that it probably makes sense to lower from some high-level dialect down to the sv dialect with annotations, and then to generate the placement constraints based on the annotations.

I assume you’ve seen Martin Langhammer’s papers on their super high-density place and route? I guess that’s not working in your case. There’s also a paper at the FPGA conference in a few weeks about AutoBridge that generates large device placement for Xilinx HLS designs. (but I suspect that is a coarser granularity than what you are doing.

I have not. I assume you’re referring to “High Density 8-Bit Multiplier Systolic Arrays For Fpga” and “High Density and Performance Multiplication for FPGA” specifically? Which other ones (if any) are particularly relevant do you think? He’s had a good many over the years…

I look forward to this paper. I suspect there are big opportunities for scheduling post-placement or floorplanning/constraining based on scheduling.

https://dl.acm.org/doi/10.1145/3289602.3293927

1 Like

I think there are two sides of this problem:

  1. representing device specific abstractions, e.g. standard cells.
  2. fixing the “flat netlist is input to P&R” problem.

For #1, I think the right answer is to build device specific dialects that integrate with the RTL level of abstraction. The ICE40 technology library is basically an MLIR dialect specification in PDF form. :slight_smile:

Building out support for such things would be very useful and interesting I think.

For #2, I’m not sure what can be done here without pushing into novel P&R and other tools. I’m personally interested in this problem over time, but not until we get the other more basic issues sorted out.

-Chris

Nice! I’ll probably need something like that in the next month or two since we’re manually instantiating some device primitives. Question: should we have one dialect for all primitives or a set of dialects per device and/or vendor? If we have one dialect, the number of ops could be huge. I don’t know if that’s a problem for MLIR or not. If it is, a reasonable compromise is vendor-specific dialects. I’ll just need a handful of Stratix10 primitives I think.

For the primitive location data (which I don’t think is applicable to ASICs), I don’t think MLIR is appropriate and another data structure is required. One that efficiently supports KNN queries and other spatial queries.

Same here – I’d like to completely replace all EDA tools. I think that’s a few years off at best though.

One thing we can do now (and I’ll likely need to for this new engagement) is what I was talking about above: the ability to place a specific RTL entity into a specific primitive instance on an FPGA. In Quartus, this is accomplished with a TCL (ugh) command and is frequently done by hand for high-performance designs. This is a PITA, so even a basic tool to assist in this would be very helpful.

On top of this, we can start playing around with placement algorithms and post-“critical”-placement pipelining to help out the tools. By “critical”, I mean placing DSPs and RAMs since once you pin them you’re typically in a much better state WRT timing stability. Quartus doesn’t do a half-bad job if you tell it exactly where to put most of the design :laughing:. Again, this is something I’ll likely need to do for this new engagement given their very, very high performance targets.

I’d recommend vendor specific and perhaps even device specific dialects, e.g. an ice40 dialect. This is equivalent to the machine instructions in the X86/ARM/RISCV backend etc. They are all different dialects in the MLIR sense. MLIR’s ability to mix and match dialects makes this really powerful.

Yeah this is nuts. The right way to do this is the equivalent of “builtins” in C. In C you can write target independnet code, but then if you want to access some crazy target-specific instruction (e.g. sum of absolute differences) you use an __builtin_foo to do it. This turns into a target specific intrinsic in LLVM, and gets instruction selected to something predictable. You handle portability in your C code by using #ifdefs or whatever.

We need the same ability in things like Chisel, which then get lowered to the RTL dialect + other stuff mixed in, then encoded into verilog in a way the tools will interpret correctly (even synthesizing TCL if necessary, shudder). This gives us single source of truth for the design.

-Chris

There are really at least 3 different usage models here (at least in the FPGA world):

  1. Something analagous to a builtin like chris is thinking about. In this case, there is a small set of primitive elements that are intended to be replicated many times. The elements are given relative placement constraints but can be otherwise placed anywhere in a device. Usually this is done to ensure particular packing or leverage of specific routing resources between the relatively placed elelemtns.
  2. Adding explicit placement to a design, in order to improve PAR results. In this case you have a netlist, but are trying to ‘help’ the placer. In this case typically a few primitives are locked in critical locations on a device and place and route fills in the rest of the design. This is typically used for I/Os and sometimes DSPs or BRAMs in particularly congested designs. Usually these constraints are not reusable to another design.
  3. Adding floorplanning to a design. In this case, parts of a design are assigned to particular portions of a device. Placement has freedom to place primitive elements anywhere in the region. This is again very device and design specific, but is generally more scalable.

I suspect that John is building systolic arrays of some sort, which can be addressed at some level in any of these ways. If the elements of the systolic array are small, then I could see explicitly placing them (#2). If they are bigger (maybe a dozen primitives, or including BRAM or DSP elements) I could see using relative placement constraints (#1) on a few critical primitives in each element of the systolic array. If the elements are even bigger, then I would probabaly move to floorplanning them (#3), maybe in small groups.

Generally there’s been a move away from having IP with relative placement constraints for a few reasons: 1) it’s not very portable and 2) devices aren’t actually that regular, so a relative placement constraints can greatly limit placement options.

In the ASIC world it’s also common to have hard macros with fixed placement and routing. This is unusual in the FPGA world.

I don’t know what the state of affairs is in the ASIC or Xilinx worlds but in Quartus country, TCL and/or device primitives are necessary for a number of common things. Memory inference – even with RTL annotations – often doesn’t result in what you want, in part since Quartus has to respect the SystemVerilog it’s given (rather than the intent) and it’s really easy to write SystemVerilog which has all sorts of unintended side effects which affect efficient synthesis. So we never rely on memory inference and always instantiate the M20Ks manually. CDC specifications are another common (infamously easy to screw up) component which are hard (impossible?) to specify completely in RTL, so we often have to write incredibly complex TCL scripts to find all the CDCs based on a set of heuristics and apply the appropriate constraints. Our experience with SV annotations isn’t great. Yeah, it’s that crazy.

I’ll have to speak in the general since I’m under deep NDA, but there are various parts of any high-performance design which – depending on the granularity Stephen mentions – would benefit from one or more of these use cases. Also – as a colleague of mind is fond of saying – there are no slow parts of an FPGA design. Even circuitry which is running at a much slower clock speed and is not performance-critical affects the high-speed, performance critical parts since they are competing for resources. So optimizing the “slow” part of the design can have significant effect on the high speed part!

This is a great argument for automating this using higher-level knowledge of the design and the specific device. You could imagine some sort of iterative placement like what Plunify’s Intime does but with placement.

Yeah, this is since you can’t arbitrarily move things around due to the fabrics irregularity.