[npcomp] CIRCT integration

I am curious if anyone has been eyeing an integration between NPCOMP and CIRCT. In last week’s presentation, the part about “target interfaces” made me think of CIRCT. I would love to be able to use CIRCT as a backend to NPCOMP and do high-level synthesis on numerical Python programs.

Is anyone else interested in this? I know the presentation mentioned IREE as a target backend, but whether we are talking about IREE, CIRCT, or something else, it seems like the bottom line is we need motivating examples to drive the definition of “target interfaces”. I’m happy to help out with this work.

Question from a circt noob: What kinds of API’s exist on the CIRCT side that you would like to plug together with npcomp?

Right now CIRCT has an mlir-opt like tool that collects all the dialects and passes, and you can use it to lower through the dialects towards a hardware description in System Verilog. There are also a couple simulator tools, which can simulate execution of IR in a couple of the dialects.

At the highest level, CIRCT supports Affine and Standard dialects. In terms of APIs, I guess I’m imagining an interface where CIRCT can receive IR in the supported dialects, and then delegate to other CIRCT components to invoke specific passes, simulators, etc.

Thanks. Affine and Standard are pretty low level compared to what we envision as a “target backend” for npcomp, which will accept high-level tensor operations. We can lower through linalg to get there though (the target interface may end up being mostly be linalg named ops, in which case lowering to affine/std is mostly trivial).

Can you explain more what circt would lower to? From what I saw in the LLVM dev meeting circt poster, it seems like the end result will be Verilog or other HDL. Suppose the input to npcomp is an ML model that does a sequence of convolution ops of various sizes. What would you like the final output to be from the compiler to be?

I’ve actually been eyeing TCP as a good integration point. As you say, we can lower through linalg to get towards affine/std. That could happen as some pre-processing on the CIRCT side if the target interface is more at the high-level tensor layer of abstraction. Starting at the TCP level might be interesting for new CIRCT passes down the road.

I’m imagining CIRCT will lower to HDL as output (currently we’re emitting System Verilog). In your example, I’m imagining we would emit a circuit whose top-level module has the same interface as the ML model. If we know the convolution sizes at compile-time, we can bake that knowledge into the circuit. If there is dynamism, we would need to construct a more generic circuit to accommodate that. I’m personally interested in whole program compilation, so if there are high-level language constructs to deal with beyond the ML kernels, I’d like to figure out how to map those into hardware.

I’m sorry if this is hand-wavey… it’s still pretty hypothetical and I don’t want to speak for the rest of the CIRCT folks. Is this at least starting to answer your question?

Thanks. Yes, that is starting to clarify it.

One problem I foresee (even in the fully static case) is that if you are lowering each conv operation to its own HDL module, then you will have a very area-inefficient realization of the model. For example, distinct convolution ops should probably share the same arithmetic units in time (or at least, there should be some calculated time-space tradeoff there; not just all spatial). Additionally, for any model with sizeable parameter buffers, there is an issue of where those are stored and how they are fetched which has to be factored in.

Overall, I think generating specialized HDL for a particular model is of fairly limited applicability. There are probably some ultra-low-power tiny ML applications where it might make sense (for example, a learned rate control circuit in a video encoder IP), but I don’t think it makes much sense for most ML application scenarios. Am I missing something?

You are definitely bringing up valid points. With HLS, there are certainly trade-offs between area, power, latency, throughput, etc. that need to be considered. I think MLIR provides a really good framework for exploring this optimization space. Regarding parameter buffers, I know there has been some work in the field with FPGAs to store parameters in on-chip memories that can provide high throughput.

I guess the application scenarios I’m imagining are related to inference workloads, either on FPGAs or with custom ASICs. There are many other directions this could go, but I think there is a real need here that we can start chipping away at. The vendors are touting FPGAs as a way to accelerate ML workloads, but we don’t have good, standardized open source tooling for putting our models onto such machines. CIRCT is trying to develop such tools, which is why I thought it might make an interesting backend to NPCOMP.

Here’s another example that is more concrete. Steve sent me this project from Xilinx a while back: End-to-End Flow — FINN documentation. This end-to-end flow diagram shows an example “starting from a trained PyTorch/Brevitas network and going all the way to a running FPGA accelerator”. This is the kind of flow I’m hoping to enable by integrating CIRCT as a backend for NPCOMP. Looking at the FINN diagram, I’m imagining the top part of the flow would be NPCOMP, the target interface would exist before the “Convert to HLS Layers” step, the bottom half of the flow is CIRCT.

Again, CIRCT is still really early on, and I don’t want to speak for the other parties, but this is what I’d personally like to see.

1 Like

Ah yes, I had forgotten about the FPGA use case. That’s a good example where going to RTL makes sense, as one can pipeline spatially for handling a stream of requests without leaving silicon doing nothing.

In Brainwave they basically made a specialized soft-core that would run the workloads, probably in large-part because they didn’t have tools like CIRCT to nicely program the FPGA’s :wink: