LLVM Discussion Forums

MLIR Quantization Roadmap?

Hi,

I’m interested in quantization infrastructure in MLIR but I’m not sure about current direction of development. I found 3 modules related to quantization: QuantOps, Quantizer and FxpMathOps. I started from one of TODO in Quantizer (https://reviews.llvm.org/D73556) and I’m ready to do my bit in quantization direction.

There are some points that I found, and I’ll very greatful if you tell me more details about future quantization infrastructure:

  • As I understand, Quantizer doesn’t support per-axis quantization now. Per-axis quantization is only in QuantOps, and TFLite MLIR module uses it.
  • FxpMathOps is only testing dialect and it won’t be used in the future (took it from https://groups.google.com/forum/#!topic/iree-discuss/Y8jI4QPnVFY).
  • Quantizer uses QuantizedType from QuantOps, but uses its own quantization algorithm.
1 Like

@stellaraccident who will know best

On my end, I plan to experiment with a simple example of codegen for a quantized kernel before end of March. The idea would be to have Linalg + quantized ops and types in the region and see how much I can push this all the way down to executable. This is very preliminary and experimental thinking but, if you are interested in codegen aspects, may be relevant.

I implemented the TFLite MLIR quantization module and I can speak from my point of view:

  • QuantOps provides a very good abstraction of modeling quantization information in MLIR, so any improvement (like per-axis, non-uniform quantization) to it is welcome.
  • Quantizer and whatever implemented in the MLIR TFLite converter are just two different ways to resolve the constraints for both accuracy and target capability. The two approaches might or might not converge. You can make new ones, as MLIR passes, so different users can have different options.
  • At the high-level, an end-to-end quantization workflow should include three stages: 1. specifying high-level semantics related constraints for accuracy (some tensors have certain ranges, bit-width, by QAT, post-training); 2. quantizing according to low-level target related constraints; 3. model code generation after both constraints are resolved.

I’m trying my best to define the interfaces between the stages, so different methodologies can be shared at different levels. Will share it once it is fixed.

Hi Vooblin - welcome.

For some historical context, most of this was written in the very early days of MLIR (literally the QuantizedType was the first dialect-specific custom type), and it turned out to be quite early to be implementing something generic like this in MLIR: since the process of quantization (as formulated today – I’m not convinced this is the only way to formulate it) is largely an analysis/transformation done on frontend ops and types, it was hard to build complete tooling in the core. Since then, things like extensible traits and op interfaces have matured, which would make it more amenable to extend the generic support in the core. Given that, liufengdb@ and I ended up collaborating to keep the QuantizedType type system in core while he implemented most of the rest of the tooling in TensorFlow, interoperating with the TFLite dialect. I also did some experiments, specifically to prove to myself where we could get with future codegen infra in core, and as you have noted, that is still there in terms of the FxpMathOps, Quantizer, and some common utilities for solving the constraints. What I didn’t have the tools to handle at the time were the tie-ins to the source dialects. I had implemented a POC flow on top of XLA HLO but since MLIR lacks any frontend ops itself, ended up leaving the top-level bits that drew it all together in a private repo, waiting for the infra to mature a bit before coming back to it.

However, fast-forwarding to now, I believe we have the infra we need in core to do this right (or at least will: this is why I’m specifically interested in seeing the Development of high-level Tensor Compute Primitives thread get traction (even without that, we can get somewhere with some example ops if needed).

For me, these next few months are all about picking up and finishing these things that we started in the early days. Right now, I’m doing a couple week sprint to thread dynamic shapes through and then was planning to come back and take a fresh look after that and write an RFC advocating a way forward.

At a high level, I was probably going to advocate for:

  • Dropping FxpMathOps
  • Leaving the common utilities that the Quantizer uses but removing the high level algorithm (which, if we need such a global algorithm could be implemented much more cleanly these days).
  • Introducing some new IR constructs that we wish we could do a year ago aimed at solving generically for arbitrary frontend dialects.

Regarding that last point, Feng and I had been discussing a new quant_region op which could be used to wrap arbitrary source-dialect high-precision ops and carry the conversion information that is typically frontend specific. This would solve a lot of the boiler-plate in both FxpMathOps and how TFLite had to couple itself to constructs in their dialect. I hadn’t finished it and was working on it in my repo. Here is the draft. See specifically quantized_op_validation.mlir and CompressOps.td (in my project, we reason about quantization as a specific form of compression, thus the name).

With this, source frameworks could provide a pass to outline supported ops into quant_region ops and then we could largely write generic passes on top of that. I provided one simple example for XLA HLO in the XlaOutlineQuantizable.cpp pass.

Regarding your point about per-axis: the type system supports this and TFLite has proven the path pretty well. I generally prefer to get the infra working in terms of per-layer first since per-axis has odd dependencies on the tensor layout and codegen regime that need to be solved for carefully (TFLite neither has a concept of layout nor codegen and gets to ignore these). That is why you don’t see it in the current samples in the MLIR repo. It is fully possible but it is just a preference to solve that once the simpler cases are established.

Finally, as Feng notes above, TFLite uses a fairly local algorithm for resolving quantization parameters whereas the work derived from me uses a global algorithm. We’ll need to resolve this, and there may end up needing to be elements of both: the computations that the TFLite quantizer works on now are not very semantically complicated, and I expect more complexity with more advanced inputs (but that is just an intuition, albeit backed by fiddling with some examples).

HTH with context. Does any of that sound unreasonable? Or do you have any interest in collaborating on the path forward?

  • Stella

Thanks so much for such complete replies.

I also investigated MLIR TFLite quantization part, but MLIR core part is more interesting for me now. I generally understood plans on next few months and I’m going to fully concentrate on this direction as minimum for next 1-2 months. So I’ll be glad to collaborate and I hope you would suggest me some specific tasks.

1 Like

Hi Vooblin, I apologize for the delay – I got pretty severely randomized this week. First, I’ll get your starter patch committed this weekend (thanks!). Second, Feng has been drafting the document he references above. Let’s check with him on Monday to see where he is on publishing that as I think it can anchor the work and help us scope some independent work items. If it seems like that will take him a while longer, maybe we can parse the short term work in a bit more ad hoc manner to get started. In any case, I’d like to hear from him.

Hi! I am working together with @Vooblin and will also likely be contributing here. For now I am looking at the code, mostly, until there is an update on tasks here :slight_smile:

Hi Feng - what do you think about sharing the doc you’re working on?

I was looking at adding sparse elements support to UniformQuantizedPerAxisValueConverter::convert. It seems relatively easy to handle cases where zero-point is 0, but for cases where it isn’t I see 4 options:

  1. Emit DenseElementsAttribute.

  2. Don’t support them at all for now.

  3. Extend SparseElementAttribute to support non-zero for omitted values (but it would need to be done very carefully).

  4. Create a new attribute subclass storing a SparseElementAttribute and a default value.

I think 4 is the best, but the most work as well. The simplest is of course 2 :slight_smile:.

There is a potential variant of this one where we emit a normal zero-based SparseElementsAttribute followed by a (broadcasted) AddOp, which would require additional lowering support to identify/fuse appropriately.

I mention it because, while it might seem counter intuitive, such a size-optimizing transformation is part of a general set of patterns that we intend to exploit at some point, and my personal opinion is that teaching the infra to reason about such transformations which explicitly balance constant size versus compute will be pretty important down the road: ie. If you are codegening kernels versus just relying on black boxes, such simple “decompression” ops (in this case, add is such an op) immediately following a constant can often be folded into the load instructions of the kennel at no or negligible cost on modern hardware.

This simple case may call for something else; however, I would caution that we are so early in terms of real support for sparse in these tools that I would bias towards solutions that don’t try to extend the type system yet. This is part of why I referred earlier to per-axis being something I like to design for but implement separately.

I’d say that if you get support from others about extending the type system for #3/4, then I’m open to it (probably calls for an explicit RFC in a new thread). Otherwise, I would probably do #1 to unblock the baseline support and acknowledge that follow-on work would be required to make it size optimal in the affine case.

I just wanted to give an update on this. Feng has a doc that he’s been iterating on with some fleshed out IR constructs (based on the aforementioned “quantized region” op). He’s trying to make sure that the concept holds together conceptually with all of the moving pieces before turning it into an RFC. Since we’ve got some collaborators here, I again encouraged him to share the doc with you all for discussion/context possibly before turning it into a full RFC.

I’ll be on vacation for the next week, so you won’t be hearing from me further on this until I return.

Thanks Stella for the intro and sorry for the late response (I was busy with the new MLIR TFLite converter launch). I just created a new topic and shared a proposal for implementing quantization in MLIR: [RFC] A Proposal for Implementing Quantization Transformations in MLIR Let’s move the discussion to that thread.

Can you look at https://reviews.llvm.org/D74705 when you have time?

Sorry for the delay - this came in when I was out and I missed the notification. Looking now.