LLVM Discussion Forums

[RFC] TOSA Dialect in MLIR

The TOSA (Tensor Operator Set Architecture) Dialect

Overview

This RFC describes the MLIR implementation of the TOSA operator set. TOSA is described on the site: https://developer.mlplatform.org/w/tosa/ .

The full TOSA operator set, supported data types and the pseudocode describing the implementation and usage of every TOSA operator, are described in a formal specification document at https://git.mlplatform.org/tosa/specification.git/ . Discussion and questions on the TOSA specification itself can be posted at https://discuss.mlplatform.org/c/Discussion-of-the-TOSA-specification/5 .

The TOSA dialect implements the following features:

  • Fidelity to the TOSA specification
  • Import/export between TOSA IR and serialized flatbuffer form
  • Enables a dialect level abstraction that sits above LinAlg/Affine in the core MLIR, or capable of supporting an entirely independent codegen backend.

The TOSA IR submission to the MLIR repository is intended to be supported by the following specification and infrastructure pieces on mlplatform.org:

  • A serialization translator to convert TOSA MLIR IR to/from FlatBuffers form
  • A reference model that executes a TOSA network and generates output that can be compared for bit level accuracy.
  • A conformance test suite that generates a bit-accurate output from a TOSA network for integer and quantized types. The test suite includes tools to compare output against that generated by another implementation – either a different framework, or a particular hardware path.

A separate RFC for the TensorFlow repository describes additional pieces:

  • Dialect conversion implementations from TensorFlow and TensorFlow Lite to TOSA.
  • Legalization test suites validating legalizations for TensorFlow and TensorFlow Lite on a per-operator basis for each implemented legalization from a high-level framework operator to one or more TOSA ones.

There also exists a proof of concept conversion from ONNX MLIR to TOSA.

The reference model and serialization library are also intended to be released and maintained on the TOSA mlplatform.org site, along with the specification document.

Motivations

The development of TOSA was driven by two needs:

  • As a top-down view, define a stable and standard compiler IR form of neural networks expressed using multiple high-level frameworks
  • As a bottom-up view, define functional and scaling information capable of supporting code generation for multiple potential hardware and software targets which may support only quantized integer, or also floating-point datatypes.

TOSA is designed to be a tensor-level minimal operator set. Operators in TOSA were chosen on the basis of a set of basic principles described in Section 1.3 of the TOSA specification. They can be summarized as:

  • Can a lowering from an operator in a high-level framework be performed using a permutation of existing TOSA ops? If so, avoid adding an operator for a compound form.
  • Can the operator be implemented for the supported datatypes and various hardware targets? For example, softmax cannot be algorithmically implemented for quantized integers, and requires a look up table operation. Therefore, TOSA does not have a native softmax operator, and instead generates appropriate decompositions for floating point and quantized integer.

The TOSA infrastructure serves as a platform to express and validate the TOSA output of multiple high-level frameworks. The MLIR infrastructure enables this by permitting conversions from higher level dialects.

Generation of Numerical Information for Operators

TOSA expresses quantized datatype numerical artifacts as a quantized multiplier within a TOSA op. A rescale operator handles cases involving scaling to/from a quantized range, or between multiple quantized ranges. The definition and functionality of each TOSA op determines how the quantized multiplier is constructed from the input and output tensors, described in more detail in the TOSA specification document. For example, let us consider the following input from TensorFlow Lite dialect:

%2 = "tfl.conv_2d"(%arg0, %0, %1) {dilation_h_factor = 1 : i32, dilation_w_factor = 1 : i32, fused_activation_function = "NONE", padding = "SAME", stride_h = 1 : i32, stride_w = 1 : i32} : (tensor<1x4x4x4x!quant.uniform<i8:f32, 0.015230963937938213:1>>, tensor<8x3x3x4x!quant.uniform<i8<-127:127>:f32:0, {0.01563686691224575,0.015513741411268711,0.015759991481900215,0.01563686691224575,0.015513742342591286,0.01563686691224575,0.01563686691224575,0.015513742342591286}>>, tensor<8x!quant.uniform<i32:f32:0, {2.381645463174209E-4,2.3628922645002604E-4,2.4003986618481576E-4,2.381645463174209E-4,2.3628924100194126E-4,2.381645463174209E-4,2.381645463174209E-4,2.3628924100194126E-4}>>) -> tensor<1x4x4x8x!quant.uniform<i8:f32, 0.078431375324726104>>

This is expressed with TOSA as:

%2 = "tosa.conv2d"(%arg0, %0, %1) {data_format = "NHWC", dilations = [1, 1, 1, 1], padding = [1, 1, 1, 1], quantization_info = {input_zp = 1 : i32, weight_zp = 0 : i32}, strides = [1, 1, 1, 1]} : (tensor<1x4x4x4x!quant.uniform<i8:f32, 0.015230963937938213:1>>, tensor<8x3x3x4x!quant.uniform<i8<-127:127>:f32:0, {0.01563686691224575,0.015513741411268711,0.015759991481900215,0.01563686691224575,0.015513742342591286,0.01563686691224575,0.01563686691224575,0.015513742342591286}>>, tensor<8x!quant.uniform<i32:f32:0, {2.381645463174209E-4,2.3628922645002604E-4,2.4003986618481576E-4,2.381645463174209E-4,2.3628924100194126E-4,2.381645463174209E-4,2.381645463174209E-4,2.3628924100194126E-4}>>) -> tensor<1x4x4x8xi32>
%3 = "tosa.rescale"(%2) {double_round = true, input_zp = 0 : i32, multiplier = [1669387395 : i32, 1656242552 : i32, 1682532139 : i32, 1669387395 : i32, 1656242652 : i32, 1669387395 : i32, 1669387395 : i32, 1656242652 : i32], output_zp = 0 : i32, per_channel = true, scale32 = true, shift = [39 : i32, 39 : i32, 39 : i32, 39 : i32, 39 : i32, 39 : i32, 39 : i32, 39 : i32]} : (tensor<1x4x4x8xi32>) -> tensor<1x4x4x8x!quant.uniform<i8:f32, 0.078431375324726104>>

In the above example, the input and weight zero points are inputs to tosa.conv2d. The scaling of the output is performed using the quantized multiplier+shift based scaling by the tosa.rescale op. This multiplier and shift information is computed from the input, output and weight tensors of the conv2d op.

When the same lowering is performed for fp32 datatype, there is no tosa.rescale generated, since floating-point domain operations don’t require an explicit rescale afterwards. The TOSA legalization passes examine datatypes and generate the correct operator sequence.

The benefit of expressing quantization and scaling information in this manner is that tensors no longer need to carry quantization information. The operator sequence expresses the necessary scaling operations either explicitly in the rescale op, or as part of its pseudocode in the operator specification.

This enables code generation for different hardware targets to obtain all functional and numerical handling information from the operator sequence in IR. Backend codegen can then use a sequence of assembly language commands to express these steps, or invoke custom hardware, in which case multiple operators, e.g conv2d and rescale, may optionally be fused.

In this process, the datatype no longer needs to overload the quantization details, enabling subsequent MLIR tensor->buffer allocation operations to ignore quantized type details. The TOSA implementation itself does not scrub quantization type details from the tensors, but TOSA makes the quantization information present within the tensors redundant (allowing higher level decisions to be made regarding how far in the lowering process to retain such information).

Directory Setup

TOSA header files will be in include/mlir/Dialect/Tosa. Sources files will be in lib/Dialect/Tosa.

A separate RFC for the TensorFlow repo describes the TF/TFLite -> TOSA legalization pieces, which require TOSA to be present within the MLIR repo.

At an appropriate time, the NPComp project would like to leverage TOSA for legalizing PyTorch quantized:: namespace operations.

MLIR Dependencies

TOSA depends on the StandardOps and QuantOps dialects. The former is used to express basic IR ops like ‘return’, while QuantOps are used to express the quantized integer types defined by the TOSA spec.

No current MLIR dialect is dependent on TOSA. It is intended for the TOSA IR to remain clean, expressing the design goals and not carrying additional meta information related to any particular framework or backend target. In order to explore codegen from TOSA, a set of proof of concept op lowerings from TOSA to LinAlg dialect were successfully exercised.

6 Likes

Wow, this is really exciting - we’d love to make use of this in IREE! Something we’ve really been missing in the MLIR ecosystem is a dialect that helps bridge frontends (TF, etc) to backends (linalg, etc) and this seems to strike the right balance to allow us to sit in the middle and still operate on an op set with reasonable granularity.

Is there a rendered spec anywhere? I’m only seeing links to the raw git repo?

Thanks for the very positive response, @benvanik !

While mostly complete, a couple of parts of spec - particularly related to the training profile - are still evolving, so we currently have it as buildable asciidoc sources. Hopefully you haven’t had any trouble building it locally, but please let us know otherwise.

here is the rendered form @benvanik :
tosa_spec.pdf (1.8 MB)

I found it somewhat difficult to build. asciidoc isn’t something that I usually have installed, and the makefile seems to assume that something called asciidoc-pdf is additionally installed. It took me a while to find that the ruby-asciidoc-pdf package was needed on my Debian-based distro.

1 Like

Thank you for the feedback ! We’ll add documentation describing the dependencies and steps more clearly.

Thanks for this proposal, and congrats on the work on TOSA! This is a very impressive piece of work and I’m super happy to see that you had success in using MLIR in this context :slight_smile:

Without judging the merit of the TOSA spec by itself (as you mentioned it seems like we can discuss it on the dedicated forum: https://discuss.mlplatform.org/c/Discussion-of-the-TOSA-specification/5 ), the elements your mentioning here are very appealing to me: it seems like having TOSA in-tree would provide us a stable anchor input point from which we can exercise end-to-end pipeline using Linalg and Affine (and Vector, and…).
The existence of the validation suite (do you have a benchmark suite as well) can fill the gap we may have in our current development in-tree (we’re relying too much on our own downstream projects at the moment).

This seems to me like a clean layering that would fit well without being intrusive in the project, I’m very positive on this so far!

Do you already have code available in public? Maybe a fork of the LLVM monorepo with TOSA integrated?

Hi @sjarus, I am personally +1 on the inclusion/development of TOSA as a dialect within MLIR, and I think doing so could be timely, especially as more of us in the community are working to connect up from the code generation layers to various ML frontends. One of the things that makes such work difficult is the lack of usable “vocabulary” ops in-tree that can be sources and sinks for various frontend transformation tasks.

Aside from generally being +1 on such “reduction op sets” being usable in the ecosystem, I am +1 on TOSA specifically because you all have a good start with respect to composable primitives for quantized arithmetic that both meets the needs of the present generation of ML-infererence-based quantization workloads and looks to be composed reasonably for the kinds of transformations and lowerings that we want to do with computations in this form. I specifically like the balance you have struck between leveraging tensor-level quantization parameters (via the QuantOps dialect) at the high level and transforming them away to finer grained primitives that carry their necessary parameters on the ops (and do not unnecessarily mix the numeric domains in the same ops). Such a strategy was what I was going for in the early days when I built the QuantOps facility, and it is good to see that it does indeed seem to work out. While quantization is a still evolving area, having a well defined, transformation-oriented vocabulary like this meets it well where it is at for dominate uses today. Also, it sounds like you have plans for additional profiles beyond these more inference-oriented ops, and it would be great to collaborate on such things eventually.

So, again, +1 from me on this going into MLIR as a dialect, and the layering sounds reasonable/unlikely to add maintenance burdens to the project at large (so long as it stays maintained). Practically, we’d like to see this as an input dialect in IREE, and I would like to see what it will take to represent PyTorch’s quantized ops in TOSA.

(I also left a note over on the mlplatform side asking about the process for discussing the specification)

@sjarus does TOSA support dynamic shapes?

I saw mention in “1.4.5 Broadcasting” that broadcasting e.g. tensor<1x8xf32> with tensor<8x1xf32> will result in a tensor<8x8xf32>. However, in the case of dynamic shapes, this is a fairly complex situation to handle, since there are multiple different dynamic behaviors.

For example add(tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> is very difficult to vectorize without doing multiversioning.

Yeah, I was looking at this part too. Based on my experience with HLO, it is important to have vocabulary options that allow you to move between implicit and explicit broadcast forms (lowering “down” to this level often starts with numpy-style implicit broadcasting while lowering “out” of this level needs explicitness). I left a not on the mlplatform/spec side regarding the right way to discuss such things.

Thanks for the positive feedback!

Could you please explain what you mean by benchmark suites here ? Does this refer to full network legalization tests ? To keep this thread itself dialect focused, I’ve set up a thread on the TOSA discourse for the general topic of the test infrastructure:

We plan to have the code out very soon - in a matter of days ideally. The response to the RFC certainly motivates us further! The implementation does run together with the legalizations in the TF repo - this setup was live demoed - but I’ve since been refactoring to the latest TF+MLIR; there was quite a bit of movement under the hood late Aug/Sept in the MLIR master that added to this effort.

The reference model and test infrastructure is also in the process of being published and will be described on the TOSA mlplatform site; we’ll post links here when ready.

Yes I meant performance testing: collecting some models with a set of inputs and the ability to run them and measure the performance. We’ve been doing “micro-benchmark” in the context of the vector dialect codegen, but we will need also more comprehensive integration suite. This will be very useful (necessary?) at some point as we’re growing the codegen capabilities.

Dynamic shapes was among topics for discussion when first sharing TOSA with @stellaraccident recently, but we ran out of time then. It sounds like something that involves an e2e infrastructural solution, with the dialect specific question being whether there are known restrictions on passthough of ?x?x… shapes, e.g. for backend multi-versioning support in downstream codegen.

That’s a good question - and it would seem it asks whether there are constraints in the dialect design or legalizations that expect shapes to be predefined but which may not necessarily be, and whether there are cases where legalization would be hard to accomplish without shape information. Assuming we’re thinking along the right lines here, would you have specific examples we can work with ?

We’ve run our frontend without --tf-input-shapes for frozen networks, which yields TOSA content with uninferred shapes. However, was more oversight than deliberate testing - the reference model we use to validate TOSA content takes actual network input (jpg/wav etc content) so we specifed precise input shapes for compile time shape inference here.

A simple tf.Add is enough to exercise this with inputs that are dynamically the following shapes (not known at compile time):

  • 1x7 + 7x1
  • 1x1 + 1x7
  • 7x7 + 1x7

The test infrastructure is capable of accepting BYO full network tests (e.g. a protobuf and input content in numpy format). Over time we’ve added dozens of tests from an internal set of conditioned & frozen full networks in this manner. Once set up, it drives the reference model (this was demoed), together with reference input files.

As part of the test infra and ref model release on mlplatform, we will set up documentation describing how new networks can be added in this manner.

What kind of hardware Is going cover the currently test infra? As It seems that the mlplatform Is mainly a Linaro project I guess the the test infra Is mainly ARM oriented.

The test infrastructure targets the TOSA operators themselves and lowerings to TOSA operators, not a particular hardware implementation.

Some more high-level information at: https://discuss.mlplatform.org/t/test-infrastructure-for-tosa/52/2?u=jsmolens

This looks like phenomenal work. I agree that it would be great to land this in mainline MLIR, even if more polish and iteration will happen over time. Great job!

-Chris

1 Like

Thank you @clattner for the encouragement!

I have just posted the RFC for the TF/TFLite legalizations in its own forum: https://groups.google.com/u/2/a/tensorflow.org/g/mlir/c/T28d6uCgXJY

The code release process is on track, both for the MLIR TOSA dialect and the separate legalization passes.

The reference model and test infrastructure were just released: https://git.mlplatform.org/tosa/reference_model.git/about/ . Their discussion threads are:
https://discuss.mlplatform.org/t/test-infrastructure-for-tosa/52
https://discuss.mlplatform.org/t/tosa-reference-model/55

Well done, and I know from personal experience that this represents a very significant amount of work on your part.

Thanks for the detailed RFC! Several questions I have here.

  1. How does this proposal’s motivations tie to those of the tensor compute dialect discussed here Development of high-level Tensor Compute Primitives dialect(s) and transformations ? The first motivating bullet “stable and standard compiler IR form of neural networks expressed using multiple high-level frameworks” is also shared by the ‘tensor compute dialect’. Is the design for quantization and conversion from/to FlatBuffers the key distinguishing point? Does TOSA aspire to be the standard tensor compute dialect on tensor types?

  2. How does all this compare / distinguish itself with Linalg’s evolving operator set and support on tensor types? As indicated on that forum, Linalg appears to aspire to become that tensor compute dialect and is being developed in tree.

  3. Is TOSA’s tensor operator set influenced by those of user facing programming models/IRs (TF, ONNX, etc.) or rather those that are convenient for in-dialect transformation and lowering? Besides the conversions (into and out of TOSA) and numerical information generation mentioned, any other in-dialect transformations that this representation would be suitable for? Related question on this further below.

  4. I can’t find the spec readily at any of the links in the original post. Could you please instead link to a checkpointed PDF in OP? I’m using the one @_sean_silva linked for now.

  5. Just to dive in to the spec of one of the operators, Conv2D, to elaborate on Q3: TOSA’s Conv2D specification here appears to be along the lines of TensorFlow’s Conv2D. Then, how would the convolutions corresponding to back prop filter and back prop input work? With TF, one has different ops: tf.Conv2DBackPropFilter and tf.Conv2DBackPropINput. However, a clearly more “compiler-friendly” / “transformation-friendly” op specification would encode all three on the same/single “convolution” op - for eg. xla_hlo/mhlo.convolution which encodes all three forms by using appropriate dimension permutation info on input, output, and filter (as part of attributes). If you go to further lower (for eg. ML dialect ops on buffers), the op design would be at least as unified. (Linalg doesn’t support these other convolution forms yet I believe, but there’d be no reason to encode these into different things AFAICS). At perhaps all levels right below the entry-level dialect, there’s really no need to keep such forms as separate ops. This is a larger question I have on the operator set choice. Which way would TOSA go?

1 Like