Development of high-level Tensor Compute Primitives dialect(s) and transformations

Hi ff7250,

Thank for chiming-in!
It seems that you have short-term needs to build an end-to-end flow, but I am not sure what you’re proposing concretely as an action for the MLIR/LLVM codebase? We can’t directly depend on the XLA codebase from MLIR/LLVM.
XLA has many effective optimizations and we’d like to leverage the XLA experience to drive the work in MLIR. I think the nGraph folks have also a significant experience in building this layer, and we’re looking forward to collaborate on this.

There is a large amount of work to migrate TensorFlow from the current XLA paths to include more and more MLIR-based components, see here for instance Redirecting to Google Groups ; the TensorFlow-specific and XLA-specific aspects are not directly relevant to this forum though: MLIR/LLVM is entirely independent.

1 Like

I’d like to leave it open to the people interested in actually building this to define the actual scope.
My take is that we really want to have a compiler IR here, designed with transformations/optimizations in mind. I am not confident enough to answer how orthogonal (or how similar) are the optimizations on the usual set of op in the Tensor domain (like in nGraph/HLO) compare to what you would want to achieve with image processing kind of primitives from OpenVX/OpenCV.

G-API has also PlaidML as a new backed. So I think some CV related operation could pass over MLIR with the PlaidML MLIR refactoring. /cc Is it correct @flaub?

https://github.com/opencv/opencv/pull/15869

Just to stir the thread up a bit… :cat:
In OpenCV we have 4 ops groups:

https://docs.opencv.org/master/da/dd3/group__gapi__math.html

  • Math operations
  • Pixelwise operations
  • Operations on matrices
  • Image and channel composition functions

Currently in the new PlaidML G-API backend (/cc @flaub) we have just some basic 5 ops covered:

Just to complete the ops overview we have also not core ops:

  • Image filters (sobel, dilate, erode, etc…)
  • Color space conversions (BGR2Gray, YUV2BGR, etc…)

https://docs.opencv.org/master/d2/d00/group__gapi__imgproc.html

@bhack how about control flow ops? are there any? :slight_smile:

Not in the current master version.

The math, bit, and control flow ops may be a start for the common set? I kind of echo @g.ramalingam’s idea of having “scalar” ops (+ control flow ops) and then others may be built against them.

Can you clarify what you mean by “having “scalar” ops (+ control flow ops) and then others may be built against them”? Maybe with an example?

This is instead the automatic generated list of ONNX ops:

Folks at IBM are working on an open source project to ingest ONNX into MLIR. We are very interested in finding commonality with other efforts to reduce implementation cost. It is being actively developed, with the goal of eventually be merged either under the LLVM, MLIR, or ONNX umbrella.

1 Like

Sure, for example, having scalar math.multiply, math.add, and loop op, matric multiplication can be implemented as a function, though matric multiplication might be a primitive op for its common enough usage. Scalar and control flow ops may be a start point of this primitive set. Make sense?

btw, is there any online meeting of this discussion now? if yes, can the meeting notes be shared so that different time zone folks can also get the status (I’m personally out of US right now. :)).

@kezhang: I’m not sure I follow, I would need to an example to understand what you mean here to have “math.add” and how it differs from the current std.add.
There hasn’t been any meeting yet.

There is definitely a lot of energy and excitement about this topic. Do we have any thoughts about how to organize the discussion/proposals in a more organized fashion?

Personally, before we go very wide in terms of large sets of domain specific ops, I would like to see us close some of the representation gaps that are common to representations of one or more frontend op sets.

In my experience in MLIR, specific frontend ops are “cheap” and committing to them can be easily deferred (and held in frontend projects) in favor of obsessing more about the common structural ops and types. Things in this category don’t necessarily need to be reduced down to one canonical form but it can be helpful to have good representations for a small number of them (which then let’s us have robust transformations in/out and algorithms).

As a concrete example, in our lowerings from TensorFlow, we ended up with multiple different representations for control flow: two functional/region based and one cfg (in std). Having robust algorithms and transformations for each has been helpful for us (but has been somewhat hindered by the fact that we evolved this over time and only CFG was in MLIR core at the outset).

Similar issues exist for region based fusions (mentioned up thread), high level concepts for broadcasting (which, if it is possible to standardize on representations would keep us from having to invent the facility over and over again), composites/reductions, indexing ops, etc.

In general, I’d like more of a catalog of such forms to pick from and would find this more valuable to define now versus a full set of DNN/CV/etc op sets.

I’m not saying that specific/wide sets of ops isn’t valuable but I think some robust definitions of forms that underlie sets of them might be a good place to start.

Anyway, back to the “how do we start” question? Can we identify next steps for meetings, proposals, background, priors, etc?

I think it would also be helpful to start with a gap analysis for the frontend op sets that people think are interesting to support and lower level facilities that already exist in MLIR (e.g. loop dialect, standard ops, etc). Ideally we find that we have starts to most of what we need and can identify specific concepts that still need to be developed.

@stellaraccident About the gap analysis, also if TensorFlow doesn’t official support ONNX[1][2] do you think that the ONNX op set could be a quite common subset baseline for all the frontends?
Also we could consider that ONNX is quite vendor neutral now as it is under the Linux Foundation umbrella.

Opencv partially support ONNX also if it is not compliant to any specific ONNX version. Instead, about the already mentioned G-API we could have some common share with Tensorflow tf.image[3] namespace ops.

As Opencv had Ngraph and “stub” Plaidml (@jbruestle) injection in its repository there could be some some common interest in DNN/CV domain for these two projects to cover Opencv ops as these Intel projects are both involved and already presented in MLIR open design meetings.

[1][feature] ONNX Support · Issue #12888 · tensorflow/tensorflow · GitHub
[2]GitHub - onnx/onnx-tensorflow: Tensorflow Backend for ONNX
[3]Module: tf.image  |  TensorFlow v2.9.1

@bhack I think we may be talking about two different levels of abstraction, and from that perspective, no I don’t think that ONNX represents a common subset factored at a level that I think is the most important to focus on for this initiative. You asked a direct question, so I’m giving a direct opinion, but I do think there is much more nuance than can be conveyed in such a way.

Like the TensorFlow op set (and numpy, scipy, otherd), it is the union of many fundamental and composite operations from various domains and implies a number of structural conventions, that while popular in present day python-derived numeric computing environments, are by no means universal.

This is a potentially good normalization point if you are trying to build some flavors of present day dominant frameworks, but it isn’t quite the same as saying that the job is to build tooling and representations that underlie and produce meaningful structural simplifications of such frameworks in a way that aids the construction of various compiler functions (i.e. the tool versus toolkit distinction).

I think it is important to focus first on the simplifications in MLIR-core, while organizing things like ONNX, et al to layer on top. We may still arrive at a very wide/comprehensive opset like ONNX in core, but the way we get there is important to me (and I am one voice here, speaking my opinion). In my/TF experience, many useful properties emerge by focusing on the decomposition of components in the toolkit while analyzing/implementing the high level mappings to it.

From my side of the work, we’ve had success with some elements of XLA/HLO and LinAlg as structural simplifications that can be used as concentration points; however the distinction is important: LinAlg was co-designed with MLIR to be such a thing whereas HLO was imported into the ecosystem. When we started the project, we were aware that while HLO got some things right for something like MLIR, it also has a lot of unrelated baggage and assumes some things that were biases of previous representations/goals which it would be good to revisit. We could have chosen to just baseline on it by importing it en-masse but did not – which has given us the design space to explore the structural components.

I’d also point to other threads where the nGraph folks are looking to upgrade the core abstractions (like Loops) which should underpin their ops. We’re also having similar discussions on the shape side as we are struggling to come to terms with how much (and where) to bake broadcasting assumptions into the core abstractions.

That kind of exercise is what I was referring to when I said “gap analysis”. I don’t think we should just be importing wide opsets en-masse without having gone through the process to distill the assumptions they’ve made and map those to core abstractions. As my examples at the beginning with a sample MLP showed (and has been shown in the various proposals for handling dynamic shapes in things like HLO), it doesn’t take very many ops (even things you might take for granted like “add” and “matmul”, “conv”, “einsum”) to interrogate the opinions and levels of abstraction involved – and arrive at useful intermediate design points.

We don’t necessarily need to reduce everything down to non-overlapping primitives, nor do we need to globally select one intermediate abstraction that underlies everything in an area (many things have a small N of reasonable approaches), but if we don’t provide intermediate simplifications as part of the design process, we’re missing an opportunity.

History shows that once an opset reaches a certain span, it grows indefinitely based on the starting core abstractions and it is very hard to change its level of abstraction. I personally want to see MLIR-core be a place where we have a well implemented catalog of such simplification points (and supporting analyses /transformations) more than I want to see it embed the breadth of something like ONNX.

As I’ve said before: I’m super supportive of such a breadth-first thing based on ONNX existing and being done well. As just one voice, I could even be convinced that such a thing should literally be part of the LLVM project, but I just want to see the MLIR-core be more toolkit than tool and am worried that we are skipping steps if we jump straight to the breadth of ONNX (or TensorFlow, or HLO, etc).

Honestly I didn’t want to put ONNX as an abstraction straight in. I meant only that if we want to have a top-down check on what we need for an useful intermediate design ONNX could one of the starting point. I don’t know if it is opinionated for a top-down check but at least is already a multi-stakeholder governed effort.

many useful properties emerge by focusing on the decomposition of components in the toolkit while analyzing/implementing the high level mappings to it.

I think this is could be ok but what do you suggest? That the Intel Ngraph team go on the same process on its own check, Onnx Team at IBM the same and so on? Will these parallel process generate the downstream requests not covered by the main XLA/HLO process?

I think that probably here the true issue has a little bit of political factor. What is really in common at the top level? And at what level is not too “high” so that every vendor can “sell” its own framework/stack?

I think downstream is easier cause at low-level we have so much men/years of workforce on assembly optimized libs that implements kernels/ops and all these work will still reused on the “hw” vendor side until the “common effort” will produce an enough optimized codegen stack.

But at top level it is a little bit less clear to me.

So the way I view this is that one needs to think about the stack holistically, and often contribute to multiple levels of abstraction. I.e. one needs to constantly:

  1. put the top-down UX driven hat and make the effort of going down to some relatively low-level of abstraction and see how things connect.
  2. switch to the bottom-up perf-driven hat and raise abstractions that are known to play well on the hardware to some level of IR that is good for 1.
  3. iterate, iterate, iterate, … then iterate some more until enough intuition is gained about the whole stack for multiple different enough hardware.

IR is the (hopefully stable) glue in between that makes the problem tractable and avoids the traditional pitfalls of:
a. I need this op and wish perf. into existence, the compiler guys will figure it out or I will use a library
b. I have this HW and ISA that gives me 10ZFl/W the compiler guys and the librarians will figure out the user story.

In other words, there are both UX/PL/Compiler codesign and a Compiler/HW codesign opportunities.
IMO, IR design driven by retargetable and generic transformations is key: transformations first, everything else second.

Without this, the alternatives are not pretty and history has shown some systems/frameworks evolution in the past: (1) pure python overhead around fast library calls or (2) N^2 behavior where ops need to know about each other.

The compiler promise is all about making this (N frontend) x (M hardware) problem linear and tractable.

We could call this approach middle-out :wink: :wink: (or not…).

@nicolasvasilache I agree with you… “attention is all you need” :wink: We could see 3. like navigating scale/space (or context if you want) as the stairway to the (holistic) heaven :slight_smile: .

I agree with you here but you have not replied to me on the top of the middle-out approach… what is the ceil/roof of this house? Is it really healthy to avoid to looking for a common representation at higher level just cause in that case all the frontends will look too much similar?
Honestly is it a marketing limit or a technical one how we define the top “layer”?

I am definitely not advocating for blindfolds here. I think as usual this is a naming problem (@mehdi_amini and @stellaraccident yes I see the irony here): “Tensor Compute Primitives dialect(s) and transformations” seem to mean different things to different people.

I think MLIR does not aim at being opinionated on frontends or on backends but really on compiler infrastructure and transformations. In other words the question core MLIR people are probably most interested in (at least speaking for myself here) is whether we have blindspots: are there abstractions that are (1) key to express user-intent + transform to well optimized code, (2) that need fundamental new design/dialect than what is already in development and (3) that we are missing by mixing the existing current set of abstractions (+ their immediate evolutions).

Having surfaced examples of such omissions, we can collectively look for solutions and debate/prototype how they would compose with other things (the really tricky bit IMO) and influence the future design and evolution of MLIR dialects.

At least this is my (current) understanding of the objectives of this proposed workgroup on “Tensor Compute Primitives dialect(s) and transformations”, so please take this with a pound of salt :slight_smile: