Starting MLIR component similar to llvm-canon!

Hello,

I was looking at the open projects page and it seems the project for creating a tool like llvm-canon is still open. I just wanted to check if this is still the case.

Currently, I am working on a new lib to combine multiple spv modules into one [1]. The main goal is to merge multiple modules and remove duplicate code so that the size of the combined module is less than the sum of input modules. I think adding cononicalization tool(s) will enable us to implement an even stronger deduplication logic. I mean such canonicalization pass might help detect semantically equivalent code that we wouldn’t be able to identify otherwise (@antiagainst please correct me if you think that doesn’t make sense :smile: ).

Therefore, I am interested to work on such tool as well. @mehdi_amini, @jpienaar, it would be great to have some more info from you! I think these links: [llvm-dev] llvm-canon and MIR-Canon: Improving Code Diff Through Canonical Transformation should be a good start to draft an initial idea of what can be done on the MLIR level as well. Please let me know if you have further info.

Best,
Kareem

[1] https://reviews.llvm.org/D90022

Edit: I missed this part of the MLIR docs: Operation Canonicalization - MLIR. I will have a look there as well. Maybe what I am interested in is already implemented.

I was literally looking at a diff this morning and wishing I had this tool already :wink:

Just quick response to say it is open and would be very useful. And would gladly iterate on proposal of such a tool.

– Jacques

Awesome, thanks for the reply!

Then, I am interested to start on this project regardless of its applicability to our current use-case (the SPIR-V module combiner). I will go through the resources I linked above but it would also great to get more input from you if any. Once I create a clearer idea of the details, I will start a proposal.

Hello,

This is a very initial draft for a proposal Please have a look below instead. It summarizes what is already done in mir-canon and llvm-canon and draws a high-level picture of what needs to be done in the MLIR context. It would be awesome if you can have a look and provide any comments that can guide further refining this.

Kareem

Sounds good, I left some comments. I think it would make it easier if you could copy and paste the document in here too, so that discussion could happen on the forum rather than in the doc/capturing it for future.

Thanks,

Jacques

Draft: MLIR Canonicalization Tool

Abstract

Comparing MLIR modules is an important and frequent activity for understanding the effects of different passes on particular input IR. It can also be beneficial in compilation workflows to automatically detect (semantically) identical or similar MLIR operations. For the former use-case, comparing the same input module across two or more stages/passes can help the programmer better understand and debug the involved passes. For the latter, one possible workflow is one where we aim to deduplicate code merged from many input modules as much as possible (e.g. SPIR-V module combiner). In both cases, the semantic differences between the compared modules/operations should stand out while the noise resulting from differences in operation ordering, naming, etc. is minimized.

To that end, defining some canonical form for MLIR modules and converting input modules to such canonical form would help bring similar or identical modules closer to each other. This in turn, can be leveraged by existing diff tools to provide a more meaningful comparison. We can also explore how to utilize canonicalized output to augment other compilation workflows.

Therefore, the aim of this project is to:

  • Define a suitable canonical form for MLIR modules.
  • Implement a tool/library that transforms an input module to its equivalent canonical form.

Previous Work

Canonicalizing IR modules is not a new idea even with the LLVM umbrella. In particular, such tooling has been implemented already for Machine IR (mir-canon) and LLVM IR (llvm-canon). Following is a brief summary on how these 2 tools work.

mir-canon

mir-canon implements a few stages through which the input module is processed in the following order:

  1. Canonical copy propagation: in this stage, copy instructions are folded for all such copies where the source and destination are virtual registers of the same register class.
  2. Canonical instruction rescheduling: in this stage, instructions defining values are brought as close as possible to their first use within the basic block. If an instruction uses more than one value defined within the BB, then such def instructions are ordered lexicographically (the instruction strings are compared against each other).
  3. Rename virtual registers: use a consistent renaming scheme to produce “canonical” names for virtual registers.

It should be noted that mir-canon limits canonicalization to the basic-block boundary. In other words, it canonicalizes one BB at a time and re-orders instructions only within such border.

llvm-canon

llvm-canon is a function pass the processes an input module in the following order:

  1. Rename function arguments using a unified renaming scheme.
  2. Rename basic blocks using a hash code computed for each block. The hash code is computed from the output instructions within the BB (output instructions are instructions that may have side effects or return instructions).
  3. Reorder output instructions by reducing the def-use distance.
  4. Rename output instructions according to a consistent renaming scheme.
  5. For commutative instructions, reorder their operands by name.
  6. For PHI instructions, reorder their values according to the names of the corresponding basic blocks.
  7. Shorten instruction names to produce an easier to grasp canonical module.

mlir-canon

We would like to develop a new pass similar to the two passes described above that works on the MLIR level. For that, we need to somehow define some canonical form for different MLIR constructs. MLIR is able to express lower level constructs similar to those expressable by LLVM IR. Therefore, the above tools provide a good start on canonicalization techniques that work well in practice. We need, however, to adapt such techniques for the MLIR context and/or come up with other techniques if needed. In particular, we need to find canonicalization techniques for loop nests, tensors, and regions. Coming up with a suitable canonical form might require some iterative experimentation/prototyping in order to have a more concrete idea about the effects of different techniques.

Generic vs. Dialect-Specific Canonicalization

One of the first questions worth exploring is:

  • should we define a generic canonical form for the entire IR or,
  • should we offer the flexibility for each dialect to implement its canonical form?

One of the main motivations of MLIR is to provide a framework able to express programs at multiple levels of abstraction. However, that comes with its own set of challenges. In our context here, it might be difficult or even impossible to find a satisfying canonical form that accommodates the vastly different dialects within MLIR. On the other hand, only providing a framework for canonicalization and leaving the choice of the actual canonical form to each dialect might provide a sub-optimal experience. For example, if a single module contains more than dialect some of which don’t implement canonicalization, then we would still end up with a diff polluted by textual differences. However, this can be overcome by implementing a default canonicalization behavior if a dialect doesn’t opt-in to a custom form of its own. Such default form would be based around the properties of ops, e.g., whether an op has side-effects. This enables us to take advantage of the modelling of ops as far as possible to enable some reasonable canonical form.

Finding a good reference point for a canonical form

Regardless of our answer to the previous question, or even to answer it, we would need to find a good canonical form. From my point of view, this can only be decided by a manual experimentation and search process. To properly understand the effects of different passes, we need at least the following:

  • For each dialect, we need to run enough number of representative modules through different pipeline setups and inspect the evolution of those modules across different passes.
  • From that inspection, we can gain more understanding of how different syntactic constructs evolve and come up with possible mechanisms that bring semantically equivalent ops to the same form.

Doing this with enough test data should provide more insight into whether a generic canonical form is enough or whether we need to provide the infrastructure for dialects to define their canonicalization. My intuition is that that latter case will be our course of action.

Thanks, I just did. Happy to handle any further comments here and will expand on the above post overtime as well in the coming days.

This is really exciting, thank you for driving this!

1 Like

Sounds good. Did you have any more thoughts here or played with a prototype?

Sorry for the late reply. Not yet. I got quite busy with my day job, other MLIR contributions, and personal life. I will make sometime next week and try to push this forward. Apologies for the delay.

No problem at all, thanks for pursuing this.

I’m trying to understand the state of this project. I see that the project remains on the open projects page. But I also see many ops implement their own canonicalizers, as was proposed by Bob. For example, the cf dialect specifies when br ops are canonicalized (example).

I don’t recall code related to this landing, but it’s been some time so could be mistaken. Canonicalization here means more wrt output/operand ordering/name/location rather than changing the op as in the linked canonicalizer. E.g., here it would be to reorder two operands of a commutative op such that the closer def is first so that a textual diff has less/no non-load bearing differences vs changing a conditional branch to unconditional one. Another example is ordering inside a “graph” region where there is freedom to interchange without having any impact on semantics but the output could be vastly different (I don’t even recall if we had those when we discussed this last) but for which there is no canonical form. Now, the canonicalization patterns of today could be useful here, but they could also be buggy and exactly what we want to find.