Draft: MLIR Canonicalization Tool
Abstract
Comparing MLIR modules is an important and frequent activity for understanding the effects of different passes on particular input IR. It can also be beneficial in compilation workflows to automatically detect (semantically) identical or similar MLIR operations. For the former use-case, comparing the same input module across two or more stages/passes can help the programmer better understand and debug the involved passes. For the latter, one possible workflow is one where we aim to deduplicate code merged from many input modules as much as possible (e.g. SPIR-V module combiner). In both cases, the semantic differences between the compared modules/operations should stand out while the noise resulting from differences in operation ordering, naming, etc. is minimized.
To that end, defining some canonical form for MLIR modules and converting input modules to such canonical form would help bring similar or identical modules closer to each other. This in turn, can be leveraged by existing diff tools to provide a more meaningful comparison. We can also explore how to utilize canonicalized output to augment other compilation workflows.
Therefore, the aim of this project is to:
- Define a suitable canonical form for MLIR modules.
- Implement a tool/library that transforms an input module to its equivalent canonical form.
Previous Work
Canonicalizing IR modules is not a new idea even with the LLVM umbrella. In particular, such tooling has been implemented already for Machine IR (mir-canon) and LLVM IR (llvm-canon). Following is a brief summary on how these 2 tools work.
mir-canon
mir-canon implements a few stages through which the input module is processed in the following order:
-
Canonical copy propagation: in this stage, copy instructions are folded for all such copies where the source and destination are virtual registers of the same register class.
-
Canonical instruction rescheduling: in this stage, instructions defining values are brought as close as possible to their first use within the basic block. If an instruction uses more than one value defined within the BB, then such def instructions are ordered lexicographically (the instruction strings are compared against each other).
-
Rename virtual registers: use a consistent renaming scheme to produce “canonical” names for virtual registers.
It should be noted that mir-canon limits canonicalization to the basic-block boundary. In other words, it canonicalizes one BB at a time and re-orders instructions only within such border.
llvm-canon
llvm-canon is a function pass the processes an input module in the following order:
- Rename function arguments using a unified renaming scheme.
- Rename basic blocks using a hash code computed for each block. The hash code is computed from the output instructions within the BB (output instructions are instructions that may have side effects or return instructions).
- Reorder output instructions by reducing the def-use distance.
- Rename output instructions according to a consistent renaming scheme.
- For commutative instructions, reorder their operands by name.
- For PHI instructions, reorder their values according to the names of the corresponding basic blocks.
- Shorten instruction names to produce an easier to grasp canonical module.
mlir-canon
We would like to develop a new pass similar to the two passes described above that works on the MLIR level. For that, we need to somehow define some canonical form for different MLIR constructs. MLIR is able to express lower level constructs similar to those expressable by LLVM IR. Therefore, the above tools provide a good start on canonicalization techniques that work well in practice. We need, however, to adapt such techniques for the MLIR context and/or come up with other techniques if needed. In particular, we need to find canonicalization techniques for loop nests, tensors, and regions. Coming up with a suitable canonical form might require some iterative experimentation/prototyping in order to have a more concrete idea about the effects of different techniques.
Generic vs. Dialect-Specific Canonicalization
One of the first questions worth exploring is:
- should we define a generic canonical form for the entire IR or,
- should we offer the flexibility for each dialect to implement its canonical form?
One of the main motivations of MLIR is to provide a framework able to express programs at multiple levels of abstraction. However, that comes with its own set of challenges. In our context here, it might be difficult or even impossible to find a satisfying canonical form that accommodates the vastly different dialects within MLIR. On the other hand, only providing a framework for canonicalization and leaving the choice of the actual canonical form to each dialect might provide a sub-optimal experience. For example, if a single module contains more than dialect some of which don’t implement canonicalization, then we would still end up with a diff polluted by textual differences. However, this can be overcome by implementing a default canonicalization behavior if a dialect doesn’t opt-in to a custom form of its own. Such default form would be based around the properties of ops, e.g., whether an op has side-effects. This enables us to take advantage of the modelling of ops as far as possible to enable some reasonable canonical form.
Finding a good reference point for a canonical form
Regardless of our answer to the previous question, or even to answer it, we would need to find a good canonical form. From my point of view, this can only be decided by a manual experimentation and search process. To properly understand the effects of different passes, we need at least the following:
- For each dialect, we need to run enough number of representative modules through different pipeline setups and inspect the evolution of those modules across different passes.
- From that inspection, we can gain more understanding of how different syntactic constructs evolve and come up with possible mechanisms that bring semantically equivalent ops to the same form.
Doing this with enough test data should provide more insight into whether a generic canonical form is enough or whether we need to provide the infrastructure for dialects to define their canonicalization. My intuition is that that latter case will be our course of action.