See the previous published edition.
Welcome to the tenth issue of the MLIR (bi)Weekly, a newsletter (published on Friday) covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!
Highlights
- The Circuit IR Compilers and Tools (CIRCT) project is proposed as an LLVM incubator project. This projects aims to apply MLIR and the LLVM development methodology to the domain of hardware design tools. The use of MLIR for HW design is a topic that was discussed at a recent open meeting on LLHD ( slides - additional slides - recording). And if you’re wondering about LLHD vs CIRCT: they already converged!
- The mlir-npcomp project is also aiming at becoming an incubator project.
Many RFCs in-flight on Discourse:
- MLIR Test Case Reducer Tool
- [RFC] Starting in-tree development of python bindings
- [RFC] Rebooting C APIs for core IR
- [RFC] First-Party Support for LLVM Types
- [RFC] New dialect for modelling asynchronous execution at a higher-level
- [RFC] Async/Await dialect targeting LLVM coroutines
MLIR Core
Infrastructure
- Types in the custom assembly format may now be resolved from Attribute types.
- Interfaces may now be defined on Attributes and Types, in a similar fashion to OpInterfaces.
-
isa<>
now supports variadic arguments - A new NoRegionArguments trait has been added to verify that all regions attached to an Op have no arguments (implicit capture only).
- Interfaces can now specify the C++ namespace that they should be generated in.
- RFC and initial commits for upstream Python bindings
- RFC for upstream C-API reboot
- Initial structure for an MLIR TestReducer has started to land.
Table-driven Infrastructure
Shape Dialect
- Support landed for basic shape function reification and constraint generation for generating kernels as part of the TF kernel generator project (details to come on this project).
- Discussion around removing confusion in dialect between operations that are used to specify “shape transfer functions” and those used to compute shapes. And clarify these to avoid confusion and simplify descriptions.
- Canonicalizations for shape compute operations related to rank, size and conversions.
Optimizations and Code Generation
-
BufferPlacement
has been extended to support nested region control flow. - The General Basis Reduction algorithm (GBR) has been implemented to perform exact integer emptiness checks for FlatAffineConstraints. It includes an implementation of the
Simplex
from “Simplify: A Theorem Prover for Program Checking” by D. Detlefs, G. Nelson, J. B. Saxe. - Linalg contraction vectorization vectorizes multiple contraction patterns (batched, degenerate, …)
- Linalg contraction vectorization uses vector.transfer + vector.contract
- Forwarding patterns are added between linalg.copy and vector.transfer ops
- A utility function was added to hoist scf.for values through scf.yield and iterArgs.
- Linalg hoisting of redundant transfers makes use of scf.for + scf.yield.
- Linalg custom ops are retired in favor of ODS-gen’d ones (batch matmul, matmul, matvec, doc)
- End-to-end programmable codegen strategy for Linalg.matmul through vector.contract prototype in progress (in IREE and XLA) for CPU and SPIRV backends with encouraging early results. More details to come in the next few weeks.
CPU codegen
- Added a mechanism to set the fp reductions reassociate flag programmatically
- XLA:CPU is first client in the pass the lowers to vector dialect using MLIR
- Other lowering paths and passes will be modified as well
- Reduction/DOT operations can run 8-20x faster this way
- Generalized vector operations, in BLAS1 terms we now have
- DOT as special case of vector.contract
- AXPY as special case of vector.outerproduct
- The AXPY as outer-product allowed unifying the MatMat and MatVec lowerings into a single method, which will simplify the heuristics we need for XLA:CPU
- Multiple canonicalization and folding patterns added for Extract/Insert/Transpose/VectorTransfer and combinations of those
- Vector unrolling now exposed as a composable pattern, backed by an OpInterface.
- 2-D Vector.contract -> vector.outerproduct lowering with support of all transposition cases.
- Systematic bottom-up performance evaluation on vector-based contractions has started for AVX512 and AVX2. Encouraging results wrt peak in favorable cases for vector.contract -> vector.outerproduct. More details to come in the next few weeks.
SPIR-V
- SPIR-V to LLVM conversion (GSOC project) is continuing to make good progress: last month George converted most math operations and scalar/vector types; this month the goal will be to scale to memory operations, composite types, and common control flow cases. During the last two weeks, George landed support for converting constant op, more bitfield ops, (runtime) array / pointer types, and a few other op conversions and clean ups. There is also a nice SPIR-V to LLLVM conversion manual doc under review.
- Support for other SPIR-V use cases also sees new progress. Hazem defined spv.MatrixTimesMatrix op and Kareem introduced spv.CopyMemory.
- SCF to SPIR-V conversions are now put in the proper directory and support for IfOp and ForOp yield values are added.
Other
In the Ecosystem
IREE : An Experimental MLIR Execution Environment
- Significant build upgrades and debugging, resulting in a newly complete Android getting started guide, and successful execution for CPU (LLVMAOT) and GPU (Mali) for all but one test. More is needed, but this represents a major milestone.
- Community interest in enabling a Hexagon backend.
mlir-npcomp: Prototype for compiling numpy programs
- Achieved milestone of e2e execution of an extracted tensor function from Python AST->MLIR Basicpy+Numpy Dialects->TCF->LinAlg->LLVM, running through the in-tree, minimal npcomprt (minimal reference runtime). Note that aside from some basic control flow and scalar/conditionals, this doesn’t do much else right now.
TensorFlow
- Progress on the plan for using the MLIR code generation for XLA GPU:
- Tuple is gone from LHLO. SortOp exercised the tuple-less dialect.
- XLA/GPU backend (ThunkSchedule and Thunk) is being refactored to not depend on XLA HLO data structures explicitly, in the preparation of adopting LHLO.
Recent Talks
- 2020-07-09: MLIR Test Case Reducer Tool ; RFC Starting in-tree development of python bindings ; and RFC Rebooting C APIs for core IR
Recent Publications
In this paper, we describe a polyhedral approach to generate efficient CUDA kernels for matrix multiplication using inline assembly instructions for programming tensor cores on NVIDIA Volta GPUs. Furthermore, we build on this approach to generate fused kernels for computation sequences involving matrix multiplication and pointwise operations such as bias addition, ReLU activation etc. Experimental evaluation of these techniques show that automatically generated kernels can provide significantly better performance than manually tuned library implementations, with speedups ranging up to 2.55×, especially through kernel fusion, which reduces the overhead of data transfer through global memory
…
MLIR is an ongoing project which aims to unify the compiler infrastructure for machine learning by providing the ability to embed multiple IR dialects in it e.g. linear algebra dialect or an affine dialect, with a progressive lowering and transformation of IR dialects. Overall, we believe our work is complementary and could be integrated with many of these frameworks as a library for targeting tensor cores.
a Synthesizer of Fast Encrypted Routines, which uses the MLIR compiler framework to generate encrypted programs. The previous version of SyFER interfaced with Halide to perform cross-operation optimizations and rewrites on the GSW scheme, and the current version uses MLIR to vastly increase low-level optimizations, modularity at every level, and ease of use