LLVM Discussion Forums

MLIR News, 11th edition (7/11/2020)

See the previous published edition.

Welcome to the tenth issue of the MLIR (bi)Weekly, a newsletter (published on Friday) covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!

Highlights

Many RFCs in-flight on Discourse:

MLIR Core

Infrastructure

Table-driven Infrastructure

Shape Dialect

  • Support landed for basic shape function reification and constraint generation for generating kernels as part of the TF kernel generator project (details to come on this project).
  • Discussion around removing confusion in dialect between operations that are used to specify “shape transfer functions” and those used to compute shapes. And clarify these to avoid confusion and simplify descriptions.
  • Canonicalizations for shape compute operations related to rank, size and conversions.

Optimizations and Code Generation

  • BufferPlacement has been extended to support nested region control flow.
  • The General Basis Reduction algorithm (GBR) has been implemented to perform exact integer emptiness checks for FlatAffineConstraints. It includes an implementation of the Simplex from “Simplify: A Theorem Prover for Program Checking” by D. Detlefs, G. Nelson, J. B. Saxe.
  • Linalg contraction vectorization vectorizes multiple contraction patterns (batched, degenerate, …)
  • Linalg contraction vectorization uses vector.transfer + vector.contract
  • Forwarding patterns are added between linalg.copy and vector.transfer ops
  • A utility function was added to hoist scf.for values through scf.yield and iterArgs.
  • Linalg hoisting of redundant transfers makes use of scf.for + scf.yield.
  • Linalg custom ops are retired in favor of ODS-gen’d ones (batch matmul, matmul, matvec, doc)
  • End-to-end programmable codegen strategy for Linalg.matmul through vector.contract prototype in progress (in IREE and XLA) for CPU and SPIRV backends with encouraging early results. More details to come in the next few weeks.

CPU codegen

  • Added a mechanism to set the fp reductions reassociate flag programmatically
    • XLA:CPU is first client in the pass the lowers to vector dialect using MLIR
    • Other lowering paths and passes will be modified as well
    • Reduction/DOT operations can run 8-20x faster this way
  • Generalized vector operations, in BLAS1 terms we now have
    • DOT as special case of vector.contract
    • AXPY as special case of vector.outerproduct
  • The AXPY as outer-product allowed unifying the MatMat and MatVec lowerings into a single method, which will simplify the heuristics we need for XLA:CPU
  • Multiple canonicalization and folding patterns added for Extract/Insert/Transpose/VectorTransfer and combinations of those
  • Vector unrolling now exposed as a composable pattern, backed by an OpInterface.
  • 2-D Vector.contract -> vector.outerproduct lowering with support of all transposition cases.
  • Systematic bottom-up performance evaluation on vector-based contractions has started for AVX512 and AVX2. Encouraging results wrt peak in favorable cases for vector.contract -> vector.outerproduct. More details to come in the next few weeks.

SPIR-V

  • SPIR-V to LLVM conversion (GSOC project) is continuing to make good progress: last month George converted most math operations and scalar/vector types; this month the goal will be to scale to memory operations, composite types, and common control flow cases. During the last two weeks, George landed support for converting constant op, more bitfield ops, (runtime) array / pointer types, and a few other op conversions and clean ups. There is also a nice SPIR-V to LLLVM conversion manual doc under review.
  • Support for other SPIR-V use cases also sees new progress. Hazem defined spv.MatrixTimesMatrix op and Kareem introduced spv.CopyMemory.
  • SCF to SPIR-V conversions are now put in the proper directory and support for IfOp and ForOp yield values are added.

Other

In the Ecosystem

IREE : An Experimental MLIR Execution Environment

  • Significant build upgrades and debugging, resulting in a newly complete Android getting started guide, and successful execution for CPU (LLVMAOT) and GPU (Mali) for all but one test. More is needed, but this represents a major milestone.
  • Community interest in enabling a Hexagon backend.

mlir-npcomp: Prototype for compiling numpy programs

  • Achieved milestone of e2e execution of an extracted tensor function from Python AST->MLIR Basicpy+Numpy Dialects->TCF->LinAlg->LLVM, running through the in-tree, minimal npcomprt (minimal reference runtime). Note that aside from some basic control flow and scalar/conditionals, this doesn’t do much else right now.

TensorFlow

  • Progress on the plan for using the MLIR code generation for XLA GPU:
    • Tuple is gone from LHLO. SortOp exercised the tuple-less dialect.
    • XLA/GPU backend (ThunkSchedule and Thunk) is being refactored to not depend on XLA HLO data structures explicitly, in the preparation of adopting LHLO.

Recent Talks

Recent Publications

In this paper, we describe a polyhedral approach to generate efficient CUDA kernels for matrix multiplication using inline assembly instructions for programming tensor cores on NVIDIA Volta GPUs. Furthermore, we build on this approach to generate fused kernels for computation sequences involving matrix multiplication and pointwise operations such as bias addition, ReLU activation etc. Experimental evaluation of these techniques show that automatically generated kernels can provide significantly better performance than manually tuned library implementations, with speedups ranging up to 2.55×, especially through kernel fusion, which reduces the overhead of data transfer through global memory

MLIR is an ongoing project which aims to unify the compiler infrastructure for machine learning by providing the ability to embed multiple IR dialects in it e.g. linear algebra dialect or an affine dialect, with a progressive lowering and transformation of IR dialects. Overall, we believe our work is complementary and could be integrated with many of these frameworks as a library for targeting tensor cores.

a Synthesizer of Fast Encrypted Routines, which uses the MLIR compiler framework to generate encrypted programs. The previous version of SyFER interfaced with Halide to perform cross-operation optimizations and rewrites on the GSW scheme, and the current version uses MLIR to vastly increase low-level optimizations, modularity at every level, and ease of use

These newsletters are really great, thank you to everyone who contributed!