MLIR News, 41st edition (8/21 - 9/3/2021)

See the previous published edition
Welcome to the forty-first issue of the MLIR (bi)Weekly, a newsletter covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!

MLIR Core

Infrastructure

  • The OpAsmInterface now allows to define a default dialect for the regions nested under an operation. This allows to elide the prefix in front of every operating. For example a TensorFlow graph operation can define its body with the operation names without repeating the tensorflow prefix in front of each individual operation.
  • The Python bindings allow now to build CFGs.

Codegen

  • Sparse compiler progress:
    • Implemented all sparse tensor to sparse tensor conversions. Using the proposal of [Chou], the full conversion matrix implementation is avoided using COO as intermediate. For MLIR’s “one size fits” all implementation, all conversions can be implemented with just a fromCOO() and toCOO() method. We can later provide specializations for particular cases if higher performance is needed
    • Improved iteration graph topsorting heuristic (pushes unrelated loops into sparse loops for better asymptotic complexity)
  • OpenMP translation now supports loops with reductions.
  • Conversion to LLVM IR is now 10x faster for large constants.

SPIR-V

  • Improvements to how image operands are represented in the SPIR-V dialect for graphics use cases.

In the Ecosystem

IREE : An Experimental MLIR Execution Environment

  • Vulkan, VMVX backends now have PoC for dynamic shape code-generation. The support in CUDA backend is in progress.
  • The LLVMCPU backend now has access to the llvm::TargetTransformInfo to query device properties and use them in the compilation.
  • Looking to enable de-tensoring by default in IREEs compilation stack (PR). This addresses a major performance issue in IREE while lowering from MHLO. Since everything, including values involved in control flow, are represented using tensor; and since IREE treats operation on tensor as compute operations; Simple operations that were just doing a compare of two scalar values (represented as 0-D tensor in MHLO) would end up on the device. This adds unnecessary round-tripping. Detensoring such operations help address this issue.
  • Early comparison of iree matmul codegen for CUDA with cuBlas:

TensorFlow / MLIR-HLO

Kernel codegen:

  • Bugfixes and improvements for jit mode, including:
    • We fixed a bug in the CUDA glue code used by mlir generated kernels. It used to have a process global cache for loaded modules, which worked well for AOT kernels that are registered globally at process start time. The upcoming jitted kernels’ lifetime is managed by TensorFlows resource system, so we had to change the cache to use the same approach.
    • We made the use of jitted kernels configurable via an environment flag.
    • Reducing the number of allocations required by the unranked calling convention.

mlir-hs

  • Skeleton lowering to MLIR via mlir-hs from Dex (link)

Recent Talks

  • 2021-08-26: High Performance GPU Tensor CoreCode Generation for Matmul Using MLIR ; slides - recording (see also the research paper below)

Recent Publications

This report presents some early results on code generation targeting tensor cores on NVIDIA GPUs using the MLIR compiler infrastructure. The state-of-the-art in high-performance deep learning today is primarily driven by manually optimized highly tuned libraries. The approach to develop such libraries is often not modular or reusable to the same extent that compiler infrastructure like LLVM is. Manual optimization typically does not use a standard intermediate representation (IR), although the optimizations performed can be encoded as a sequence of transformation steps and customized passes on an IR. Hand tuning may also miss exploration of design points only reachable easily by automatic code generation. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), IR infrastructure was not geared to tackle the problem of automatic generation of domain-specific libraries in an effective manner. In particular, it was hard to represent and transform compute abstractions at high, middle, and low levels using a single IR.
With suitable abstractions in MLIR, we build an experimental lowering pipeline that is able to automatically generate code for matrix-matrix multiplication on NVIDIA GPUs targeting its tensor cores. On a set of problem sizes we evaluated, initial performance results show that we are able to attain performance that is 95-119% and 80-160% of CuBLAS for FP32 and FP16 accumulate respectively on NVIDIA’s Ampere microarchitecture-based Geforce 3090 RTX. We believe that these results could be used as motivation for further research and development on automatic code and library generation using IR infrastructure for similar specialized accelerators.

We present a multi-level quantum-classical intermediate representation (IR) that enables an optimizing, retargetable, ahead-of-time compiler for available quantum programming languages. To demonstrate our architecture, we leverage our proposed IR to enable a compiler for version 3 of the OpenQASM quantum language specification. We support the entire gate-based OpenQASM 3 language and provide custom extensions for common quantum programming patterns and improved syntax. Our work builds upon the Multi-level Intermediate Representation (MLIR) framework and leverages its unique progressive lowering capabilities to map quantum language expressions to the LLVM machine-level IR. We provide both quantum and classical optimizations via the MLIR pattern rewriting sub-system and standard LLVM optimization passes, and demonstrate the programmability, compilation, and execution of our approach via standard benchmarks and test cases. In comparison to other standalone language and compiler efforts available today, our work results in compile times that are 1000x faster than standard Pythonic approaches, and 5-10x faster than comparative standalone quantum language compilers. Our compiler provides quantum resource optimizations via standard programming patterns that result in a 10x reduction in entangling operations, a common source of program noise. Ultimately, we see this work as a vehicle for rapid quantum compiler prototyping enabling language integration, optimizations, and interoperability with classical compilation approaches.

We introduce QSSA, a novel quantum IR based on static single assignment (SSA) that enables decades of research in compiler optimizations to be applied to quantum compilation. QSSA models quantum operations as being side-effect-free. The inputs and outputs of the operation are in one-to-one correspondence; qubits cannot be created or destroyed. As a result, our IR supports a static analysis pass that verifies no-cloning at compile-time. The quantum circuit is fully encoded within the def-use chain of the IR, allowing us to leverage existing optimization passes on SSA representations such as redundancy elimination and dead-code elimination.