MLIR News, 46th edition (10/30 - 11/12/2021)

See the previous published edition
Welcome to the forty-sixth issue of the MLIR (bi)Weekly, a newsletter covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!




Python Bindings

  • Numerous usability improvements: construction of affine expressions, moving operations between blocks and detaching them, optional operands in constructors.


  • Split padding and hoist padding out of the tiling pass into a separate pass.
  • Improve the buffer size computation for hoist padding to reduce the footprint of the hoisted buffers.
  • Use Fourier Motzkin to compute the size of the padding and of the hoisted buffers (now they share the same code).
  • Separate Comprehensive Bufferize from Linalg dialect + various other improvements.
  • LLVM type conversion now supports recursive types.
  • Rewrite vector.transpose lowering to a single unrolled vector.shuffle.
  • Add AVX2-specific lowering patterns and ongoing investigation of non-peak perf.
  • Various improvements to convolution lowering and vectorization.
  • Ongoing experiments to get to peak single thread CPU performance.
  • Sparse compiler progress:
    • Reduction “scalarization” now spans all for-/while-loops over all invariant dimensions. When vectorized, SIMD chains are formed
    • Sparse tensor output supports “injective” cases (without reduction)


  • was regenerated from upstream spec to include new extensions and symbols.
  • More atomic ops were defined and a pattern to convert shufflevector is added to SPIR-V to LLVM conversion.

In the Ecosystem

IREE : An Experimental MLIR Execution Environment

  • Making progress towards using upstream ComprehensiveBufferization in Linalg for dispatch region code-generation. (IREE uses a version of this that is much more simplified, but special-cased for IREE)
  • CPU backend being evolved to perform better on x86. With (this) PR, IREEs x86 backend runs the transformer model in 30 ms (Baseline is TF+XLA run in 45 ms as measured by us). Effort underway to make the x86 backend mirror sandbox as closely as possible, that is known to get peak GEMM performance on a range of
  • CUDA backend:
    • Add support for tensorcore code generation for fp16 and tf32 types
    • Basic tuning for tensorcore performance but still bound by copy to shared memory

TensorFlow / MLIR-HLO

  • Code generation for softmax has landed. In some benchmarks MLIR-based compiler is up to 6 times faster than Eigen.
  • Vectorization for TensorFlow Tensor on boolean types is broken: memrefs of i1 type don’t map directly to vectors of i1 type: memref of i1 is actually stored as memref of byte sized values. We have a workaround to unblock experimentation by adding a pass that performs i1->i8 tensor type conversion early in the pipeline. However, this pass is too broad to use it for actual workloads.

Recent Talks

LLVM Dev Meeting (will be on youtube later):

1 Like