MLIR News, 21st edition (11/28/2020)

See the previous published edition.

Welcome to the twenty-first issue of the MLIR (bi)Weekly, a newsletter (published on Friday) covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!

MLIR Core

Infrastructure

  • Function.h and Module.h are in the process of being removed in favor of BuiltinOps.h
  • Side effect instances can now specify an Attribute containing additional effect parameters.
  • Side effect instances can now provide a SymbolRefAttr as the value being affected.

Optimizations and Code Generation

  • RFC open to discussion about adding dialects for modeling the ARM Neon and SVE instruction sets.
  • The prototype sparse compiler has been committed:
    • Some sanity check benchmarking shows “on par” performance for a couple of sparse kernels and matrices compared to the Eigen library.
  • A parallelization strategy was also added:
    • Provides control over what loops should be expressed with “scf.parallel” (inner/outer loops, dense/sparse loops)
    • Some sanity check benchmarking using the current in-tree async lowering for parallel loops exhibits reasonable speedups over sequential sparse code.
  • Planned next: vectorization strategy, invariant code hoisting, storage type control

SPIR-V

  • Various clean ups were introduced to improve consistency in the SPIR-V dialect. spv._* ops are renamed as spv.mlir.* ops to follow general convention in MLIR.
  • Module combiner now can unique global variables, specialization constants, and functions.

Other

  • OpenMP: Added operation for the OpenMP worksharing loop construct, omp.wsloop. An SCF parallel to OpenMP parallel+worksharing loop conversion pass was also added. Patches to pretty-print+parse, lower to LLVM IR are in progress.

In the Ecosystem

CIRCT : Circuit IR Compilers and Tools aka ‘MLIR for hardware’

  • A pass was added to flatten FIRRTL bundle types, making it simpler for other CIRCT components to interface with FIRRTL.
    • Notably, this opens up a path for some Handshake modules to be emitted as System Verilog that previously failed.

mlir-npcomp: Prototype for compiling numpy programs

TensorFlow / MLIR-HLO

Recent Talks

Hi,

You mentioned “Some sanity check benchmarking using the current in-tree async lowering for parallel loops exhibits reasonable speedups over sequential sparse code” in Optimizations and Code Generation section, could you please explain what is “in-tree async lowering” and how to do “in-tree async lowering”?

Thank you so much!

You can see an example here: https://github.com/llvm/llvm-project/blob/master/mlir/test/mlir-cpu-runner/async.mlir

And here is a test with parallel loop lowering to the async primitives: https://github.com/llvm/llvm-project/blob/master/mlir/integration_test/Dialect/Async/CPU/test-async-parallel-for-2d.mlir

Hi Medhi,

I looked at the example you sent to me. I tried with the example: https://github.com/llvm/llvm-project/blob/master/mlir/integration_test/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir, I run with “with-async (i.e. the 1st mlir-opt command in the mlir code)” and “no-async (i.e. the 2nd mlir-opt command in the mlir code)” separately, the “with-async” execution time is: 0.245081; While the “no-async” execution time is: 0.238621. So it shows that “with-async” has no performance improvement in this case, and a tiny performance reduction actually. I also tried the same mlir code, just change the all matrix sizes from 1024x1024 to 83334x83334, the “with-async” exection time is: 4362.45; While the “no-async” execution time is: 4940.47. So in this case “async” gives a small performance improvement. I am running on an Intel Xeon machine, with 2 threads per core, 12 cores per socket and 2 sockets total.

Is it possible for your to tell me which example your group used to get the “reasonable speedups” over the async lowering? Is it in the github repo? I really want to learn this!

Thank you so much!

The runtime implementation that is currently in-tree isn’t intended to showcase any performance at the moment: it is a fairly naive thread pool right now.
I don’t know if @ezhulenev has an example that would still fit and show some speedup there.

@rqtian I disabled threading in https://reviews.llvm.org/D92368 because of the problems with dynamic library unloading. With a thread pool that tests shows about ~3x speedup for 1024x1024 with 4 threads.

One option to fix it, is to build a mlir-cpu-runner with statically linked runtime, and bind Async API symbols at runtime (example from TFRT https://github.com/tensorflow/runtime/blob/master/backends/cpu/lib/jit/async_runtime_api.cc#L61).

Or figure out how to do proper dynamic library unloading that will wait for all threads to stop before the shutdown.

@rqtian I submitted ⚙ D94346 [mlir] AsyncRuntime: use LLVM ThreadPool to run async tasks that brings back parallel execution to async, on my desktop I see execution time 0.318219 vs 0.126553 in the microbench-linalg-async-parallel-for.mlir test.