Authors: Akash Kothari (UIUC), Abdul Rafae Noor (UIUC), Dounia Khaldi (Intel), Vikram Adve (UIUC), Yuanke Luo(Intel), Sudipta Sengupta (Amazon AWS), Milind Girkar (Intel), Charith Mendis (UIUC)
Diverse hardware vendors are developing new hardware support for (mostly dense) tensor computations, which have become increasingly important for machine learning applications. These include both ISA extensions on CPUs and GPUs (such as Intel AMX, Power MMA, NVIDIA’s tensor cores, AMD’s matrix cores, and Qualcomm’s HVX vector ISA) and dedicated accelerators for compute offload (such as NVIDIA’s NVDLA, Amazon’s Inferentia and Trainium, and numerous ML accelerators from smaller companies). While ML workloads are the primary motivation and likely to be the dominant use cases, other tensor-intensive application domains, such as image processing, scientific computing, quantum simulations, financial modeling, and others can benefit from this hardware support as well, via languages like C++, DPC++, Julia, Fortran, Halide, CUDA, OpenCL, and others.
LLVM can play a crucial role in making it easier for these vendors to create optimizing compiler back-ends for their emerging hardware (if the existing vector and matrix support in LLVM were generalized to support tensor operations). LLVM is already widely-used today by many of the vendors that develop these tensor architectures, e.g., to target CPUs and GPUs. LLVM is highly retargetable, by design. For the CPU targets, LLVM allows an integrated code generation framework for tensor operations with optimized intermixing of scalar, 1-D vector and 2-D matrix operations in the same code section (e.g., loop body). And LLVM has front-ends for a wide range of high-level languages, including essentially all the languages used widely for relevant application domains today.
No existing infrastructure we know of meets these needs. MLIR is likely the best option, and we believe it is entirely complementary to LLVM. MLIR provides strong support for high-level tensor operations in TOSA, relevant optimizations in Affine and Linalg, and lowering paths to accelerators, GPUs and (via the LLVM dialect) CPUs. Crucially, however, MLIR does not have a low-level code generation framework that is retargetable to diverse hardware: it relies on LLVM for this purpose. If LLVM could be extended with tensor operations and a corresponding retargetable tensor code generation framework, MLIR could leverage this as well. Moreover, there are enough vendors and also languages that rely heavily on LLVM (but don’t use MLIR) that it seems worthwhile to have a high-quality tensor code generation framework in both LLVM as well as in MLIR. Ideally, both systems would largely share the same code.
The broad goal of our project is to add a retargetable tensor code generation framework to LLVM. We are currently working on a prototype implementation with our collaborators at Amazon AWS, Intel, IBM and Qualcomm. This RFC focuses on the first stage: extending the LLVM IR with tensor operations which we refer to as TLX (Tensor LLVM eXtensions).
- A unified retargetable code generation and optimization framework for LLVM to target diverse tensor architectures with a common set of IR extensions, instead of using target-specific solutions.
- (Subject of this RFC.) A single set of target-agnostic tensor extensions in LLVM IR that higher-level tensor code generation frameworks such as XLA, Halide, TVM, MLIR, etc. can target, instead of lowering to target-specific intrinsics in LLVM, while retaining the optimizations in these high-level frameworks.
- A pathway for LLVM-based languages such as C/C++, DPC++, Fortran, Rust, Julia, etc. that do not have front ends for compiler systems like MLIR, TVM, XLA, etc. to target modern tensor architectures by lowering to our tensor extensions in LLVM.
- Target-independent optimizations (e.g. peephole optimizations and generic SSA-based optimizations) and also flexible code generation capabilities in LLVM that could involve mixing instructions operating on vector and rectangular registers, and involve developing cost models which could help reduce register spills and maximize usage of available hardware resources.
- Contribute our tensor extensions (this RFC) and retargetable code generation framework (as a followup) to the LLVM project for the community to experiment with and provide feedback.
Google doc with the full proposal is here.