LLVM Discussion Forums

RFC: Enhancing Machine Retargetability in MLIR

MLIR supports the higher-level semantics of source-level abstractions by explicitly representing them in its IRs (dialects) at different levels, before these IRs are eventually translated to optimized low-level implementations targeting a variety of hardware platforms (e.g., groups of CPUs, GPUs, and TPUs). The description of the hardware platforms is currently supported within LLVM at the very late stage of CPU core-level code generation (instruction selection, scheduling, and register allocation). The existing description therefore includes only internal details of a single CPU core. At MLIR level, there is no systematic description of the hardware platform in terms of its memory hierarchy, network connections, and available accelerators (e.g., GPUs and TPUs). Such information, however, is critical to determining how to apply many of the source-level optimizations, e.g., tiling, fusion, and GPU/TPU kernel launches. While each individual optimization can be parameterized with various machine configurations, lacking a collective platform description interface reduces the reusability/retargetability of these optimizations.

Here we propose a machine description API, represented by a set of abstract interface classes, each of which including a set of public virtual function declarations (no implementation), to support generic queries to target machine platforms. A set of concrete API implementations, one for each type of hardware platform, can then be implemented to provide accesses to hardware-level details in a generic fashion. For example, the generic query interface at the top level can be defined to include the following.

  • typedef enum { THREAD_LEVEL, CORE_LEVEL, PROC_LEVEL, WORKGROUP_LEVEL, } ParallelizationLevel;
  • const ShortVector<const ComputeGroup&>& compute_groups_at( ParallelizationLevel ): returns the compute groups at the given parallelization level (an empty vector is returned if no such group exists at the given level). The ComputeGroup type will be defined separately and is responsible for answering queries specific to its internal settings.
  • const NetworkConnections& network_connections_at( ParallelizationLevel ): returns a description of the network connections between different compute groups at the given parallelization level.
  • const MemoryHierarchy& memory_hierarchy_at(ParallelizationLevel): returns a description of the memory hierarchy at the given parallelization level

The definitions of the ComputeGroup, NetworkConnections, and MemoryHierarchy abstractions can in turn be defined in a similar fashion, each responsible for providing retargetability to MLIR dialect translations, parallelization/scheduling strategies (e.g., task partitioning and fusion), and cache-oriented optimizations (e.g., tiling, loop unroll & jam). A ComputeGroup can be a group of CPU/GPU/TPU cores, or a single unit of them, which can be defined via a ComputeUnit class that inherits as a special case of a ComputeGroup. The interface of a ComputeUnit is fundamentally responsible for supporting retargetability of MLIR dialect translations and can be defined to include the following.

  • typedef enum {CPU, GPU, TPU} ComputeUnitType;
  • ComputeUnitType get_type(): returns type of the unit.
  • unsigned peak_MFLOPS(): the peak MFLOPS rate of the unit.
  • const MemoryHierarchy& memory_hierarchy(): returns a description of the memory hierarchy within the compute unit
  • unsigned expected_MFLOPS(const ComputeKernel&): the expected MFLOPS when executing a given computation kernel (defined later) .
  • const ShortVector<ComputeKernel*>& supported_kernels(): returns a list of intrinsic compute kernels/operations internally supported by this unit.

Here MFLOPS is used as a metric to describe the performance (computation speed) of the unit, as floating point operations are considered the dominant factors in tensor-flow applications. Other metrics (e.g., MIOPS, control-flow overhead) can be added as needed in future extensions. A new type, ComputeKernel, is introduced to represent the intrinsic operations supported by the compute unit (e.g., a CPU, GPU, or TPU). The collection of ComputeKernel objects returned by each compute unit can be seen as another virtual interface for the MLIR dialect supported by the respective processors. This new interface allows a more retargetable cost-oriented translation process on top of the existing pattern-driven translation process in MLIR.

In summary, the overarching goal of the proposal is to provide a virtual interface which can be used to represent the common features of a wide variety of different machine architectures today (and in the future) while accommodating their differences in effective ways. The interface aims to enable more machine retargetable dialect translation, parallelization, and memory-level optimizations. Many details are yet to be worked out. Currently we have the skeleton of an initial design which we will try to implement. Currently CPU, GPU and TPU are the main considerations, it can be extended to later support other microarchitectures, e.g., embedded systems, and distributed architectures.

We would like to solicit feedback from the MLIR community in the following aspects.

  • Opinions and any comments you may have regarding the general proposal. Let us know if you have interest in contributing to the effort or if you have specific ideas and suggestions on how to proceed with the proposed project.
  • What are your suggestions on integrating the machine platform descriptions into MLIR?
    What additional ideas or details would you like to suggest?

Hi Qing,

I’m really happy to see this get developed. I don’t know much about your current plans, but here are a couple of things to consider:

First, I think this probably could and should this be built into the existing LLVM TargetMachine classes. For example, adding a new TargetMemoryHierarchy class and integrating it into the existing target description framework should be straightforward. If you don’t do this, we’ll end up evolving a parallel set of abstractions that will be different, and targets that want to implement both an LLVM backend and an MLIR lowering will have to implement two different things.

The LLVM code generator is modular: you can implement just some of the interfaces, but not others. For example, a target can implement syntax and encoding information for their assembly language, but no code generator information - that is sufficient to get an assembler and disassembler.

Second, when building out new hardware abstractions, I’d recommend starting by building the code that will consume the target description. Target Description always work in service of enabling retargetable algorithms in the compiler, and the two need to be co-designed together.

In any case, I think this is great and very exciting to explore and start pushing!

-Chris

Hey,

I think we should also not construe target too low: tflite is conceptually just a “interesting” ISA backend, and the target concept can depend on the level operating on (e.g., target(llvm(x86-avx-64), tflite (cpu-v6, gpu-proc-321), interconnect(a-b{12mbps},...) could be input description to compilation saying we will be targeting a combination of CPU(s) & GPU(s) with some native codegenerated and some via runtime, some interconnect fabric etc.).

The lower level parts could be incorporated into LLVM as Chris suggested, while the higher level ones that don’t fit could be use by the different optimization, scheduling and lowering passes. So that target itself has a nesting, and it is almost as if layers are removed towards codegen :slightly_smiling_face:

I would also say that we should not have privileged processing units here, e.g., CPU should be just as custom as TPU as IPU as , and I’d even say runtimes are a target as they expose a compute capability. I think considering it more generally gives value add for the different parts, while encouraging reuse at the lower end.

– Jacques

1 Like

I also think that we can look into this from a generic point of view: there are informations that are relevant for the LLVM TargetMachine description, but there are also some piece of info that wouldn’t make sense there (like the TF->TFlite converter for instance, or other esoteric compiler).
So if we abstract away of “what property of a ‘Target’ do we want to model” or even wonder what is a ‘Target’ in the first place, there may be a place for a generic concept of TTI (Target Transform Info) in MLIR (in a similar way we came up with operation/dialect interfaces). I’m interested to think about this layer and how it would fit in the infrastructure.

Thanks for the helpful discussions!

Perhaps I should start with one case study and go from there? I agree it would be good to start from an optimization that consumes the target description. Any recommendations on which optimization to select from MLIR that could really benefit from some refactoring to use a generic target description API so that it can be reused for multiple purposes? It seems dialect translation is of some interest here. Or would something like tiling or unroll and jam be better options as a starting point?

Perhaps I should start with one case study and go from there? I agree it would be good to start from an optimization that consumes the target description. Any recommendations on which optimization to select from MLIR that could really benefit from some refactoring to use a generic target description API so that it can be reused for multiple purposes? It seems dialect translation is of some interest here. Or would something like tiling or unroll and jam be better options as a starting point?

The LoopTiling pass (lib/Dialect/Affine/Transforms/LoopTiling.cpp) is one candidate where the tile sizes could be set based on cache size information (and innermost or other tile sizes based on vector width, etc.). The unroll-and-jam pass that exists as part of affine transforms isn’t really driven by a cost model or dependence information to even check validity - it’s just a test pass, but easy to complete given the infrastructure needed already exists. I’d recommend unroll-and-jam as that’d be an easier starting point for your purposes. There is also an affine vectorization pass that I plan to contribute (in my fork, but not yet upstream) which I think would have been a good case for using target transform info - it’ll be a few weeks before I submit it for review.

Thanks for the suggestions. I will get started on implementing something concrete next week. My plan right now is to start with a top-level TargetPlatform description and then focus on a memory hierarchy abstraction, while using it to guide the loop tiling optimization at lib/Dialect/Affine/Transforms/LoopTiling.cpp. This will give me something concrete to start working through things. Once I have the code, I will post again regarding all the dependences and where might be the best place to put it (MLIR vs LLVM).

Inspired by the Design Meeting today, I did some independent brainstorming on this today. One concern I have is that an MLIR module may contain code with different targets, or even multiple copies of the same code optimized under different assumptions (e.g. Case-splitting). My thinking is to represent this concept using an operation, that would implement a special Trait, indicating that it represents a target.:

module {
tf.GPUtarget (Warps = 1024) {
func @myfunc() {…}
}
x86.target (Cachesize=4MB) {
func @myfunc() {…}
}
}

Targets could implement interfaces or traits that could drive transformations. For instance, the design meeting this morning, A tensor transformation depends on a target assumption that tensors are contiguously represented in memory. This could be exposed as a trait/interface of the tf.GPUtarget op above which would then become a predicate about those transformations. The transformations would not match to operations contained in an x86.target. Parameters could also be specified on the target (shown above) which would guide transformations.

Steve

1 Like