LLVM Discussion Forums

[RFC] First-Party Support for LLVM Types

First-Party Support for LLVM Types

Motivation

Currently, the LLVM dialect type in MLIR wraps instances of llvm::Type * and delegates printing, parsing and uniquing to LLVM itself. The advantages of this approach are as follows:

  • Exact correspondence with the LLVM IR type system; any change to the LLVM IR type system is reflected automatically.
  • Identical syntax if wrapped in !llvm<" ">
  • Trivial translation logic.

However, this approach has numerous drawbacks, in particular:

  • The modeled type system is incomplete. More specifically, it is currently impossible to roundtrip identified structs. This is due to llvm::parseType expecting to be called in a context where all identified structs have been already defined in a module.
  • The LLVM dialect has to own an LLVMContext (and an llvm::Module) which is not thread-safe, unlike MLIRContext. This in turn requires additional synchronization when working with the LLVM dialect, which has led to multiple issues when attempting to achieve parallel compilation (e.g. D78207, D78580).
  • Changes to LLVM IR type system may break the LLVM dialect in a hard-to-detect way because LLVM-driven parsing will almost always work, limiting the first advantage listed above.
  • The type system is poorly integrated with the isa-based logic and thus poorly reflected in the ODS definitions of LLVM dialect Ops, leaving them under-verified and thus prone to runtime assertions inside LLVM.

An alternative is to replicate the LLVM IR type system within MLIR, providing first-party modeling for all LLVM types as MLIR dialect types. This will address the drawbacks and may lead to only a limited loss of the current advantages.

Design Questions and Decisions

Closedness

It might make sense to reuse compatible standard types in the LLVM dialect to avoid unnecessary duplication. However, only few types fully match between LLVM IR and MLIR standard: MLIR integer types can have signed semantics, unlike those in LLVM IR; MLIR floating point types have the same semantics by different syntax; MLIR vector types can be multi-dimensional while LLVM IR vector types cannot, but can be scalable. The complexity of specifying and keeping in mind the details of supported types outweighs that of partial duplication.

The LLVM dialect is most often the “sink” dialect in the lowering pipeline that is intended for further conversion to LLVM IR. As such, allowing LLVM dialect container types (structs, arrays, pointers) to refer to non-LLVM dialect types has limited usefulness that does not justify the complexity of supporting such types. A specific example of this complexity would be the parsing support for recursive named structs referring to non-LLVM dialect types.

Therefore, the type system of the LLVM dialect can be closed, that is, all LLVM dialect types can only refer to other LLVM dialect types. This has an extra benefit of reducing the verbosity of the syntax.

Generic Form and Uniquing

MLIR allows operations and types to be printed in a generic form even if the dialect is registered. In this form, the textual representation of the type should uniquely identify it. In particular, this property should hold for identified structs which are uniqued by name in LLVM IR. Specifically, since MLIR does not have dedicated support for defining new types that are usable across the module (type aliases are merely shorthand names), each occurrence of a type should provide its complete definition, including for recursive types.

Nested Modules

MLIR supports arbitrarily nested modules, in particular an MLIR module may contain multiple LLVM modules, each of which uses identified structs. These structs are uniqued in the MLIR context. Therefore, identically-named structs in different nested modules (or, alternatively, when processing multiple modules sequentially in the same context) refer to the same type. Depending on the use case, it may or may not be desirable to share types across modules: desirable for LTO, undesirable for parallel compilation of multiple modules for different targets. Currently, MLIR does not provide a mechanism to have type definitions scoped or removed from the context. Hence the pragmatic decision of sharing types across nested modules so that they remain uniqued in the context. It is left under responsibility of the dialect user to ensure that structs are renamed if they should refer to different types. This can be revisited in the future (see Possible Extensions below).

Thread Safety

Type creation and manipulation should be thread safe, similarly to other MLIR types. In particular, manipulation of the body of an identified struct should not lead to race conditions or assertions inside MLIR if called from multiple threads.

Syntax

LLVM IR has a concise and well-structured syntax for types. However, this syntax is not readily representable in MLIR and would require reimplementing parts of the LLVM type parser in the dialect (in particular, to support using %-identifiers for type names and trailing * for pointers). Arguably, these changes would also lead to confusing syntax in the MLIR context. Supporting only a subset of the LLVM IR syntax would be also confusing due to inconsistency.

Therefore, compound parametric types should use the syntax native to MLIR, e.g. kind<..>, and avoid fortuitous correspondence with the LLVM IR syntax, e.g. struct<{i32}> may be confusing to determine whether a structure is packed or not. This will lead to more verbose types, which can be partially compensated by using type aliases extensively.

Simple Types

The following non-parametric in MLIR types are supported: void, half, float, double, fp128, x86_fp80, ppc_fp128, x86_mmx, token. They will be represented as separate mlir::LLVM::LLVMType subclasses, with names connected, capitalized and prefixed with LLVM to differentiate them from standard MLIR (e.g. mlir::LLVM::LLVMFloatType, mlir::LLVM::LLVMX86Fp80Type). The prefixing is similar to that of FuncOp/LLVMFuncOp/GPUFuncOp and is intended for disambiguation in case when both mlir and mlir::LLVM namespaces are imported. The pretty syntax for these types correspond to the LLVM IR syntax, prefixed with !llvm. as required for MLIR dialect-specific types. This will give us, e.g., !llvm.void, !llvm.half that are consistent with the current syntax.

Unsupported Types

label type corresponds to basic block labels in LLVM. We don’t need an equivalent in MLIR because Blocks are not Values and don’t have types. They are also treated specially in the operations that use them.

metadata type is used for LLVM metadata. MLIR has a significantly more structured attribute system with (mostly) typed values.

Parametric Types

If the body of a parametric type refers to another type, the leading !llvm. in the reference can be omitted for brevity. For example, !llvm.ptr<!llvm.i32> can be spelled as !llvm.ptr<i32>. This is possible thanks to the closedness of the type system.

Integer Types

Integer types are parametric in MLIR terminology. They will be represented as mlir::LLVM::LLVMIntegerType class parameterized by bitwidth (similar to standard integers). The syntax remains identical to the existing !llvm.i32.

Pointer Types

Pointer types are parametric types represented as mlir::LLVM::LLVMPointerType parameterized by the element type and the address space. The address space is an integer, but this choice may be reconsidered if MLIR implements named address spaces. Syntactically, this type can be represented as !llvm.ptr<float> and !llvm.ptr<float, 5> for non-default address spaces.

Vector Types

These can be represented as mlir::LLVM::LLVMFixedVectorType and mlir::LLVM::LLVMScalableVectorType (both derived from a common base class) parameterized by the element type, the number of elements (or the divisor of said number) and a boolean flag indicating whether a vector is scalable. Syntactically, the representation of this type can remain similar to LLVM IR: !llvm.vector<42 x float> for plain vectors and !llvm.vector<vscale x 42 x float> for scalable vectors.

Array Types

These are parametric types represented as mlir::LLVM::LLVMArrayType parameterized by the element type and the number of elements. Syntactically, the representation can be similar to that of a vector !llvm.array<42 x float>. This specific notation is preferred for consistency with other MLIR types, although !llvm.array[42 x float] is also possible (but not !llvm[42 x float]).

Function Types

These are parametric types represented as mlir::LLVM::LLVMFuncType parameterized by the return type, the list of argument types, and the boolean flag indicating whether the function is variadic. The syntax for this type can be relatively close to LLVM IR: !llvm.func<float (array<2 x float>, i32, ...)> corresponds to LLVM IR’s float (array<2 x float>, i32, ...).

Structure Types

Structure types are represented in a single dedicated class mlir::LLVM::LLVMStructType. For convenience, it is possible to provide further decomposition of this type into identified and literal structs, but this is left for future work. Internally, the struct type stores a (potentially empty) name, a (potentially empty) list of contained types and a bitmask indicating whether the struct is named, opaque, packed or uninitialized. Their uniquing key consists of a pointer-int pair, which corresponds to the StringRef of the name or ArrayRef<Type> of the contained types for named and literal structs, respectively, followed by the “opaque”, “named” and “packed” bits.

Literal Structs

Literal structures are uniqued according to the list of elements they contain, and can optionally be packed. The syntax for such structs is as follows.

llvm-literal-struct-type ::= `!llvm.struct<` `packed`? `(` llvm-type-list `)`  `>`
llvm-type-list ::= <potentially empty comma-separated list of llvm types w/o `!llvm`>

Literal structs cannot be recursive, but can contain other structs. Therefore, they must be constructed in a single step with the entire list of contained elements provided. This ensures thread safety. The API call can resemble the following.

static mlir::LLVM::LLVMStructType::getLiteral(MLIRContext *, TypeRange, bool isPacked);

Identified Structs

Identified structures are uniqued according to their name, which must be provided when the type is constructed, and optionally with the list of contained types (which may not be possible in case of recursive structs). The storage for the type is allocated immediately and the type is uniqued given the name, under lock in the context. If an identified struct with the given name already exists, it will be returned instead (Note: LLVM would auto-rename the type). If the list of contained types is not provided, the struct type is created in the “uninitialized” state, in which case one can only dump the type, check the presence of the body or set it. The body can be set only on the uninitialized type. This modification is performed under lock and may fail if the type is already initialized with a different body. This is necessary to make potentially concurrent modifications of the type safe. The creation API resembles the following.

static mlir::LLVM::LLVMStructType::getIdentified(MLIRContext *, StringRef);
static mlir::LLVM::LLVMStructType::getIdentified(MLIRContext *, StringRef, TypeRange,
                                                 bool isPacked);
static mlir::LLVM::LLVMStructType::getOpaque(MLIRContext *, StringRef);
bool mlir::LLVM::LLVMStructType::hasBody();
mlir::LogicalResult mlir::LLVM::LLVMStructType::trySetBody(TypeRange, bool isPacked);

In the textual representation, identified structs always fully list their members except for self-references to any surrounding identified struct. For self-references, only the name of the structure is provided. For example,

// Equivalent of struct A { struct A *a; }; in C.
!llvm.struct<"A", (ptr<struct<"A">>)>

// Equivalent of struct A { int32_t i; struct B *b};
// struct B { struct A *a; struct B *b }; in C.
// Note that the first occurrence of "B" has its members listed.
!llvm.struct<"A", (i32, struct<"B", (ptr<struct<"A">>, ptr<struct<"B">>)>)>

With this approach, all recursive types can be constructed following a single type definition and independently roundtripped in the generic form. The representation of the outermost type does not change if the order in which the types occur in the IR changes. The syntax for identified structs is as follows.

llvm-ident-struct-type ::= `!llvm.struct<` string-literal, `opaque` `>`
                         | `!llvm.struct<` string-literal, `packed`?
                            `(` llvm-type-or-ref-list  `)` `>`
llvm-type-or-ref-list ::= <potentially empty comma-separated list of llvm-type-or-ref>
llvm-type-or-ref ::= <any llvm type>
                   | `!llvm.struct<` string-literal >

The quoted string literal enables support for types with names not representable as identifiers, also supported by LLVM IR.

Examples of Structure Types

!llvm.struct<>                  # NOT allowed
!llvm.struct<()>                # empty, literal
!llvm.struct<(i32)>             # literal
!llvm.struct<(struct<(i32)>)>   # struct containing a struct
!llvm.struct<packed (i8, i32)>  # packed struct
!llvm.struct<"a">               # recursive reference, only allowed within another
                                # struct, NOT allowed at top level
!llvm.struct<"a", ptr<struct<"a">>>  # supported example of recursive reference
!llvm.struct<"a", ()>           # empty, named (necessary to differentiate from 
                                # recursive reference)
!llvm.struct<"a", opaque>       # opaque, named
!llvm.struct<"a", (i32)>        # named
!llvm.struct<"a", packed (i8, i32)>  # named, packed

Possible Extensions

Other Recursive Types

Other dialects requiring recursive types can consider the following model.

  • Potentially recursive types are split into the “identifier” (e.g., name) and “content” (e.g., list of members) part.
  • The “identifier” is the sufficient uniquing key.
  • The main storage has fixed size and is allocated the first type the “identifier” is encountered. Dynamically-sized content such as lists can be copied separately into the type storage held by the context, with the main storage keeping only a fixed number of pointers to that storage (this is already the case for types in upstream MLIR).
  • The “content” part of the type can be modified by updating the main storage without changing the “identifier” and thus preserving the pointer-comparability of types. It may be desirable to only allow setting the “content” once to avoid surprising concurrent modifications.
  • In the custom type syntax, the “identifier” part serves as a placeholder for self-references within a recursive type. Actual type objects are being created immediately and can be added to the “content” of the surrounding type. The parser keeps a list of the types in process of being constructed and sets up their “content” part when available, which is guaranteed to be by the end of the outermost type.

Region-Scoped Type Definitions

More precise control over the use of recursive or named types across multiple modules can be provided by extending core IR with region-scoped type definitions with forward-declaration. Each region contains, in addition to a list of blocks, a list of types that are only visible within this region. This can be implemented by creating a new allocation arena in the context when the region is created, and destroying it when the region is deleted. Type definitions local to the region are uniqued within its scope, rather than in the global context, and can reuse identifiers. Furthermore, local definitions can refer to not-yet-declared identifiers as long as they are defined later in the same scope. Having this as core syntax provides a sequence point (first block or end-of-region) to check if all forward references were defined. It also ensures the mechanism is supported in the generic syntax. Types from different regions can be merged explicitly by lifting them into the parent region. When parsing, types can be looked up in a stack of such regions to allow repeated names and shadowing.

This has non-trivial modifications and requires analyzing runtime and memory overhead, without clear use cases at the moment, so it is left for future work.

Auto-renaming

It seems possible to provide a createOrRename function for identified structs that would automatically change the identifier if the requested one already exists. Making this behavior default for get would be inconsistent with the remaining types in MLIR, and could interfere with parsing in the generic form. Consider parsing the following IR:

llvm.module {
  llvm.func(...) {
    %0 = "llvm.op"() : () -> !llvm.struct<"A", ()>
    // Here, we need to know it's the same A, otherwise we fail the basic IR verification.
    "llvm.op"(%0) : (!llvm.struct<"A", ()>) -> ()
  }
}
llvm.module {
  llvm.func(...) {
    // Different definition for "A", so we rename it to "A1" internally.
    %0 = "llvm.op"() : () -> !llvm.struct<"A", (i32)>
    // How do we know if this should be renamed or not? In the exact same 
    // case above, we looked up the name as-is, but here we won't find it.
    // If we rename it, it will become "A2", mismatch again. If we look
    // through the list of possible renames, we may find one that has the
    // same content, but that would mean identified structs are also
    // identified by content...
    "llvm.op"(%0) : (!llvm.struct<"A", (i32)>) -> ()
  }
}
1 Like

I think this is a great thing to have, thanks @ftynse! This will also help with type conversion when going to LLVM dialect, in my opinion.