First-Party Support for LLVM Types
Motivation
Currently, the LLVM dialect type in MLIR wraps instances of llvm::Type *
and delegates printing, parsing and uniquing to LLVM itself. The advantages of this approach are as follows:
- Exact correspondence with the LLVM IR type system; any change to the LLVM IR type system is reflected automatically.
- Identical syntax if wrapped in
!llvm<" ">
- Trivial translation logic.
However, this approach has numerous drawbacks, in particular:
- The modeled type system is incomplete. More specifically, it is currently impossible to roundtrip identified structs. This is due to
llvm::parseType
expecting to be called in a context where all identified structs have been already defined in a module. - The LLVM dialect has to own an
LLVMContext
(and anllvm::Module
) which is not thread-safe, unlikeMLIRContext.
This in turn requires additional synchronization when working with the LLVM dialect, which has led to multiple issues when attempting to achieve parallel compilation (e.g. D78207, D78580). - Changes to LLVM IR type system may break the LLVM dialect in a hard-to-detect way because LLVM-driven parsing will almost always work, limiting the first advantage listed above.
- The type system is poorly integrated with the
isa
-based logic and thus poorly reflected in the ODS definitions of LLVM dialect Ops, leaving them under-verified and thus prone to runtime assertions inside LLVM.
An alternative is to replicate the LLVM IR type system within MLIR, providing first-party modeling for all LLVM types as MLIR dialect types. This will address the drawbacks and may lead to only a limited loss of the current advantages.
Design Questions and Decisions
Closedness
It might make sense to reuse compatible standard types in the LLVM dialect to avoid unnecessary duplication. However, only few types fully match between LLVM IR and MLIR standard: MLIR integer types can have signed semantics, unlike those in LLVM IR; MLIR floating point types have the same semantics by different syntax; MLIR vector types can be multi-dimensional while LLVM IR vector types cannot, but can be scalable. The complexity of specifying and keeping in mind the details of supported types outweighs that of partial duplication.
The LLVM dialect is most often the “sink” dialect in the lowering pipeline that is intended for further conversion to LLVM IR. As such, allowing LLVM dialect container types (structs, arrays, pointers) to refer to non-LLVM dialect types has limited usefulness that does not justify the complexity of supporting such types. A specific example of this complexity would be the parsing support for recursive named structs referring to non-LLVM dialect types.
Therefore, the type system of the LLVM dialect can be closed, that is, all LLVM dialect types can only refer to other LLVM dialect types. This has an extra benefit of reducing the verbosity of the syntax.
Generic Form and Uniquing
MLIR allows operations and types to be printed in a generic form even if the dialect is registered. In this form, the textual representation of the type should uniquely identify it. In particular, this property should hold for identified structs which are uniqued by name in LLVM IR. Specifically, since MLIR does not have dedicated support for defining new types that are usable across the module (type aliases are merely shorthand names), each occurrence of a type should provide its complete definition, including for recursive types.
Nested Modules
MLIR supports arbitrarily nested modules, in particular an MLIR module may contain multiple LLVM modules, each of which uses identified structs. These structs are uniqued in the MLIR context. Therefore, identically-named structs in different nested modules (or, alternatively, when processing multiple modules sequentially in the same context) refer to the same type. Depending on the use case, it may or may not be desirable to share types across modules: desirable for LTO, undesirable for parallel compilation of multiple modules for different targets. Currently, MLIR does not provide a mechanism to have type definitions scoped or removed from the context. Hence the pragmatic decision of sharing types across nested modules so that they remain uniqued in the context. It is left under responsibility of the dialect user to ensure that structs are renamed if they should refer to different types. This can be revisited in the future (see Possible Extensions below).
Thread Safety
Type creation and manipulation should be thread safe, similarly to other MLIR types. In particular, manipulation of the body of an identified struct should not lead to race conditions or assertions inside MLIR if called from multiple threads.
Syntax
LLVM IR has a concise and well-structured syntax for types. However, this syntax is not readily representable in MLIR and would require reimplementing parts of the LLVM type parser in the dialect (in particular, to support using %-identifiers
for type names and trailing *
for pointers). Arguably, these changes would also lead to confusing syntax in the MLIR context. Supporting only a subset of the LLVM IR syntax would be also confusing due to inconsistency.
Therefore, compound parametric types should use the syntax native to MLIR, e.g. kind<..>
, and avoid fortuitous correspondence with the LLVM IR syntax, e.g. struct<{i32}>
may be confusing to determine whether a structure is packed or not. This will lead to more verbose types, which can be partially compensated by using type aliases extensively.
Simple Types
The following non-parametric in MLIR types are supported: void
, half
, float
, double
, fp128
, x86_fp80
, ppc_fp128
, x86_mmx
, token
. They will be represented as separate mlir::LLVM::LLVMType
subclasses, with names connected, capitalized and prefixed with LLVM
to differentiate them from standard MLIR (e.g. mlir::LLVM::LLVMFloatType
, mlir::LLVM::LLVMX86Fp80Type
). The prefixing is similar to that of FuncOp/LLVMFuncOp/GPUFuncOp
and is intended for disambiguation in case when both mlir
and mlir::LLVM
namespaces are imported. The pretty syntax for these types correspond to the LLVM IR syntax, prefixed with !llvm.
as required for MLIR dialect-specific types. This will give us, e.g., !llvm.void
, !llvm.half
that are consistent with the current syntax.
Unsupported Types
label
type corresponds to basic block labels in LLVM. We don’t need an equivalent in MLIR because Blocks are not Values and don’t have types. They are also treated specially in the operations that use them.
metadata
type is used for LLVM metadata. MLIR has a significantly more structured attribute system with (mostly) typed values.
Parametric Types
If the body of a parametric type refers to another type, the leading !llvm.
in the reference can be omitted for brevity. For example, !llvm.ptr<!llvm.i32>
can be spelled as !llvm.ptr<i32>
. This is possible thanks to the closedness of the type system.
Integer Types
Integer types are parametric in MLIR terminology. They will be represented as mlir::LLVM::LLVMIntegerType
class parameterized by bitwidth (similar to standard integers). The syntax remains identical to the existing !llvm.i32
.
Pointer Types
Pointer types are parametric types represented as mlir::LLVM::LLVMPointerType
parameterized by the element type and the address space. The address space is an integer, but this choice may be reconsidered if MLIR implements named address spaces. Syntactically, this type can be represented as !llvm.ptr<float>
and !llvm.ptr<float, 5>
for non-default address spaces.
Vector Types
These can be represented as mlir::LLVM::LLVMFixedVectorType
and mlir::LLVM::LLVMScalableVectorType
(both derived from a common base class) parameterized by the element type, the number of elements (or the divisor of said number) and a boolean flag indicating whether a vector is scalable. Syntactically, the representation of this type can remain similar to LLVM IR: !llvm.vector<42 x float>
for plain vectors and !llvm.vector<vscale x 42 x float>
for scalable vectors.
Array Types
These are parametric types represented as mlir::LLVM::LLVMArrayType
parameterized by the element type and the number of elements. Syntactically, the representation can be similar to that of a vector !llvm.array<42 x float>
. This specific notation is preferred for consistency with other MLIR types, although !llvm.array[42 x float]
is also possible (but not !llvm[42 x float]
).
Function Types
These are parametric types represented as mlir::LLVM::LLVMFuncType
parameterized by the return type, the list of argument types, and the boolean flag indicating whether the function is variadic. The syntax for this type can be relatively close to LLVM IR: !llvm.func<float (array<2 x float>, i32, ...)>
corresponds to LLVM IR’s float (array<2 x float>, i32, ...)
.
Structure Types
Structure types are represented in a single dedicated class mlir::LLVM::LLVMStructType
. For convenience, it is possible to provide further decomposition of this type into identified and literal structs, but this is left for future work. Internally, the struct type stores a (potentially empty) name, a (potentially empty) list of contained types and a bitmask indicating whether the struct is named, opaque, packed or uninitialized. Their uniquing key consists of a pointer-int pair, which corresponds to the StringRef
of the name or ArrayRef<Type>
of the contained types for named and literal structs, respectively, followed by the “opaque”, “named” and “packed” bits.
Literal Structs
Literal structures are uniqued according to the list of elements they contain, and can optionally be packed. The syntax for such structs is as follows.
llvm-literal-struct-type ::= `!llvm.struct<` `packed`? `(` llvm-type-list `)` `>`
llvm-type-list ::= <potentially empty comma-separated list of llvm types w/o `!llvm`>
Literal structs cannot be recursive, but can contain other structs. Therefore, they must be constructed in a single step with the entire list of contained elements provided. This ensures thread safety. The API call can resemble the following.
static mlir::LLVM::LLVMStructType::getLiteral(MLIRContext *, TypeRange, bool isPacked);
Identified Structs
Identified structures are uniqued according to their name, which must be provided when the type is constructed, and optionally with the list of contained types (which may not be possible in case of recursive structs). The storage for the type is allocated immediately and the type is uniqued given the name, under lock in the context. If an identified struct with the given name already exists, it will be returned instead (Note: LLVM would auto-rename the type). If the list of contained types is not provided, the struct type is created in the “uninitialized” state, in which case one can only dump the type, check the presence of the body or set it. The body can be set only on the uninitialized type. This modification is performed under lock and may fail if the type is already initialized with a different body. This is necessary to make potentially concurrent modifications of the type safe. The creation API resembles the following.
static mlir::LLVM::LLVMStructType::getIdentified(MLIRContext *, StringRef);
static mlir::LLVM::LLVMStructType::getIdentified(MLIRContext *, StringRef, TypeRange,
bool isPacked);
static mlir::LLVM::LLVMStructType::getOpaque(MLIRContext *, StringRef);
bool mlir::LLVM::LLVMStructType::hasBody();
mlir::LogicalResult mlir::LLVM::LLVMStructType::trySetBody(TypeRange, bool isPacked);
In the textual representation, identified structs always fully list their members except for self-references to any surrounding identified struct. For self-references, only the name of the structure is provided. For example,
// Equivalent of struct A { struct A *a; }; in C.
!llvm.struct<"A", (ptr<struct<"A">>)>
// Equivalent of struct A { int32_t i; struct B *b};
// struct B { struct A *a; struct B *b }; in C.
// Note that the first occurrence of "B" has its members listed.
!llvm.struct<"A", (i32, struct<"B", (ptr<struct<"A">>, ptr<struct<"B">>)>)>
With this approach, all recursive types can be constructed following a single type definition and independently roundtripped in the generic form. The representation of the outermost type does not change if the order in which the types occur in the IR changes. The syntax for identified structs is as follows.
llvm-ident-struct-type ::= `!llvm.struct<` string-literal, `opaque` `>`
| `!llvm.struct<` string-literal, `packed`?
`(` llvm-type-or-ref-list `)` `>`
llvm-type-or-ref-list ::= <potentially empty comma-separated list of llvm-type-or-ref>
llvm-type-or-ref ::= <any llvm type>
| `!llvm.struct<` string-literal >
The quoted string literal enables support for types with names not representable as identifiers, also supported by LLVM IR.
Examples of Structure Types
!llvm.struct<> # NOT allowed
!llvm.struct<()> # empty, literal
!llvm.struct<(i32)> # literal
!llvm.struct<(struct<(i32)>)> # struct containing a struct
!llvm.struct<packed (i8, i32)> # packed struct
!llvm.struct<"a"> # recursive reference, only allowed within another
# struct, NOT allowed at top level
!llvm.struct<"a", ptr<struct<"a">>> # supported example of recursive reference
!llvm.struct<"a", ()> # empty, named (necessary to differentiate from
# recursive reference)
!llvm.struct<"a", opaque> # opaque, named
!llvm.struct<"a", (i32)> # named
!llvm.struct<"a", packed (i8, i32)> # named, packed
Possible Extensions
Other Recursive Types
Other dialects requiring recursive types can consider the following model.
- Potentially recursive types are split into the “identifier” (e.g., name) and “content” (e.g., list of members) part.
- The “identifier” is the sufficient uniquing key.
- The main storage has fixed size and is allocated the first type the “identifier” is encountered. Dynamically-sized content such as lists can be copied separately into the type storage held by the context, with the main storage keeping only a fixed number of pointers to that storage (this is already the case for types in upstream MLIR).
- The “content” part of the type can be modified by updating the main storage without changing the “identifier” and thus preserving the pointer-comparability of types. It may be desirable to only allow setting the “content” once to avoid surprising concurrent modifications.
- In the custom type syntax, the “identifier” part serves as a placeholder for self-references within a recursive type. Actual type objects are being created immediately and can be added to the “content” of the surrounding type. The parser keeps a list of the types in process of being constructed and sets up their “content” part when available, which is guaranteed to be by the end of the outermost type.
Region-Scoped Type Definitions
More precise control over the use of recursive or named types across multiple modules can be provided by extending core IR with region-scoped type definitions with forward-declaration. Each region contains, in addition to a list of blocks, a list of types that are only visible within this region. This can be implemented by creating a new allocation arena in the context when the region is created, and destroying it when the region is deleted. Type definitions local to the region are uniqued within its scope, rather than in the global context, and can reuse identifiers. Furthermore, local definitions can refer to not-yet-declared identifiers as long as they are defined later in the same scope. Having this as core syntax provides a sequence point (first block or end-of-region) to check if all forward references were defined. It also ensures the mechanism is supported in the generic syntax. Types from different regions can be merged explicitly by lifting them into the parent region. When parsing, types can be looked up in a stack of such regions to allow repeated names and shadowing.
This has non-trivial modifications and requires analyzing runtime and memory overhead, without clear use cases at the moment, so it is left for future work.
Auto-renaming
It seems possible to provide a createOrRename
function for identified structs that would automatically change the identifier if the requested one already exists. Making this behavior default for get
would be inconsistent with the remaining types in MLIR, and could interfere with parsing in the generic form. Consider parsing the following IR:
llvm.module {
llvm.func(...) {
%0 = "llvm.op"() : () -> !llvm.struct<"A", ()>
// Here, we need to know it's the same A, otherwise we fail the basic IR verification.
"llvm.op"(%0) : (!llvm.struct<"A", ()>) -> ()
}
}
llvm.module {
llvm.func(...) {
// Different definition for "A", so we rename it to "A1" internally.
%0 = "llvm.op"() : () -> !llvm.struct<"A", (i32)>
// How do we know if this should be renamed or not? In the exact same
// case above, we looked up the name as-is, but here we won't find it.
// If we rename it, it will become "A2", mismatch again. If we look
// through the list of possible renames, we may find one that has the
// same content, but that would mean identified structs are also
// identified by content...
"llvm.op"(%0) : (!llvm.struct<"A", (i32)>) -> ()
}
}