[RFC] First-Party Support for LLVM Types

First-Party Support for LLVM Types

Motivation

Currently, the LLVM dialect type in MLIR wraps instances of llvm::Type * and delegates printing, parsing and uniquing to LLVM itself. The advantages of this approach are as follows:

  • Exact correspondence with the LLVM IR type system; any change to the LLVM IR type system is reflected automatically.
  • Identical syntax if wrapped in !llvm<" ">
  • Trivial translation logic.

However, this approach has numerous drawbacks, in particular:

  • The modeled type system is incomplete. More specifically, it is currently impossible to roundtrip identified structs. This is due to llvm::parseType expecting to be called in a context where all identified structs have been already defined in a module.
  • The LLVM dialect has to own an LLVMContext (and an llvm::Module) which is not thread-safe, unlike MLIRContext. This in turn requires additional synchronization when working with the LLVM dialect, which has led to multiple issues when attempting to achieve parallel compilation (e.g. D78207, D78580).
  • Changes to LLVM IR type system may break the LLVM dialect in a hard-to-detect way because LLVM-driven parsing will almost always work, limiting the first advantage listed above.
  • The type system is poorly integrated with the isa-based logic and thus poorly reflected in the ODS definitions of LLVM dialect Ops, leaving them under-verified and thus prone to runtime assertions inside LLVM.

An alternative is to replicate the LLVM IR type system within MLIR, providing first-party modeling for all LLVM types as MLIR dialect types. This will address the drawbacks and may lead to only a limited loss of the current advantages.

Design Questions and Decisions

Closedness

It might make sense to reuse compatible standard types in the LLVM dialect to avoid unnecessary duplication. However, only few types fully match between LLVM IR and MLIR standard: MLIR integer types can have signed semantics, unlike those in LLVM IR; MLIR floating point types have the same semantics by different syntax; MLIR vector types can be multi-dimensional while LLVM IR vector types cannot, but can be scalable. The complexity of specifying and keeping in mind the details of supported types outweighs that of partial duplication.

The LLVM dialect is most often the “sink” dialect in the lowering pipeline that is intended for further conversion to LLVM IR. As such, allowing LLVM dialect container types (structs, arrays, pointers) to refer to non-LLVM dialect types has limited usefulness that does not justify the complexity of supporting such types. A specific example of this complexity would be the parsing support for recursive named structs referring to non-LLVM dialect types.

Therefore, the type system of the LLVM dialect can be closed, that is, all LLVM dialect types can only refer to other LLVM dialect types. This has an extra benefit of reducing the verbosity of the syntax.

Generic Form and Uniquing

MLIR allows operations and types to be printed in a generic form even if the dialect is registered. In this form, the textual representation of the type should uniquely identify it. In particular, this property should hold for identified structs which are uniqued by name in LLVM IR. Specifically, since MLIR does not have dedicated support for defining new types that are usable across the module (type aliases are merely shorthand names), each occurrence of a type should provide its complete definition, including for recursive types.

Nested Modules

MLIR supports arbitrarily nested modules, in particular an MLIR module may contain multiple LLVM modules, each of which uses identified structs. These structs are uniqued in the MLIR context. Therefore, identically-named structs in different nested modules (or, alternatively, when processing multiple modules sequentially in the same context) refer to the same type. Depending on the use case, it may or may not be desirable to share types across modules: desirable for LTO, undesirable for parallel compilation of multiple modules for different targets. Currently, MLIR does not provide a mechanism to have type definitions scoped or removed from the context. Hence the pragmatic decision of sharing types across nested modules so that they remain uniqued in the context. It is left under responsibility of the dialect user to ensure that structs are renamed if they should refer to different types. This can be revisited in the future (see Possible Extensions below).

Thread Safety

Type creation and manipulation should be thread safe, similarly to other MLIR types. In particular, manipulation of the body of an identified struct should not lead to race conditions or assertions inside MLIR if called from multiple threads.

Syntax

LLVM IR has a concise and well-structured syntax for types. However, this syntax is not readily representable in MLIR and would require reimplementing parts of the LLVM type parser in the dialect (in particular, to support using %-identifiers for type names and trailing * for pointers). Arguably, these changes would also lead to confusing syntax in the MLIR context. Supporting only a subset of the LLVM IR syntax would be also confusing due to inconsistency.

Therefore, compound parametric types should use the syntax native to MLIR, e.g. kind<..>, and avoid fortuitous correspondence with the LLVM IR syntax, e.g. struct<{i32}> may be confusing to determine whether a structure is packed or not. This will lead to more verbose types, which can be partially compensated by using type aliases extensively.

Simple Types

The following non-parametric in MLIR types are supported: void, half, float, double, fp128, x86_fp80, ppc_fp128, x86_mmx, token. They will be represented as separate mlir::LLVM::LLVMType subclasses, with names connected, capitalized and prefixed with LLVM to differentiate them from standard MLIR (e.g. mlir::LLVM::LLVMFloatType, mlir::LLVM::LLVMX86Fp80Type). The prefixing is similar to that of FuncOp/LLVMFuncOp/GPUFuncOp and is intended for disambiguation in case when both mlir and mlir::LLVM namespaces are imported. The pretty syntax for these types correspond to the LLVM IR syntax, prefixed with !llvm. as required for MLIR dialect-specific types. This will give us, e.g., !llvm.void, !llvm.half that are consistent with the current syntax.

Unsupported Types

label type corresponds to basic block labels in LLVM. We don’t need an equivalent in MLIR because Blocks are not Values and don’t have types. They are also treated specially in the operations that use them.

metadata type is used for LLVM metadata. MLIR has a significantly more structured attribute system with (mostly) typed values.

Parametric Types

If the body of a parametric type refers to another type, the leading !llvm. in the reference can be omitted for brevity. For example, !llvm.ptr<!llvm.i32> can be spelled as !llvm.ptr<i32>. This is possible thanks to the closedness of the type system.

Integer Types

Integer types are parametric in MLIR terminology. They will be represented as mlir::LLVM::LLVMIntegerType class parameterized by bitwidth (similar to standard integers). The syntax remains identical to the existing !llvm.i32.

Pointer Types

Pointer types are parametric types represented as mlir::LLVM::LLVMPointerType parameterized by the element type and the address space. The address space is an integer, but this choice may be reconsidered if MLIR implements named address spaces. Syntactically, this type can be represented as !llvm.ptr<float> and !llvm.ptr<float, 5> for non-default address spaces.

Vector Types

These can be represented as mlir::LLVM::LLVMFixedVectorType and mlir::LLVM::LLVMScalableVectorType (both derived from a common base class) parameterized by the element type, the number of elements (or the divisor of said number) and a boolean flag indicating whether a vector is scalable. Syntactically, the representation of this type can remain similar to LLVM IR: !llvm.vector<42 x float> for plain vectors and !llvm.vector<vscale x 42 x float> for scalable vectors.

Array Types

These are parametric types represented as mlir::LLVM::LLVMArrayType parameterized by the element type and the number of elements. Syntactically, the representation can be similar to that of a vector !llvm.array<42 x float>. This specific notation is preferred for consistency with other MLIR types, although !llvm.array[42 x float] is also possible (but not !llvm[42 x float]).

Function Types

These are parametric types represented as mlir::LLVM::LLVMFuncType parameterized by the return type, the list of argument types, and the boolean flag indicating whether the function is variadic. The syntax for this type can be relatively close to LLVM IR: !llvm.func<float (array<2 x float>, i32, ...)> corresponds to LLVM IR’s float (array<2 x float>, i32, ...).

Structure Types

Structure types are represented in a single dedicated class mlir::LLVM::LLVMStructType. For convenience, it is possible to provide further decomposition of this type into identified and literal structs, but this is left for future work. Internally, the struct type stores a (potentially empty) name, a (potentially empty) list of contained types and a bitmask indicating whether the struct is named, opaque, packed or uninitialized. Their uniquing key consists of a pointer-int pair, which corresponds to the StringRef of the name or ArrayRef<Type> of the contained types for named and literal structs, respectively, followed by the “opaque”, “named” and “packed” bits.

Literal Structs

Literal structures are uniqued according to the list of elements they contain, and can optionally be packed. The syntax for such structs is as follows.

llvm-literal-struct-type ::= `!llvm.struct<` `packed`? `(` llvm-type-list `)`  `>`
llvm-type-list ::= <potentially empty comma-separated list of llvm types w/o `!llvm`>

Literal structs cannot be recursive, but can contain other structs. Therefore, they must be constructed in a single step with the entire list of contained elements provided. This ensures thread safety. The API call can resemble the following.

static mlir::LLVM::LLVMStructType::getLiteral(MLIRContext *, TypeRange, bool isPacked);

Identified Structs

Identified structures are uniqued according to their name, which must be provided when the type is constructed, and optionally with the list of contained types (which may not be possible in case of recursive structs). The storage for the type is allocated immediately and the type is uniqued given the name, under lock in the context. If an identified struct with the given name already exists, it will be returned instead (Note: LLVM would auto-rename the type). If the list of contained types is not provided, the struct type is created in the “uninitialized” state, in which case one can only dump the type, check the presence of the body or set it. The body can be set only on the uninitialized type. This modification is performed under lock and may fail if the type is already initialized with a different body. This is necessary to make potentially concurrent modifications of the type safe. The creation API resembles the following.

static mlir::LLVM::LLVMStructType::getIdentified(MLIRContext *, StringRef);
static mlir::LLVM::LLVMStructType::getIdentified(MLIRContext *, StringRef, TypeRange,
                                                 bool isPacked);
static mlir::LLVM::LLVMStructType::getOpaque(MLIRContext *, StringRef);
bool mlir::LLVM::LLVMStructType::hasBody();
mlir::LogicalResult mlir::LLVM::LLVMStructType::trySetBody(TypeRange, bool isPacked);

In the textual representation, identified structs always fully list their members except for self-references to any surrounding identified struct. For self-references, only the name of the structure is provided. For example,

// Equivalent of struct A { struct A *a; }; in C.
!llvm.struct<"A", (ptr<struct<"A">>)>

// Equivalent of struct A { int32_t i; struct B *b};
// struct B { struct A *a; struct B *b }; in C.
// Note that the first occurrence of "B" has its members listed.
!llvm.struct<"A", (i32, struct<"B", (ptr<struct<"A">>, ptr<struct<"B">>)>)>

With this approach, all recursive types can be constructed following a single type definition and independently roundtripped in the generic form. The representation of the outermost type does not change if the order in which the types occur in the IR changes. The syntax for identified structs is as follows.

llvm-ident-struct-type ::= `!llvm.struct<` string-literal, `opaque` `>`
                         | `!llvm.struct<` string-literal, `packed`?
                            `(` llvm-type-or-ref-list  `)` `>`
llvm-type-or-ref-list ::= <potentially empty comma-separated list of llvm-type-or-ref>
llvm-type-or-ref ::= <any llvm type>
                   | `!llvm.struct<` string-literal >

The quoted string literal enables support for types with names not representable as identifiers, also supported by LLVM IR.

Examples of Structure Types

!llvm.struct<>                  # NOT allowed
!llvm.struct<()>                # empty, literal
!llvm.struct<(i32)>             # literal
!llvm.struct<(struct<(i32)>)>   # struct containing a struct
!llvm.struct<packed (i8, i32)>  # packed struct
!llvm.struct<"a">               # recursive reference, only allowed within another
                                # struct, NOT allowed at top level
!llvm.struct<"a", ptr<struct<"a">>>  # supported example of recursive reference
!llvm.struct<"a", ()>           # empty, named (necessary to differentiate from 
                                # recursive reference)
!llvm.struct<"a", opaque>       # opaque, named
!llvm.struct<"a", (i32)>        # named
!llvm.struct<"a", packed (i8, i32)>  # named, packed

Possible Extensions

Other Recursive Types

Other dialects requiring recursive types can consider the following model.

  • Potentially recursive types are split into the “identifier” (e.g., name) and “content” (e.g., list of members) part.
  • The “identifier” is the sufficient uniquing key.
  • The main storage has fixed size and is allocated the first type the “identifier” is encountered. Dynamically-sized content such as lists can be copied separately into the type storage held by the context, with the main storage keeping only a fixed number of pointers to that storage (this is already the case for types in upstream MLIR).
  • The “content” part of the type can be modified by updating the main storage without changing the “identifier” and thus preserving the pointer-comparability of types. It may be desirable to only allow setting the “content” once to avoid surprising concurrent modifications.
  • In the custom type syntax, the “identifier” part serves as a placeholder for self-references within a recursive type. Actual type objects are being created immediately and can be added to the “content” of the surrounding type. The parser keeps a list of the types in process of being constructed and sets up their “content” part when available, which is guaranteed to be by the end of the outermost type.

Region-Scoped Type Definitions

More precise control over the use of recursive or named types across multiple modules can be provided by extending core IR with region-scoped type definitions with forward-declaration. Each region contains, in addition to a list of blocks, a list of types that are only visible within this region. This can be implemented by creating a new allocation arena in the context when the region is created, and destroying it when the region is deleted. Type definitions local to the region are uniqued within its scope, rather than in the global context, and can reuse identifiers. Furthermore, local definitions can refer to not-yet-declared identifiers as long as they are defined later in the same scope. Having this as core syntax provides a sequence point (first block or end-of-region) to check if all forward references were defined. It also ensures the mechanism is supported in the generic syntax. Types from different regions can be merged explicitly by lifting them into the parent region. When parsing, types can be looked up in a stack of such regions to allow repeated names and shadowing.

This has non-trivial modifications and requires analyzing runtime and memory overhead, without clear use cases at the moment, so it is left for future work.

Auto-renaming

It seems possible to provide a createOrRename function for identified structs that would automatically change the identifier if the requested one already exists. Making this behavior default for get would be inconsistent with the remaining types in MLIR, and could interfere with parsing in the generic form. Consider parsing the following IR:

llvm.module {
  llvm.func(...) {
    %0 = "llvm.op"() : () -> !llvm.struct<"A", ()>
    // Here, we need to know it's the same A, otherwise we fail the basic IR verification.
    "llvm.op"(%0) : (!llvm.struct<"A", ()>) -> ()
  }
}
llvm.module {
  llvm.func(...) {
    // Different definition for "A", so we rename it to "A1" internally.
    %0 = "llvm.op"() : () -> !llvm.struct<"A", (i32)>
    // How do we know if this should be renamed or not? In the exact same 
    // case above, we looked up the name as-is, but here we won't find it.
    // If we rename it, it will become "A2", mismatch again. If we look
    // through the list of possible renames, we may find one that has the
    // same content, but that would mean identified structs are also
    // identified by content...
    "llvm.op"(%0) : (!llvm.struct<"A", (i32)>) -> ()
  }
}
2 Likes

I think this is a great thing to have, thanks @ftynse! This will also help with type conversion when going to LLVM dialect, in my opinion.

@ftynse That was a good read for a beginner like myself, thanks!

I am interested to know how could a recursive type provide its complete definition without having some kind of syntactic way to give names to types (as far as I know)? I referring to something like LLVM IR’s identified types.

I am asking because I am interested to give modelling SPIR-V’s OpTypeForwardPointer a try. I think that there is no need to directly model that op directly in the spv dialect but to generate it from (mutually) recursive spv structs. (@antiagainst I appreciate your input on this as well :slight_smile:).

Edit: I think I missed some of the details in the post that might contain an answer to my question already. Will re-read it carefully.

By complete, I meant self-contained. We should be able to call parseType on a single string and get all mutually recursive types constructed at once, without any lookups in the context (like LLVM does for identified structs by looking them up in the module). Recursive types should have some unique identifier that can be used to reference them, string name is just one option. This identified should be the first part of the type we parse, and should be sufficient to unique the type. In the textual representation, every type will have to be fully unrolled, except for any of the types currently being parsed (to avoid infinitely long types). That’s how the RFC proposes to model identified structs.

1 Like

Very cool, thanks Alex for pushing forward this! Having first-party support for LLVM types can bring much better verification check and other good stuff. So this is great.

I’m particularly interested in the recursive struct representation here given we have similar needs in SPIR-V, as pointed out by @bob.belcher. (For Vulkan we won’t see this but for OpenCL it’s possible.) Your proposal looks good to me. For the “identifier” for the type, I’d vote for using the name. It’s what people are generally familiar with and lots of existing tools are written based on. But I understand in MLIR names are not a first-class entity in the representation so I’d happy to see what other folks are in favor of.

Also I like your idea of eliding the leading namespace inside a llvm.struct, gated on closeness. In SPIR-V at the moment we don’t do that but that sounds like a nice thing to have to declutter the IR a bit.

Thanks again for the very nice writeup! :slight_smile:

With regard to recursive types: in the case of co-recursive types,

!llvm.struct<"A", ptr<struct<"B">>
!llvm.struct<"B", ptr<struct<"A">>

both of the types would have to be printed in the asm output to make sense. If only A is used by code, I think only !llvm.struct<"A", ptr<struct<"B">> would be printed so the type would not be valid. Am I correct? If so, is there currently any way to ensure they both get printed? Would that be part of this RFC?

This is already answered in the RFC:

The recursive context is only within a single type, so we always print the full context for the first instance of that type. This does not apply to two different prints, i.e. your example would be something like:

!llvm.struct<"A", ptr<struct<"B", ptr<struct<"A">>>>>
!llvm.struct<"B", ptr<struct<"A", ptr<struct<"B">>>>>

Ah… didn’t catch that.

So if would be the responsibility of the type printer to print it and know that the definition has already been printed? That’s not ideal… I tend to write printers recursively using DialectAsmPrinter.printType(), but that wouldn’t work (I think) since the type printer would have to have some notion of context. Am I correct?

It would work in the LLVM Dialect case since it’s a closed type system.

That wouldn’t work if you call the top-level printType(). My prototype implementation does something like

printType(Type ty) {
  SmallVector<StringRef, 4> nameStack;
  printStructType(ty, nameStack);
}

printStructType(Type, SmallVectorImpl<StringRef> &stack)
  // if printing a named struct, check if it's name isn't already on stack
  // if it is, just print the name
  // otherwise, print the name and body, and push the struct on the stack
  // call printStructType for any nested types

This indeed leverages closedness of the LLVM dialect type system. This is a pragmatic decision that doesn’t require us to change the core infrastructure (e.g., but exposing the stack of identifiers to top-level printType). I am not aware of cases that would require recursive types in an open (at the dialect level) type system, and even cross-dialect type mixing is rare. When we do have a use case, we can consider evolving the infra. Region-scoped type definitions are one plausible solution IMO, it guarantees printing and provides a sequence point for resolving recursive types, but it would require changing to the core IR structure that aren’t justified by current use cases.

I’m designing a type system for a messaging system. Since it can (theoretically) transport any data, I would like it to be an open type system. Given that the actual implementation(s) will need some information about all the underlying types (i.e. how large it is), I’m hoping that any type will be able to support it by adding some additional information in some sort of standard fashion – I haven’t thought about this too much since for now it’s acceptable to have a closed type system. I just don’t want to do anything to preclude an open type system in the future.

More importantly, I think carrying around a separate context precludes (or makes significantly more complex) my generic tablegen-based type generation’s printer auto-generation support if I want it to support recursive type systems.

Couldn’t we simply add the context to DialectAsmPrinter? I’m not certain, but isn’t the same instance passed through the printType method regardless of the dialect? A stack of Type should be general enough, and printType could manage it.

Assuming this sort of thing can be done, my one other minor concern is bloat in type output due to redundancy.

I’m not familiar with the structure of any MLIR code, but isn’t this just a simple question of requiring the various type printers to request another type be printed after the one currently being printed? You’d probably need some additional accounting to make sure it hasn’t been printed within that scope. This would avoid having to keep some context with the printer.

I don’t think this requires region-scoped aliases or forward alias references (as you’ve discussed) since the types themselves would still be required to resolve references themselves. This would merely control the type printing.

I’m not pushing for this approach, just trying to understand the issues with it. Feel free to point me to some documentation/code on the subject – I may just be missing some context.

It could do this, and I put a lot of work into the internals of the AsmPrinter to make sure that this would be possible if we wanted to do it.

– River

Last call for bikeshedding and other propositions!

The prototype implementation is available here https://reviews.llvm.org/D84339#change-XY6q8TTRkceB. It has the following small differences from the initial proposal:

  • the keyword for vector types is vec rather than vector to avoid clashing with built-in type;
  • scalable vectors are represented as vec<? x 4 x f32> instead of vec<vscale x 4 x f32>, this is more common for MLIR and avoids awkward concatentation vec<vscalex4xf32>

Both changes make a lot of sense to me Alex, thanks. Random question, is LLVMDialectNewTypes a transitional thing?

Yes, I factored it out to test/lib in my private version to make it clear. The idea is to reach feature parity with the current LLVM IR dialect type system without touching it, and then implement the transition (not yet sure if in one shot or not).

Awesome, great approach, thanks!

These changes are about to land. I updated all upstream users to use the new syntax. For convenience, https://reviews.llvm.org/D85194 provides a simple tool that reads files written using the old type syntax and prints the same content using the new type syntax, including FileCheck comments but including regex and incomplete types (it still needs to be able to parse the type in its entirety even if it treats the rest as text).