LLVM Discussion Forums

Modeling LLVM IR Struct Types

Rationale

Currently, the LLVM dialect type in MLIR wraps instances of llvm::Type * and delegates printing, parsing and uniquing to LLVM itself. While this approach has benefits, in particular the exact correspondence of the type system, identical syntax and trivial translation logic, it has two major drawbacks.

  1. The type system is not complete. More specifically, it is currently impossible to roundtrip identified structs. This is due to llvm::parseType expecting to be called in a context where all identified structs have been already defined in a module.
  2. This approach requires LLVM dialect to own an LLVMContext (and an llvm::Module) which is not thread-safe, unlike MLIRContext. This in turn requires additional synchronization when working with the LLVM dialect, which has led to multiple issues when attempting to achieve parallel compilation (e.g. D78207, D78580).

An alternative is to replicate the LLVM IR type system within MLIR, providing first-class modeling for all LLVM types as MLIR dialect types. However, such modeling is challenging, if at all possible, with the current MLIR approach.

Note: I do not (yet) propose to change the representation of LLVM dialect types, but seek how to work around expressiveness limitations.

Working Example

LLVM IR supports identified structs in order to enable (mutually) recursive types. For example, one can specify a type that contains a pointer to another, not-yet-defined type.

%a = type { %b* }
%b = type { %a* }

This is currently impossible in MLIR: a type must be fully specified when constructed.

Furthermore, MLIR types are uniqued in context whereas identified structs in LLVM IR are said to be not uniqued. If an identified struct with the same identifier is created programmatically or parsed in a different module but in the same LLVMContext, it is renamed automatically.

Representing such types in MLIR requires addressing two issues:

  1. Type uniquing;
  2. Type parsing.

Possible Representations

LLVM supports identified struct by decoupling type creation from initialization: it is possible to create an identified struct, use it in constructors of another type, and then initialize the type by setting up its body.

Practically, the behavior of LLVM IR with respect to identified structs can be seen as ensuring a unique pair “name + module”. That is, in a given module, only one instance of an identified struct with the given name can exist. This is relatively straightforward to implement in MLIR by extending the type uniquing infrastructure to be key/value storage, where the value is mutable and can be used to store the body of an identified struct after the struct itself was constructed and uniqued. Calling get with the given name and module can return the same type with mutable body. It is equally possible to auto-rename upon construction if the name is already in use.

Module-based Uniquing

The challenging part in this scheme is referring to the module and parsing. If we follow LLVM, identified structs with the same name but different body can appear in different modules. Furthermore, we need to avoid uses of identified structs defined in another module. To achieve this, when parsing a type, we would need to identify to which module does the type belong. This sounds challenging in MLIR since there is no way to guarantee a type is only used within a module. Moreover, MLIR types are immortal in context whereas modules are not, although one can arguably make the modules immortal as well.

It is unclear how to refer to the “name + module” pair in the syntax. Giving modules a name only delays the problem of having duplicate names (imagine parsing the same named module twice in the same MLIRContext). In the custom syntax, we could follow LLVM by requiring identified structs to be listed upfront in a hypothetical llvm.module operation (as attributes) and implement local renaming, i.e. after parsing a struct identified, look up a uniqued version in a table and use it to get the type instead of the original name. This wouldn’t work if the llvm.module operation is presented in the generic syntax, with the dialect registered, because we cannot inject behavior into the generic parser.

Alternatively, MLIR provides type aliases, but they are declared in the top-level module and used by the parser only. Providing scoped aliasing capability, where an alias is only active within a top-level module or within a given region, could be a potential solution.

Delayed Resolution

An alternative representation would only use and unique the identifier of the struct. The mapping between identifiers and bodies could be stored in a hypothetical llvm.module operation, e.g. as an array-of-types attribute. Functions accessing or modifying the body of such type would take a module operation as argument and modify it, while types themselves would remain immutable. Parsing in this case would be trivial as only the name should be parsed. However, depending on the module, an identified struct "a" could have different bodies. The downside of this approach is that, from the type alone, it is impossible to determine that two identified structs used in different modules have different bodies since the type only contains the name.

Conclusion

It appears that it is currently impossible to achieve exactly the same behavior as LLVM does for identified structs (i.e., unique names with auto-renaming, modules with identical names can be parsed repeatedly without error if in different modules, types with the same name in different modules correspond to different Type instances) without extending MLIR.

I am vaguely inclined to work towards having region-scoped aliases + forward-referencing for types in the alias list and implement structs based on that, but I would like to gather opinions and alternative proposals before proceeding.

This would resemble the following in the textual IR

module {
  llvm.module {
    !a = !llvm.struct<"a", ptr<!b>>
    !b = !llvm.struct<"b", ptr<!a>>
  }
  // cannot use !a or !b outside of scope
  llvm.module {
    !a = !llvm.struct<"a", i32> // this is fine
  }
}

Ping @River707 @clattner

Are there further thoughts on this topic? Or a resolution?

I’ve run into this exact problem lately and would rather not resort to a delayed-resolution approach which works in my particular case (since there are globally unique type IDs). The only problem is that I don’t think is resolved by this particular hack is printing/parsing. I don’t see a way to guarantee that both types be output.

There were no reactions to this thread, as you can see…

River and I chatted about this offline and decided that, for the LLVM dialect, it makes sense to go for a minimal viable solution (likely based on fully printing the struct type every time, except for self-references, and postponing the decision on multi-module type behavior to better times). Do you have a case other than the LLVM dialect?

I’m confused: would you resort to it or would you not?

RE use case: Yes, my primary use case is not LLVM. I’m trying to model a strongly-typed messaging system (for both software and hardware). Said messages are flat memory (C-style structs), so I only run into issues when (buffer) pointers are present. Said pointers are not necessarily pointing to the type in which they are members, they could be indirectly referencing that type. Additionally, void* is not an option since the amount of space/wires to be allocated for the pointed to data is based on the size of the data type which is being pointed to. The aspect that makes the delay-resolution approach without namespaces is that each message has a globally unique, randomly generated id associated with it.

If anyone is curious, I’m shooting for a superset of the Capnp type system.

RE resort to it: if there was some way to ensure that all the needed type data was present in the various serializations, yes I think I would (assuming I don’t run into other issues).