MLIR for binary code analysis

Hi, @mehdi_amini @ftynse, I am Ajay, I am currently learning MLIR, most of the work with MLIR towards compiler infrastructure. I wanted to use the same MLIR concept for binary code analysis.
In this regard, I have a few questions and expecting your suggestion.

  1. Sorry if I am wrong, but what I understood from the tutorials is the Toy language is converting into AST, and from the AST MLIR is emitting. When you say multi-level IR, is there any level within the MLIR like level-1, level-2, etc…?

  2. Once MLIR is emitted what is the further use from the MLIR, is that the current status of MLIR work is converting from MLIR to LLVM-IR…?. if so, can we use that LLVM-IR for recompilation …? (if I use this MLIR for binary code analysis).

  3. I have a disassembled assembly instruction of the ELF binary (obtained from RE tool like IDA pro). In place of toy language if I insert this disassembly assembly instruction. Can I continue to use the remaining steps like Toy language/Assembly instructions → AST–> MLIR–> LLVM-IR…?

Thanks in advance.

Hi,

Toy language is an educational example, I would not use or extend it for anything anyhow real.

It is easier to see MLIR as IR substrate or framework, not a single IR. One can define multiple dialects at different levels of abstraction: from language constructs to hardware instructions. The process of converting from one to another will likely happen in several steps, after each of which you’ll have some representation that can be called a level. The tutorial illustrates that by going to the SCF dialect (loops), then to the LLVM dialect through Standard dialect. Each of those can be seen as a separate level, but nobody is obliged to use these specific dialects.

MLIR is an internal representation for the compiler. The compiler can transform them for optimization or code generation purposes. Lowering progressively between different dialects is mostly code generation. You can also run generic (e.g. canonicalization or CSE) or dialect-specific (e.g. loop transformations on SCF or polyhedral transformations on Affine) optimizations.

Please don’t use the Toy language for anything other than learning about MLIR. It is exactly what its name says it is – a toy.

I don’t see why you would need an AST for assembly instructions, or how the lowering from tensor operations to loops can be useful for assembly instructions… I suppose what you actually want is to define a dialect that corresponds to your assembly instructions and reason about them using MLIR infrastructure. Going from that to LLVM IR sounds like raising, LLVM IR being more abstract than (most) assembly formats. You can certainly implement raising in MLIR, but you need to be certain about your goals before you do that.

I don’t see why you would need an AST for assembly instructions, or how the lowering from tensor operations to loops can be useful for assembly instructions

I am planning to define a dialect that corresponds to disassembled assembly instruction of X86-64 ELF static binaries. These assembly instructions are obtained from the disassembler tools such as objdump, capstone, IDA pro, etc. The disassembled assembly instruction has below data structure semantics:

  • registers
  • instruction
  • functions
  • data and segment section etc. of the static elf binary.

I think generating to AST and then emitting an MLIR many not be any sense in my approach which I asked. However, why I asked means, in the tutorials toy language or some any language first converted to AST then emitted MLIR based on AST.

But in my case, I want to emit/generate MLIR in reverse engineering approach, not a compiler infrastructure. What’s your advice to generate MLIR corresponds to the above assembly instruction semantics…? is it feasible to do while fallowing MLIR concept…?

Traditionally, a single IR representation for binary code analysis and recompilation widely studied in the area of reverse engineering. Additionally, the LLVM-IR is also widely used by many approaches (GitHub - lifting-bits/mcsema: Framework for lifting x86, amd64, aarch64, sparc32, and sparc64 program binaries to LLVM bitcode). But my interest I wanted to use Multi-level IR (MLIR) in place of single IR generation for binary lifting and recompilation.
In this stage, I wanted to know from your end is my approach is feasible to generate Multi-level IR based disassembled instruction while following the MLIR approach.?

But you need to be certain about your goals before you do that.

My goal something.

  1. Disassemble legacy raw ELF binary so that I will have disassembly info of binary (in Assembly pseudocode).
  2. lift/rise those instructions to multi-level IR based on MLIR concept.
  3. Reassemble/ recompile the MLIR to target a workable binary. In connection to step 3 if I want to go for recompile do I need to convert emitted MLIR to LLVM-IR or LLVM-IR is already representing as last form/level of dialects in the MLIR…?
    At last, can we able to read the content inside the MLIR (something like pseudo assembly instruction form)…?

Thanks

Let me give you one key piece of information that seems missing from your reasoning: MLIR does not have a list of predefined instructions, types or attributes (unlike LLVM). In a sense, MLIR is not an IR itself, it’s an IR constructor. The entire point of MLIR is its extensibility, that is, anybody can add operations and types with (almost) any semantics and they are as good as, e.g., “standard” operations. So the answer to a question “can I do X with MLIR” is always “yes, provided you define the operations/types you need to do X”. The harder question is how these operations/types look like, and how they interact with other operations/types that may exist in the MLIR ecosystem.

You don’t need a language or an AST to produce an IR, MLIR or otherwise. You can just create the appropriate components (operations, blocks, regions, types) using the relevant APIs, e.g. mlir::Builder. Those APIs are explained in the tutorial, you can call them from any code you have, there is no need to have an AST.

Define the dialect that has the semantics you want, make sure it is expressive enough. Then write your binary analysis code and use mlir::Builder and relevant APIs to construct the IR from your analysis results. Then you can write the lifting to the LLVM dialect using the pattern rewriting infrastructure. It may require some extensions to the LLVM dialect, it’s not complete yet. At the LLVM dialect level, you can export to LLVM IR and compile it back to binary using LLVM.

But MLIR is a compiler infrastructure and so is LLVM. This is an unfortunately common misconception perpetuated by some textbooks that compilers are about lexing and parsing. They are much more than that.

MLIR is a compiler framework that lets you define the abstractions you need. By itself, it is not necessarily “multi-level”, it’s the set of abstractions you use that is. To me, at this point, MLIR is a proper name, the same way as LLVM used to stand for Low-Level Virtual Machine but that has not been true for quite some time.

What are the different levels of abstractions you would need? Assembly instruction level and LLVM dialect level? Something else?

MLIR is very flexible so it is definitely feasible. If by MLIR approach you mean expressing everything as MLIR operations and transforming them using the pattern rewriting infrastructure, then also yes, it sounds feasible. It will be interesting to see how we should adapt the infrastructure to support such use cases (it was mostly designed for lowering, not raising).

MLIR has the LLVM dialect, which gives you the same abstractions as LLVM IR, but which is not complete yet. And you must translate the LLVM dialect to the LLVM IR if you want to use LLVM proper.

As long as you introduce operations and/or attributes for that purpose.