[RFC] Add a printf op

Rationale

Many, though not necessarily all, MLIR target platforms support some form of the printf() call in order to print values that occur within programs at runtime.

The ability to call printf() is generally helpful for debugging.

For many platforms (such as x86 and other similar CPU targets) the process of calling printf() is a matter of calling the appropriate library function in the C standard library, which’ll be linked against the code generated via MLIR. However, not all platforms use this mechanism. For example, AMD GPUs running code to HIP use calls to a library of functions such as __ockl_printf_begin() and __ockl_printf_append_args() to print values that reside on GPUs, and Clang lowers calls to printf() to these functions before passing GPU kernels to LLVM.

Since the process for calling printf can vary between targets, but the operation itself is target-independent, I propose we add an op to MLIR that represents printf, which can be lowered to whichever sequence of operations or function calls is needed to print things out.

The operation

def Somewhere_PrintfOp : Somewhere_Op<"printf", [MemoryEffects<[MemWrite]>]>,
  Arguments<(ins StrAttr:$format,
                Variadic<AnyTypeOf<[AnyInteger, Index, AnyFloat]>>:$args)> {
  let summary = "Calls printf() to print runtime values in generated code";
  let description = [{
    `somewhere.printf` takes a literal format string `format` and an arbitrary number of scalar arguments that should be printed.

    The format string is a C-style printf string, subject to any restrictions
    imposed by one's target platform.
  }];
  let assemblyFormat = [{
    attr-dict ($args^ `:` type($args))?
  }];
}

This operation has the restrictions that

  • It only takes scalars, as it’s unclear how to handle the printing of something like a vector or memref
  • The format string is a constant, as MLIR doesn’t have a string type at runtime

Existing work

⚙ D110448 [MLIR][GPU] Define gpu.printf op and its lowerings defines the printf op under the GPU dialect and adds lowering to the printf() function (which are listed as the “OpenCL” lowering, though it’d apply to most CPUs as well) and to the sequences of function calls needed for the AMD HIP API.

Open questions

  1. What dialect should this op reside in?

(I’m personally thinking std might be the right spot, but adding things to std is dispreferred)

1 Like

Printing (and logging) have enough target, runtime, and application -specific details that I’d personally like to see more design convergence before trying to add a new op / set of ops to the core dialects. If I had to pick a name, I’d lean towards some (new) dialect like strings, logging, os, or debug rather than std.

  • Output stream if relevant? stdout? stderr? append to file?
  • Formatting for types? Maybe through utility ops in a strings dialect?
  • Buffering behavior?
  • Are side effects important? Can print calls be safely removed in release builds?

Trying to lower from a loosely defined op to a specific runtime could leave a disconnect between programmer intent and compiler behavior.

This thread has some discussion, including links to a few implementations in tutorials and the like: Print in MLIR

FWIW, IREE also has a sample of printing that is at the very edge of the system, implemented through a custom module: iree/custom_ops.td at main · google/iree · GitHub. We have our own printing for specific things (including full VM bytecode tracing: Adding a VM bytecode disassembler and flags for tracing to stderr. by benvanik · Pull Request #7261 · google/iree · GitHub), but no native support for printing from user code (if someone wants it, they can use a custom module and wire that up for their own use case).

Just as another data point, the vector dialect has supported its own print method from the early days, enabling ops like

vector.print %v : vector<32x32xi32>
vector.print %f : f64

The objective of adding this operation was simply to enable FileCheck-based integration tests to verify the correctness of lowering to LLVM IR while running MLIR code ‘end-to-end’. The underlying implementation was deliberately kept very light-weight so that print support could be quickly added to new targets. The drawback of this decision is that vector printing does not really ‘scale’ since we fully unroll vectors into these elementary print ops in the support library.

My 2c is that MLIR makes it easy to add domain specific printf ops. I don’t think there is a great need to “standardize” one. I agree with the concerns up thread about having to nail down stream models etc

The reason why this thread is brought up, and why I think there is some value behind the proposal of this print op is that we have in-tree a GPU dialect and platform specific dialects like ROCM (including the lowering from one to the other). The contribution is not as much with the op itself than with the convenience that comes from the lowering and mapping (the patch is about ROCM).

If this op can’t be generalized enough or if we can’t find it a home in a “generic” place, one possible approach would be to add this op to the rocm dialect, and then handle the lowering “in-dialect” instead of when converting from a higher-level abstraction.

Thanks @mehdi_amini for the context.

One of the reasons I initially proposed this op for gpu is that (as far as I know) GPUs (and their associated runtimes) generally have an interface for printing debug information within kernels that, to the programmer, looks like a variant or lowering of printf() - I’ve heard that CUDA does this, for example.

From where I’m standing (and I suspect my coworkers are too) getting this printf op (wherever it gets parked, even if it’s rocm) upstream is good because other folks working in the MLIR->GPU space might find it useful, it’s not too specific to our convolution generator work, and it reduces the number of patches to upstream we have to maintain.

Also, the GPU dialect does have some effectively vendor-specific ops in it, such as subgroup_mma_compute, so while I can understand taking a stand for a vendor-independent gpu, it might be too late

Not really: we should aim to identify and fix such issue instead. This isn’t a reason to me to make the GPU dialect open for adding any vendor-specific ops there.

Can you look into how your op would lower on a CUDA platform? That could help justifying adding it to the GPU dialect. Also, maybe renaming it to gpu.debug_print instead could help alleviate @scotttodd 's concerns.

Adding a different opinion from others based on the same reasons…

I think a debugging operation is an extremely common desire, so having it exist is nice. MLIR making it easy to define specific ops does not mean that I should have to setup a new op/dialect for a use case that almost every piece of code has in common (debugging). I also don’t think that we should standardize it across the various backends as they all have different concerns and such.

I do think that this still leaves the option of an un-standardized debug.log operation that can take any set of inputs, is marked as side-effecting, and has no other guarantees, and then users can define this some more for their backend.

Trying to be constructive here, my guess is that many of the proponents of this are going through LLVM IR lowering, and this provides an easier to product, analyze and transform op than llvm.call. I can’t imagine another case where this abstraction is helpful, because “printf” isn’t at all a generic thing someone would design into an IR (format characters, compatibility with only certain kinds of operations, etc).

If this is the primary use-case, it sounds like it should go into an “llvm++” sort of dialect, which is known to predictably lower to LLVM. There are probably lots of different similar things that would make sense to add to this sort of dialect, including a bunch of the stuff people are trying to squeeze into LLVM IR proper.

WDYT about something like this?

-Chris

Looks like there’s a transformation from calls to printf() to a syscall/intrinsic, going off the nearby Clang sources

And, since I went and looked, there’s printf support in Vulkan (including via SPIR-V as well

Ok, so, to summarize what I’m seeing

  • The main GPU runtimes (CUDA/HIP, OpenCL, Vulkan) support debug output facilities that are specifically modeled around the C printf() API
  • Using those facilities requires platform and runtime specific lowering of the printf() call to lower-level intrinsics or library functions, and is, in general, not as straightforward as emitting an llvm.call @printf
  • I believe programmers using MLIR to develop GPU code shouldn’t have to work with these platform-specific details and should be able to “call printf()” just like in their C code

Therefore, while, in the future, it might get shuffled around in a more general “add debug output support” change, I believe there’s a case to be made for gpu.printf (since, like something like gpu.thread_id, it’s an op that abstracts over facilities provided by most GPUs)

1 Like

If your gpu.printf can abstract over / lower to CUDA/HIP, RocM, OpenCL, and Vulkan ; then yeah that’s a strong case to me to provide this op in the GPU dialect.

For the benefit of everyone who may not have looks into the revision in details, the real complexity is to abstract the platform specific lowering. If this was about a straight 1-1 mapping to @llvm.call rocm_printf ... there wouldn’t be as much interest into this I think. Instead consider:

gpu.module  @test_module {
    gpu.func @printf_test(%arg0 : i32, %arg1 : f32) {
      gpu.printf {format = "Value int: %d and float: %f"} %arg0, %arg1 : i32, f32
      gpu.return
    }
}

(the op only accepts integers and floating point arguments)

When targeting the HIP runtime, the lowering looks like this:

  gpu.module @test_module {
    llvm.mlir.global internal constant @printfFormat_0("Value int: %d and float: %f\00")
    llvm.func @__ockl_printf_append_string_n(i64, !llvm.ptr<i8>, i64, i32) -> i64
    llvm.func @__ockl_printf_append_args(i64, i32, i64, i64, i64, i64, i64, i64, i64, i32) -> i64
    llvm.func @__ockl_printf_begin(i64) -> i64
    llvm.func @printf_test(%arg0: i32, %arg1: f32) {
      %0 = llvm.mlir.constant(0 : i64) : i64
      %1 = llvm.call @__ockl_printf_begin(%0) : (i64) -> i64
      %2 = llvm.mlir.addressof @printfFormat_0 : !llvm.ptr<array<28 x i8>>
      %3 = llvm.mlir.constant(0 : i32) : i32
      %4 = llvm.getelementptr %2[%3, %3] : (!llvm.ptr<array<28 x i8>>, i32, i32) -> !llvm.ptr<i8>
      %5 = llvm.mlir.constant(28 : i64) : i64
      %6 = llvm.mlir.constant(1 : i32) : i32
      %7 = llvm.mlir.constant(0 : i32) : i32
      %8 = llvm.call @__ockl_printf_append_string_n(%1, %4, %5, %7) : (i64, !llvm.ptr<i8>, i64, i32) -> i64
      %9 = llvm.mlir.constant(2 : i32) : i32
      %10 = llvm.zext %arg0 : i32 to i64
      %11 = llvm.fpext %arg1 : f32 to f64
      %12 = llvm.bitcast %11 : f64 to i64
      %13 = llvm.call @__ockl_printf_append_args(%8, %9, %10, %12, %0, %0, %0, %0, %0, %6) : (i64, i32, i64, i64, i64, i64, i64, i64, i64, i32) -> i64
      llvm.return
    }
  }

This involves at least 3 intrinsics chained together and some non-trivial type conversions, and this lowering will differ for each target runtime!

From looking at the documentation for various platforms, I’m rather confident it can. I could probably attempt to write some of the other lowerings, but I don’t quite have the ability to make sure they work - my test environment is a compute-only AMD card.