NVPTX: Calling convention for aggregate arguments passed by value

Hi,

I’m thinking of changing clang to pass aggregate values directly (as opposed via byval pointer now) when we’re compiling CUDA for NVPTX.

It would not affect PTX output as byval arguments will be lowered into exactly the same arrays in .param space, but the function signatures in IR will look different, which may affect folks if they have to link together IR variants that use different calling conventions. The only potentially mismatching IR clang has to deal with during CUDA compilation is libdevice bitcode and, AFAICT, it does not use any byval arguments.

The original commit that implemented the current NVPTX calling convention intended to match NVIDIA’s NVVM spec. However, the NVVM spec is not particularly relevant to clang these days, IMO. LLVM’s IR != NVVM IR and, generally speaking, they are not guaranteed to interoperate to start with. They provide different intrinsics and we don’t have any idea how much their fork of LLVM has diverged from upstream LLVM. Furthermore, AFAICT, NVVM spec does not prohibit passing aggregates directly. In fact it discusses how to handle direct values in the clause 4 of the rules-and-restrictions list on NVVM spec linked above.

Before I start messing with this, I want to check if it makes sense. Can anyone think why switching to direct passing of aggregates may not be a good idea? I believe it will be an improvement for CUDA compilation, but I may not be aware of other use cases that may be affected.

@jdoerfert – would this affect OpenMP’s compilation targeting NVPTX?

The motivation for this change is to allow SROA to eliminate local copies in more cases. Local copies that make it to the generated PTX present a substantial performance hit, as we end up with all threads on the GPU rushing to access their own chunk of very high-latency memory.

One case when SROA is unable to eliminate local copies is when it considers the alloca pointer to escape because it was passed as a byval argument to another function. Passing aggregates directly gives LLVM better information and happens to be easier to optimize. This situation keeps popping up surprisingly often. E.g. passing a lambda with captures would trigger it pretty reliably, if the lambda did not get inlined. It can be triggered by something as simple as just passing an argument to another function.

–Artem

So, quick question to make sure I understand this properly.
The plan is to lower “byval arguments” pretty much exactly as we do now, except we drop the bybal part, right?
I also don’t follow the SROA part. Why would byval capture the pointer? I mean, it is easier (or in fact trivial) to deduce no-capture for byval arguments than for non-byval arguments. See the IR arguments of kernel in in this example.
That said, we should be able to eliminate the local copy (alloca) if there are no writes to the byval memory (which are potentially executed before the call).
This should be true assuming byval arguments don’t need to be stack allocations (looking at @rnk).
~ Johannes

The plan is to lower “byval arguments” pretty much exactly as we do now, except we drop the bybal part, right?

define void @func(%struct.S* byval(%struct.S) %0)
-> 
define void @func(%struct.S %0)

I also don’t follow the SROA part. Why would byval capture the pointer?

we should be able to eliminate the local copy (alloca) if there are no writes to the byval memory

The issue is that SROA can not get rid of that pointer (and the the alloca), because it’s passed to another function, even if it’s not modified there.

I would prefer we do not break interoperability with nvcc as I hope we actually introduce a sane way to link nvcc (device) code with OpenMP offload code soon.

Right. I agree that is a problem. I don’t think removing byval is the solution though. We should instead optimize

%mem = alloca
%mem <- %orig
call foo(byval(...) %mem)

to

call foo(byval(...) %orig)

if we can show %orig is not modified between the %mem <- %orig part and the call.
Assuming this reasoning is somewhat sound, it will improve all our targets and probably catches most of your cases already. At least it’s worth a try.

WDYT?

The problem is that %orig is not necessarily a pointer. If all we have is a value, we will need the alloca in order to have a byval pointer to pass as an argument.

OK. In your use case, is the callee available or external?

The entire thing seems to be some missing optimization problem given that byval simply means, make a copy on the call edge (from the optimizer perspective at least). If we see the caller and callee we should be able to optimize byval at least as good as non byval passing (given the absence of interposable linkage types).

OK. In your use case, is the callee available or external?

Typically it is available in the same TU, but in general it’s not necessarily the case. E.g. if we’re compiling with -fgpu-rdc.

we should be able to optimize byval at least as good as non byval passing

The best optimization is the one where we don’t have to do it at all. :slight_smile:

I still don’t quite understand how you think it should’ve been optimized. Let’s suppose we have this IR : Compiler Explorer

%struct.S = type { i32 }

declare %struct.S @func_byval(%struct.S* noundef byval(%struct.S) align 4)
define void @call_byval(%struct.S %0) {
  %alloca = alloca %struct.S, align 4
  store %struct.S %0, %struct.S* %alloca, align 4
  %result = tail call %struct.S @func_byval(%struct.S* noundef nonnull byval(%struct.S) align 4 %alloca)
  ret void
}

declare %struct.S @func_direct(%struct.S)
define void @call_direct(%struct.S %0) {
  %result = tail call %struct.S @func_direct(%struct.S %0)
  ret void
}

How would call_byval be optimized?

As long as the callee has a byval pointer as an argument, I do not see how SROA (or anything else) can optimize away the alloca in call_byval, as the value would not have a pointer to it unless it’s stored somewhere.

Hopefully this context helps. Diverging from NVVM IR would create issues for projects like IREE that generate LLVM IR for target NVIDIA GPUs through LLVM in open-source, but use libdevice that is shipped as part of CUDA for handling arithmetic functions. As you mentioned, libdevice currently does not have any byval usage, but thats not set in stone. Diverging from NVVM IR will mean linker errors when linking libdevice in. At that point we would be forced to either use LLVM or use libNVVM from NVIDIA. Given that LLVM does not have a replacement for libdevice we will be forced to use libNVVM.

So the situation is that you don’t have a pointer to pass on because the caller is passed the value by copy first? This seems very odd.

Maybe we should look at the “leftover byval and alloca” here as artifacts of our encoding and not conceptual shortcomings. My argumentation from before would still apply with the caveat that we cannot pass %0 into byval. However, since func_byval and func_direct excepts the argument in the same way, we can use the same reasoning as described earlier to pass the value directly rather than making a local copy (= we still have access to the initializers of the local copy). The key point here is that ptx seems to make the argument memory explicit anyway and pass by value as well as pass by copy end up looking the same on the call edge. That said, when would the byval vs bycopy change make a difference then wrt. interoperability?

Diverging from NVVM IR would create issues for projects like IREE that generate LLVM IR for target NVIDIA GPUs through LLVM in open-source

This proposal only affects clang-generated IR. I’m not familiar with IREE, but it sounds like it’s generating IR itself. If that’s the case it will not be affected. LLVM will continue to lower aggregate arguments regardless of whether they are passed directly or via byval pointer.

As for interfacing with libdevice, if/when it may start passing aggregate arguments, it should be easy enough to annotate them with an explicit calling convention attribute and generate compatible calls. We do need to keep libdevice working for CUDA compilation, so you can expect it to be usable by IREE, too.

So the situation is that you don’t have a pointer to pass on because the caller is passed the value by copy first? This seems very odd.

The example is contrived – it was just an easy way to have a value to pass without cluttering the IR with irrelevant details. In practice most of the values will be temporaries constructed in thin air with insertvalue. It does not change the situation in principle.

we can use the same reasoning as described earlier to pass the value directly rather than making a local copy

To pass value directly we’ll need to change the calling convention so that we always agree how to map a given C++ function signature to IR.

when would the byval vs bycopy change make a difference then wrt. interoperability?

I can’t think of practical cases, but I do know that folks do unexpected things, and that’s exactly the reason I’ve made this post – to check if it may affect someone in a way I’m not aware of.

Hypothetically, someone could compile CUDA code to IR with clang-11 and release it as a bitcode library to link with later. E.g. a homegrown libdevice. Then they build something with clang at HEAD and attempt to link with that custom libdevice. If they happen to call a function there where it would have an aggregate parameter, they will have a problem. The old code would expect a byval pointer but the callsite would attempt to pass it directly. Again, the example is contrived as LLVM does not guarantee IR interoperability between different versions. The real NVIDIA’s libdevice just “happens to work” for now, but it may break at any point.

As long as everything is compiled with clang of the same version, the change should not affect anyone.

I think I agree with you. I also think we will not loose nvcc compatibility through this. Thus, I don’t see a good reason to not change the nvptx calling convention as proposed.


[Edit] Sorry it took me so long to come to this conclusion.

I do not expect the change to be invasive. In case someone runs into an issue it should be easy to add an escape hatch option to preserve the current calling convention.

Code review: ⚙ D118084 [CUDA, NVPTX] Pass byval aggregates directly