Hi,
I’m thinking of changing clang to pass aggregate values directly (as opposed via byval pointer now) when we’re compiling CUDA for NVPTX.
It would not affect PTX output as byval arguments will be lowered into exactly the same arrays in .param
space, but the function signatures in IR will look different, which may affect folks if they have to link together IR variants that use different calling conventions. The only potentially mismatching IR clang has to deal with during CUDA compilation is libdevice bitcode and, AFAICT, it does not use any byval arguments.
The original commit that implemented the current NVPTX calling convention intended to match NVIDIA’s NVVM spec. However, the NVVM spec is not particularly relevant to clang these days, IMO. LLVM’s IR != NVVM IR and, generally speaking, they are not guaranteed to interoperate to start with. They provide different intrinsics and we don’t have any idea how much their fork of LLVM has diverged from upstream LLVM. Furthermore, AFAICT, NVVM spec does not prohibit passing aggregates directly. In fact it discusses how to handle direct values in the clause 4 of the rules-and-restrictions list on NVVM spec linked above.
Before I start messing with this, I want to check if it makes sense. Can anyone think why switching to direct passing of aggregates may not be a good idea? I believe it will be an improvement for CUDA compilation, but I may not be aware of other use cases that may be affected.
@jdoerfert – would this affect OpenMP’s compilation targeting NVPTX?
The motivation for this change is to allow SROA to eliminate local copies in more cases. Local copies that make it to the generated PTX present a substantial performance hit, as we end up with all threads on the GPU rushing to access their own chunk of very high-latency memory.
One case when SROA is unable to eliminate local copies is when it considers the alloca pointer to escape because it was passed as a byval argument to another function. Passing aggregates directly gives LLVM better information and happens to be easier to optimize. This situation keeps popping up surprisingly often. E.g. passing a lambda with captures would trigger it pretty reliably, if the lambda did not get inlined. It can be triggered by something as simple as just passing an argument to another function.
–Artem