Get inline asm for PTX mma instruction

What’s the proper way to use asm for PTX mma in LLVM IR?

I tried to add mma instruction which talked here using asm in llvm IR, like the code below.

However the intrinsic part appearently has something wrong since it generated PTX code like:

//begin inline asm
mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 {%0,%1},{%2,%3},{%4},{%5,%6};
// end inline asm`

It seems the registers are not subsituted into %0 , %1

I am not sure whether I return the proper data type of this kind of multiple return value instruction. I made it a vector of <2 x i16> . There is no explaination in the document.

define <2 x i16> @thefunction(<2 x i16> %0, <1 x i16> %1, <2 x i16> %2) {
BB:
  %a0 = extractelement <2 x i16> %0, i32 0
  %a1 = extractelement <2 x i16> %0, i32 1
  %b0 = extractelement <1 x i16> %1, i32 0
  %c0 = extractelement <2 x i16> %2, i32 0
  %c1 = extractelement <2 x i16> %2, i32 1
  %value = call <2 x i16> asm "mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 {%0,%1},{%2,%3},{%4},{%5,%6};", "=r,=r,r,r,r,r,r"(i16 %a0, i16 %a1, i16 %b0, i16 %c0, i16 %c1)
  ret <2 x i16> %value
}