Seg fault when returning a float array on osx/arm64

Hi peeps,

I’ve got a function which returns a float array, and this seems to cause a problem on arm64 when llc runs on it -O0, but works when run -O1. Array sizes up to and including 8 are good, larger, not good. Vectors larger that 8 are good too.

The code works correctly with both on x64. I’ve tried on osx/arm64 and linux/arm64 and get the same behaviour, and osx/x64 and linux/x64 are working correctly. Reducing the code down to the simplest example that fails is this nugget:

target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-macosx11.0.0"
; Function Attrs: noinline norecurse nounwind optnone ssp uwtable
define i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = call [9 x float] @returnArray()
  store i32 0, i32* %1, align 4
  ret i32 4
}


; Function Attrs: argmemonly nounwind
define private [9 x float] @returnArray() #0 {
  ret [9 x float] zeroinitializer
}

Here’s a comparison of the different optimisation levels:

jenkins@server failing % llc test.failing.ll -O1
jenkins@server failing % clang test.failing.s
jenkins@server failing % ./a.out 
jenkins@server failing % echo $?
4
jenkins@server failing % llc test.failing.ll -O0
jenkins@server failing % clang test.failing.s   
jenkins@server failing % ./a.out                
zsh: segmentation fault  ./a.out
jenkins@server failing % 

The llc build is a recent 13 build:

jenkins@server failing % llc --version
LLVM (http://llvm.org/):
  LLVM version 13.0.1
  Optimized build with assertions.
  Default target: arm64-apple-darwin20.3.0
  Host CPU: cyclone

  Registered Targets:
    aarch64    - AArch64 (little endian)
    aarch64_32 - AArch64 (little endian ILP32)
    aarch64_be - AArch64 (big endian)
    arm        - ARM
    arm64      - ARM64 (little endian)
    arm64_32   - ARM64 (little endian ILP32)
    armeb      - ARM (big endian)
    thumb      - Thumb
    thumbeb    - Thumb (big endian)
    wasm32     - WebAssembly 32-bit
    wasm64     - WebAssembly 64-bit
    x86        - 32-bit X86: Pentium-Pro and above
    x86-64     - 64-bit X86: EM64T and AMD64
jenkins@server failing % 

So the question is, what am I doing wrong?

Any help appreciated!

Cesare

Just in case this helps, here’s the assembler for -O1:

jenkins@server failing % llc test.failing.ll -O1
jenkins@server failing % cat test.failing.s 
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 11, 0
	.globl	_main                           ; -- Begin function main
	.p2align	2
_main:                                  ; @main
	.cfi_startproc
; %bb.0:
	sub	sp, sp, #64                     ; =64
	stp	x29, x30, [sp, #48]             ; 16-byte Folded Spill
	.cfi_def_cfa_offset 64
	.cfi_offset w30, -8
	.cfi_offset w29, -16
	add	x8, sp, #8                      ; =8
	bl	l_returnArray
	ldp	x29, x30, [sp, #48]             ; 16-byte Folded Reload
	mov	w0, #4
	str	wzr, [sp, #44]
	add	sp, sp, #64                     ; =64
	ret
	.cfi_endproc
                                        ; -- End function
	.p2align	2                               ; -- Begin function returnArray
l_returnArray:                          ; @returnArray
	.cfi_startproc
; %bb.0:
	movi.2d	v0, #0000000000000000
	str	wzr, [x8, #32]
	stp	q0, q0, [x8]
	ret
	.cfi_endproc
                                        ; -- End function
.subsections_via_symbols

and here’s -O0:

jenkins@server failing % cat test.failing.s     
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 11, 0
	.globl	_main                           ; -- Begin function main
	.p2align	2
_main:                                  ; @main
	.cfi_startproc
; %bb.0:
	sub	sp, sp, #32                     ; =32
	stp	x29, x30, [sp, #16]             ; 16-byte Folded Spill
	.cfi_def_cfa_offset 32
	.cfi_offset w30, -8
	.cfi_offset w29, -16
	bl	l_returnArray
	str	wzr, [sp, #12]
	mov	w0, #4
	ldp	x29, x30, [sp, #16]             ; 16-byte Folded Reload
	add	sp, sp, #32                     ; =32
	ret
	.cfi_endproc
                                        ; -- End function
	.p2align	2                               ; -- Begin function returnArray
l_returnArray:                          ; @returnArray
	.cfi_startproc
; %bb.0:
	mov	x9, x8
	mov	w8, wzr
	str	w8, [x9, #32]
	str	w8, [x9, #28]
	str	w8, [x9, #24]
	str	w8, [x9, #20]
	str	w8, [x9, #16]
	str	w8, [x9, #12]
	str	w8, [x9, #8]
	str	w8, [x9, #4]
	str	w8, [x9]
	ret
	.cfi_endproc
                                        ; -- End function
.subsections_via_symbols
jenkins@server failing % 

I don’t know arm64 so i’m unsure what to expect, whether this is an obvious code gen issue, but given how simple it is to recreate, i’m assuming it’s not that.

Yep seem to have a similar case with a [16 x float] as return type. If opt is used it just works.
The same ir is correct on x86/x64
Not working:

 %17 = call [16 x float] @main.hack_matrix4_from_trs_f32([3 x float] %13, [4 x float] %16, [3 x float] %14, i8* %__.context_ptr)

Working:

%17 = call fastcc [16 x float] @main.hack_matrix4_from_trs_f32([3 x float] %13, [4 x float] %16, [3 x float] %14)

(no the context pointer doesn’t matter here)

This seems to be a clang driver error. As when you build it with llc and then link it with clang it doesn’t fail. So something is fishy…

Ok update it’s not a driver error llc seems to default to -opt=2 which is basically fixing this problem but that’s not a real fix.

Yes, there’s something fishy going on. I think from reading online that the ABI for arm64 has a hard limit on the size of the return value, and that the convention is to pass a function argument marked ‘sret’ which is then used to drive the return value logic using the buffer allocated to x8.

If we look at the generated assembler:

jenkins@server failing % cat test.failing.s
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 11, 0
	.globl	_main                           ; -- Begin function main
	.p2align	2
_main:                                  ; @main
	.cfi_startproc
; %bb.0:
	sub	sp, sp, #32                     ; =32
	stp	x29, x30, [sp, #16]             ; 16-byte Folded Spill
	.cfi_def_cfa_offset 32
	.cfi_offset w30, -8
	.cfi_offset w29, -16
	bl	l_returnArray
	bl	l_returnArray
	str	wzr, [sp, #12]
	mov	w0, #4
	ldp	x29, x30, [sp, #16]             ; 16-byte Folded Reload
	add	sp, sp, #32                     ; =32
	ret
	.cfi_endproc
                                        ; -- End function
	.p2align	2                               ; -- Begin function returnArray
l_returnArray:                          ; @returnArray
	.cfi_startproc
; %bb.0:
	mov	x9, x8
	mov	w8, wzr
	str	w8, [x9, #32]
	str	w8, [x9, #28]
	str	w8, [x9, #24]
	str	w8, [x9, #20]
	str	w8, [x9, #16]
	str	w8, [x9, #12]
	str	w8, [x9, #8]
	str	w8, [x9, #4]
	str	w8, [x9]
	ret
	.cfi_endproc
                                        ; -- End function
.subsections_via_symbols

We see that the _returnArray function is storing the correct value in the memory pointed to by x8 (my arm64 isn’t good, but I think I see that this is happening). However, the call is not setting x8 to a location, so x8 is likely junk.

If I generate llvm for a c++ function that returns a large structure, I see similar logic in the generated assembler, but a chunk of stack being allocated in the caller and assigned to x8. The c++ generated llvm doesn’t use the return value as a llvm function return, but the above mentioned sret argument.

So, my current thinking is that either clang always does this (and knows some ABI limits and hence changes the code it generated for different platforms) or, there is a transformation that can be applied that does this for you, and generates the correct logic. I’ll experiment with that train of thought, but i’m speculating wildly at the moment and could be way off the mark.