[Sparse Tensor] mlir-cpu-runner segmentation fault

I tried to compile and run sparse_matvec.mlir in the integration test, but it presented segmentation fault. Anyone have idea?

  1. I am using the newest repo (Clang, LLVM and MLIR).
  2. I didn’t change anything in the mlir file.
  3. The path I given in the screenshot is correct

Welcome to the forum and thanks for your question @allenaann.

The command works for me (I obviously could not copy and paste your screenshot but I think we have the same flags)!

./build/bin/mlir-opt mlir/test/Integration/Dialect/SparseTensor/CPU/sparse_matvec.mlir  \
  --sparsification --sparse-tensor-conversion \
  --convert-vector-to-scf --convert-scf-to-std \
  --func-bufferize --tensor-constant-bufferize --tensor-bufferize \
  --std-bufferize --finalizing-bufferize  --lower-affine \
  --convert-vector-to-llvm --convert-memref-to-llvm --convert-std-to-llvm --reconcile-unrealized-casts | \
TENSOR0="./build/lib/data/wide.mtx" \
./build/bin/mlir-cpu-runner \
  -e entry -entry-point-result=void  -O3 \
  -shared-libs=./build/lib/libmlir_c_runner_utils.so

Output:

( 889, 1514, -21, -3431 )

Can you please use a debugger to get some more detail from the crash in your run? It looks like it is crashing somewhere in the sparse support library.

:joy:Thanks for your quick reply! I will firstly try to recompile the whole LLVM-project to see what will happen.

(A silly question: what debugger is suitable for this situation?)

As another data point, when you configure MLIR with

-DMLIR_INCLUDE_INTEGRATION_TESTS=ON

do you have a clean test run?

cmake --build . --target check-mlir-integration

yeah, I added the flag

You probably suspect a very LLVM-like answer, but gdb will do :wink: :wink:


Here is the whole integration test screenshot. All sparse-related tests failed:/

Alright :joy:

What is your host platform?

CPU: Kunpeng920 (AArch64)
env: Linux 4.18.0 CentOS 7.9.2009

LLVM should be runnable on every platform right? Anyway I will recompile everything first:)

CMAKE configuration:

cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_RTTI=ON -DLLVM_ENABLE_PROJECTS=“libcxx;libcxxabi” -DLLVM_TARGETS_TO_BUILD=“ARM;AArch64” -G “Unix Makefiles” -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ …/llvm;

make -j;

cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=“lld” -DLLVM_TARGETS_TO_BUILD=“ARM;AArch64” -G “Unix Makefiles” …/llvm ;

make -j;

cmake -G “Unix Makefiles” …/llvm -DLLVM_ENABLE_PROJECTS=mlir -DLLVM_BUILD_EXAMPLES=ON -DLLVM_TARGETS_TO_BUILD=“ARM;AArch64” -DCMAKE_BUILD_TYPE=Release -DMLIR_INCLUDE_INTEGRATION_TESTS=ON -DLLVM_ENABLE_ASSERTIONS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++;

make -j

Can I ask you one more favor to build with -DCMAKE_BUILD_TYPE=Debug instead of Release just to see if we get a friendlier runtime error? Also, would you mind checking if the shared library you pass is really build for AArch64 (use the “file” util for that)? If this does not give sufficient clues, I am afraid I will have to wait until Monday so I can ask around a bit.

Sure :grinning:

There is no more valuable information when I build with -DCMAKE_BUILD_TYPE=Debug.
The shared library is for AArch64.

The same error when I run with MLIRX

Stack dump:
0.      Program arguments: /home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner -e entry -entry-point-result=void -shared-libs=/home/xiaofeng/install/mlirx/build/lib/libmlir_c_runner_utils.so
 #0 0x0000000000470980 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner+0x470980)
 #1 0x000000000046ea10 llvm::sys::RunSignalHandlers() (/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner+0x46ea10)
 #2 0x0000000000471794 SignalHandler(int) Signals.cpp:0:0
 #3 0x0000ffff9e96066c  0x66c sparsePointers8
 #4 0x0000ffff9e96066c
 #5 0x0000ffff9e96066c compileAndExecute((anonymous namespace)::Options&, mlir::ModuleOp, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**) JitRunner.cpp:0:0
 #6 0x0000ffff9e394278 compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::ModuleOp, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig) JitRunner.cpp:0:0
 #7 0x0000ffff9e940028
 #8 0x0000ffff9e9401e4
 #9 0x0000ffff9e940368
#10 0x00000000006c9da8 mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner+0x6c9da8)
#11 0x00000000006c82ac main (/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner+0x6c82ac)
#12 0x00000000006c707c __libc_start_main (/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner+0x6c707c)
#13 0x0000000000461b3c _start (/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner+0x461b3c)
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner[0x470980]
/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner[0x46ea10]
/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner[0x471794]
[0xffff9e96066c]
/home/xiaofeng/install/mlirx/build/lib/libmlir_c_runner_utils.so(sparsePointers8+0x2c)[0xffff9e394278]
[0xffff9e940028]
[0xffff9e9401e4]
[0xffff9e940368]
/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner[0x6c9da8]
/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner[0x6c82ac]
/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner[0x6c707c]
/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner[0x461b3c]
/lib64/libc.so.6(__libc_start_main+0xf0)[0xffff9e401724]
/home/xiaofeng/install/mlirx/build/bin/mlir-cpu-runner[0x461984]
Segmentation fault

What seems most likely is just that we don’t have any CI or maintainer for ARM64 and so we may introduce portability bugs like this one that go unnoticed until someone tries it.

If you can run with gdb (in a debug build) and provide a stack trace we may be able to do some guess work. But it may end up needed someone with access to a machine to repro and debug deeper.

1 Like

Hi all, sorry it took me this long to catch this one. I’m not sure where I need to go in the code to fix it (I’ve tried and I can’t find it), but whatever is generating the caller code is misinterpreting the ABI (abi-aa/aapcs64.rst at main · ARM-software/abi-aa · GitHub). When the return struct has 9 or fewer fields, it’s expecting the callee to return the resulting struct in registers, like it says in C.12, but it’s ignoring B.4, which says that if the result is more than 16 bytes long, it has to return through a buffer prepared by the caller. So, the code generated for these calls (e.g.: in dense-output.mlir):

  llvm.func @sparseValuesF64(!llvm.ptr<i8>) -> !llvm.struct<(ptr<f64>, ptr<f64>, i64, array<1 x i64>, array<1 x i64>)> attributes {sym_visibility = "private"}
  llvm.func @sparseIndices(!llvm.ptr<i8>, i64) -> !llvm.struct<(ptr<i64>, ptr<i64>, i64, array<1 x i64>, array<1 x i64>)> attributes {sym_visibility = "private"}
  llvm.func @sparsePointers(!llvm.ptr<i8>, i64) -> !llvm.struct<(ptr<i64>, ptr<i64>, i64, array<1 x i64>, array<1 x i64>)> attributes {sym_visibility = "private"}

Expect those structures to be returned in registers, but the compiled functions in libmlir_c_runner_utils.so expect the caller to pass a buffer to return them. If somebody points me at where this is resolved in the code, I’m happy to fix it, but it’s taking too long to find on my own. Beware though, this is just one issue, but I suspect there may be another one at least. In any case, it should be easier to catch once this one is out of the way.

EDIT: I’ve found another problem with the ABI, although I don’t thing this issue is triggering it. It might be worth taking a look at that part of the JITer, if someone can point me to it :slight_smile:

Cheers,
Javier

1 Like

Thanks so much, Javier, to get to the bottom of us with your careful analysis on AArch64! It is much appreciated. Just for my own understanding, is this just a plain old bug in the LLVM codegen for passing structs using the AArch64 ABI, or do I miss setting a particular flag on the method call and/or declarations? Once it has lowered to LLVM IR (viz. the !llvm.struct<>), there is not much more I can do from the MLIR side?