Loops not optimized under -Oz that should have been

I’m using the following environment:
rustc 1.52.0-nightly (1705a7d64 2021-03-18) (It used LLVM 12.0)

When compiling for thumbv8m.main-none-eabi target, some trivial code pattern is not optimized into memset, but results in a loop that writes a byte at a time in an interation. For large array this is frustratingly slow.

pub fn memzero(slice: &mut [u8]) {
   for i in slice { *i = 0 } // Not optimized
   slice.iter_mut().for_each(|v| *v = 0); // Not optimized
   slice.fill(0); // Optimized into memset
}

I was investigating the reason and find out that LoopRotation is disabled in Oz to avoid header duplication, but that prevents Memset Optimization because of Dominator assumption.

Edit

I just found that Rust aeabi_memset/memcpy implementation for the target is basically the same… It is a different / separate issue though.

I was investigating the reason and find out that LoopRotation is disabled in Oz to avoid header duplication, but that prevents Memset Optimization because of Dominator assumption.

Yes, this is an known unfortunate side-effect of -Oz.

It may be possible to convert this to a memset, even without loop rotation, but for further investigation, the LLVM IR would be helpful. Can you share the IR or a link to godbolt or something?

Here’s the link: Compiler Explorer

Quick inspection of what’s happening in LLVM:
In Loop Idiom Recognize pass, MemSet Optimization tries to find a candidate store, but bb6 is not scanned because it does not dominate all the basicblocks in the loop. If I modify the LLVM source code manually to enable Loop Rotation pass, the loop is transformed to a single basic block loop, enabling MemSet Optimization.

Thanks, so basically loop-idiom does not convert something like the following to a memset (Compiler Explorer)

define void @test(i8* noalias nonnull align 1 %start, i8* %end) {
entry:
  br label %loop.header

loop.header:
  %ptr.iv = phi i8* [ %start, %entry ], [ %ptr.iv.next, %loop.latch ]
  %_12.i = icmp eq i8* %ptr.iv, %end
  br i1 %_12.i, label %exit, label %loop.latch

loop.latch:
  %ptr.iv.next = getelementptr inbounds i8, i8* %ptr.iv, i64 1
  store i8 1, i8* %ptr.iv, align 1
  br label %loop.header

exit:
  ret void
}

I don’t think it would be too hard to support unrotated loops. We should still be able to detect memsets even if we have the header-exiting loop form. When doing the transform, we’d need to duplicate the header, but that could be limited to cases where the rest of the loop is empty, after form the memset.

Probably worth an issue: https://bugs.llvm.org

Thanks for filing the bug! For reference, the PR is 50964 – Trivial memset optimization not applied to loops under -Oz (LoopIdiomRecognize)

1 Like