The correct way to convert scf::ForOp's signature?

Hello,

Sorry for the long description, and thanks for your patience! I want to change scf::ForOp’s iterArgs type, I guess it should be easy, but I have been blocked here for a while. After some trial and error, I come out a patch, but it still has bugs in some cases.

This is the input mlir

%12 = vector.transfer_read %8[%c0, %c0], %cst {..} : memref<1x4xf32, #map1>, vector<1x1xf32>
%13 = vector.transfer_read %8[%c0, %c1], %cst {..} : memref<1x4xf32, #map1>, vector<1x1xf32>

%16:2 = scf.for %arg3 = %c0 to %c1024 step %c2
                iter_args(%arg4 = %12, %arg5 = %13) -> 
                (vector<1x1xf32>, vector<1x1xf32>) {
  ...
  %32 = vector.contract {indexing_maps = [#map5, #map6, #map7], iterator_types = ["parallel", "parallel", "reduction"]} %21, %22, %arg4 : vector<1x1xf32>, vector<1x1xf32> into vector<1x1xf32>
  ...
  scf.yield %31, %32 : vector<1x1xf32>, vector<1x1xf32>
}
vector.transfer_write %16#1, %8[%c0, %c3] {..} : vector<1x1xf32>, memref<1x4xf32, #map1>
vector.transfer_write %16#0, %8[%c0, %c2] {..} : vector<1x1xf32>, memref<1x4xf32, #map1>

Through the ConvertVectorToGPUPass, vector is converted to standard:

%12 = load %8[%c0, %c0] : memref<1x4xf32, #map1>
%13 = load %8[%c0, %c1] : memref<1x4xf32, #map1>
%16:2 = scf.for %arg3 = %c0 to %c1024 step %c2
                iter_args(%arg4 = %12, %arg5 = %13) ->
                (f32, f32) {
  ...
  %43 = mulf %27, %29 : f32
  %44 = addf %43, %arg4 : f32
  ...
  scf.yield %44, %46 : f32, f32
}
store %16#1, %8[%c0, %c3] {..} : f32, memref<1x4xf32, #map1>
store %16#0, %8[%c0, %c2] {..} : f32, memref<1x4xf32, #map1>

In order to convert vector.contract %21, %22, %arg4 successfully, I have to convert %arg4 to f32 type. (%21 and %22 will be f32 naturally, so we only need to convert %arg4).

This is my method, but it is buggy:

(1) I create a new scf::ForOp with f32 iterArgs, and erase the original one.

std::map<Operation *, Value> loopResultUserMap;

void convertForOpSignatureToF32(FuncOp funcOp, MLIRContext *ctx) {
  bool changed = true;
  while (changed) {
    changed = false;
    funcOp.walk([&](Operation *op) {
      auto loop = dyn_cast<scf::ForOp>(op);

      if (!loop || loop.getNumResults() == 0 ||
          loop.getResult(0).getType() == FloatType::getF32(ctx))
        return WalkResult::advance();

Create a new scf::ForOp with f32 iterArgs

      llvm::SmallVector<Type, 4> newResultTypes;
      for (unsigned i = 0; i < loop.getNumResults(); ++i)
        newResultTypes.push_back(FloatType::getF32(ctx));

      OpBuilder builder(loop);
      auto newloop = cloneWithNewResultTypes(loop, newResultTypes);
      builder.insert(newloop);

      unsigned i = 0;
      for (OpResult result : loop.getResults()) {
        for (auto &use : result.getUses())
          loopResultUserMap[use.getOwner()] = newloop->getResult(i);
        ++i;
      }

I found I can’t directly use loop.getResult(i).replaceAllUsesWith(newloop->getResult(i));, or I will get crash if I try to dump vector.contract later.

      loop.erase();
      changed = true;
      return WalkResult::interrupt();
    });
  }
}

(2) In the partial conversion of vector dialect to gpu/standard dialect, when converting vector::TransferReadOp to LoadOp

  LogicalResult matchAndRewrite(...) {
    Value newOp = rewriter.create<LoadOp>(loc, op.memref(), op.indices());
    rewriter.replaceOp(op, newOp);

If the load is used by scf::ForOp, change BlockArgument type to f32 as well.

    for (auto &use : op.vector().getUses()) {
      if (scf::ForOp loop = dyn_cast<scf::ForOp>(use.getOwner())) {
        unsigned argNo = use.getOperandNumber() - 3;
        BlockArgument newArg =
            loop.getBody()->insertArgument(argNo + 1, newOp.getType());
        loop.getBody()->getArgument(argNo + 2).replaceAllUsesWith(newArg);
        loop.getBody()->eraseArgument(argNo + 2);
      }
    }
  }

(3) When converting vector::TransferWriteOp to StoreOp

  LogicalResult matchAndRewrite(...) {
    Value operand0 = operands[0];
    if (loopResultUserMap.find(op.getOperation()) != loopResultUserMap.end()) {
      operand0 = loopResultUserMap[op.getOperation()];
    }

If the store uses scf::ForOp’s yield result, change operand0 to use new scf::ForOp’s yield result.

    rewriter.create<StoreOp>(loc, operand0, operands[1], op.indices());
    rewriter.eraseOp(op);
  }

After these steps, I can convert vector OP to standard OP, but I found if scf::ForOp’s iterArg count > 16 (<= 16 is fine), the compiler will crash.

I also tried SignatureConversion, but still can’t make it work.
Experiment 1:

class ForOpConversionHack final : public VectorToGPUPattern<scf::ForOp> {
public:
  LogicalResult matchAndRewrite(..) const override
  {
    TypeConverter::SignatureConversion signatureConverter(loop.getNumOperands());
    for (unsigned i = 3; i < loop.getNumOperands(); ++i)
      signatureConverter.addInputs(i, FloatType::getF32(ctx));

    rewriter.applySignatureConversion(&loop.getRegion(), signatureConverter);
    return success();
  }
};

  OwningRewritePatternList pattern;
  pattern.insert<ForOpConversionHack>(ctx);
  std::unique_ptr<ConversionTarget> target = std::make_unique<ConversionTarget>(*ctx);
  applyPartialConversion(func, *target, pattern);

Experiment 2:

  ConversionPatternRewriter rewriter(ctx);
  bool changed = true;
  while (changed) {
    changed = false;
    func.walk([&](Operation *op) {
      auto loop = dyn_cast<scf::ForOp>(op);

      if (!loop || loop.getNumResults() == 0 || loop.getResult(0).getType() == FloatType::getF32(ctx))
        return WalkResult::advance();

      TypeConverter::SignatureConversion signatureConverter(loop.getNumOperands());
      for (unsigned i = 3; i < loop.getNumOperands(); ++i)
        signatureConverter.addInputs(i, FloatType::getF32(ctx));

      rewriter.applySignatureConversion(&loop.getRegion(), signatureConverter);
      return WalkResult::interrupt();
    });
  }

Thanks in advance!
CY

Hi CY,

I think the problem you run into is that within applyPartialConversion you cannot change modify operations outside the operation of the matchRewrite. Otherwise the rewriter gets confused.
If you do this within applyPartialConversion then the operands of the scf.for should have already been converted when it gets to the scf.for, so you really just need to update the type of the operation.

For that you can create a new scf.ForOp and remap the iteration arguments using TypeConverter::SignatureConversion then just replace the ForOp with the new one you created.

Something like this works for me:

class ForOpConversionHack
    : public VectorToGPUPattern<scf::ForOp> {
 public:
  using VectorToGPUPattern<scf::ForOp>::VectorToGPUPattern;

  LogicalResult matchAndRewrite(
      scf::ForOp op, ArrayRef<Value> operands,
      ConversionPatternRewriter &rewriter) const override {
    scf::ForOp::Adaptor adaptor(operands);
    TypeConverter::SignatureConversion signatureConverter(
        adaptor.initArgs().size() + 1);
    scf::ForOp newFor = rewriter.create<scf::ForOp>(
        op.getLoc(), adaptor.lowerBound(), adaptor.upperBound(), adaptor.step(),
        adaptor.initArgs());
    signatureConverter.addInputs(0, newFor.getInductionVar().getType());
    for (auto it : llvm::enumerate(newFor.getIterOperands()))
      signatureConverter.addInputs(it.index()+1, it.value().getType());
    rewriter.applySignatureConversion(&op.getLoopBody(), signatureConverter);
    rewriter.inlineRegionBefore(op.getLoopBody(), newFor.getLoopBody(),
                                newFor.getLoopBody().begin());
    // Remove the default block added during create<ForOp>
    newFor.getLoopBody().back().erase();
    rewriter.replaceOp(op, newFor.results());
    return success();
  }
};

1 Like

@ThomasRaoux
It works perfect for me!!! I am still trying to figure out why. I realized that I am still too fresh in terms of hacking or implement something in MLIR/IREE :disappointed_relieved:

I also got some verified performance numbers from our chromebook gpu and nv gpu, will let you know later.

==
(By the way, some glue code in VectorToGPUPass.cpp for those who are also interested in.)

  patterns.insert<ForOpConversionHack, UnaryAndBinaryOpPattern<AddFOp>,
                  VectorTransferReadConversion, VectorTransferWriteConversion,
                  VectorTransferReadConversionScalar,
                  VectorTransferWriteConversionScalar, VectorContractScalar>(
      context, cooperativeMatrixAnalysis);

  std::unique_ptr<VectorToGPUConversionTarget> target =
      std::make_unique<VectorToGPUConversionTarget>(*context);
  target->addDynamicallyLegalOp<scf::ForOp>([&](scf::ForOp loop) -> bool {
    if (loop.getNumResults() == 0 ||
        loop.getResult(0).getType() == FloatType::getF32(context))
      return true;
    return false;
  });
  target->addDynamicallyLegalDialect<StandardOpsDialect>();
  target->addIllegalOp<scf::ParallelOp>();
  target->addLegalOp<scf::YieldOp>();
  // target->addLegalOp<scf::ForOp>(); <= remove this.
  target->addLegalDialect<gpu::GPUDialect>();
  if (failed(applyPartialConversion(funcOp, *target, patterns)))
    return signalPassFailure();
1 Like

The benchmarking is still running, but I can see some good numbers on our platform.

All of the results are verified.

On Chromebook gx6250 @ 455 mhz (peak fp32: ~55 GFLOPS)
Matrix size: 1024x1024x1024, tile size: 32x32x8, (Wait idle time: 100.819ms)
Matrix size: 1024x1024x1024, tile size: 32x32x4, (Wait idle time: 109.603ms)

The hand tuned code is 50 ms, I have checked the generated gpu assembly code, there are some differences, I am trying to figure out why, because I think MLIR+IREE generated code should almost same as hand tuned one. I think it should be able to match or even better than hand tuned code, since I start from 500 ms two weeks ago :wink:

On nv gt1030
Matrix size: 1024x1024x1024, tile size: 32x32x8, Compute shader execution time: 9.37677ms, (Wait idle time: 9.404ms)
Matrix size: 1024x1024x1024, tile size: 32x32x4, Compute shader execution time: 10.7428ms, (Wait idle time: 10.769ms)

I will try to get cuBLAS result on my pc (but it’s low priority in my schedule.)

== Detail Settings ==

Matmul CodeGen Strategy is:

    strategy
        .tile<linalg::MatmulOp>(
            linalg::LinalgTilingOptions()
                .setLoopType(linalg::LinalgTilingLoopType::ParallelLoops)
                .setTileSizes({tileM, tileN, tileK})
                .setInterchange({1, 0, 2})
                .setDistributionOptions(WGDistribute))
        .setHoistInvariantCode(true);
    strategy.promote<linalg::MatmulOp>(
        linalg::LinalgPromotionOptions()
            .setAllocationDeallocationFns(
                mlir::iree_compiler::allocateWorkgroupMemory,
                mlir::iree_compiler::deallocateWorkgroupMemory)
            .setCopyInOutFns(mlir::iree_compiler::copyToWorkgroupMemory,
                             mlir::iree_compiler::copyToWorkgroupMemory)
            // promote matrix B only because tile of B is shared in a workgroup.
            .setOperandsToPromote({1})
            .setUseFullTileBuffers({false, false}));
    strategy.tile<linalg::MatmulOp>(
        linalg::LinalgTilingOptions()
            .setLoopType(linalg::LinalgTilingLoopType::ParallelLoops)
            .setTileSizes({1, tileN, tileK})
            .setDistributionOptions(WIDistribute));

Distribution settings:

    linalg::LinalgLoopDistributionOptions WGDistribute;
    WGDistribute.distributionMethod = {
        linalg::DistributionMethod::CyclicNumProcsEqNumIters,
        linalg::DistributionMethod::CyclicNumProcsEqNumIters};
    WGDistribute.procInfo = getGpuProcIds<gpu::BlockIdOp, gpu::GridDimOp>;

    linalg::LinalgLoopDistributionOptions WIDistribute;
    WIDistribute.distributionMethod = {
        linalg::DistributionMethod::CyclicNumProcsEqNumIters,
        linalg::DistributionMethod::CyclicNumProcsEqNumIters};
    WIDistribute.procInfo = getGpuProcIds(tileM);
template <typename IdOp, typename NProcsOp>
static SmallVector<linalg::ProcInfo, 2> getGpuProcIds(
    OpBuilder &b, Location loc, ArrayRef<SubViewOp::Range> parallelLoopRanges) {
  if (parallelLoopRanges.size() != 2)
    llvm_unreachable("expected two parallel loops for matmul operation");
  Type indexType = b.getIndexType();
  SmallVector<linalg::ProcInfo, 2> procInfo(2);
  procInfo[0] = {b.create<IdOp>(loc, indexType, b.getStringAttr("y")),
                 b.create<NProcsOp>(loc, indexType, b.getStringAttr("y"))};
  procInfo[1] = {b.create<IdOp>(loc, indexType, b.getStringAttr("x")),
                 b.create<NProcsOp>(loc, indexType, b.getStringAttr("x"))};
  return procInfo;
}

template <typename IdOp, int N>
static SmallVector<linalg::ProcInfo, 2> getGpuProcIds(
    OpBuilder &b, Location loc, ArrayRef<SubViewOp::Range> parallelLoopRanges) {
  if (parallelLoopRanges.size() != 2)
    llvm_unreachable("expected two parallel loops for matmul operation");
  Type indexType = b.getIndexType();
  SmallVector<linalg::ProcInfo, 2> procInfo(2);
  procInfo[0] = {b.create<IdOp>(loc, indexType, b.getStringAttr("x")),
                 b.create<ConstantIndexOp>(loc, N)};
  procInfo[1] = {b.create<ConstantIndexOp>(loc, 0),
                 b.create<ConstantIndexOp>(loc, N)};
  return procInfo;
}

static linalg::ProcInfoCallBackFn getGpuProcIds(int tileN) {
  switch (tileN) {
    case 2:
      return getGpuProcIds<gpu::ThreadIdOp, 2>;
    case 4:
      return getGpuProcIds<gpu::ThreadIdOp, 4>;
    case 8:
      return getGpuProcIds<gpu::ThreadIdOp, 8>;
    case 16:
      return getGpuProcIds<gpu::ThreadIdOp, 16>;
    case 32:
      return getGpuProcIds<gpu::ThreadIdOp, 32>;
    case 64:
      return getGpuProcIds<gpu::ThreadIdOp, 64>;
    case 128:
      return getGpuProcIds<gpu::ThreadIdOp, 128>;
    default:
      llvm_unreachable("unsupported tileN size.");
  }
}

This looks great, thanks for sharing! I think you want to remove the .setInterchange({1, 0, 2}) in the first tiling. It was originally there due to a bug on my side, since getGpuProcIds maps the first dimension to BlockId.y it shouldn’t be there. That may improve perf slightly although on my NV card it didn’t make a significant difference.

Is the promotion copy using scalar f32 load/store? Would you want to do vec4 load/store instead? vec4 should be more efficient although maybe the tile size is too small to have each WI do a vec4 load/store. I was working on making the copy more efficient, I added tileAndVectorizeLinalgCopy and VectorizeMemRefPass but the latter probably doesn’t handle your case yet.

By the way, I assume the hand written kernel is not open source but if it is I would be curious to see it :slight_smile:

@ThomasRaoux Thanks =)

.setInterchange({1, 0, 2})

Have a quick test, I still got around 100 ms with 32x32x8 tile size on gx6250

Is the promotion copy using scalar f32 load/store?

I’ve check the assembly code, it looks like our backend compiler can still generate bursting load (=> but this is weird to me actually… sorry the assembly code is quite complex to read)

I don’t know if part of the gap comes from index computing when loading shared B, because in our case, we are just using a simple add, but in tileAndVectorizeLinalgCopy, it will generate:

// %58 = number of threads in a workgroup
scf.for %arg36 = %57 to %c128 step %58 {
  %448 = divi_signed %arg36, %c32 : index
  %449 = remi_signed %arg36, %c32 : index
  %450 = subview %46[%448, %449] [1, 1] [1, 1]  : memref<4x32xf32, #map1> to memref<1x1xf32, #map1>
  %451 = subview %10[%448, %449] [1, 1] [1, 1]  : memref<4x32xf32, #map2, 3> to memref<1x1xf32, #map3, 3>
  %452 = load %450[%c0, %c0] : memref<1x1xf32, #map1>
  store %452, %451[%c0, %c0] : memref<1x1xf32, #map3, 3>
}

Our hand written kernel is just using a base pointer, with a fixed stride size

float *ptr = B + offset;
float *localB = ..;
for (int i = 0; i < 4; ++i) {
  localB[ local.x ] = ptr[ local.x ]
  localB += 32;
  ptr += strideB * 4;
}

the hand written kernel is not open source

Yes :pleading_face: But the matrix configuration should almost match our hand written code :wink: I am still checking the complex assembly code to figure out where the 50 ms gap come from!

By the way, I also tested cuBLAS. This is my update:
(1) Install CUDA11 for gt1030 in Ubuntu18.04
(2) Redo the bench-matmul test, I got

Matrix size: 1024x1024x1024, tile size: 32x32x8, Compute shader execution time: 7.98106ms, (Wait idle time: 8.146ms)
pass

I got 2 ms faster than before. This is the driver version:

NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0  

To benchmark sgemm, I use this benchmark: GitHub - hma02/cublasgemm-benchmark: code for benchmarking GPU performance based on cublasSgemm and cublasHgemm

      cudaEventRecord(start, 0);
        stat = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, d_A, lda, d_B, ldb, beta, d_C, ldc);
      cudaEventRecord(stop,0);
      cudaEventElapsedTime(&elapsed, start, stop);

The number I got for 1024x1024x1024 is 5.7 ms:

float32: size 1024 average: 0.0057687 s

But I think this also includes data copy of matrix A,B to GPU, and C to/from GPU?

It’s peak f32 is 1,127 GFLOPS, so the ideal execution time for 1024x1024x1024:

// 2 = fmul, fadd.
// Divide 1000^3 = 2.14 GFLOP
(2*(1024^3) / (1000^3)) / 1127 = 1.9 ms

@ThomasRaoux

I copy some code from JitRunner.cpp and ModelBuilder.cpp in order to feed MLIR file to bench-matmul directly for debugging:

static llvm::cl::opt<std::string> inputMlirFile(
    "input", llvm::cl::desc("Use input mlir file instead (for testing..)"));

static OwningModuleRef parseMLIRInput(StringRef inputFilename,
                                      MLIRContext *context) {
  // Set up the input file.
  std::string errorMessage;
  auto file = openInputFile(inputFilename, &errorMessage);
  if (!file) {
    llvm::errs() << errorMessage << "\n";
    return nullptr;
  }

  llvm::SourceMgr sourceMgr;
  sourceMgr.AddNewSourceBuffer(std::move(file), llvm::SMLoc());
  return OwningModuleRef(parseSourceFile(sourceMgr, context));
}

static MLIRContext globalCtx;
static OwningModuleRef createModuleFromFile(StringRef filename,
                                            MLIRContext &ctx) {
  ctx.getOrLoadDialect<AffineDialect>();
  ctx.getOrLoadDialect<gpu::GPUDialect>();
  ctx.getOrLoadDialect<LLVM::LLVMDialect>();
  ctx.getOrLoadDialect<linalg::LinalgDialect>();
  ctx.getOrLoadDialect<scf::SCFDialect>();
  ctx.getOrLoadDialect<spirv::SPIRVDialect>();
  ctx.getOrLoadDialect<StandardOpsDialect>();
  ctx.getOrLoadDialect<vector::VectorDialect>();
  return parseMLIRInput(filename, &ctx);
}
------------------------------------------------
  OwningModuleRef moduleRef;
  if (inputMlirFile.size()) {
    moduleRef = createModuleFromFile(inputMlirFile, globalCtx);
  }

  auto getModuleRef = [&]() -> OwningModuleRef & {
    if (moduleRef) return moduleRef;
    return modelBuilder.getModuleRef();
  };

And I modify the index computation manually:

      %45 = subview %arg0[%2, %arg3] [32, 4] [1, 1]  : memref<1024x1024xf32> to memref<32x4xf32, #map1>
      %46 = subview %arg1[%arg3, %3] [4, 32] [1, 1]  : memref<1024x1024xf32> to memref<4x32xf32, #map1>
      %47 = "gpu.thread_id"() {dimension = "x"} : () -> index
      %48 = "gpu.block_dim"() {dimension = "x"} : () -> index
      %49 = "gpu.thread_id"() {dimension = "y"} : () -> index
      %50 = "gpu.block_dim"() {dimension = "y"} : () -> index
      %51 = "gpu.thread_id"() {dimension = "z"} : () -> index
      %52 = "gpu.block_dim"() {dimension = "z"} : () -> index
      %53 = muli %51, %50 : index // z * y.dim
      %54 = addi %53, %49 : index // z * y.dim + y
      %55 = muli %52, %50 : index // z.dim * y.dim
      %56 = muli %54, %48 : index // (z * y.dim + y) * x.dim
      %57 = addi %56, %47 : index // (z * y.dim + y) * x.dim + x
      %58 = muli %55, %48 : index // z.dim * y.dim * x.dim
      //scf.for %arg36 = %57 to %c128 step %58 {
      // local group size = [32, 1, 1]
      // So each thread x just load one element of each row, and we have 4 rows.
      scf.for %arg36 = %c0 to %c4 step %c1 {
        // %448 = divi_signed %arg36, %c32 : index
        // %449 = remi_signed %arg36, %c32 : index
        // %450 = subview %46[%448, %449] [1, 1] [1, 1]  : memref<4x32xf32, #map1> to memref<1x1xf32, #map1>
        // %451 = subview %10[%448, %449] [1, 1] [1, 1]  : memref<4x32xf32, #map2, 3> to memref<1x1xf32, #map3, 3>
        %450 = subview %46[%arg36, %47] [1, 1] [1, 1]  : memref<4x32xf32, #map1> to memref<1x1xf32, #map1>
        %451 = subview %10[%arg36, %47] [1, 1] [1, 1]  : memref<4x32xf32, #map2, 3> to memref<1x1xf32, #map3, 3>
        %452 = load %450[%c0, %c0] : memref<1x1xf32, #map1>
        store %452, %451[%c0, %c0] : memref<1x1xf32, #map3, 3>
      }

The result is verified, and the performance of 1024x1024x1024 with [32x32x4] Tile size: 110 ms → 90 ms

The SPIR-V I got:

      ^bb2:  // pred: ^bb1
        spv.loop {
          spv.Branch ^bb1(%32 : i32)
        ^bb1(%2364: i32):  // 2 preds: ^bb0, ^bb2
          %2365 = spv.SLessThan %2364, %4 : i32
          spv.BranchConditional %2365, ^bb2, ^bb3
        ^bb2:  // pred: ^bb1
          %2366 = spv.IAdd %750, %2364 : i32
          %2367 = spv.IAdd %41, %44 : i32
          %2368 = spv.constant 0 : i32
          %2369 = spv.constant 0 : i32
          %2370 = spv.constant 1024 : i32
          %2371 = spv.IMul %2370, %2366 : i32
          %2372 = spv.IAdd %2369, %2371 : i32
          %2373 = spv.constant 1 : i32
          %2374 = spv.IMul %2373, %2367 : i32
          %2375 = spv.IAdd %2372, %2374 : i32
          %2376 = spv.AccessChain %arg1[%2368, %2375] : !spv.ptr<!spv.struct<!spv.array<1048576 x f32, stride=4> [0]>, StorageBuffer>, i32, i32
          %2377 = spv.Load "StorageBuffer" %2376 : f32
          %2378 = spv.constant 0 : i32
          %2379 = spv.constant 0 : i32
          %2380 = spv.constant 32 : i32
          %2381 = spv.IMul %2380, %2364 : i32
          %2382 = spv.IAdd %2379, %2381 : i32
          %2383 = spv.constant 1 : i32
          %2384 = spv.IMul %2383, %44 : i32
          %2385 = spv.IAdd %2382, %2384 : i32
          %2386 = spv.AccessChain %45[%2378, %2385] : !spv.ptr<!spv.struct<!spv.array<128 x f32, stride=4>>, Workgroup>, i32, i32
          spv.Store "Workgroup" %2386, %2377 : f32
          %2387 = spv.IAdd %2364, %1 : i32
          spv.Branch ^bb1(%2387 : i32)
        ^bb3:  // pred: ^bb1
          spv._merge
        }
        spv.ControlBarrier "Workgroup", "Workgroup", "AcquireRelease"

I will continue to dig the performance gap.

Looks like a super useful debug infrastructure. Would be great to integrate in ModelBuilder :slight_smile:

Does the loop get unrolled by the driver? I wonder if that could help the driver schedule the load with high latency better.

Sounds like we need some strength reduction optimization pass at the SPIR-V dialect level. Shouldn’t be too hard to do.

About the Nvidia results:

Looks like the copy is done separatly and is not part of the timing:

    checkCuda(cudaMemcpy(d_A,h_A,max_m_k_n * max_m_k_n * sizeof(float),cudaMemcpyHostToDevice));
    checkCuda(cudaMemcpy(d_B,h_B,max_m_k_n * max_m_k_n * sizeof(float),cudaMemcpyHostToDevice));
    checkCuda(cudaMemcpy(d_C,h_C,max_m_k_n * max_m_k_n * sizeof(float),cudaMemcpyHostToDevice));

I would assume 32x32x8 is not the optimal tile size. Walking through the tile sizes might be able to get some results closer to cuBLAS.

Thanks for all the updates!

Ha… Thanks =)
Sorry I am still unable to send the patch because the management team has not signed the CLA. So it is easier for me to share some common code here. I don’t mind if someone adopt it and send the patch.

I have done many experiments and check the assembly code carefully, our backend compiler can combine consecutive load/store into single one. But I suspect it probably do something wrong in pointer analysis, look the example:

scf.for %arg8 = %c0 to %c2 step %c1 {
  %60 = subview %18[%arg8, %19] [1, 1] [1, 1]  : memref<2x4xf32, #map1> to memref<1x1xf32, #map1>
  %61 = subview %10[%arg8, %19] [1, 1] [1, 1]  : memref<2x4xf32, #map2, 3> to memref<1x1xf32, #map3, 3>
  %62 = load %60[%c0, %c0] : memref<1x1xf32, #map1>
  store %62, %61[%c0, %c0] : memref<1x1xf32, #map3, 3>
}

Each thread load single f32, and store to local memory.
Our backend compiler will generate memory barrier for every load here, I guess it has a problem in handling phi value, that makes the compiler make conservative decision. I have been working with our backend compiler engineer to investigate it.

I think the 50 ms gap majorly (95% I guess) comes from here.

Because the assembly code (hand written v.s. IREE generated) almost match…

You are right! the cudaMemcpy is blocking call, then it’s 5.7 ms v.s. 8 ms, cool!!

Agree!! Or maybe I should use different strategy, because current strategy is good for PowerVR ; P (Leave it to NV guys…)

No worries, sure I can add this sometime soon.

Interesting. Would manually unrolling and re-ordering the load/stores allow the driver to merge those? Emitting vector load/store directly should definitely be one of the short term optimizations I’m planning to do. I already added passes for it for the case with cooperative matrix and it shouldn’t be too hard to support this case as well so if it is the main problem it is good :slight_smile:

%60 = subview %18[%c0, %19] [1, 1] [1, 1]  : memref<2x4xf32, #map1> to memref<1x1xf32, #map1>
%61 = subview %10[%c0, %19] [1, 1] [1, 1]  : memref<2x4xf32, #map2, 3> to memref<1x1xf32, #map3, 3>
%62 = subview %18[%c1, %19] [1, 1] [1, 1]  : memref<2x4xf32, #map1> to memref<1x1xf32, #map1>
%63 = subview %10[%c1, %19] [1, 1] [1, 1]  : memref<2x4xf32, #map2, 3> to memref<1x1xf32, #map3, 3>
%64 = load %60[%c0, %c0] : memref<1x1xf32, #map1>
%65 = load %62[%c0, %c0] : memref<1x1xf32, #map1>
store %64, %61[%c0, %c0] : memref<1x1xf32, #map3, 3>
store %65, %63[%c0, %c0] : memref<1x1xf32, #map3, 3>