[RFC] Proposal of View Dialect for view-style operations

1. Motivation

View style operations are popular in deep learning algorithms, especially domain-specific layers, such as bounding boxes and anchors in detection (see the following example offset2bbox), agent-environment interaction in reinforcement learning. The output of the view-style operation shares the same physical memory as the input. The output is also named as a view of the physical memory. The typical examples are NumPy-style basic indexing operations and some operations in the deep learning scenario, such as reshape, squeeze, unsqueeze, and so on. In fact, in-place operations can be regarded as a special form of view behavior, whose input and output operands not only share the same physical memory but also have the same memory layout.

1.1 Related Work

However, the mainstream deep learning compilers, such as TVM, Halide, TorchScript, XLA, and dialects in MLIR, are limited for view-style operations. These works take rigid tensor descriptions as input while translating high-level languages (i.e. the above example offset2bbox) to such descriptions requires nontrivial efforts. The tensor description is subject to the SSA form. In this proposal, we first identify the traditional single-static-assignment compilation technique for scalar cannot be competent for the tensor-level view operations due to their complex memory layout relationships and read-write relationships. We design an intermediate representative to represent the layout and read-write relationships between multiple views of the physical memory. Then, we present an efficient tensor-level SSA resolution algorithm to avoid redundant memory accesses at the same time with correctness. Finally, we present the ViewDialect based on the LLVM MLIR compiler infrastructure and perform a throughout evaluation.

1.2 Motivational Example

We use an example in the following list. The input is a 2-dimension tensor. Let us take a visit to the function body. Line 3 and 4 are two indexing operations of tensor A, and tensor B and C do not allocate memory space. In line 5, tensor D will produce a new memory location. Then, C adds a constant 2 in place in Line 6. In fact, because C shares the same memory with A, the mutation will occur in A and may further influence B. In this case, B and C have different memory layouts on tensor A, and they have no overlap. Thus, the update in Line 6 will not influence B. Line 7 produces a new memory location of tensor E, and it is equal to A+D. Here, A has been partially updated in line 6. We can observe that the tensor-level multiple assignments are more complex than scalar level one because of the complex memory layout and read-write sequence. Hence, this motivates us to propose tensor-level SSA ViewDialect.

import torch
def view_example(A):
    B = A[1,:]
    C = A[2,:]
    D = B + C
    C += 2
    E = A + D
    return E

2. Rationale

Each view has a tuple (dims, strides, offset) to express the position relationship with the physical memory, where dims is a vector that stores the dimension information of the view, strides is also a vector that stores the physical memory step length information of adjacent elements in each dimension of the view, and offset is a scalar refers to the offset of the first element of the view in the physical memory. Naturally, the tensor can also be regarded as a view of itself. Given an N-dim tensor with shape (S1, S2,…, SN), the default tuple can be calculated. For more details, please refer to NumPy indexing.

In fact, multiple view operations can be associated together. For example, A is a tensor with the shape of (3,4,5). B =A[1][2] is equivalent to C = A[1] and B =C[2]. Thus, the view tuple of B can be obtained by two steps. We can get the view tuple of C, based on which, we can further get the view tuple of B. Then, we will present how to use the view tuple. We still use tensor A with shape (3,4,5) as an example. Here, we calculate B = A[:,1 : 4 : 2,:]. We first get the view tuple of the above indexing operation according to the above equations. The tuple is ([3,2,5],[20,10,1],5).

In summary, the view attribute has the following three features.

  • The memory layout of input and output may not be the same, although they share the same physical memory. The key point is the view tuple.
  • A single physical memory may have multiple views. Multiple view operations can be associated together. Also, a view of physical memory can derive a new view of itself as shown in the following Figure (a). Essentially, they are all the views of the physical memory like the disjoint-set data structure as shown in Figure (b).
  • A mutation of a single view will reflect the physical memory, and further influence the other views as shown in Figure (c).

3. TensorSSA

3.1 IR for View

We use a ViewNode data structure to record the memory layout information as follows:

  struct ViewNode{
    Tensor *PhysicalMem; //pointer to physical addr
    uint64 *Shape[];     //view tuple - Shape
    uint64 *Stride[];    //view tuple - Stride
    uint64 Offset;       //view tuple - Offset
};

Then, we focus on the read-write relationships between multiple views and physical memory. Naturally, we categorize the operations into two types: read (Line 3 and 4 in the motivation example) and write (Line 6 in the motivation example).

  • Read. We design an operator Access as follows: A_access = Access(A, view_tuple)
  • Write. We design an operator Assign as follows: A_assign = Assign(A,B, view_tuple)

For the sake of the static single assignment feature, A_access and A_assign will produce a new memory location.

3.2 Lowering Algorithm

We have presented the IR designed for view operations. In this section, we will introduce the lowering algorithm. The input is an operation list that contains view operations and other operations. The output is an operation list without view attribute. The algorithm mainly contains two steps. The first step is an SSA converter, and the second step is consisting of two optimization solutions.

  • Step 1. We visit the operation in the input operation list one by one. According to whether the operation has view attribute or not, we choose different methods.

    • View Read Operation. We will create a ViewTree when we meet the first view operation. The root node of the ViewTree is the source physical memory. Once we visit a view read operation, we first add a node and a related edge in the ViewTree. Then, we add an access operation to the output operation list.
    • View Write Operation. View write operation will update the physical memory and further update the value of the related views. When we meet the view write operation, we first add a view node and an edge in the view disjoint-set. We need to cope with the influence of in-place mutation on other views. Thus, we will first perform pass-up assign process by adding an assign operation to the source tensor. Then, we perform pass-down access process to broadcast the influence to all views by adding related access operations.
    • Operation without view attribute. We update the operands based on their latest version. Then, we add the operation to the output operation list.
  • Step 2. After the lowering process, the output operation sequence is subject to the static single assignment form. In this part, we will propose two optimization solutions.

    • Access/Assign Remove. In the pass-down access process, we add access operations for all other views. According to liveness analysis and memory layout overlap analysis, we can perform dead code elimination and remove unnecessary access operations.
    • Access/Assign Merge. We can merge the Access or Assign operation with its following use operation according to the use-define chain analysis. This can inline the computation within the same loop nest.

4. Implementation of ViewDialect

4.1 Operation Definition Specification

In ViewDialect, we implement five operations, which can be divided into three groups as follows:

  • ViewOnTensorOp and CopyOnTensorOp. They represent the view syntax in MLIR front-end as the input dialect. The Operation Definition Specification(ODS) of these two operations are as follows:
def view_ViewOnTensorOp : View_Op<"view"> {
  let arguments = (ins
    AnyTensor: $source,
    I64ArrayAttr: $shape,
    I64ArrayAttr: $stride,
    I64Attr: $offset
  );

  let results = (outs
    AnyTensor: $ref
  );
}
def view_CopyOnTensorOp : View_Op<"copy"> {
  let arguments = (ins
    AnyTensor: $from,
    AnyTensor: $to
  );
}
};
  • AccessOnTensorOp and AssignOnTensorOp. They represent the SSA Operation proposed in Section 3.1.
def view_AccessOnTensorOp : View_Op<"access",
    [NoSideEffect]> {
  let arguments = (ins
    AnyTensor: $source,
    I64ArrayAttr: $shape,
    I64ArrayAttr: $stride,
    I64Attr: $offset
  );
  let results = (outs
    AnyTensor: $access
  );
}
def view_AssignOnTensorOp : View_Op<"assign",
    [NoSideEffect]> {
  let arguments = (ins
    AnyTensor: $source,
    AnyTensor: $assignee,
    I64ArrayAttr: $shape,
    I64ArrayAttr: $stride,
    I64Attr: $offset
  );

  let results = (outs
    AnyTensor: $result
  );
}
  • LinkOp. It represents the edges in the view disjoint-set.
def View_LinkOp : View_Op<"link"> {
  let arguments = (ins
    AnyTensor: $source,
    AnyTensor: $ref,
    I64ArrayAttr: $shape,
    I64ArrayAttr: $stride,
    I64Attr: $offset
  );
}

4.2 Lowering Pipeline

In the ViewDialect, we implement the lowering algorithm in Section 3.2. Firstly, ViewOnTensorOp and CopyOnTensorOp are converted to AccessOnTensorOp and AssignOnTensorOp. LinkOp records the snapshot of the disjoint-set. After the lowering process, the output operations are subject to SSA form, and we can perform fuse pass on them in linalg. Next, linalg dialect operations can be lowered into scf dialect. Then, scf dialect will be lowered to MLIR NVVM Dialect. Finally, NVVM Dialect and further to ptx or cubin.

5. Evaluation

In this proposal, we use Python3.6, NVIDIA GeForce GTX 1660 (6GB) GPUs and CUDA 10.1 as the evaluation environment. We implement a simple python CUDA runtime wrapper to run PTX code generated by MLIR. We compare TensorSSA implemented in MLIR with PyTorch (v1.7.0) and TorchScript (within PyTorch). We use some representative operators in the novel deep learning models. We can find that ViewDialect can have a consistent performance speedup than the state-of-the-art works. The performance speedup is 11.23X and 5.26X than PyTorch and TorchScript. On the whole, the performance benefits come from two main aspects. The first one is that ViewDialect can convert multiple view ops into one access op by our algorithm. As a result, we do not generate unnecessary memory assignment statements. The second is that we can fuse view ops and element-wise compute ops together into a single kernel since there is no implicit assignment after tensorSSA.

Because I am a newer, I can only embed one media in this post. More details can refer to our paper: Overleaf, Online LaTeX Editor

3 Likes

Hi, thank you for the writeup. It sounds like you have converted some non trivial programs: would you happen to have some that we can look at? I mostly followed the description but the details here matter and are best seen.

One note in looking at your op-defs: are you sure that this is a proper use of the tensor type? (Ie. It is a value type) When I’ve done similar things in the past, I’ve needed some form of mutable tensor type and conversions to bridge the worlds. I think that is also what the torch dialect is doing to model lowerings from pytorch (which exhibit these characteristics).

Specifically, the copy op takes two tensors as input and has no results, which I assume means that one argument is being copied into another’s memory? This is not really legal on tensor types. However, possibly as an encapsulated intermediate form that can’t escape, maybe it can be workable. Is this just an intermediate state?

we used view dialect to bridge torch code and linalg dialect. The torch code are as follows:

class Normalize(torch.nn.Module):
    def forward(self, src, mean, scale):
        # RGB to BGR
        dup = src.clone()
        dup[:,:,0] = src[:,:,2]
        dup[:,:,2] = src[:,:,0]
        return (dup - mean) * scale

The shape of src, mean and scale are [800, 1333, 3], [3] and [3] dividely.

The view and copy operation in function are then convert to access and assign function using our algorithm to avoid implicit data dependency in the program:

module  {
  func @main_graph(%arg0: tensor<800x1333x3xf32>, %arg1: tensor<3xf32>, %arg2: tensor<3xf32>) -> tensor<800x1333x3xf32> {
    %0 = "view.access"(%arg0) {offset = 2 : i64, shape = [800, 1333], stride = [3999, 3]} : (tensor<800x1333x3xf32>) -> tensor<800x1333xf32>
    %1 = "view.assign"(%0, %arg0) {offset = 0 : i64, shape = [800, 1333], stride = [3999, 3]} : (tensor<800x1333xf32>, tensor<800x1333x3xf32>) -> tensor<800x1333x3xf32>
    %2 = "view.access"(%arg0) {offset = 0 : i64, shape = [800, 1333], stride = [3999, 3]} : (tensor<800x1333x3xf32>) -> tensor<800x1333xf32>
    %3 = "view.assign"(%2, %1) {offset = 2 : i64, shape = [800, 1333], stride = [3999, 3]} : (tensor<800x1333xf32>, tensor<800x1333x3xf32>) -> tensor<800x1333x3xf32>
    %4 = linalg.init_tensor [800, 1333, 3] : tensor<800x1333x3xf32>
    %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%3, %arg1 : tensor<800x1333x3xf32>, tensor<3xf32>) outs(%4 : tensor<800x1333x3xf32>) {
    ^bb0(%arg3: f32, %arg4: f32, %arg5: f32):  // no predecessors
      %8 = subf %arg3, %arg4 : f32
      linalg.yield %8 : f32
    } -> tensor<800x1333x3xf32>
    %6 = linalg.init_tensor [800, 1333, 3] : tensor<800x1333x3xf32>
    %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%5, %arg2 : tensor<800x1333x3xf32>, tensor<3xf32>) outs(%6 : tensor<800x1333x3xf32>) {
    ^bb0(%arg3: f32, %arg4: f32, %arg5: f32):  // no predecessors
      %8 = mulf %arg3, %arg4 : f32
      linalg.yield %8 : f32
    } -> tensor<800x1333x3xf32>
    return %7 : tensor<800x1333x3xf32>
  }
}

And the access and assign can then lowering to linalg dialect and fuse to one linalg generic op:

module  {
  func @main_graph(%arg0: tensor<800x1333x3xf32>, %arg1: tensor<3xf32>, %arg2: tensor<3xf32>) -> tensor<800x1333x3xf32> {
    %c2 = constant 2 : index
    %c3 = constant 3 : index
    %c800 = constant 800 : index
    %c0 = constant 0 : index
    %c1333 = constant 1333 : index
    %0 = linalg.init_tensor [800, 1333, 3] : tensor<800x1333x3xf32>
    %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (((d0 * 3999 + d1 * 3 + d2 - 2) floordiv 3999 + (((d0 * 3999 + d1 * 3 + d2 - 2) mod 3999) floordiv 3) floordiv 1333) mod 800, (((d0 * 3999 + d1 * 3 + d2 - 2) mod 3999) floordiv 3) mod 1333, 0)>, affine_map<(d0, d1, d2) -> (((d0 * 3999 + d1 * 3 + d2) floordiv 3999 + (((d0 * 3999 + d1 * 3 + d2) mod 3999) floordiv 3) floordiv 1333) mod 800, (((d0 * 3999 + d1 * 3 + d2) mod 3999) floordiv 3) mod 1333, 2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d2)>, affine_map<(d0, d1, d2) -> (d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg0, %arg0, %arg0, %arg1, %arg2 : tensor<800x1333x3xf32>, tensor<800x1333x3xf32>, tensor<800x1333x3xf32>, tensor<3xf32>, tensor<3xf32>) outs(%0 : tensor<800x1333x3xf32>) {
    ^bb0(%arg3: f32, %arg4: f32, %arg5: f32, %arg6: f32, %arg7: f32, %arg8: f32):  // no predecessors
      %2 = linalg.index 0 : index
      %3 = linalg.index 1 : index
      %4 = linalg.index 2 : index
      %5 = affine.apply affine_map<(d0, d1, d2) -> (d0 * 3999 + d1 * 3 + d2)>(%2, %3, %4)
      %6 = remi_unsigned %5, %c3 : index
      %7 = cmpi eq, %6, %c0 : index
      %8 = affine.apply affine_map<(d0, d1, d2) -> ((d0 * 3999 + d1 * 3 + d2) floordiv 3999)>(%2, %3, %4)
      %9 = cmpi sge, %8, %c0 : index
      %10 = cmpi slt, %8, %c800 : index
      %11 = affine.apply affine_map<(d0, d1, d2) -> (((d0 * 3999 + d1 * 3 + d2) mod 3999) floordiv 3)>(%2, %3, %4)
      %12 = cmpi sge, %11, %c0 : index
      %13 = cmpi slt, %11, %c1333 : index
      %14 = and %7, %9 : i1
      %15 = and %14, %10 : i1
      %16 = and %15, %12 : i1
      %17 = and %16, %13 : i1
      %18 = select %17, %arg4, %arg5 : f32
      %19 = affine.apply affine_map<(d0, d1, d2) -> (d0 * 3999 + d1 * 3 + d2)>(%2, %3, %4)
      %20 = subi %19, %c2 : index
      %21 = remi_unsigned %20, %c3 : index
      %22 = cmpi eq, %21, %c0 : index
      %23 = affine.apply affine_map<(d0, d1, d2) -> ((d0 * 3999 + d1 * 3 + d2 - 2) floordiv 3999)>(%2, %3, %4)
      %24 = cmpi sge, %23, %c0 : index
      %25 = cmpi slt, %23, %c800 : index
      %26 = affine.apply affine_map<(d0, d1, d2) -> ((d0 * 3999 + d1 * 3 + d2 - ((d0 * 3999 + d1 * 3 + d2 - 2) floordiv 3999) * 3999 - 2) floordiv 3)>(%2, %3, %4)
      %27 = cmpi sge, %26, %c0 : index
      %28 = cmpi slt, %26, %c1333 : index
      %29 = and %22, %24 : i1
      %30 = and %29, %25 : i1
      %31 = and %30, %27 : i1
      %32 = and %31, %28 : i1
      %33 = select %32, %arg3, %18 : f32
      %34 = subf %33, %arg6 : f32
      %35 = mulf %34, %arg7 : f32
      linalg.yield %35 : f32
    } -> tensor<800x1333x3xf32>
    return %1 : tensor<800x1333x3xf32>
  }
}

You are right. So we need to design a dialect to convert the pytorch/tensorflow code that is not subject to SSA form to the MLIR input that is subject too SSA form.

I only did a quick skim but a question: is this only intended for static shapes (where the dimension sizes are constant) or do you plan to extend to dynamic? IMO, it isn’t easy to extend to the dynamic case as an afterthought – so if the goal is to support those as well, it’s important to factor those in the initial design. A design that is proof to dynamic shapes I feel will add significant value and given where MLIR is I would often take a step back on designs that have only considered static shapes.

Thank you for your question. In essential, the SSA converter and dynamic/static shape are two orthogonal directions. We intend to support dynamic shape, and we are doing this.

I think you primarily need a “mutable tensor” type and some conversions. In torch-mlir, they just call this !torch.tensor, which is what everything in the program starts as (since that matches semantics with pytorch). They then have a !torch.vtensor (value tensor) that they lower in to as possible. In all generality, these ssa formation algorithms are failable, so it is important for correctness to maintain the distinction: in the case that you can’t cancel out the accesses, you are still left with a legal program, even if there are a handful of trips through memory at various boundaries (which may still be recoverable, but might need a different kind of algorithm/more analysis/etc).

I haven’t checked in recently on how much of the slice ops they’ve implemented, but I know they have similar things in their roadmap as what you’ve done here. And as Uday mentions, that work was set up with fill generality to ranked dynamic shapes from the start.

@_sean_silva @cathyzhyi

Thanks for the details proposal and examples. Reading through the proposal once, and still digesting it, but to me it looks like this is trying to introduce read-modify-write semantics on tensor types which would be a far reaching change. The motivating example of NumPy makes me believe that the right modeling for this kind of code is not at the tensor level, but rather at memref level. Even the data structure you are describing as a struct is essentially what memref is represented as in the LLVM lowering. So adding “view”-like semantics on tensor types seems like a layering violation to me.
If you want to use Linalg at this level , you can still lower to Linalg operation with memref operands (though I suspect there is nothing out-of-the box you can use for this).

Could you also describe a bit about motivation for landing this in MLIR core (given the concerns of layering violation I highlighted above)? If you have your own dialect, or like others have mentioned in torch dialect, then everything is encompassed within that dialect, and lowering to Linalg on tensors can then account for the view-like behavior that Numpy allows by using the transformation you describe. This way when lowering to Linalg on tensors you have a pure tensor SSA representation of your program?

Seems to me that most of what is being “TensorSSA with view” is akin to what can be model right now in MLIR with “constant memref”?
The thing with memref is that is does not have any notion of ownership or lifetime, which may be abstracted away here?

I think you primarily need a “mutable tensor” type and some conversions.

Do you mean this problem can be solved by torch.tensor with inplace, by IsTrailingUnderscoreInplaceVariant? I found the code in ReduceOpVariants: class ReduceTrailingUnderscoreInplaceVariant, which seems we create a new CopyToValueTensorOp to represent something like view_copy. It’s good, but I’m not sure how these separate ops fuse into one kernel, like what we saw in the View Dialect.

In my view, the View Dialect will convert the DAG into:

Then we can fuse all these ops into one kernel.

I haven’t checked in recently on how much of the slice ops they’ve implemented, but I know they have similar things in their roadmap as what you’ve done here. And as Uday mentions, that work was set up with fill generality to ranked dynamic shapes from the start.

It’s interesting to check what the torch-mlir can do with this problem yet, and I gave it a try.

However, it seems that we haven’t got the implementation of clone op:

error: unsupported by backend lowering: `torch.operator` op
    %1 = torch.operator "aten.clone"(%0, %none) : (!torch.tensor<[800,1333,3],f32>, !torch.none) -> !torch.tensor

And it will cause segment fault in TorchToLinalg:

0.	Program arguments: ./build/bin/torch-mlir-opt call_torch.mlir --torch-verify-invariants-before-backend-lowering --convert-torch-to-linalg
 #0 0x00000000018f9b73 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (./build/bin/torch-mlir-opt+0x18f9b73)
 #1 0x00000000018f798e llvm::sys::RunSignalHandlers() (./build/bin/torch-mlir-opt+0x18f798e)
 #2 0x00000000018fa026 SignalHandler(int) Signals.cpp:0:0
 #3 0x00007f9726a2d390 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x11390)
 #4 0x00000000018055e3 mlir::Type::getContext() const (./build/bin/torch-mlir-opt+0x18055e3)
 #5 0x000000000178d6b1 mlir::RankedTensorType::get(llvm::ArrayRef<long>, mlir::Type, mlir::Attribute) (./build/bin/torch-mlir-opt+0x178d6b1)
 #6 0x0000000000d82711 mlir::tensor::ExtractSliceOp::inferResultType(mlir::RankedTensorType, llvm::ArrayRef<long>, llvm::ArrayRef<long>, llvm::ArrayRef<long>) (./build/bin/torch-mlir-opt+0xd82711)
 #7 0x0000000000d82ee5 mlir::tensor::ExtractSliceOp::build(mlir::OpBuilder&, mlir::OperationState&, mlir::RankedTensorType, mlir::Value, llvm::ArrayRef<mlir::OpFoldResult>, llvm::ArrayRef<mlir::OpFoldResult>, llvm::ArrayRef<mlir::OpFoldResult>, llvm::ArrayRef<mlir::NamedAttribute>) (./build/bin/torch-mlir-opt+0xd82ee5)
 #8 0x0000000000d83896 mlir::tensor::ExtractSliceOp::build(mlir::OpBuilder&, mlir::OperationState&, mlir::RankedTensorType, mlir::Value, mlir::ValueRange, mlir::ValueRange, mlir::ValueRange, llvm::ArrayRef<mlir::NamedAttribute>) (./build/bin/torch-mlir-opt+0xd83896)
 #9 0x00000000010d0da8 mlir::tensor::ExtractSliceOp mlir::OpBuilder::create<mlir::tensor::ExtractSliceOp, mlir::Value&, llvm::SmallVector<mlir::Value, 6u>&, llvm::SmallVector<mlir::Value, 6u>&, llvm::SmallVector<mlir::Value, 6u>&>(mlir::Location, mlir::Value&, llvm::SmallVector<mlir::Value, 6u>&, llvm::SmallVector<mlir::Value, 6u>&, llvm::SmallVector<mlir::Value, 6u>&) (./build/bin/torch-mlir-opt+0x10d0da8)
#10 0x00000000010d07e9 (anonymous namespace)::ConvertAtenSliceTensorOp::matchAndRewrite(mlir::torch::Torch::AtenSliceTensorOp, mlir::torch::Torch::AtenSliceTensorOpAdaptor, mlir::ConversionPatternRewriter&) const TorchToLinalg.cpp:0:0
#11 0x00000000010cfc53 mlir::OpConversionPattern<mlir::torch::Torch::AtenSliceTensorOp>::matchAndRewrite(mlir::Operation*, llvm::ArrayRef<mlir::Value>, mlir::ConversionPatternRewriter&) const (./build/bin/torch-mlir-opt+0x10cfc53)
#12 0x00000000014d1392 mlir::ConversionPattern::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&) const (./build/bin/torch-mlir-opt+0x14d1392)
#13 0x00000000016f4834 mlir::PatternApplicator::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&, llvm::function_ref<bool (mlir::Pattern const&)>, llvm::function_ref<void (mlir::Pattern const&)>, llvm::function_ref<mlir::LogicalResult (mlir::Pattern const&)>) (./build/bin/torch-mlir-opt+0x16f4834)

(the person who knows the most about where this is at and going is out until Monday - fyi)

torch-mlir’s way of solving this is as Stella mentioned here [RFC] Proposal of View Dialect for view-style operations - #8 by stellaraccident. This should be achieved by both ReduceOpVariants and MaximizeValueSemantics. However, the current implementation is not a full SSA formation algorithm so it’s very limited. Basically, it can only handle basic blocks and also the non value tensors can only be passed to view like ops that doesn’t modify the underlying memory. For the given example, the crash you are seeing is due to these limitations and the non value tensor is not converted. You should be able to find an error message can only handle these transitive user op in debug tracing during MaximizeValueSemantics. A more sophisticated analysis like SSA formation analysis should be able to solve this one correctly.

1 Like

Oh, I got it! Thank you!

Thank you for your questions and replies.

Thank you for your reply to the way of torch-mlir. Do you know the plan of torch-mlir to solve the view problem? In fact, the primary motivation of this proposal is to fill the gap.

We use the view disjoint-set to model the ownership and lifetime information.

Thank you for your suggestion that addresses the SSA form problem at memref level. However, it seems that TOSA dialect or torch-mlir as the input interface to MLIR cannot express the view syntax correctly so that the problem even cannot be lowered to memref level. Moreover, as the bridge of deep learning algorithm to mlir, they come from torch.view and torch.copy. We need a view dialect to act as a supply of TOSA dialect or torch-mlir to express view syntax.

Oh you’re view aren’t read-only, I missed this when I skimmed through. Then I don’t understand why this isn’t just memref?

Since your views are akin to memref: aren’t you just implementing a memref->tensor transformation to target TOSA?
memref as-is may not exactly match what is needed here, but we may not be far: at least it matches better than tensor (your view really can’t be using the tensor type: it is immutable).
I’d be interested to have a clear description of the difference between memref and the type you need here.

Yeah, +1.

Also some notes on terminology: in torch-mlir, the tensor type is mutable/memref-like (matches the terminology of torch) and the vtensor type is a variant that has value semantics (not a torch concept but a stepping stone to legalize torch programs). A naked MLIR tensor or memref is not a direct match for representing a torch graph prior to some transformations: the torch centric types model the semantics of torch and we lower into MLIR types.

There may be a missing interface here for bridging worlds (and this also tied into the ShapedType hierarchy becoming an interface discussion).

My point is: I think you have roughly what you need at the torch-mlir level to represent this today and if that is a main use case, solving it there, in the concepts of its types and ops may be a better starting point (versus trying to directly generalize to new mlir core abstractions). As Yi, says, what is there now is just to get started, and I expect they would be open to collaboration to enhance it.