Hi all! Happy to be here! Excited to get to work with all of you!

An astute follower of the sparse compiler work may have noticed that there is a subtle difference between a non-annotated tensor and an all-dense annotated tensor.

```
#DenseMatrix = #sparse_tensor.encoding<{
dimLevelType = [ "dense", "dense" ]
}>
tensor<100x100xf64> // non-annotated
tensor<100x100xf64, #DenseMatrix> // all-dense annotated
```

Both tensors are assumed dense, but the former remains an n-dimensional memref that is passed to subsequent bufferization passes, whereas the latter is linearized into a one-dimensional memref that is further lowered into a storage scheme that is backed by the runtime support library.

As a first step towards supporting sparse outputs, since, until now, the sparse compiler did not accept annotated types at the left-hand-side of a linalg op, the revision D104074 relaxes this constraint to accept all-dense annotated outputs as well, as in the following example.

```
#trait_assign = {
indexing_maps = [
affine_map<(i,j) -> (i,j)>, // A
affine_map<(i,j) -> (i,j)> // X (out)
],
iterator_types = ["parallel", "parallel"],
doc = "X(i,j) = A(i,j)"
}
func @dense_output(%arga: tensor<?x?xf64, #SparseMatrix>,
%argx: tensor<?x?xf64, #DenseMatrix> {linalg.inplaceable = true})
-> tensor<?x?xf64, #DenseMatrix> {
%0 = linalg.generic #trait_assign
ins(%arga: tensor<?x?xf64, #SparseMatrix>)
outs(%argx: tensor<?x?xf64, #DenseMatrix>) {
^bb(%a: f64, %x: f64):
linalg.yield %a : f64
} -> tensor<?x?xf64, #DenseMatrix>
return %0 : tensor<?x?xf64, #DenseMatrix>
}
```

This code will lower into an assignment to one-dimensional array that is backed by the runtime support library. So given an input sparse matrix

```
positions[1] : 0 2 4 5 7 9
indices[1] : 0 3 1 4 2 0 3 1 4
values : 1.0 1.4 2.0 2.5 3.0 4.1 4.0 5.2 5.0
```

the output will be stored as

```
[ 1, 0, 0, 1.4, 0, 0, 2, 0, 0, 2.5, 0, 0, 3, 0, 0, 4.1, 0, 0, 4, 0, 0, 5.2, 0, 0, 5 ]
```

Note that this allows for a very simple change from default row-wise to column-wise storage of the two matrices by adding a dimOrdering to both the sparse and dense type annotation.

And now, as a next step towards supporting sparse outputs, a pending revision generalizes the MLIR sparse compiler to operations I called “simply dynamic” in my thesis [Bik96,Ch9] (in retrospect probably somewhat poor grammar ), which are tensor expressions with a sparse output tensor that changes its values but not its nonzero structure. As long as such output tensors appear in a conjunction for co-iteration, codegen is straightforward.

For instance, scaling a matrix as follows

```
#trait_scale = {
indexing_maps = [
affine_map<(i,j) -> (i,j)> // X (out)
],
iterator_types = ["parallel", "parallel"],
doc = "X(i,j) = X(i,j) * 2"
}
func @sparse_simply_dynamic1(%argx: tensor<32x16xf32, #DCSR> {linalg.inplaceable = true}) -> tensor<32x16xf32, #DCSR> {
%c = constant 2.0 : f32
%0 = linalg.generic #trait_scale
outs(%argx: tensor<32x16xf32, #DCSR>) {
^bb(%x: f32):
%1 = mulf %x, %c : f32
linalg.yield %1 : f32
} -> tensor<32x16xf32, #DCSR>
return %0 : tensor<32x16xf32, #DCSR>
}
```

trivially maps to sparse code, as follows.

```
%5 = memref.load %0[%c0] : memref<?xindex>
%6 = memref.load %0[%c1] : memref<?xindex>
scf.for %arg1 = %5 to %6 step %c1 {
%8 = memref.load %2[%arg1] : memref<?xindex>
%9 = addi %arg1, %c1 : index
%10 = memref.load %2[%9] : memref<?xindex>
scf.for %arg2 = %8 to %10 step %c1 {
%11 = memref.load %4[%arg2] : memref<?xf32>
%12 = mulf %11, %cst : f32
memref.store %12, %4[%arg2] : memref<?xf32>
}
}
```

But the co-iteration can be arbitrarily complex, as shown below.

```
#trait_elt_wise_mult = {
indexing_maps = [
affine_map<(i,j) -> (i,j)>, // A
affine_map<(i,j) -> (i,j)> // X (out)
],
iterator_types = ["parallel", "parallel"],
doc = "X(i,j) = A(i,j) * X(i,j)"
}
func @sparse_simply_dynamic2(%arga: tensor<32x16xf32, #CSR>,
%argx: tensor<32x16xf32, #DCSR> {linalg.inplaceable = true}) -> tensor<32x16xf32, #DCSR> {
%0 = linalg.generic #trait_elt_wise_mult
ins(%arga: tensor<32x16xf32, #CSR>)
outs(%argx: tensor<32x16xf32, #DCSR>) {
^bb(%a: f32, %x: f32):
%1 = mulf %x, %a : f32
linalg.yield %1 : f32
} -> tensor<32x16xf32, #DCSR>
return %0 : tensor<32x16xf32, #DCSR>
}
```

which will result in more elaborate loops to implement the co-iteration (but still without “insertions” that change the nonzero structure).

This is very exciting to see! I am curious in the next step how structural changes will be handled.

In particular, how do you see indicating “missing” values in the `linalg.generic`

block? For GraphBLAS use cases, we would prefer to treat missing values as distinct from zero values. That is to say, we don’t want the sparse tensor operations to automatically compress zeros out of the resulting tensor, but we do want to be able to indicate when a slot should be empty, possibly with a special `linalg.yield`

variant?

Is this something that fits into your design vision?

Your question forces me to jump ahead a bit, since I just started the MLIR work for supporting sparse output tensors that change nonzero structure. But for the next step, I plan to use a very well known sparse trick that is known under several different names (for example, phase scan [Gustavson71, Duff90], switch technique [Pissanetsky84], access pattern expansion [Bik96], workspaces [Kjolstad2019]).

The MT1 sparse compiler, for example, would take the following annotated FORTRAN source program.

```
C_SPARSE(ARRAY(A), SPARSE()(1,0))
C_SPARSE(ARRAY(B), SPARSE()(1,0))
DO 2 J = 1, N
DO 1 I = 1, M
A(I,J) = A(I,J) + B(I,J)
1 CONTINUE
2 CONTINUE
```

and generate the following sparse code automatically. Here, the “switch” array and “expanded” array are assumed initialized to false and zero, respectively, only once, after which the arrays are only “scattered” and “gathered” for the nonzero elements, to keep the runtime complexity proportional to the number of nonzeros. In this case, the insert call is a smart library call that amortizes data movement required for insertions somehow. For the MLIR implementation, I plan to use something slightly different though.

```
L1: DO J = <1>, <100>, 1
X12: CALL SSCT__(VAL_A(<1>),IND_A(<1>),LOW_A(<J>),HGH_A(<J>),SAP_10(<1>),SWT_10(<1>))
L2: DO I_ = LOW_B(<J>), HGH_B(<J>), 1
S14: I = IND_B(I_)
C15: IF (.NOT.SWT_10(<I>)) THEN
S16: SWT_10(<I>) = .TRUE.
X17: CALL SINS__(VAL_A(<1>),IND_A(<1>),LOW_A(<1>),HGH_A(<1>),J,100,700,LST_A,L_1,I)
END IF
S3: SAP_10(<I>) = SAP_10(<I>)+VAL_B(I_)
ENDDO
X13: CALL SGTH__(VAL_A(<1>),IND_A(<1>),LOW_A(<J>),HGH_A(<J>),SAP_10(<1>),SWT_10(<1>))
ENDDO
```

The revision D105306 generalizes the sparse compiler support from “just” multiplication and addition to include subtraction (and negation by viewing -y as 0-y). Other than the subtlety that the disjunction needs to split x-y into x-y, x, and -y (rather than just y for addition), the generalization is straightforward.

For example, “x(i) = a(i) - b(i)” with a and b sparse becomes the usual while-loop with three cases followed by two for-loops, with the subtlety of using -b(i) in the third if-branch and second for-loop.

Next up is division, which is a very tricky operation, however, due to the fact that there is no clear cancelling property for zeros under the semantics of this operation for both floating-point and integer cases (e.g. you cannot simply assume 0/y is 0, since y may happen to be zero too, even when stored). The following table illustrates this issue (where the notation !x and x indicates an implicit zero and an explicit stored value, respectively).

```
x/y|!y | y |
---+---+---+
!x |0/0|0/y| FP: 0/0=NaN,c/0=Inf,0/c=0 with c true nonzero
x |x/0|x/y| INT: x/0=exception, with x any value
```

So, this probably means that the full iteration space must be traversed.

Any insights in how to handle this better?

The revision D105731 makes a first step towards adding support for floating-point and integral divisions by only allowing x/c for cases where the compiler can prove that c is a true nonzero. This situation can conceptually be treated as a x*(1/c) conjunction as far as lattice construction is concerned. The codegen, of course, keeps the division to preserve precise semantics.

This implies that an operation like “x(i) = a(i) / 2” for sparse vector a and unsigned integer division maps to the following sparse code.

```
%0 = sparse_tensor.pointers %arg0, %c0
%1 = sparse_tensor.indices %arg0, %c0
%2 = sparse_tensor.values %arg0
%3 = memref.buffer_cast %arg1 : memref<32xi64>
%4 = memref.load %0[%c0] : memref<?xindex>
%5 = memref.load %0[%c1] : memref<?xindex>
scf.for %arg2 = %4 to %5 step %c1 {
%7 = memref.load %1[%arg2] : memref<?xindex>
%8 = memref.load %2[%arg2] : memref<?xi64>
%9 = divi_unsigned %8, %c2_i64 : i64
memref.store %9, %3[%7] : memref<32xi64>
}
```

The revisions D105928 and D105851 add a few more operations, such as unary “zero preserving” functions (using the MT1 definition) like abs, ceil, floor, and also binary bitwise operation.

Since sparse storage schemes differ substantially from their dense counterparts (even when generated by the compiler ), initializing a sparse tensor for proper testing has been a bit of a paint point for some of our users. So far, the sparse tensor dialect supports the `new`

operator, as in

` %a = sparse_tensor.new %fileName : !Filename to tensor<?x?xf64, #CSR>`

to materialize a sparse tensor from an external format (FROSTT or MatrixMarket format). And even though this `new`

operator is slotted to become a “Swiss army knife” for initializing sparse tensors (e.g. using set up code), so far the only supported implementation requires access to a file system.

But an upcoming revision will introduce a `cast`

operator that is slotted to support all conversions between sparse and dense tensor types. When casting between two different sparse tensor types, only explicitly stored values are moved from one underlying sparse storage format to the other. When casting from an “unannotated” dense tensor type to a sparse tensor type, an explicit test for nonzero values is used. When casting to an “unannotated” dense tensor type, implicit zeroes in the sparse storage format are made explicit. Note that these casts may have non-trivial costs associated with them, since they may involve elaborate data structure transformations. Also, casts from sparse tensor types into dense tensor types may be infeasible in terms of storage requirements.

Despite the stated caveats, this operator will greatly simplify testing, since something like this will soon be possible:

```
// Initialize a "dense" tensor.
%0 = constant dense<[
[1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0],
[0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 1.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0],
[0.0, 1.0, 1.0, 0.0, 0.0, 6.0, 0.0, 0.0],
[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 7.0, 1.0],
[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 8.0]
]> : tensor<8x8xf32>
// Cast dense tensor to sparse tensor.
%1 = sparse_tensor.cast %0 : tensor<8x8xf32> to tensor<8x8xf32, #CSR>
```

Could we call this something other than `cast`

. Cast usually signifies a “metadata” change (at least to me). So the cost is not immediately obvious with the name `cast`

. How about `convert`

?

Although I am not opposed to a different name (and indeed, we seem to have a multitude of different `cast`

operations in MLIR already), it does feel that this is truly a cast, since it changes the data from one type to another, which sometimes incurs a cost. But if others agree with you, `convert`

seems a very acceptable compromise.

I tend to think “cheap” when I read `cast`

in the IR, so I’m also in favor of a more explicit name.

On another angle, have you thought about some first class for these as constant / literals?

Okay, two votes in favor, so `sparse_tensor.convert`

it is!

Very interesting idea! You mean something like

`%0 = constant sparse< what_to_put_here? > : tensor<8x8xf32, #CSR>`

with perhaps a coordinate scheme as contents? Definitely something to ponder over some more (in the example above that could even mean the cast could be done at compile-time if we had such first-class constants).

Yes that’s what I meant!

Revision D107205 introduced the new operation, but without an actual implementation. Revision D107681 introduces an actual lowering to LLVM IR for dense tensors to sparse tensors, which is the most useful to quickly setup unit and integration tests. The implementation of the other conversions (sparse to dense and any sparse to any sparse is still TBD).

Revision D107683 exploits the new feature by adding an elaborate integration test that combine all SparseTensor features to look “under the hood” of various sparse storage schemes; useful to make sure the data structures are correct as well as to illustrate the contents for a sample matrix.

For example, the matrix

```
%t = constant dense<[
[ 1.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 3.0],
[ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[ 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[ 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0, 0.0],
[ 0.0, 0.0, 0.0, 0.0, 6.0, 0.0, 0.0, 0.0],
[ 0.0, 7.0, 8.0, 0.0, 0.0, 0.0, 0.0, 9.0],
[ 0.0, 0.0, 10.0, 0.0, 0.0, 0.0, 11.0, 12.0],
[ 0.0, 13.0, 14.0, 0.0, 0.0, 0.0, 15.0, 16.0]
]> : tensor<8x8xf64>
```

becomes the following compact storage under DCSR:

```
// pointers(0)
// indices(0)
// pointers(1)
// indices(1)
// values
( 0, 7 )
( 0, 2, 3, 4, 5, 6, 7 )
( 0, 3, 4, 5, 6, 9, 12, 16 )
( 0, 2, 7, 2, 3, 4, 1, 2, 7, 2, 6, 7, 1, 2, 6, 7 )
( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 )
```

Of course, our goal is that most users would never be interested in these details. That is what **sparse compilers** are for!

The first version of the testing methodology I was alluding to above is now available in the open source tree, as revision D107943. It exhaustively goes through all possible annotations of a single sparse tensor (actually a subset, just to keep runtime manageable) and builds, compiles, and runs the SpMM, testing for the correct outcome. As stated above, such tests are extremely useful for stress testing all dark corners of the sparse compiler as well as searching for the best possible annotation. The code given here should provide a convenient starting point for more serious (but also more costly) testing.

The revision D108721 wraps up a small series of revisions to implement sparse tensor to sparse tensor conversions fully (dense to sparse was already done, sparse to dense is still TBD). You can now simply say something like

```
%1 = sparse_tensor.convert %t
: tensor<?x?xf64, #CSC> to tensor<?x?xf64, #CSR>
```

to convert from CSC format to CSR format at run-time. Note that the current implementation avoids the requirement for a full *O(n^2)* conversion matrix of implementations (with *n* also pretty large for TACO-flavored sparse tensor formats) by simply using a coordinate scheme as intermediate, as proposed by [Chou]. That way, every format only needs to know how to map from and to such a coordinate scheme. With the “one size fits all” solution of MLIR, this basically consists of a single to and from method!

Note that obviously this does not always yield the most efficient solution, but over time we can improve individual conversions that need higher performance with specialized implementations (contributions welcome!).

Chou & Kjolstad have a paper on how to automatically generate code that converts between any two formats. Not sure if you’ve seen it.

Absolutely! I referred to [Chou] in my posting, but should have made it clickable like you did! Thanks for fixing this omission!

The sparse compiler is carefully generalizing from typical linalg ops such as matrix multiplication into more elaborate convolutions. Right now the sparse compiler can already deal with sparse filters! For example, given a typical edge detection kernel (see this fun tweet to see what it can do) such as

```
%filter = constant dense<[
[ 1, 0, -1 ],
[ 0, 0, 0 ],
[ -1, 0, 1 ]
]> : tensor<3x3xi32>
```

you can soon use the sparse compiler to exploit sparsity in the filter for convolutions as follows:

```
%sparse_filter = sparse_tensor.convert %filter
: tensor<3x3xi32> to tensor<3x3xi32, #SparseKernel>
...
%0 = linalg.conv_2d
ins (%input, %sparse_filter: tensor<8x8xi32>,
tensor<3x3xi32, #SparseKernel>)
outs (%output: tensor<6x6xi32>) -> tensor<6x6xi32>
```

Of course, the bigger challenge will be to deal with sparse inputs (and outputs) as well!