Updates on MLIR based Dynamic Shape Compiler

We raised a discussion about an end2end compilation flow providing fully dynamic shape support several months ago, which aims to solve the limitations of current XLA in dynamic shape workloads. Here was the RFC and the relative discussion at that time.

Here are some recent updates: the initial version of this dynamic shape compiler has already been released internally in Alibaba about two months ago, started to provide services for a few internal/external inference workloads on GPU backend. Although still a lot of TODOs, the performance basically meets our expectation: close to the performance of XLA with only one compilation result on different shapes; or even exceeds it sometimes when XLA is in trouble, for example While loop with different shapes on iterations.

Here is the performance result of an internal ASR inference model as an example (in this model XLA provides dramatic gains):

It is currently made up with about 30-40 passes in all from end to end, part of which (roughly<50%) is reused or inherited from the TF/MLIR community. The backbone passes are shown in this figure:

We also made a “runtime abstraction layer” which provides interfaces on different runtime scenarios: both execution with TF runtime (like XLA) and standalone application.

Attached is the full logs of the ASR model after each passes. This might help a little bit on a rough understanding of these passes, although messed with too much information. (Pls cat them to one tar.gz)
asr.tar.gz.part_aa.txt (4 MB)
asr.tar.gz.part_ab.txt (4 MB) asr.tar.gz.part_ac.txt (4 MB)
asr.tar.gz.part_ad.txt (2.9 MB)

We also found some interesting aspects regarding to the performance:

(1) we observed that the “shape constraints” are very import on the performance in dynamic shape semantics, aka, dim_size_a is always equal to dim_size_b. Sometimes CSE can infer the same information while sometimes not, so it’s very important to explicitly have these informations on the IR, benefiting both to the graph optimization passes (eg. the fusion decisions) and bottom layer code generation. See more discussion on thread.

(2) some basic optimization is straight forward in static shape compiler but requires more work in dynamic shape compilers, for example, the loop unrolling in XLA (vectorization), and the resolving of implicit broadcast. One of the solutions is to generate multi versions of kernels and select proper ones in runtime according to runtime shapes.

Now the op coverage is roughly acceptable for inference workloads, while for training workloads still more work to do for coverage on backwards ops. Also the works on CPU backend is also ongoing.

We are interested and are now considering about pushing these codes to the community. But a headache is, there is for sure a lot of work to do for code merge, both for us and the reviewers… . We are still investigating on how to do this more efficiently. and I’m also opening this thread to collect suggestions. Please let me know what do think.

8 Likes

This is awesome! I’m very happy to see that you were able to drive this through.
I’m very interested in integrating this, if you’d like to submit some code to the TensorFlow repository I’ll happily help.

Some of the questions I’d have about some of the passes will be how much they can be abstracted to use interfaces and land upstream instead, but we can take this as we go.

Would you like to present again during one of the Thursday meeting?

This is very exciting update, thank you.

Indeed :slight_smile: You know I’m supporting in general, and I think there will be some shorter and longer term partitioning. Breaking up the code to manageable chunks for review and testing, while also allowing for iteration as needed.

I agree with Mehdi that a great start could be presenting at a open design meeting. Then we could discuss too (perhaps in smaller forum) how to reduce the headache for review & merge. The TF repo is a good initial step here, although there will probably be general parts that make more sense in core (& mlir-hlo side) and some may make sense even initially, while it may be simpler for the initial commit and iteration to start with one.

1 Like

This looks really interesting! A couple of quick questions:

  1. What would it take to merge DHLO with MHLO and DLHLO with MLHLO?
  2. Is there any reason to use the Loops dialect instead of affine? (Btw, is loops same as scf?) In the large number of vision and language translations models I have seen, I’m yet to come across a single tensorflow op that can’t be expressed in the affine dialect. It’s also possible to freely mix *lhlo, scf/loops and affine dialects since they are all work on the same type system. So, even if you are unable to express something in one, you should be able to freely mix (scf loops with affine loops, standard load/stores with affine load/stores) — the mix and conversions it should just work out of the box. (For eg. lhlo + affine + scf -> lhlo + scf).

Hi ZhuKai,

This looks really interesting! I agree with Mehdi and Jacques that landing these in TensorFlow & MHLO is going to be useful.

In particular, @timshen, @jurahul and others are working on adding an LMHLO “backend” to XLA GPU; that could be a testing ground for some of the DLHLO based passes.

I have a couple of questions (happy to discuss at a later ODM presentation if email is too cumbersome).

  • What do you do for mismatched shapes (e.g. user is trying to add f32[5] with f32[3])?
  • What is your integration point with TensorFlow? Are you using auto-clustering?
  • I’m a little surprised that you’re fusing on DLHLO and not DHLO, I would have expected the latter to be simpler. Any reasons for this choice?
  • When you say “Now the op coverage is roughly acceptable for inference workloads”, do you mean that not all HLO operations are supported?
  • Have you compared the performance with a more recent version of TensorFlow and XLA on public benchmarks?

– Sanjoy

Well, first of all, congratulations to you and your team for reaching this milestone!

Thank you for sharing the intermediate IR. That was really helpful in quenching my curiosity. I have loads of question still and giving a presentation during a Thursday meeting would be great!

Here are some minor ones, to ensure I understand your approach right.

  • There are parts in the IR where you cast to specific shapes. Where do these shapes come from?
  • I assume if there is no xla_dhlo.device = "cpu" annotation, then the HLO operation runs on gpu?
  • Are the @dhlo_fusion_xla_lhlo_reshape_1_0 operations that have been identified for fusion? So in essence, you outline ops that should be fused into their own functions?
  • I found some specializations, like @dhlo_fusion_xla_lhlo_reduce_2_0_bsize_128 for block sizes. Do specialize for other reasons and how is this driven?
  • Regarding DHLO, beyond d_broadcast_in_dim and d_reshape, any (D)HLO operations that were interesting to extend?

1, I’ve noticed that some of the ops in MHLO has taken dynamic shape semantics into consideration. I think DHLO should be finally merged into MHLO, and in my current understanding, this shouldn’t bring too much effort, maybe some supplement on op definitions are needed.
2, yes loop dialect is renamed to scf later. Why we chose scf is that, if the topic is about the memory-intensive kernels codegen like XLA, i don’t think the additional features in affine dialect can help a lot. We want to explicitly control the index calculations codegen since they are important in performance. And with the plain definition of scf dialect this seems more straight forward. Anyway i’ll take a look into the recent codes in affine dialect and maybe my point will change:)
BTW, what’s the road map of scf and affine, are they to be merged someday?

  • What do you do for mismatched shapes (e.g. user is trying to add f32[5] with f32[3] )?
    If the shape is unknown in compile time, currently we don’t do any check. I think we should also consider this with ‘shape constraints’ but currently we didn’t do anything.
  • What is your integration point with TensorFlow? Are you using auto-clustering?
    yes, same as mark-encapsulate-build passes, with a different supporting list.
  • I’m a little surprised that you’re fusing on DLHLO and not DHLO, I would have expected the latter to be simpler. Any reasons for this choice?
    A same question as in XLA is, when we fuse A with B, we should make sure that no loops are formed.
    For mlir, currently in HLO layer it’s actually a mixer of hlo dialect and std dialect, that later is used to represent shape calculation. This brings some trouble in telling if A and B can be fused. For example, a std.DimOp (shape edge) can also make a loop and this is not infact a problem. While in LHLO layer, such problems are much easier to deal with.
    However, for future evolution, we should be able to adopt shape dialect to fully represent shape calculation and shape constraints in tensor level. We might still be able to do fusion on HLO layer by then.
  • When you say “Now the op coverage is roughly acceptable for inference workloads”, do you mean that not all HLO operations are supported?
    Here i mean the TF op coverage. For hlo ops, Most of the HLO ops for codegen are already supported.
  • Have you compared the performance with a more recent version of TensorFlow and XLA on public benchmarks?
    I have not done it yet. Even compared with xla1.15, there are still a lot of known TODOs on performance, both kernel side and host side…

No, they are intended to be separate things. affine places additional restrictions on the form to make certain analyses and transformations easier. A conversion from affine to scf is always possible by design. So if your output can be represented in affine, I don’t see any reason to directly use scf. You only skip a conversion which is trivial in the whole scheme of things. With scf, you’d be stuck even if you want really simple optimizations around if/else or for: they’d be either severely limited or would require more code to do less powerful things.

  • There are parts in the IR where you cast to specific shapes. Where do these shapes come from?
    can you provide a detail example? In principle, if in HLO layer a tensor is casted to specific shape we must be able to tell this is legal at compile time.
  • I assume if there is no xla_dhlo.device = "cpu" annotation, then the HLO operation runs on gpu?
    yes, gpu is by default, and it might be better to also explicitly mark gpu.
    In my understanding, there are no ‘official’ place assignment attrs in MLIR TF/HLO dialects, pls correct me if i’m out of date.
  • Are the @dhlo_fusion_xla_lhlo_reshape_1_0 operations that have been identified for fusion? So in essence, you outline ops that should be fused into their own functions?
    yes, this functions similiar to kFusion in XLA, after kernel / launch codegen inside of it, the fusion_func will be inlined back.
  • I found some specializations, like @dhlo_fusion_xla_lhlo_reduce_2_0_bsize_128 for block sizes. Do specialize for other reasons and how is this driven?
    This is to generate multiple versions of kernels, and launch a proper one when the shape is known at runtime. A codegen strategy may not be best for all shapes.
  • Regarding DHLO, beyond d_broadcast_in_dim and d_reshape , any (D)HLO operations that were interesting to extend?
    not much, DSlice, DPad, DGather, DIota and a few others.

Thanks @bondhugula, that’s valuable information to me. I’ll look into recent codes of affine for some details.

The ones I found involved arguments. Like in

  %443 = "xla_hlo.reshape"(%arg30) : (tensor<?x?xf32>) -> tensor<2304x384xf32>
  %444 = "xla_hlo.reshape"(%arg32) : (tensor<?x?xf32>) -> tensor<384x768xf32>
  %445 = "xla_hlo.reshape"(%arg34) : (tensor<?x?xf32>) -> tensor<384x768xf32>
  %446 = "xla_hlo.reshape"(%arg36) : (tensor<?x?xf32>) -> tensor<384x768xf32>
  %447 = "xla_hlo.reshape"(%arg38) : (tensor<?x?xf32>) -> tensor<768x384xf32>

At least for HLO, I am also not aware of existing placement attributes.

That comes from the static shape inference in TF dialect, those hlo nodes comes from
%117 = “tf.Reshape”(%arg30, %31) {T = f32, Tshape = i32, device = “”} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<2304x384xf32>
%118 = “tf.Reshape”(%arg32, %25) {T = f32, Tshape = i32, device = “”} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<384x768xf32>

where %31, %25 are const nodes.

This is great work. Is it in a form you can share the code ? Would love to try it out in the current form to understand the different passes. Thanks