TF<->MLIR conversion

Hello,

I am playing with the conversion between keras, TF and MLIR in order to see how the code looks like (and possibly optimize it). The code generated for RNNs seems particularly bloated, and in looking into the causes of this issue, I stumbled into a conversion problem that might explain it.

For some reason, there seems to be no way to execute a stateful model step by step (sample by sample). I can only execute one batch at a time. Thus, in order to have a step-by-step stateful execution I have to enable the “stateful=True” flag in the RNN definition, and then provide each sample as part of a single-step input batch.

This is a very inefficient execution mechanism, which creates a useless extra loop. Do you happen to know if there’s a way for natively executing batches step by step?

Best,
Dumitru

How exactly did you generate this code (from Keras)?

As in the module/the TF graph or the code at the backend?

How are you executing it? I’d assume hooked up in the MLIR optimization registry there, following the example of the layout optimizer and enabled in code.

Step by step execution seems like a Keras specific question. I’m assuming there may be some batch configuration option there but I mostly start way below that layer.

I am converting keras code to frozen graphs (.pb/.pbtxt), then converting this one to MLIR. After working on it, I can go back to .pb, which I can execute from inside python. I thought this is a good way of understanding how the basic dialects (tf/tf_executor) work (and how keras generates code) before moving into iree or tfrt.

Indeed, it seems like a keras- and TF-specific question, but since MLIR is mainly used in back-end to Keras and TF, it is important for people working on MLIR (if only to clarify how to obtain use cases). For instance, my objective is not keras-specific, but to understand how the keras generates code, and most notably how the tf_executor code wraps tf operations. BTW, any insight here will be helpful.

Your best bet is going to be reading the keras code, especially if you are trying to untangle the RNN fiasco. Since you are operating in graph mode, each tf function called in the keras python emits some chunk of graphdef nodes. There is no reference to the internals except the code.

The RNN code is particularly twisty as it has been the target of too many tactical optimizations and special casing over the years – often with the same pattern: someone notices it is too slow in an op by op executor and either adds a special case, manual “fusion” or otherwise adds to it (when what it really needs is subtraction and to do less).

If the tools all the way through the chain are working right, then, at least for some set of dominant cases, most of that complexity should fold away in some form. Based on my personal experience, I wouldn’t claim this is robust. When I am doing research in TF, I avoid keras RNN wherever possible, preferring less surprising, functional abstractions – but that is just my perspective as a user of the tools.

1 Like

I’m confused - aren’t you using mlir-cpu-runner to execute the MLIR?

1 Like

No. I actually don’t know how to lower tensorflow+tf_executor specifications downto the dialects of the MLIR distribution (e.g. based on XLA). I would appreciate any help in doing this (e.g. a compilation script).

This is why I take the MLIR code back to frozen graph.

This remark was enlightening. It explains a lot. One problem is of course that the hierarchic dataflow model incarnated in tf_executor seems unusual and different from other established dataflow models (SDF, BDF, synchronous). But the main problem for me was that compilation does not produce “init” and “step” functions allowing it to be called from various contexts where I/O methods are implemented, nor does it seem to allow linking to externally-provided I/O functions (e.g. a function that reads one frame from a webcam at each computing cycle, as opposed to passing all frames as a “batch” or encoding the state outside the TF model). I assume this is why the “session.run” method must run all recursion steps without interaction with the environment…

Which brings another issue: now that I know that I cannot use the standard python flow for execution, how can I convert TF MLIR code (which I know how to write) into code that uses the standard dialects ? For instance, how to take the following code:

func @myfun(%x:tensor<f32>)->(tensor<f32>) {
  %c = "tf.Const"(){value = dense<3.000000e+00>:tensor<f32>} : () -> tensor<f32>
  %0 = "tf.AddV2"(%c, %x) {device = ""} : (tensor<f32>, tensor<f32>) -> tensor<f32>
  return %0 : tensor<f32>
}

And produce MLIR code working with memref values? More precisely, are there good pipeline definitions for tf-opt to allow CPU and GPU code generation?
Any help here would be very useful.

@bondhugula @qaco