RNNs in IREE - how is I/O performed?

Hello,

I’m interested in understanding how RNNs are implemented in IREE.
Given that IREE targets embedded implementation, I assume (maybe I’m
wrong) that the time recurrence of the RNN is mapped into cyclic input
reading and state update in the implementation (one input sample at a time).

In basic TensorFlow (dialect tf_executor and examples such as
the one of TensorFlow 2 Tutorial: Get Started in Deep Learning With tf.keras) it’s different. If you take a look at the MLIR TF model, you can see that the whole time history arrives as a single tensor. This is of course OK for training.

However, I assume that embedded implementation would need to receive
data one sample at a time, and perform computations incrementally, not
wait until all data has arrived for one LSTM run.

What I’d like to understand:

  • how this is implemented in IREE - maybe with an example.
  • if possible, how TF RNN models (e.g. LSTM) are converted to IREE.

I could only find one example in the iree hierarchy: unidirectional_lstm.mlir. However, in this example input seems to arrive like in basic TF (the whole time history at once as a single tensor).

Best regards,
Dumitru

Due to the obfuscation that the higher level frameworks have added to the topic over the years, this is a somewhat thick subject that boils down to relatively simple math and structure – IREE aims to actually implement the mechanics and leave the nature of use to higher level frameworks.

Most of my experience with LSTMs in TensorFlow has actually not been with the official tf.keras APIs: I’ve generally not been inspired by how well they extend to advanced cases. However, I think what you are looking for is encoded in the tf.keras.layers.LSTM parameters return_state, and initial_state. While more of a high level API question, you typically find that whole-sequence cases are the default in such things and then you can also work in an incremental fashion by stitching states through. Again, this is not my favorite API for many reasons – including that for production cases I’ve typically worked on, you have more explicit hooks for the zero state initializers and state0/state1 bits that become necessary to work in per-step mode. For advanced use cases, there is a lot of state shuffling and those ergonomics matter.

Since IREE supports variables, you have two choices: you can either stitch the states through externally per the above, accepting and returning state maps from (say) your predict function. Or, as we do sometimes for online systems, you can create a tf.Module that is stateful, having variables to hold the states and entry points reset() and predict(inputs), where your inputs may be a partial sequence and the states will be managed internally to the Module by initializing and storing them to variables.

It’s not really an embedded vs not thing: many production systems, regardless of what platform they are on, need to operate incrementally on an indeterminate stream. In general, these are some of the more complicated implementations, and there are limited, simple examples (a lot depends on what you are trying to do). One of the folks on my team did open source kws_streaming with a number of worked e2e examples and a bit of a framework for putting such things together. Very little of it is IREE specific: IREE supports (or seeks to support) the low level primitives needed to implement any of those schemes. It does not take the approach of some of the pre-existing op-by-op systems where they attempt to encapsulate all of this into one monolithic “LSTM” or “RNN” layer (that is an extreme pessimizer for a compiler, since these recurrent architectures actually have a lot of room for the compiler to optimize if not sealed up as a black box).

Also, for a lot of high performance applications, it is actually advantageous to feed multiple samples at a time, even for small cases – since you can usually operate internally “layer by layer” and make better utilization of your memory and cache.

1 Like

In terms of concrete examples in IREE, you’ll find some in our TF integrations. In particular, the kws_streaming example Stella references and tests for various Keras layers, including LSTM.

Side note: the example you found is just a really early model I got working back before we had anything other than ad-hoc frontend integrations, which is why it’s one of the few checked in as a .mlir file. As Stella said, because a lot of high-level APIs assume you’re operating in training batches, that was what I had that we exported. I think it was the second “model” we got running on IREE :smiley: We still do check in a few mlir files for testing since models defined in the the frontend integrations require building the frontend integrations, but we’ll be reworking our testing infra in the next quarter, which should hopefully enable a better story there, like artifacts and conformance test suites that don’t need to be checked in as source files :smiley:

1 Like