Due to the obfuscation that the higher level frameworks have added to the topic over the years, this is a somewhat thick subject that boils down to relatively simple math and structure – IREE aims to actually implement the mechanics and leave the nature of use to higher level frameworks.
Most of my experience with LSTMs in TensorFlow has actually not been with the official tf.keras APIs: I’ve generally not been inspired by how well they extend to advanced cases. However, I think what you are looking for is encoded in the
initial_state. While more of a high level API question, you typically find that whole-sequence cases are the default in such things and then you can also work in an incremental fashion by stitching states through. Again, this is not my favorite API for many reasons – including that for production cases I’ve typically worked on, you have more explicit hooks for the zero state initializers and state0/state1 bits that become necessary to work in per-step mode. For advanced use cases, there is a lot of state shuffling and those ergonomics matter.
Since IREE supports variables, you have two choices: you can either stitch the states through externally per the above, accepting and returning state maps from (say) your
predict function. Or, as we do sometimes for online systems, you can create a
tf.Module that is stateful, having variables to hold the states and entry points
predict(inputs), where your inputs may be a partial sequence and the states will be managed internally to the Module by initializing and storing them to variables.
It’s not really an embedded vs not thing: many production systems, regardless of what platform they are on, need to operate incrementally on an indeterminate stream. In general, these are some of the more complicated implementations, and there are limited, simple examples (a lot depends on what you are trying to do). One of the folks on my team did open source kws_streaming with a number of worked e2e examples and a bit of a framework for putting such things together. Very little of it is IREE specific: IREE supports (or seeks to support) the low level primitives needed to implement any of those schemes. It does not take the approach of some of the pre-existing op-by-op systems where they attempt to encapsulate all of this into one monolithic “LSTM” or “RNN” layer (that is an extreme pessimizer for a compiler, since these recurrent architectures actually have a lot of room for the compiler to optimize if not sealed up as a black box).
Also, for a lot of high performance applications, it is actually advantageous to feed multiple samples at a time, even for small cases – since you can usually operate internally “layer by layer” and make better utilization of your memory and cache.