You can use a CNN:
- The input is then not 3 * w * h but (3*number of images) * w * h – so you can just concatenate the stuff in depth
- The output is just an image instead of a class. So no flattening in between… or a reshape has to be added.
Have a look at
Fully Convolutional Networks for Semantic Segmentation and Image-to-Image translation.
If you haven’t seen it already: Keras is pretty handy.
You might also be interested in the concept of Optical Flow
1
solved How can I predict the next image from a series of images?