Our system generates videos conditioned on the state embedding derived from past interaction video. We show the generated videos for the same task but with and without conditioning on the state embedding. The state embedding helps the system to generate videos that are more consistent with the underlying system parameters.
We visualize the distribution of underlying system parameters using t-SNE applied to state embeddings extracted from interaction videos. The colors indicate the ground-truth system parameters.
For tasks with discrete modes (e.g., Open Box, Turn Faucet), the clusters are well-separated. In contrast, the distribution is more complex for tasks with continuous modes (e.g., Slide Brick, Pick Bar), which may help explain why learning to identify system parameters for these tasks is more challenging and tends to require more data.