Implicit State Estimation
via Video Replanning

anonymous authors
Code

Qualitative Results


To illustrate how our system adapts to test-time information, we present intermediate outputs of the framework, including the failed plan and interaction, the newly generated plans, the selected plan, and the final execution that improves task completion. Our system operates in three stages:
  • (a) It takes in the failed interaction, updates the state embedding, and generates future plans accordingly.
  • (b) It then selects the plan that deviates the most from the previously failed attempt.
  • (c) Finally, it executes the selected plan to successfully accomplish the task.

  • Task: Push Bar

    Failed Plan
    Failed Interaction
    New Plans
    Selected Plan
    Execution


    Task: Pick Bar

    Failed Plan
    Failed Interaction
    New Plans
    Selected Plan
    Execution


    Task: Turn Faucet

    Failed Plan
    Failed Interaction
    New Plans
    Selected Plan
    Execution


    Task: Open Box

    Failed Plan
    Failed Interaction
    New Plans
    Selected Plan
    Execution


    Video Generation with State Embedding


    Our system generates videos conditioned on the state embedding derived from past interaction video. We show the generated videos for the same task but with and without conditioning on the state embedding. The state embedding helps the system to generate videos that are more consistent with the underlying system parameters.


    Task: Push Bar

    Conditioned Video




    Plans without State Embedding




    Plans with State Embedding (Ours)


    Task: Open Box

    Conditioned Video




    Plans without State Embedding




    Plans with State Embedding (Ours)


    Task: Pick Bar

    Conditioned Video




    Plans without State Embedding




    Plans with State Embedding (Ours)


    Task: Turn Faucet

    Conditioned Video




    Plans without State Embedding




    Plans with State Embedding (Ours)


    t-SNE Visualization on System Parameters


    We visualize the distribution of underlying system parameters using t-SNE applied to state embeddings extracted from interaction videos. The colors indicate the ground-truth system parameters.

    For tasks with discrete modes (e.g., Open Box, Turn Faucet), the clusters are well-separated. In contrast, the distribution is more complex for tasks with continuous modes (e.g., Slide Brick, Pick Bar), which may help explain why learning to identify system parameters for these tasks is more challenging and tends to require more data.


    Open Box
    Turn Faucet
    Slide Brick
    Pick Bar