StageCraft: Execution Aware Mitigation of Distractor and Obstruction Failures in VLA Models

Anonymous Submission
StageCraft Overview

Abstract

Large scale pre-training on text and image data along with diverse robot demonstrations has helped Vision Language Action models (VLAs) to generalize to novel tasks, objects and scenes. However, these models are still susceptible to failure in the presence of execution-time impediments such as distractors and physical obstructions in the robot's workspace. Existing policy improvement methods finetune base VLAs to improve generalization, yet they still struggle in unseen distractor settings. To address this problem, we investigate whether internet-scale pretraining of large vision-language models (VLMs) can be leveraged to reason about these impediments and mitigate policy failures. To this end, we propose StageCraft, a training-free approach to improve pretrained VLA policy performance by manipulating the environment's initial state using VLM-based in-context reasoning. StageCraft takes policy rollout videos and success labels as input and leverages VLM's reasoning ability to infer which objects in the initial state need to be manipulated to avoid anticipated execution failures. StageCraft is an extensible plug-and-play module that does not introduce additional constraints on the underlying policy, and only requires a few policy rollouts to work. We evaluate performance of state-of-the-art VLA models with StageCraft and show an absolute 40% performance improvement across three real world task domains involving diverse distractors and obstructions. Our simulation experiments in RLBench empirically show that StageCraft tailors its extent of intervention based on the strength of the underlying policy and improves its performance with more in-context samples.

Main Video

Method

StageCraft analyzes past rollout successes and failures using a vision-language model to identify distractor objects. It then removes the minimal set of failure-inducing objects via primitive actions before executing the policy.

1. Observe policy behavior: StageCraft collects rollout episodes of a pretrained VLA policy under different distractor configurations and records their success or failure.

2. Reason about failure sources: A vision-language model analyzes these rollout examples to identify objects in the scene that are likely responsible for policy failures.

3. Prepare the environment before execution: The robot removes the minimal set of predicted distractors using primitive pick-and-place actions, after which the VLA policy is executed in the modified environment to improve task success.

Method diagram

Real Robot Tasks

We evaluate StageCraft across three real-world robotic manipulation tasks.

Tasks
  • Stack Cups: Stack the left and right cups on top of the center cup.
  • Setup Plate: Take a plate from the rack and place a piece of bread on the plate.
  • Block in Bowl: Pick up the block and place it inside the bowl.

The underlying policy for performing the tasks is a vision language action model. Our method works irrespective of the model used which we showcase by testing our method for SmolVLA and Pi 0.5. We test robustness to diverse visual distractors and physical obstructions across both seen and unseen object configurations.

Distractor Objects

We used a set of 8 distractor objects as shown in the figure above to conduct our real-world experiments. The gray collector bin is also a distractor as it was not present in the robot's workspace during data collection.

Real Robot Results

SmolVLA

Stack Cups

Success

Failure

StageCraft

The first two videos showcase the VLA policy performing the task in the presence of distractors and obstructors in the scene. The robot is able to perform the task in the presence of the bunny and the mustard bottle present in the scene, while in the case of the Santa toy the robot faces an obstruction and fails the task. These rollouts are fed to StageCraft which removes the obstructor as it is responsible for policy failure.

Setup Plate

Success

Failure

StageCraft

Similarly, the first two videos showcase the success and failure of the policy for the task of setup plate. StageCraft recognizes the failure through a history of 15 rollouts to judge the effect of these distractors and obstructors in the scene and promptly removes them to improve the chance of success for the policy.

Block in Bowl

Success

Failure

StageCraft

Pi0.5

Stack Cups

Success

Failure

StageCraft

Setup Plate

Success

Failure

StageCraft

Block in Bowl

Success

Failure

StageCraft