R2RDreamer: 3D-aware Data Augmentation
for Spatially-generalized 2D Manipulation Policies

Anonymous Authors
CoRL 2026 Submission

Abstract

Spatial generalization is critical for imitation-learned manipulation policies, but collecting demonstrations across diverse object poses, robot configurations, and camera viewpoints is expensive. R2RDreamer is a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. It first edits incomplete object pointclouds and end-effector trajectories in a shared 3D frame, then projects the edited scene into masked image-space control videos with occlusion-aware reasoning. A dense-control image-to-video model completes temporally coherent RGB observations, producing augmented RGB-action data for compact 2D visuomotor policies and VLA-style policies.

R2RDreamer teaser showing the shift from 3D co-editing to 2D video completion.
R2RDreamer keeps real-to-real 3D observation-action co-editing for geometric consistency, then performs visual repair in 2D video space.

Method

R2RDreamer keeps 3D where it is necessary for action consistency, but does not require the edited 3D scene to be a complete policy-ready observation. Incomplete geometry becomes a dense spatial control for video completion, shifting the visual-repair bottleneck from task-specific 3D reconstruction to scalable 2D video modeling.

R2RDreamer augmentation pipeline with 3D co-editing, occlusion-aware projection, and video completion training pairs.
Demonstration augmentation and self-supervised completion data. R2RDreamer augments source demonstrations through 3D co-editing and occlusion-aware projection, while projection-consistent and random object-drop masks provide training pairs for video completion.
Projection-controlled video completion network with a dense control branch.
Projection-controlled video completion model. The model conditions on a masked projected control video, optional reference image, and language instruction to complete temporally coherent RGB frames.

Qualitative Augmentation Cases

Select a case to compare the source demonstration, the edited 3D pointcloud control, and the completed augmented video.

Case 1: Pot-Food

Original

Video Placeholder

videos/case_01_original.mp4

3D Pointcloud Control

Video Placeholder

videos/case_01_pointcloud.mp4

Completed Augmentation

Video Placeholder

videos/case_01_augmented.mp4

The three videos expose the full augmentation path: the original real demonstration, the geometry-aware control produced after 3D co-editing and projection, and the completed RGB video used as policy training data.