Emoji SAM2Act: Integrating Visual Foundation Models with A Memory Architecture for Robotic Manipulation

Anonymous Submission



Abstract

Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance gap under diverse environmental perturbations. Building on this foundation, we propose \textbf{SAM2Act+}, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves competitive performance on MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems.

Real World Results



Memory Tasks

Summary


SAM2Act, a multi-view robotics transformer that enhances feature representation by integrating multi-resolution upsampling with visual embeddings from large-scale foundation models. Built on the RVT-2 multiview transformer, SAM2Act achieves strong multitask success and generalization. Building on this foundation, we introduce SAM2Act+, which incorporates a memory-based architecture inspired by the SAM2 approach. Using a memory bank and an attention mechanism, SAM2Act + enables episodic recall to solve more complex, spatial memory-dependent manipulation tasks.

Overview of SAM2Act and SAM2Act+


The SAM2Act architecture leverages the SAM2 image encoder to generate prompt-conditioned, multi-resolution embeddings, fine-tuned with LoRA for efficient adaptation to manipulation tasks. A multi-view transformer aligns spatial coordinates with language instructions, while a cascaded multi-resolution upsampling mechanism refines feature maps and generates accurate translation heatmaps. SAM2Act+ extends this architecture by incorporating memory-based components, including the Memory Encoder, Memory Attention, and Memory Bank, into the coarse branch. These components enable memory-driven reasoning by processing historical heatmaps and integrating prior observations, allowing the agent to predict actions based on stored contextual information. Observations are reconstructed into point clouds, rendered into three virtual images, and lifted into 3D translation points, enabling precise spatial reasoning across both architectures.

Experiments and Results


SAM2Act outperform all other baselines, and achieve highest performance on the RLBench 18 tasks.




SAM2Act outperform all other baseline, and achieve SoTA performance on the The Colosseum.

Results on RLBench 18 tasks

Task
Baseline


Ours

Baseline

RLBench 18 Tasks: Model Performance

Models RLBench 18 Tasks (%)
SAM2Act 86.8
ARP+ 86.0
3D-LOTUS 83.1
3D Diffuser Actor 81.3
ACT3D 65.0
PolarNet 62.9

The COLOSSEUM: Average Decrease Across Perturbations

Models Average Decrease (%)
SAM2Act -4.3
3D Diffuser Actor -15.6
MVP -16.3
R3M -49.9
ACT -61.8

Ablation studies for Memory v.s. Long-horizon tasks

Results



Tasks



Results on Colosseum



Results on MemoryBench

Task
Method
Episode


In-distribution Real-world Results

Task
Episode


SAM2Act

RVT-2

Out-distribution Real-world Results

Task
Episode


SAM2Act

RVT-2