MEM for Robots: Physical Intelligence Unveils 15-Minute Memory System for Gemma 3-4B VLAs
These articles are AI-generated summaries. Please check the original sources for full details.
Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks
Researchers from Physical Intelligence, Stanford, UC Berkeley, and MIT have introduced Multi-Scale Embodied Memory (MEM) for robotic policies. This system enables Vision-Language-Action (VLA) models to process up to 15 minutes of context, overcoming the standard lack of memory in traditional end-to-end models.
Why This Matters
Current robotic policies typically operate on a single observation or a very short history, making long-horizon tasks like kitchen cleaning computationally intractable or prone to failure. By factorizing memory into short-term video and long-term language scales, MEM maintains a 380ms real-time inference threshold while allowing robots to adapt manipulation strategies based on recent failures.
Key Insights
- 62% success rate increase in refrigerator opening tasks with unknown hinge directions (MEM Research, 2026)
- Space-Time Separable Attention concept to interleave spatial and causal-temporal attention, reducing complexity from O(n^2K^2) to O(n^2+nK)
- Gemma 3-4B tool utilized by Physical Intelligence and Stanford researchers as the foundation for the π0.6 VLA backbone
- Language-based long-term memory to compress 15 minutes of events into semantic summaries such as ‘I placed three bowls’
- Single NVIDIA H100 GPU implementation capable of processing 16 observation frames while staying under the 380ms real-time barrier
Practical Applications
- Use Case: π0.6 VLA performing ‘Recipe Setup’ by retrieving ingredients from multiple locations over 15 minutes. Pitfall: Memory-less VLAs failing tasks significantly more often due to short-term history constraints.
- Use Case: MEM-based robot adapting manipulation strategies in real-time to pick up chopsticks at variable heights. Pitfall: Single-observation models failing to resolve self-occlusions or adapt grasps during the execution phase.
References:
Continue reading
Next article
Streamlining DevOps: Automatic HTTPS Reverse Proxy with Caddy and Docker Compose
Related Content
OpenMind OM1: Building an Open Source Operating System for Humanoid Robots
Jan Liphardt introduces OM1, an open-source robotic OS that leverages large language models for data fusion and utilizes $1,250 hardware components with 10,000-hour durability to enable human-centric robot interactions, shifting the focus from complex motor tasks like onion chopping to social engagement and spatial understanding.
Nous Research Debuts Lighthouse Attention for 1.7x Faster Long-Context Pretraining
Nous Research introduces Lighthouse Attention, delivering up to 1.7x pretraining speedups and 21x faster forward passes at 512K context lengths.
Allen Institute for AI (AI2) Introduces Olmo 3: Open Source 7B/32B LLMs with 65K Context Window
Allen Institute for AI (AI2) launches Olmo 3, open-source 7B/32B LLMs with 65,536 token context window and Dolma 3 data stack.