Skip to main content

On This Page

MEM for Robots: Physical Intelligence Unveils 15-Minute Memory System for Gemma 3-4B VLAs

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

Researchers from Physical Intelligence, Stanford, UC Berkeley, and MIT have introduced Multi-Scale Embodied Memory (MEM) for robotic policies. This system enables Vision-Language-Action (VLA) models to process up to 15 minutes of context, overcoming the standard lack of memory in traditional end-to-end models.

Why This Matters

Current robotic policies typically operate on a single observation or a very short history, making long-horizon tasks like kitchen cleaning computationally intractable or prone to failure. By factorizing memory into short-term video and long-term language scales, MEM maintains a 380ms real-time inference threshold while allowing robots to adapt manipulation strategies based on recent failures.

Key Insights

  • 62% success rate increase in refrigerator opening tasks with unknown hinge directions (MEM Research, 2026)
  • Space-Time Separable Attention concept to interleave spatial and causal-temporal attention, reducing complexity from O(n^2K^2) to O(n^2+nK)
  • Gemma 3-4B tool utilized by Physical Intelligence and Stanford researchers as the foundation for the π0.6 VLA backbone
  • Language-based long-term memory to compress 15 minutes of events into semantic summaries such as ‘I placed three bowls’
  • Single NVIDIA H100 GPU implementation capable of processing 16 observation frames while staying under the 380ms real-time barrier

Practical Applications

  • Use Case: π0.6 VLA performing ‘Recipe Setup’ by retrieving ingredients from multiple locations over 15 minutes. Pitfall: Memory-less VLAs failing tasks significantly more often due to short-term history constraints.
  • Use Case: MEM-based robot adapting manipulation strategies in real-time to pick up chopsticks at variable heights. Pitfall: Single-observation models failing to resolve self-occlusions or adapt grasps during the execution phase.

References:

Continue reading

Next article

Streamlining DevOps: Automatic HTTPS Reverse Proxy with Caddy and Docker Compose

Related Content