Skip to main content

On This Page

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV)

Meta AI has launched Perception Encoder Audiovisual (PE-AV), a new encoder family designed for joint audio and video understanding, trained on a massive dataset of 100 million audio-video pairs with text captions. This release extends Meta’s Perception Encoder (PE) stack, surpassing previous models like SigLIP2 and InternVideo2 in performance.

PE-AV addresses the challenge of creating a unified embedding space for audio, video, and text, moving beyond specialized models for each modality. Current multimodal models often struggle with generalization and require extensive task-specific fine-tuning, leading to significant development costs and limited scalability.

Key Insights

  • 100M Audio-Video Pairs: PE-AV was pre-trained on a large-scale dataset of 100 million audio-video pairs with text captions.
  • DAC VAE for Audio: The model utilizes a DAC VAE codec to convert raw waveforms into discrete audio tokens, enabling efficient processing.
  • SAM Audio Integration: PE-AV serves as the core perception engine for Meta’s SAM Audio model, enabling prompt-based audio separation and sound event localization.

Working Example

# PE-AV utilizes a contrastive loss across ten modality pairs.
# Example (Conceptual - actual implementation is within the framework):
# loss = contrastive_loss(audio_embedding, video_embedding, text_embedding)
# The model learns to minimize the distance between related modalities
# and maximize the distance between unrelated modalities.

Practical Applications

  • SAM Audio: Meta’s SAM Audio uses PE-AV embeddings to separate sound sources in complex audio mixtures.
  • Pitfall: Relying solely on unimodal models (e.g., audio-only or video-only) can lead to inaccurate or incomplete understanding of the scene, especially in noisy or ambiguous environments.

References:

Continue reading

Next article

Meta Details GEM Ads Model Using LLM-Scale Training, Hybrid Parallelism, and Knowledge Transfer

Related Content