Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV)

Meta AI has launched Perception Encoder Audiovisual (PE-AV), a new encoder family designed for joint audio and video understanding, trained on a massive dataset of 100 million audio-video pairs with text captions. This release extends Meta’s Perception Encoder (PE) stack, surpassing previous models like SigLIP2 and InternVideo2 in performance.

PE-AV addresses the challenge of creating a unified embedding space for audio, video, and text, moving beyond specialized models for each modality. Current multimodal models often struggle with generalization and require extensive task-specific fine-tuning, leading to significant development costs and limited scalability.

Key Insights

100M Audio-Video Pairs: PE-AV was pre-trained on a large-scale dataset of 100 million audio-video pairs with text captions.
DAC VAE for Audio: The model utilizes a DAC VAE codec to convert raw waveforms into discrete audio tokens, enabling efficient processing.
SAM Audio Integration: PE-AV serves as the core perception engine for Meta’s SAM Audio model, enabling prompt-based audio separation and sound event localization.

Working Example

# PE-AV utilizes a contrastive loss across ten modality pairs.
# Example (Conceptual - actual implementation is within the framework):
# loss = contrastive_loss(audio_embedding, video_embedding, text_embedding)
# The model learns to minimize the distance between related modalities
# and maximize the distance between unrelated modalities.

Practical Applications

SAM Audio: Meta’s SAM Audio uses PE-AV embeddings to separate sound sources in complex audio mixtures.
Pitfall: Relying solely on unimodal models (e.g., audio-only or video-only) can lead to inaccurate or incomplete understanding of the scene, especially in noisy or ambiguous environments.

References:

https://www.marktechpost.com/2025/12/22/meta-ai-open-sourced-perception-encoder-audiovisual-pe-av-the-audiovisual-encoder-powering-sam-audio-and-large-scale-multimodal-retrieval/

On This Page