NVIDIA and University of Maryland Release Audio Flamingo Next (AF-Next)

NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model

NVIDIA and the University of Maryland have released Audio Flamingo Next (AF-Next), a breakthrough open Large Audio-Language Model trained on 1 million hours of audio. The model achieves state-of-the-art performance on the MMAU benchmark with a sound accuracy score of 79.87. It represents the first internet-scale open model capable of robust reasoning over 30-minute audio recordings.

Why This Matters

The development of open audio models has traditionally lagged behind vision-language counterparts due to the difficulty of reasoning over diverse environmental sounds, music, and long-form speech. Standard transformer architectures often struggle with temporal grounding, leading to hallucinations when processing audio beyond short clips.

AF-Next addresses these technical limitations through Rotary Time Embeddings (RoTE) and Temporal Audio Chain-of-Thought reasoning. By anchoring intermediate logic to specific timestamps, the model enables precise evidence aggregation across context windows up to 128k tokens, a feat previously reserved for proprietary closed-source models like Gemini 2.5 Pro.

Key Insights

AF-Next-Instruct scored 73.9 on LongAudioBench in 2026, significantly outperforming the closed-source Gemini 2.5 Pro which scored 60.4.
Temporal Audio Chain-of-Thought anchors reasoning steps to specific timestamps to reduce hallucinations in long-form audio up to 30 minutes.
Hybrid sequence parallelism, combining Ulysses and Ring attention, allows the model to handle 128K context tokens across multi-node GPU clusters.
The training corpus includes 108 million samples and 1 million hours of audio, featuring a new dataset called AF-Think-Time for complex reasoning.
Architecture utilizes an AF-Whisper encoder with a Qwen-2.5-7B backbone, mapping features through a 2-layer MLP adaptor into the embedding space.

Practical Applications

Use Case: NVIDIA AF-Next-Think for multi-party conversation analysis and speaker identification in 30-minute recordings. Pitfall: Using sequence-based positional encoding instead of RoTE leads to temporal reasoning failure in long contexts.
Use Case: High-fidelity music captioning and instrument recognition achieving 92.13 on Medley-Solos-DB. Pitfall: Relying on short-clip training data which fails to capture the structural complexity of extended musical compositions.
Use Case: Real-time voice-to-voice interaction using the integrated streaming TTS module for low-latency response. Pitfall: High memory overhead during long-context inference without implementing hybrid sequence parallelism.

References:

https://www.marktechpost.com/2026/04/14/nvidia-and-the-university-of-maryland-researchers-released-audio-flamingo-next-af-next-a-super-powerful-and-open-large-audio-language-model/

On This Page

NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Google AI Groundsource: Transforming Global News into 2.6M Flash Flood Data Points

Multi-Agent System for Integrated Multi-Omics Data Analysis with Pathway Reasoning

NVIDIA's Tile-Based Programming: A New Era for AI Development