Audio ML Papers

Week of October 05 - October 12, 2025

Subcategories: All (143) | Speech Synthesis (29) | Music Synthesis (10) | Ambient Synthesis (2) | Quality Assessment (5) | Enhancement (17) | Asr (16) | Other (64)
← Previous Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 83)
Satvik Dixit, Soham Deshmukh, Bhiksha Raj · arXiv
Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface simila...
#2 TOP PAPER (Score: 83)
Juncheng Wang, Chao Xu, Cheng Yu ... · EMNLP 2025
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating mo...
#3 TOP PAPER (Score: 83)
Samuel A. Verburg, Efren Fernandez-Grande, Peter Gerstoft · arXiv
Sound field reconstruction involves estimating sound fields from a limited number of spatially distributed observations. This work introduces a differentiable physics approach for sound field reconstruction, where the initial conditions of the wave equation are approximated with ...
Monday, October 06, 2025
Satvik Dixit, Soham Deshmukh, Bhiksha Raj · arXiv
Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface simila...
Samuel A. Verburg, Efren Fernandez-Grande, Peter Gerstoft · arXiv
Sound field reconstruction involves estimating sound fields from a limited number of spatially distributed observations. This work introduces a differentiable physics approach for sound field reconstruction, where the initial conditions of the wave equation are approximated with ...
Juncheng Wang, Chao Xu, Cheng Yu ... · EMNLP 2025
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating mo...
Baher Mohammad, Magauiya Zhussip, Stamatios Lefkimmiatis · arXiv
We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art perfo...
Wenhao Guan, Zhikang Niu, Ziyue Jiang ... · arXiv
Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unif...