Audio ML Papers

Week of October 05 - October 12, 2025

Subcategories: All (14) | Speech Synthesis (5) | Music Synthesis (2) | Ambient Synthesis (0) | Quality Assessment (0) | Enhancement (0) | Asr (1) | Other (6)
← Previous Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 83)
Satvik Dixit, Soham Deshmukh, Bhiksha Raj · arXiv
Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface simila...
#2 TOP PAPER (Score: 83)
Juncheng Wang, Chao Xu, Cheng Yu ... · EMNLP 2025
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating mo...
#3 TOP PAPER (Score: 83)
Samuel A. Verburg, Efren Fernandez-Grande, Peter Gerstoft · arXiv
Sound field reconstruction involves estimating sound fields from a limited number of spatially distributed observations. This work introduces a differentiable physics approach for sound field reconstruction, where the initial conditions of the wave equation are approximated with ...
Wednesday, October 08, 2025
Peize He, Zichen Wen, Yubo Wang ... · arXiv
Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do no...
Peter Plantinga, Roozbeh Sattari, Karine Marcotte ... · SMASH 2025
The speech of people with Parkinson's Disease (PD) has been shown to hold important clues about the presence and progression of the disease. We investigate the factors based on which humans experts make judgments of the presence of disease in speech samples over five different sp...
Tuesday, October 07, 2025
Akshay Muppidi, Martin Radfar · ICASSP 2024, 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 10881, 10885 · 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the in...
Haoxun Li, Yu Liu, Yuqing Sun ... · arXiv
Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We ...
Christian Marinoni, Riccardo Fosco Gramaccioni, Kazuki Shimada ... · IJCNN 2025
Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchro...
Mingxuan Wang, Satoshi Nakamura · IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained wit...
Tao Zhu, Yinfeng Yu, Liejun Wang ... · Proceedings of the 2025 ACM Multimedia Asia Conference (MMAsia '25)
Diffusion models have demonstrated remarkable performance in speech synthesis, but typically require multi-step sampling, resulting in low inference efficiency. Recent studies address this issue by distilling diffusion models into consistency models, enabling efficient one-step g...
Huang-Cheng Chou, Chi-Chun Lee · arXiv
Over the past two decades, speech emotion recognition (SER) has received growing attention. To train SER systems, researchers collect emotional speech databases annotated by crowdsourced or in-house raters who select emotions from predefined categories. However, disagreements amo...
Monday, October 06, 2025
Satvik Dixit, Soham Deshmukh, Bhiksha Raj · arXiv
Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface simila...
Samuel A. Verburg, Efren Fernandez-Grande, Peter Gerstoft · arXiv
Sound field reconstruction involves estimating sound fields from a limited number of spatially distributed observations. This work introduces a differentiable physics approach for sound field reconstruction, where the initial conditions of the wave equation are approximated with ...
Juncheng Wang, Chao Xu, Cheng Yu ... · EMNLP 2025
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating mo...
Baher Mohammad, Magauiya Zhussip, Stamatios Lefkimmiatis · arXiv
We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art perfo...
Wenhao Guan, Zhikang Niu, Ziyue Jiang ... · arXiv
Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unif...
Sunday, October 05, 2025
Umberto Cappellazzo, Minsu Kim, Pingchuan Ma ... · arXiv
Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce i...