Audio ML Papers

Last 7 Days (October 07 - October 14, 2025)

Subcategories: All (20) | Speech Synthesis (6) | Music Synthesis (3) | Ambient Synthesis (0) | Quality Assessment (1) | Enhancement (1) | Asr (2) | Other (7)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 90)
Harshvardhan C. Takawale, Nirupam Roy, Phil Brown · arXiv
Accurate modeling of spatial acoustics is critical for immersive and intelligible audio in confined, resonant environments such as car cabins. Current tuning methods are manual, hardware-intensive, and static, failing to account for frequency selective behaviors and dynamic chang...
#2 TOP PAPER (Score: 83)
Mingxuan Wang, Satoshi Nakamura · IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained wit...
#3 TOP PAPER (Score: 83)
Akshay Muppidi, Martin Radfar · ICASSP 2024, 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 10881, 10885 · 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the in...
Friday, October 10, 2025
Yuxuan Jiang, Zehua Chen, Zeqian Ju ... · arXiv
Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we r...
Shulin He, Zhong-Qiu Wang · arXiv
Blind speech separation (BSS) aims to recover multiple speech sources from multi-channel, multi-speaker mixtures under unknown array geometry and room impulse responses. In unsupervised setup where clean target speech is not available for model training, UNSSOR proposes a mixture...
Zhao Guo, Ziqian Ning, Guobin Ma ... · NCMMSC2025
Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which...
Zongcai Du, Guilin Deng, Xiaofeng Guo ... · arXiv
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies ...
Thursday, October 09, 2025
Guobin Ma, Jixun Yao, Ziqian Ning ... · arXiv
Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are ...
Liyang Chen, Hongkai Chen, Yujun Cai ... · arXiv
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech ...
Wei Wang, Rong Cao, Yi Guo ... · arXiv
Flow-based generative models have greatly improved text-to-speech (TTS) synthesis quality, but inference speed remains limited by the iterative sampling process and multiple function evaluations (NFE). The recent MeanFlow model accelerates generation by modeling average velocity ...
Wednesday, October 08, 2025
Harshvardhan C. Takawale, Nirupam Roy, Phil Brown · arXiv
Accurate modeling of spatial acoustics is critical for immersive and intelligible audio in confined, resonant environments such as car cabins. Current tuning methods are manual, hardware-intensive, and static, failing to account for frequency selective behaviors and dynamic chang...
Xutao Mao, Ke Li, Cameron Baird ... · arXiv
As advances in synthetic voice generation accelerate, an increasing variety of fake voice generators have emerged, producing audio that is often indistinguishable from real human speech. This evolution poses new and serious threats across sectors where audio recordings serve as c...
Sebastian Braun, Hannes Gamper, Dimitra Emmanouilidou · arXiv
Modern generative and multimodal models increasingly rely on compact latent representations that trade and balance semantic richness with high-fidelity reconstruction. We introduce SALAD-VAE, a continuous and highly compact semantic Audio Variational Autoencoder, which operates i...
Phuong Tuan Dat, Tran Huy Dat · 2025 IEEE International Conference on Advanced Video and Signal-Based Surveillance
Recent advancements in speech synthesis technologies have led to increasingly sophisticated spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer ar...
Peize He, Zichen Wen, Yubo Wang ... · arXiv
Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do no...
Peter Plantinga, Roozbeh Sattari, Karine Marcotte ... · SMASH 2025
The speech of people with Parkinson's Disease (PD) has been shown to hold important clues about the presence and progression of the disease. We investigate the factors based on which humans experts make judgments of the presence of disease in speech samples over five different sp...
Rui Hu, Delai Qiu, Yining Wang ... · arXiv
Automatic speech recognition (ASR) systems often struggle with domain-specific terminology, especially in specialized settings such as academic lectures. To address this, we define the SlideASR task, which leverages the rich visual information from presentation slides to improve ...
Tuesday, October 07, 2025
Akshay Muppidi, Martin Radfar · ICASSP 2024, 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 10881, 10885 · 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the in...
Haoxun Li, Yu Liu, Yuqing Sun ... · arXiv
Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We ...
Christian Marinoni, Riccardo Fosco Gramaccioni, Kazuki Shimada ... · IJCNN 2025
Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchro...
Mingxuan Wang, Satoshi Nakamura · IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained wit...
Tao Zhu, Yinfeng Yu, Liejun Wang ... · Proceedings of the 2025 ACM Multimedia Asia Conference (MMAsia '25)
Diffusion models have demonstrated remarkable performance in speech synthesis, but typically require multi-step sampling, resulting in low inference efficiency. Recent studies address this issue by distilling diffusion models into consistency models, enabling efficient one-step g...
Huang-Cheng Chou, Chi-Chun Lee · arXiv
Over the past two decades, speech emotion recognition (SER) has received growing attention. To train SER systems, researchers collect emotional speech databases annotated by crowdsourced or in-house raters who select emotions from predefined categories. However, disagreements amo...