Audio ML Papers

Week of September 21 - September 28, 2025

Subcategories: All (92) | Speech Synthesis (18) | Music Synthesis (3) | Ambient Synthesis (2) | Quality Assessment (4) | Enhancement (10) | Asr (11) | Other (44)
← Previous Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Chang Li, Zehua Chen, Liyuan Wang ... · Accepted at NeurIPS 2025
Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative gene...
#2 TOP PAPER (Score: 86)
Sitong Cheng, Weizhen Bian, Xinsheng Wang ... · arXiv
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data...
#3 TOP PAPER (Score: 83)
Rostislav Makarov, Lea Schönherr, Timo Gerkmann · arXiv
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be suscep...
Thursday, September 25, 2025
Sitong Cheng, Weizhen Bian, Xinsheng Wang ... · arXiv
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data...
Rostislav Makarov, Lea Schönherr, Timo Gerkmann · arXiv
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be suscep...
Haolin He, Xingjian Du, Renhe Sun ... · arXiv
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-tra...
Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong ... · arXiv
Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fideli...
Wednesday, September 24, 2025
Junchuan Zhao, Wei Zeng, Tianle Lyu ... · arXiv
Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extend...
The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran ... · arXiv
Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generat...
Ismail Rasim Ulgen, Zongyang Du, Junchen Lu ... · arXiv
Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measur...
Jinyang Wu, Nana Hou, Zihan Pan ... · arXiv
The rapid growth of the digital economy in South-East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained ...
Pin-Jui Ku, He Huang, Jean-Marie Lemercier ... · arXiv
This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronge...
Stefano Ciapponi, Leonardo Mannini, Jarek Scanferla ... · arXiv
This paper introduces WrenNet, an efficient neural network enabling real-time multi-species bird audio classification on low-power microcontrollers for scalable biodiversity monitoring. We propose a semi-learnable spectral feature extractor that adapts to avian vocalizations, out...
Hongzhao Chen, XiaoYang Wang, Jing Lan ... · arXiv
Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker overlap, and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, f...
Yifan Yang, Bing Han, Hui Wang ... · arXiv
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably qu...
Tuesday, September 23, 2025
Shaoshi Ling, Gang Liu, Guoli Ye ... · arXiv
Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries di...
Niclas Pokel, Pehuén Moure, Roman Boehringer ... · arXiv
Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This w...
Zhijun Liu, Dongya Jia, Xiaoqiang Wang ... · arXiv
Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising al...
Karen Rosero, Eunjung Yeo, David R. Mortensen ... · arXiv
We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with partic...
Jiarui Hai, Helin Wang, Weizhe Guo ... · arXiv
Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-...
Runyan Yang, Yuke Si, Yingying Gao ... · arXiv
While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge...
Niclas Pokel, Pehuén Moure, Roman Boehringer ... · arXiv
Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite rece...
Aditya Bhattacharjee, Marco Pasini, Emmanouil Benetos · Under review for International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, 2026
The evaluation of audio fingerprinting at a realistic scale is limited by the scarcity of large public music databases. We present an audio-free approach that synthesises latent fingerprints which approximate the distribution of real fingerprints. Our method trains a Rectified Fl...
Seungyoun Shin, Dongha Ahn, Jiwoo Kim ... · arXiv
Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into mon...
Junyu Wang, Ziyang Ma, Zhengding Luo ... · arXiv
Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, cau...
Monday, September 22, 2025
Chang Li, Zehua Chen, Liyuan Wang ... · Accepted at NeurIPS 2025
Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative gene...
Viola Negroni, Davide Salvi, Alessandro Ilic Mezza ... · Accepted @ IEEE WIFS 2025
AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfa...
Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel · arXiv
Spatial target speaker extraction isolates a desired speaker's voice in multi-speaker environments using spatial information, such as the direction of arrival (DoA). Although recent deep neural network (DNN)-based discriminative methods have shown significant performance improvem...
Qiushi Han, Yuan Liao, Youhao Si ... · 5 pages, 2 figures, conference
Achieving robust and personalized performance in neuro-steered Target Speaker Extraction (TSE) remains a significant challenge for next-generation hearing aids. This is primarily due to two factors: the inherent non-stationarity of EEG signals across sessions, and the high inter-...
Mélisande Teng, Julien Boussard, David Rolnick ... · arXiv
Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great p...
Tornike Karchkhadze, Kuan-Lin Chen, Mojtaba ... · arXiv
We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely rema...