Audio Papers

← Back to current week

Week of September 21 - September 28, 2025

← Previous Week | Next Week → | Current Week
Thursday, September 25, 2025
Sitong Cheng, Weizhen Bian, Xinsheng Wang ... · arXiv
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data...
Rostislav Makarov, Lea Schönherr, Timo Gerkmann · arXiv
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be suscep...
Haolin He, Xingjian Du, Renhe Sun ... · arXiv
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-tra...
Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong ... · arXiv
Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fideli...
Wednesday, September 24, 2025
Junchuan Zhao, Wei Zeng, Tianle Lyu ... · arXiv
Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extend...
The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran ... · arXiv
Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generat...
Ismail Rasim Ulgen, Zongyang Du, Junchen Lu ... · arXiv
Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measur...
Jinyang Wu, Nana Hou, Zihan Pan ... · arXiv
The rapid growth of the digital economy in South-East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained ...
Pin-Jui Ku, He Huang, Jean-Marie Lemercier ... · arXiv
This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronge...
Stefano Ciapponi, Leonardo Mannini, Jarek Scanferla ... · arXiv
This paper introduces WrenNet, an efficient neural network enabling real-time multi-species bird audio classification on low-power microcontrollers for scalable biodiversity monitoring. We propose a semi-learnable spectral feature extractor that adapts to avian vocalizations, out...
Hongzhao Chen, XiaoYang Wang, Jing Lan ... · arXiv
Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker overlap, and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, f...
Yifan Yang, Bing Han, Hui Wang ... · arXiv
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably qu...
Tuesday, September 23, 2025
Shaoshi Ling, Gang Liu, Guoli Ye ... · arXiv
Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries di...
Niclas Pokel, Pehuén Moure, Roman Boehringer ... · arXiv
Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This w...
Zhijun Liu, Dongya Jia, Xiaoqiang Wang ... · arXiv
Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising al...
Karen Rosero, Eunjung Yeo, David R. Mortensen ... · arXiv
We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with partic...
Jiarui Hai, Helin Wang, Weizhe Guo ... · arXiv
Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-...
Runyan Yang, Yuke Si, Yingying Gao ... · arXiv
While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge...
Niclas Pokel, Pehuén Moure, Roman Boehringer ... · arXiv
Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite rece...
Aditya Bhattacharjee, Marco Pasini, Emmanouil Benetos · Under review for International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, 2026
The evaluation of audio fingerprinting at a realistic scale is limited by the scarcity of large public music databases. We present an audio-free approach that synthesises latent fingerprints which approximate the distribution of real fingerprints. Our method trains a Rectified Fl...
Seungyoun Shin, Dongha Ahn, Jiwoo Kim ... · arXiv
Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into mon...
Junyu Wang, Ziyang Ma, Zhengding Luo ... · arXiv
Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, cau...
Monday, September 22, 2025
Chang Li, Zehua Chen, Liyuan Wang ... · Accepted at NeurIPS 2025
Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative gene...
Viola Negroni, Davide Salvi, Alessandro Ilic Mezza ... · Accepted @ IEEE WIFS 2025
AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfa...
Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel · arXiv
Spatial target speaker extraction isolates a desired speaker's voice in multi-speaker environments using spatial information, such as the direction of arrival (DoA). Although recent deep neural network (DNN)-based discriminative methods have shown significant performance improvem...
Qiushi Han, Yuan Liao, Youhao Si ... · 5 pages, 2 figures, conference
Achieving robust and personalized performance in neuro-steered Target Speaker Extraction (TSE) remains a significant challenge for next-generation hearing aids. This is primarily due to two factors: the inherent non-stationarity of EEG signals across sessions, and the high inter-...
Mélisande Teng, Julien Boussard, David Rolnick ... · arXiv
Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great p...
Tornike Karchkhadze, Kuan-Lin Chen, Mojtaba ... · arXiv
We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely rema...
Sunday, September 21, 2025
Yan Rong, Chenxing Li, Dong Yu ... · arXiv
Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training dat...
Junhyeok Lee, Helin Wang, Yaohan Guan ... · arXiv
We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To furt...
Tianheng Zhu, Yinfeng Yu, Liejun Wang ... · Main paper (15 pages). Accepted for publication by ICONIP( International Conference on Neural Information Processing) 2025
Audio-driven talking head generation is crucial for applications in virtual reality, digital avatars, and film production. While NeRF-based methods enable high-fidelity reconstruction, they suffer from low rendering efficiency and suboptimal audio-visual synchronization. This wor...
Ragib Amin Nihal, Benjamin Yen, Takeshi Ashizawa ... · Accepted on Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025)
Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization. However, existing methods often struggle to address nonlinear clock drift and lack mechanisms for quantifying uncertainty. Traditional methods like Cros...
Dongheon Lee, Younghoo Kwon, Jung-Woo Choi · 26 pages, 13 figures, 8 tables, accepted in NeurIPS 2025
We propose DeepASA, a one-for-all model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA ...
Massa Baali, Sarthak Bisht, Francisco Teixeira ... · Accepted to EMNLP 2025 Findings
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions caus...
Zeyu Xie, Yaoyun Zhang, Xuenan Xu ... · arXiv
The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largel...
Ruonan Zhang, Xiaoyang Hao, Yichen Han ... · arXiv
High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and semantic information within tokens, leading to ...