Audio ML Papers

Week of September 28 - October 05, 2025

Subcategories: All (133) | Speech Synthesis (26) | Music Synthesis (8) | Ambient Synthesis (2) | Quality Assessment (5) | Enhancement (17) | Asr (14) | Other (61)
← Previous Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Yixuan Zhou, Guoyang Zeng, Xin Liu ... · arXiv
Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field tow...
#2 TOP PAPER (Score: 84)
Bence Mark Halpern, Thomas B. Tienkamp, Teja Rebernik ... · IEEE Selected Topics in Signal Processing
Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, ...
#3 TOP PAPER (Score: 83)
Koichi Saito, Julian Tanke, Christian Simon ... · arXiv
Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To a...
Thursday, October 02, 2025
Xuyi Hu, Jian Li, Shaojie Zhang ... · arXiv
Personalized Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating personalize...
Jingyi Li, Zhiyuan Zhao, Yunfei Liu ... · arXiv
Neural audio codecs have recently emerged as powerful tools for high-quality and low-bitrate audio compression, leveraging deep generative models to learn latent representations of audio signals. However, existing approaches either rely on a single quantizer that only processes s...
Koichi Saito, Julian Tanke, Christian Simon ... · arXiv
Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To a...
Oğuzhan Kurnaz, Jagabandhu Mishra, Tomi H. Kinnunen ... · arXiv
Spoofing-robust speaker verification (SASV) combines the tasks of speaker and spoof detection to authenticate speakers under adversarial settings. Many SASV systems rely on fusion of speaker and spoof cues at embedding, score or decision levels, based on independently trained sub...
Ahmet Solak, Florian Grötschla, Luca A. Lanzendörfer ... · arXiv
While recent years have seen remarkable progress in music generation models, research on their biases across countries, languages, cultures, and musical genres remains underexplored. This gap is compounded by the lack of datasets and benchmarks that capture the global diversity o...
Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard ... · arXiv
Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reduc...
Luca A. Lanzendörfer, Frédéric Berdoz, Antonis Asonitis ... · arXiv
Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs...
Angelika Ando, Auguste Crabeil, Adrien Lesage ... · arXiv
Speech encodes paralinguistic information such as demographics, voice quality, and health. Yet no audio foundation model supports zero-shot or out-of-distribution (OOD) generalization to these tasks. We introduce SLAP (Speaker contrastive Language-Audio Pretraining), the first mo...
Wednesday, October 01, 2025
Bence Mark Halpern, Thomas B. Tienkamp, Teja Rebernik ... · IEEE Selected Topics in Signal Processing
Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, ...
Jiaqi Li, Yao Qian, Yuxuan Hu ... · arXiv
Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent ...
Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan ... · arXiv
Pressure sensors are widely integrated into modern Heating, Ventilation and Air Conditioning (HVAC) systems. As they are sensitive to acoustic pressure, they can be a source of eavesdropping. This paper introduces HVAC-EAR, which reconstructs intelligible speech from low-resoluti...
Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie · arXiv
Acoustic to articulatory inversion has often been limited to a small part of the vocal tract because the data are generally EMA (ElectroMagnetic Articulography) data requiring sensors to be glued to easily accessible articulators. The presented acoustic to articulation model focu...
Jiaye Tan, Haonan Luo, Linfeng Song ... · arXiv
Low-latency symbolic music generation is essential for real-time improvisation and human-AI co-creation. Existing transformer-based models, however, face a trade-off between inference speed and musical quality. Traditional acceleration techniques such as embedding pooling signifi...
Md. Abdur Rahman, Selvarajah Thuseethan, Kheng Cher Yeo ... · arXiv
Automated birdsong classification is essential for advancing ecological monitoring and biodiversity studies. Despite recent progress, existing methods often depend heavily on labeled data, use limited feature representations, and overlook temporal dynamics essential for accurate ...
Woongjib Choi, Sangmin Lee, Hyungseob Lim ... · arXiv
In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel...
Yifei Cao, Changhao Jiang, Jiabao Zhuang ... · arXiv
Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and of...
Tuesday, September 30, 2025
Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso ... · arXiv
While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and resear...
Armin Gerami, Ramani Duraiswami · arXiv
We introduce a computationally efficient and tunable feedback delay network (FDN) architecture for real-time room impulse response (RIR) rendering that addresses the computational and latency challenges inherent in traditional convolution and Fourier transform based methods. Our ...
Yueqian Lin, Zhengmian Hu, Qinsi Wang ... · arXiv
We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into fi...
Eleonora Ristori, Luca Bindini, Paolo Frasconi · arXiv
Research on audio generation has progressively shifted from waveform-based approaches to spectrogram-based methods, which more naturally capture harmonic and temporal structures. At the same time, advances in image synthesis have shown that autoregression across scales, rather th...
Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam · arXiv
Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpreta...
Jyrki Alakuijala, Martin Bruse, Sami Boukortt ... · arXiv
This paper introduces Zimtohrli, a novel, full-reference audio similarity metric designed for efficient and perceptually accurate quality assessment. In an era dominated by computationally intensive deep learning models and proprietary legacy standards, there is a pressing need f...
Monday, September 29, 2025
Yixuan Zhou, Guoyang Zeng, Xin Liu ... · arXiv
Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field tow...
Tianrui Wang, Haoyu Wang, Meng Ge ... · arXiv
While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the c...
Jaekwon Im, Juhan Nam · arXiv
Versatile audio super-resolution (SR) aims to predict high-frequency components from low-resolution audio across diverse domains such as speech, music, and sound effects. Existing diffusion-based SR methods often fail to produce semantically aligned outputs and struggle with cons...
Xingchen Li, Hanke Xie, Ziqian Wang ... · arXiv
Generative universal speech enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. Diffusion- or flow-based generative models are capable of producing enhanced speech with high quality and fidelity. However, they ...
Chengyao Wang, Zhisheng Zhong, Bohao Peng ... · arXiv
We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples ...