Best Audio ML Papers

← Back to current papers

Best Papers: All Time Best | Audio
#1 (Score: 83)
Rostislav Makarov, Lea Schönherr, Timo Gerkmann · arXiv
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be suscep...
#2 (Score: 86)
Sitong Cheng, Weizhen Bian, Xinsheng Wang ... · arXiv
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data...
#3 (Score: 83)
Haolin He, Xingjian Du, Renhe Sun ... · arXiv
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-tra...
#4 (Score: 80)
Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong ... · arXiv
Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fideli...
#5 (Score: 83)
Ismail Rasim Ulgen, Zongyang Du, Junchen Lu ... · arXiv
Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measur...
#6 (Score: 82)
Pin-Jui Ku, He Huang, Jean-Marie Lemercier ... · arXiv
This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronge...
#7 (Score: 82)
Stefano Ciapponi, Leonardo Mannini, Jarek Scanferla ... · arXiv
This paper introduces WrenNet, an efficient neural network enabling real-time multi-species bird audio classification on low-power microcontrollers for scalable biodiversity monitoring. We propose a semi-learnable spectral feature extractor that adapts to avian vocalizations, out...
#8 (Score: 81)
Yifan Yang, Bing Han, Hui Wang ... · arXiv
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably qu...
#9 (Score: 83)
The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran ... · arXiv
Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generat...
#10 (Score: 83)
Junchuan Zhao, Wei Zeng, Tianle Lyu ... · arXiv
Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extend...
#11 (Score: 83)
Jinyang Wu, Nana Hou, Zihan Pan ... · arXiv
The rapid growth of the digital economy in South-East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained ...
#12 (Score: 82)
Hongzhao Chen, XiaoYang Wang, Jing Lan ... · arXiv
Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker overlap, and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, f...
#13 (Score: 83)
Shaoshi Ling, Gang Liu, Guoli Ye ... · arXiv
Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries di...
#14 (Score: 83)
Karen Rosero, Eunjung Yeo, David R. Mortensen ... · arXiv
We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with partic...
#15 (Score: 83)
Zhijun Liu, Dongya Jia, Xiaoqiang Wang ... · arXiv
Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising al...
#16 (Score: 83)
Niclas Pokel, Pehuén Moure, Roman Boehringer ... · arXiv
Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite rece...
#17 (Score: 83)
Niclas Pokel, Pehuén Moure, Roman Boehringer ... · arXiv
Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This w...
#18 (Score: 80)
Junyu Wang, Ziyang Ma, Zhengding Luo ... · arXiv
Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, cau...
#19 (Score: 82)
Aditya Bhattacharjee, Marco Pasini, Emmanouil Benetos · Under review for International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, 2026
The evaluation of audio fingerprinting at a realistic scale is limited by the scarcity of large public music databases. We present an audio-free approach that synthesises latent fingerprints which approximate the distribution of real fingerprints. Our method trains a Rectified Fl...
#20 (Score: 83)
Jiarui Hai, Helin Wang, Weizhe Guo ... · arXiv
Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-...
#21 (Score: 83)
Runyan Yang, Yuke Si, Yingying Gao ... · arXiv
While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge...
#22 (Score: 81)
Seungyoun Shin, Dongha Ahn, Jiwoo Kim ... · arXiv
Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into mon...
#23 (Score: 82)
Mélisande Teng, Julien Boussard, David Rolnick ... · arXiv
Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great p...
#24 (Score: 82)
Tornike Karchkhadze, Kuan-Lin Chen, Mojtaba ... · arXiv
We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely rema...
#25 (Score: 82)
Qiushi Han, Yuan Liao, Youhao Si ... · 5 pages, 2 figures, conference
Achieving robust and personalized performance in neuro-steered Target Speaker Extraction (TSE) remains a significant challenge for next-generation hearing aids. This is primarily due to two factors: the inherent non-stationarity of EEG signals across sessions, and the high inter-...
#26 (Score: 83)
Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel · arXiv
Spatial target speaker extraction isolates a desired speaker's voice in multi-speaker environments using spatial information, such as the direction of arrival (DoA). Although recent deep neural network (DNN)-based discriminative methods have shown significant performance improvem...
#27 (Score: 83)
Viola Negroni, Davide Salvi, Alessandro Ilic Mezza ... · Accepted @ IEEE WIFS 2025
AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfa...
#28 (Score: 92)
Chang Li, Zehua Chen, Liyuan Wang ... · Accepted at NeurIPS 2025
Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative gene...
#29 (Score: 82)
Dongheon Lee, Younghoo Kwon, Jung-Woo Choi · 26 pages, 13 figures, 8 tables, accepted in NeurIPS 2025
We propose DeepASA, a one-for-all model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA ...
#30 (Score: 83)
Junhyeok Lee, Helin Wang, Yaohan Guan ... · arXiv
We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To furt...
#31 (Score: 77)
Zeyu Xie, Yaoyun Zhang, Xuenan Xu ... · arXiv
The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largel...
#32 (Score: 81)
Massa Baali, Sarthak Bisht, Francisco Teixeira ... · Accepted to EMNLP 2025 Findings
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions caus...
#33 (Score: 77)
Ruonan Zhang, Xiaoyang Hao, Yichen Han ... · arXiv
High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and semantic information within tokens, leading to ...
#34 (Score: 83)
Yan Rong, Chenxing Li, Dong Yu ... · arXiv
Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training dat...
#35 (Score: 82)
Ragib Amin Nihal, Benjamin Yen, Takeshi Ashizawa ... · Accepted on Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025)
Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization. However, existing methods often struggle to address nonlinear clock drift and lack mechanisms for quantifying uncertainty. Traditional methods like Cros...
#36 (Score: 83)
Tianheng Zhu, Yinfeng Yu, Liejun Wang ... · Main paper (15 pages). Accepted for publication by ICONIP( International Conference on Neural Information Processing) 2025
Audio-driven talking head generation is crucial for applications in virtual reality, digital avatars, and film production. While NeRF-based methods enable high-fidelity reconstruction, they suffer from low rendering efficiency and suboptimal audio-visual synchronization. This wor...
#37 (Score: 83)
Vishnu Raja, Adithya V Ganesan, Anand Syamkumar ... · Will appear in EMNLP 2025 Main Proceedings
State-of-the-art automatic speech recognition (ASR) models like Whisper, perform poorly on atypical speech, such as that produced by individuals with dysarthria. Past works for atypical speech have mostly investigated fully personalized (or idiosyncratic) models, but modeling str...
#38 (Score: 82)
Maurício do V. M. da Costa, Eloi Moliner · accepted at IEEE International Symposium on the Internet of Sounds
This paper introduces MR-CQTdiff, a novel neural-network architecture for diffusion-based audio generation that leverages a multi-resolution Constant-$Q$ Transform (C$Q$T). The proposed architecture employs an efficient, invertible CQT framework that adjusts the time-frequency re...
#39 (Score: 82)
Tse-Yang Chen, Yuh-Jzer Joung · arXiv
Piano cover generation aims to automatically transform a pop song into a piano arrangement. While numerous deep learning approaches have been proposed, existing models often fail to maintain structural consistency with the original song, likely due to the absence of beat-aware me...
#40 (Score: 82)
Luca Della Libera, Cem Subakan, Mirco Ravanelli · arXiv
Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications....
#41 (Score: 82)
Qi Wang, Shituo Ma, Guoxin Yu ... · arXiv
Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high c...
#42 (Score: 80)
Mohd Mujtaba Akhtar, Girish, Orchid Chetia Phukan ... · Accepted to APSIPA-ASC 2025
In this work, we investigate multimodal foundation models (MFMs) for EmoFake detection (EFD) and hypothesize that they will outperform audio foundation models (AFMs). MFMs due to their cross-modal pre-training, learns emotional patterns from multiple modalities, while AFMs rely o...
#43 (Score: 83)
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze · arXiv
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does ...
#44 (Score: 82)
Dohwan Kim, Jung-Woo Choi · arXiv
In speech enhancement, knowledge distillation (KD) compresses models by transferring a high-capacity teacher's knowledge to a compact student. However, conventional KD methods train the student to mimic the teacher's output entirely, which forces the student to imitate the region...
#45 (Score: 75)
Gang Yang, Yue Lei, Wenxin Tai ... · arXiv
Diffusion and flow matching (FM) models have achieved remarkable progress in speech enhancement (SE), yet their dependence on multi-step generation is computationally expensive and vulnerable to discretization errors. Recent advances in one-step generative modeling, particularly ...
#46 (Score: 82)
Younghoo Kwon, Jung-Woo Choi · arXiv
The spatial semantic segmentation task focuses on separating and classifying sound objects from multichannel signals. To achieve two different goals, conventional methods fine-tune a large classification model cascaded with the separation model and inject classified labels as sep...
#47 (Score: 83)
Ziqi Dai, Yiting Chen, Jiacheng Xu ... · arXiv
The pipeline for multi-participant audiobook production primarily consists of three stages: script analysis, character voice timbre selection, and speech synthesis. Among these, script analysis can be automated with high accuracy using NLP models, whereas character voice timbre s...
#48 (Score: 83)
Yongsheng Feng, Yuetonghui Xu, Jiehui Luo ... · arXiv
Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and ...
#49 (Score: 83)
Qiaolin Wang, Xilin Jiang, Linyang He ... · arXiv
While large audio-language models (LALMs) have demonstrated state-of-the-art audio understanding, their reasoning capability in complex soundscapes still falls behind large vision-language models (LVLMs). Compared to the visual domain, one bottleneck is the lack of large-scale ch...
#50 (Score: 81)
Pengcheng Li, Botao Zhao, Zuheng Kang ... · Accepted by the Findings of 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings 2025)
Although Large Audio-Language Models (LALMs) have exhibited outstanding performance in auditory understanding, their performance in affective computing scenarios, particularly in emotion recognition, reasoning, and subtle sentiment differentiation, remains suboptimal. Recent adva...
#51 (Score: 83)
Kaspar Müller, Markus Buck, Simon Doclo ... · Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing
The steered response power (SRP) method is one of the most popular approaches for acoustic source localization with microphone arrays. It is often based on simplifying acoustic assumptions, such as an omnidirectional sound source in the far field of the microphone array(s), free ...
#52 (Score: 81)
Yiru Zhang, Hang Su, Lichun Fan ... · arXiv
Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. How...
#53 (Score: 83)
Jun-Wei Yeow, Ee-Leng Tan, Santi Peksi ... · arXiv
Deep learning-based Sound Event Localization and Detection (SELD) systems degrade significantly on real-world, long-tailed datasets. Standard regression losses bias learning toward frequent classes, causing rare events to be systematically under-recognized. To address this challe...
#54 (Score: 84)
Dhruuv Agarwal, Harry Zhang, Yang Yu ... · arXiv
Personalizing Automatic Speech Recognition (ASR) for dysarthric speech is crucial but challenging due to training and storing of individual user adapters. We propose a hybrid meta-training method for a single model, excelling in zero-shot and few-shot on-the-fly personalization v...
#55 (Score: 82)
Ryan Collette, Ross Greenwood, Serena Nicoll · arXiv
While existing speech audio codecs designed for compression exploit limited forms of temporal redundancy and allow for multi-scale representations, they tend to represent all features of audio in the same way. In contrast, generative voice models designed for text-to-speech and v...
#56 (Score: 82)
Daniyal Kabir Dar, Qiben Yan, Li Xiao ... · arXiv
Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-e...
#57 (Score: 82)
Simon Welker, Tal Peer, Timo Gerkmann · arXiv
The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudo...
#58 (Score: 80)
Théo Charlot, Tarek Kunze, Maxime Poli ... · arXiv
Child-centered long-form recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, the first self-supervised speech representation mo...
#59 (Score: 77)
Xiaolei Xu, Chaoyue Niu, Guy J. Brown ... · arXiv
Obstructive sleep apnoea (OSA) is a prevalent condition with significant health consequences, yet many patients remain undiagnosed due to the complexity and cost of over-night polysomnography. Acoustic-based screening provides a scalable alternative, yet performance is limited by...
#60 (Score: 83)
Kangdi Wang, Zhiyue Wu, Dinghao Zhou ... · arXiv
Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representati...
#61 (Score: 81)
Yuanjian Chen, Yang Xiao, Jinjie Huang · arXiv
Multimodal acoustic event classification plays a key role in audio-visual systems. Although combining audio and visual signals improves recognition, it is still difficult to align them over time and to reduce the effect of noise across modalities. Existing methods often treat aud...
#62 (Score: 77)
Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan · arXiv
In this paper, we show that discrete optimal transport (DOT) is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level WavLM embeddings of generated spe...
#63 (Score: 82)
Duojia Li, Shenghui Lu, Hongchen Pan ... · arXiv
Multistep inference is a bottleneck for real-time generative speech enhancement because flow- and diffusion-based systems learn an instantaneous velocity field and therefore rely on iterative ordinary differential equation (ODE) solvers. We introduce MeanFlowSE, a conditional gen...
#64 (Score: 82)
Mingchen Shao, Bingshen Mu, Chengyou Wang ... · arXiv
Speech large language models (SLLMs) built on speech encoders, adapters, and LLMs demonstrate remarkable multitask understanding performance in high-resource languages such as English and Chinese. However, their effectiveness substantially degrades in low-resource languages such ...
#65 (Score: 83)
Michael Tatarjitzky, Boaz Rafaely · arXiv
Multichannel speech enhancement leverages spatial cues to improve intelligibility and quality, but most learning-based methods rely on specific microphone array geometry, unable to account for geometry changes. To mitigate this limitation, current array-agnostic approaches employ...
#66 (Score: 83)
Kentaro Seki, Yuki Okamoto, Kouei Yamaoka ... · arXiv
Contrastive language--audio pretraining (CLAP) has achieved remarkable success as an audio--text embedding framework, but existing approaches are limited to monaural or single-source conditions and cannot fully capture spatial information. The central challenge in modeling spatia...
#67 (Score: 82)
Keyu An, Zhiyu Zhang, Changfeng Gao ... · arXiv
This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenizat...
#68 (Score: 80)
Samuel J. Broughton, Lahiru Samarakoon · In Proc. Interspeech 2025 (pp. 5218-5222) · In Proc. Interspeech 2025 (pp. 5218-5222)
In this paper, we present state-of-the-art diarization error rates (DERs) on multiple publicly available datasets, including AliMeeting-far, AliMeeting-near, AMI-Mix, AMI-SDM, DIHARD III, and MagicData RAMC. Leveraging EEND-TA, a single unified non-autoregressive model for end-to...
#69 (Score: 83)
Ye-Xin Lu, Yu Gu, Kun Wei ... · arXiv
This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background...
#70 (Score: 80)
Kartik Hegde, Rehana Mahfuz, Yinyi Guo ... · arXiv
Current audio captioning systems rely heavily on supervised learning with paired audio-caption datasets, which are expensive to curate and may not reflect human preferences in real-world scenarios. To address this limitation, we propose a preference-aligned audio captioning frame...
#71 (Score: 83)
Seungmin Seo, Oleg Aulov, P. Jonathon Phillips · arXiv
We use the term re-identification to refer to the process of recovering the original speaker's identity from anonymized speech outputs. Speaker de-identification systems aim to reduce the risk of re-identification, but most evaluations focus only on individual-level measures and ...
#72 (Score: 82)
Jungwoo Heo, Hyun-seo Shin, Chan-yeong Lim ... · 8 pages, 5 figures, accepted at IEEE ASRU 2025
Self-supervised learning (SSL) has pushed speaker verification accuracy close to state-of-the-art levels, but the Transformer backbones used in most SSL encoders hinder on-device and real-time deployment. Prior compression work trims layer depth or width yet still inherits the qu...
#73 (Score: 82)
Xikun Lu, Fang Liu, Weizhi Shi ... · arXiv
High-fidelity binaural audio synthesis is crucial for immersive listening, but existing methods require extensive computational resources, limiting their edge-device application. To address this, we propose the Lightweight Implicit Neural Network (LINN), a novel two-stage framewo...
#74 (Score: 83)
Junan Zhang, Yunjia Zhang, Xueyao Zhang ... · arXiv
Singing Accompaniment Generation (SAG) is the process of generating instrumental music for a given clean vocal input. However, existing SAG techniques use source-separated vocals as input and overfit to separation artifacts. This creates a critical train-test mismatch, leading to...
#75 (Score: 83)
Eric Zhang, Li Wei, Sarah Chen ... · arXiv
Stuttered and dysfluent speech detection systems have traditionally suffered from the trade-off between accuracy and clinical interpretability. While end-to-end deep learning models achieve high performance, their black-box nature limits clinical adoption. This paper looks at the...
#76 (Score: 82)
Shun Huang, Zhihua Fang, Liang He · Accepted ICASSP 2025
Unsupervised anomalous sound detection aims to detect unknown anomalous sounds by training a model using only normal audio data. Despite advancements in self-supervised methods, the issue of frequent false alarms when handling samples of the same type from different machines rema...
#77 (Score: 75)
Janne Laakkonen, Ivan Kukanov, Ville Hautamäki · arXiv
Foundation models such as Wav2Vec2 excel at representation learning in speech tasks, including audio deepfake detection. However, after being fine-tuned on a fixed set of bonafide and spoofed audio clips, they often fail to generalize to novel deepfake methods not represented in ...
#78 (Score: 83)
Fei Liu, Yang Ai, Zhen-Hua Ling · Accepted by APSIPA2025
This paper proposes APSS, a novel neural speech separation model with parallel amplitude and phase spectrum estimation. Unlike most existing speech separation methods, the APSS distinguishes itself by explicitly estimating the phase spectrum for more complete and accurate separat...
#79 (Score: 83)
Younghoo Kwon, Dongheon Lee, Dohwan Kim ... · 5 pages, 2 figures, submitted to DCASE workshop 2025
This paper introduces a multi-stage self-directed framework designed to address the spatial semantic segmentation of sound scene (S5) task in the DCASE 2025 Task 4 challenge. This framework integrates models focused on three distinct tasks: Universal Sound Separation (USS), Singl...
#80 (Score: 83)
Justin Lovelace, Rithesh Kumar, Jiaqi Su ... · arXiv
While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To...
#81 (Score: 83)
Hui-Peng Du, Yang Ai, Zhen-Hua Ling · Accepted by APSIPA ASC 2025
The majority of mainstream neural vocoders primarily focus on speech quality and generation speed, while overlooking latency, which is a critical factor in real-time applications. Excessive latency leads to noticeable delays in user interaction, severely degrading the user experi...
#82 (Score: 82)
Arnab Kumar Roy, Hemant Kumar Kathania, Paban Sapkota · arXiv
Dysarthric speech severity classification is crucial for objective clinical assessment and progress monitoring in individuals with motor speech disorders. Although prior methods have addressed this task, achieving robust generalization in speaker-independent (SID) scenarios remai...
#83 (Score: 77)
Han Yin, Jung-Woo Choi · arXiv
Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate LALMs' audio understanding performance, researchers have proposed different benchmarks. However...
#84 (Score: 83)
Yudong Yang, Xiaokang Liu, Shaofeng zhao ... · arXiv
Speech therapy plays a critical role in training speech disorders caused by neurological impairments such as stroke. However, traditional manual and computer-assisted systems are limited in real-time accessibility and articulatory motion feedback, constraining their practical uti...
#85 (Score: 82)
Yujie Guo, Jiaming Zhou, Yuhang Jia ... · arXiv
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which ...
#86 (Score: 83)
Jingyu Li, Guangyan Zhang, Zhen Ye ... · arXiv
Audio codecs are a critical component of modern speech generation systems. This paper introduces a low-bitrate, multi-scale residual codec that encodes speech into four distinct streams: semantic, timbre, prosody, and residual. This architecture achieves high-fidelity speech reco...
#87 (Score: 77)
Zhan Jin, Bang Zeng, Peijun Yang ... · arXiv
Target Speaker Extraction (TSE) is a critical challenge in cocktail party scenarios. While leveraging multiple modalities, such as voice, lip, face, and expression embeddings, can enhance performance, real-world applications often suffer from intermittent modality dropout. This p...
#88 (Score: 83)
Wen-Yung Wu, Pei-Chin Hsieh, Tai-Shih Chi · arXiv
Voice activity detection (VAD) is essential in speech-based systems, but traditional methods detect only speech presence without identifying speakers. Target-speaker VAD (TS-VAD) extends this by detecting the speech of a known speaker using a short enrollment utterance, but this ...
#89 (Score: 80)
Adhiraj Banerjee, Vipul Arora · arXiv
Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to ...
#90 (Score: 82)
Milan Marocchi, Matthew Fynn, Kayapanda Mandana ... · arXiv
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to class...
#91 (Score: 77)
Emmanouil Karystinaios · Accepted at Large Language Models for Music & Audio Workshop (LLM4MA) 2025
Agentic AI has been standardized in industry as a practical paradigm for coordinating specialized models and tools to solve complex multimodal tasks. In this work, we present WeaveMuse, a multi-agent system for music understanding, symbolic composition, and audio synthesis. Each ...
#92 (Score: 80)
Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin ... · arXiv
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic repre...