Audio ML Papers

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

Sitong Cheng, Weizhen Bian, Xinsheng Wang ... · arXiv

The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data...

The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.

Demo · GitHub

ML Relevance Analysis (86)

The main contribution of this paper is the introduction of UniSS, a unified single-stage framework for expressive speech-to-speech translation that significantly advances the state of the art by integrating large language models and addressing key challenges in the field. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its significance in machine learning research.

Comprehensive Analysis

Methodology Assessment

The paper introduces a unified single-stage framework for expressive speech-to-speech translation (S2ST) called UniSS, which effectively addresses the challenges of preserving speaker identity and emotional style during translation. The methodology is innovative, employing a cross-modal chain-of-thought prompting process that allows for the integration of large language models (LLMs) into the speech domain. The use of a triple-tokenizer strategy to represent different aspects of speech (speaker, linguistic, and semantic tokens) is a notable strength, as it enhances the model's ability to capture and reproduce expressive characteristics. The progressive training strategy is well-structured, emphasizing the importance of data quality and alignment between speech and text modalities.

Experimental Evaluation

The experimental results are robust, demonstrating that UniSS significantly outperforms existing methods in translation fidelity, speech quality, and emotional preservation. The authors provide a comprehensive evaluation using both objective metrics (e.g., BLEU scores, prosody preservation) and subjective assessments (e.g., MOS scores), which lend credibility to their claims. The introduction of the UniST dataset, comprising 44.8k hours of expressive S2ST data, is a significant contribution that enhances the reproducibility of results and provides a valuable resource for future research.

Reproducibility

The paper includes detailed implementation details, including the training configuration, hyperparameters, and the data construction process for the UniST dataset. The availability of the code and demo enhances reproducibility, allowing other researchers to replicate the findings and build upon the work. However, the complexity of the model and the extensive training data required may pose challenges for some researchers in terms of resource availability.

Limitations

While the paper presents a strong framework, it acknowledges limitations such as the focus on only Chinese and English languages, which restricts the applicability of the model to multilingual scenarios. Additionally, the reliance on a large-scale dataset may limit the model's accessibility for smaller research teams or institutions. The authors also mention the need for a unified tokenizer to optimize vocabulary size, indicating potential areas for further improvement.

Broader Impact

The proposed UniSS framework has significant implications for real-time interpretation, cross-lingual video dubbing, and other applications requiring high-quality expressive S2ST. By effectively preserving emotional style and speaker identity, this work could enhance user experiences in various communication technologies, making it particularly relevant in globalized contexts where multilingual interactions are common. The main contribution of this paper is the introduction of UniSS, a unified single-stage framework for expressive speech-to-speech translation that significantly advances the state of the art by integrating large language models and addressing key challenges in the field. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its significance in machine learning research.

Analysis: Full Paper • Full text: 50,026 characters

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Rostislav Makarov, Lea Schönherr, Timo Gerkmann · arXiv

Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be suscep...

Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.

Institutional Affiliations

Primary: University of Hamburg

All Institutions: University of Hamburg, CISPA Helmholtz Center for Information Security, Signal Processing (SP)

Demo · GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the demonstration that modern speech enhancement systems are vulnerable to adversarial attacks, highlighting the need for robust defenses in the field. The comprehensive methodology and experimental evaluation provide valuable insights into the vulnerabilities of both predictive and generative models, making a meaningful contribution to the ongoing discourse on security in machine learning applications.

Comprehensive Analysis

Methodology Assessment

The paper proposes a novel approach to adversarial attacks on speech enhancement systems by leveraging psychoacoustic principles to mask adversarial noise. The methodology is well-structured, incorporating a white-box attack scenario where the adversary has full knowledge of the model. The introduction of a psychoacoustic model to optimize the inaudibility of the perturbation is particularly innovative. The authors also provide a detailed description of the optimization process, including the use of projected gradient descent and the incorporation of constraints to balance attack success and audibility. This methodological rigor enhances the credibility of the findings.

Experimental Evaluation

The experiments are comprehensive, utilizing the EARS-WHAM-v2 dataset, which is appropriate for evaluating speech enhancement systems. The evaluation metrics are well-chosen, including both attack success (WER, POLQA, ESTOI) and perturbation impact (SNR). The results are presented clearly, showing a systematic comparison between predictive and generative models, with insightful analysis on the effects of different configurations. The paper effectively demonstrates the vulnerability of speech enhancement systems to adversarial attacks and highlights the robustness of diffusion models.

Reproducibility

The authors provide sufficient details regarding the experimental setup, including model architectures and training procedures. The inclusion of links to the project page and GitHub repository enhances reproducibility. However, the paper could benefit from more explicit instructions on replicating the psychoacoustic model and the adversarial attack process, as these are critical to understanding the full scope of the methodology.

Limitations

One limitation of the study is that it primarily focuses on white-box attacks, which may not fully represent real-world scenarios where adversaries have limited knowledge of the model. Additionally, while the paper discusses the robustness of diffusion models, it does not explore the potential trade-offs in performance or the computational complexity associated with these models. The generalizability of the findings to other speech enhancement systems beyond those tested is also not addressed.

Broader Impact

This research has significant implications for the security of speech enhancement systems, which are increasingly used in applications such as hearing aids and telecommunication devices. By demonstrating vulnerabilities to adversarial attacks, the work raises awareness about the need for more robust models in real-world applications. The findings could inform future research aimed at developing defenses against such attacks, ultimately contributing to safer and more reliable speech processing technologies. The main contribution of this paper is the demonstration that modern speech enhancement systems are vulnerable to adversarial attacks, highlighting the need for robust defenses in the field. The comprehensive methodology and experimental evaluation provide valuable insights into the vulnerabilities of both predictive and generative models, making a meaningful contribution to the ongoing discourse on security in machine learning applications.

Analysis: Full Paper • Full text: 21,558 characters

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

Haolin He, Xingjian Du, Renhe Sun ... · arXiv

Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-tra...

Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.1\% on MMAR, and 70.7\% on MMSU, establishing new state-of-the-art performance across these benchmarks.

Institutional Affiliations

Primary: South China University of Technology

All Institutions: South China University of Technology, Antgroup, Shanghai Jiao Tong University, University of Rochester, The Chinese University of Hong Kong, King's College London

ML Relevance Analysis (83)

The paper presents a comprehensive approach to improving Large Audio Language Models through innovative dataset construction and training paradigms, addressing critical gaps in the current research landscape. The technical contributions, particularly in the context of audio contribution analysis, position this work as a notable advancement in the field of audio processing and multimodal AI.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel dataset, AudioMCQ, which is substantial in size (571k samples) and includes chain-of-thought annotations. The methodology for dataset construction is well-structured, avoiding reliance on existing LALMs to prevent hallucinations. The introduction of Audio-Contribution Filtering to categorize audio contributions into weak and strong subsets is a significant methodological advancement. The proposed post-training paradigms (Weak-to-Strong and Mixed-to-Strong) are innovative and provide a framework for enhancing LALM performance based on audio contributions, which is a relatively unexplored area in the field.

Experimental Evaluation

The experiments are robust, demonstrating the effectiveness of the proposed methods through competitive results in the DCASE 2025 Audio-Question-Answering challenge, where the authors achieved first place. The performance metrics across multiple benchmarks (MMAU-test-mini, MMAU, MMAR, MMSU) indicate that the proposed strategies lead to state-of-the-art results. The systematic evaluation of the zero audio-contribution phenomenon adds depth to the experimental design, showcasing the authors' thorough understanding of the challenges in LALM training.

Reproducibility

The paper provides sufficient details regarding the dataset construction pipeline, training strategies, and evaluation protocols, which enhances reproducibility. However, the absence of a public repository or demo URL limits the ease with which others can replicate the results.

Limitations

One limitation is the reliance on the quality of the audio data and its annotations, which can introduce biases or inaccuracies. Additionally, while the dataset is large, the diversity of audio types and contexts may still be limited, potentially affecting generalizability. The paper does not address the potential computational costs associated with the proposed training paradigms.

Broader Impact

The findings have significant implications for the development of more effective multimodal AI systems, particularly in audio understanding tasks. The methodologies proposed could be applied to other domains where audio and text modalities intersect, potentially influencing future research directions in LALMs and related fields. The paper presents a comprehensive approach to improving Large Audio Language Models through innovative dataset construction and training paradigms, addressing critical gaps in the current research landscape. The technical contributions, particularly in the context of audio contribution analysis, position this work as a notable advancement in the field of audio processing and multimodal AI.

Analysis: Full Paper • Full text: 50,026 characters

TF-Restormer: Complex Spectral Prediction for Speech Restoration

Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong ... · arXiv

Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fideli...

Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fidelity, while diffusion models remain impractical for streaming. Moreover, most assume a fixed target sampling rate, requiring external resampling that leads to redundant computations. We present TF-Restormer, an encoder-decoder architecture that concentrates analysis on input-bandwidth with a time-frequency dual-path encoder and reconstructs missing high-frequency bands through a light decoder with frequency extension queries. It enables efficient and universal restoration across arbitrary input-output rates without redundant resampling. To support adversarial training across diverse rates, we introduce a shared sampling-frequency-independent (SFI) STFT discriminator. TF-Restormer further supports streaming with a causal time module, and improves robustness under extreme degradations by injecting spectral inductive bias into the frequency module. Finally, we propose a scaled log-spectral loss that stabilizes optimization under severe conditions while emphasizing well-predicted spectral details. As a single model across sampling rates, TF-Restormer consistently outperforms prior systems, achieving balanced gains in signal fidelity and perceptual quality, while its streaming mode maintains competitive effectiveness for real-time application. Code and demos are available at https://tf-restormer.github.io/demo.

Demo

ML Relevance Analysis (80)

The main contribution of this paper is the introduction of TF-Restormer, a novel speech restoration model that effectively addresses the challenges of restoring speech signals under various distortions while maintaining efficiency across different sampling rates. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its significance in the field of audio processing and machine learning.

Comprehensive Analysis

Methodology Assessment

The paper presents TF-Restormer, an innovative encoder-decoder architecture that utilizes a time-frequency dual-path approach to address the challenges of speech restoration under various distortions. The methodology is well-structured, focusing on the input bandwidth while employing a lightweight decoder for high-frequency reconstruction. The introduction of a shared sampling-frequency-independent (SFI) STFT discriminator for adversarial training is a notable contribution, allowing the model to operate efficiently across different sampling rates without the need for redundant resampling. The use of a scaled log-spectral loss to stabilize optimization under severe conditions is also a significant methodological advancement. Overall, the methodology is robust and addresses key limitations in existing approaches.

Experimental Evaluation

The experiments are thorough, utilizing diverse datasets such as UNIVERSE and VCTK for evaluation across various tasks, including denoising and super-resolution. The results demonstrate consistent improvements over prior systems in terms of signal fidelity and perceptual quality, with detailed comparisons against state-of-the-art models. The use of multiple metrics (PESQ, SDR, LSD, etc.) provides a comprehensive assessment of the model's performance. However, the paper could benefit from additional ablation studies to further validate the impact of individual components within the architecture.

Reproducibility

The authors provide a clear implementation strategy, including training details, model configurations, and the use of publicly available datasets. The availability of code and demos enhances reproducibility, although the lack of a direct GitHub repository link may hinder ease of access for some researchers. The detailed training pipeline and parameter settings are well-documented, which is a positive aspect for reproducibility.

Limitations

While the paper presents a strong framework, it does not extensively address potential limitations, such as the computational cost associated with the dual-path architecture at higher sampling rates. Additionally, the model's performance in real-world scenarios could be further validated with more extensive testing on diverse datasets beyond the synthetic and controlled environments used in the experiments.

Broader Impact

The TF-Restormer model has significant implications for real-time speech restoration applications, particularly in scenarios involving low-bandwidth communication and various distortions. Its ability to operate across different sampling rates without redundant resampling makes it a practical solution for real-world applications. The advancements in spectral prediction and adversarial training could also inspire further research in audio processing and enhancement. The main contribution of this paper is the introduction of TF-Restormer, a novel speech restoration model that effectively addresses the challenges of restoring speech signals under various distortions while maintaining efficiency across different sampling rates. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its significance in the field of audio processing and machine learning.

Analysis: Full Paper • Full text: 50,026 characters

CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance

Junchuan Zhao, Wei Zeng, Tianle Lyu ... · arXiv

Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extend...

Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extending these techniques to SVS remains non-trivial due to the requirement for precise melody control. In particular, prompt-based generation often introduces prosody leakage, where pitch information is inadvertently entangled within the timbre prompt, compromising controllability. We present CoMelSinger, a zero-shot SVS framework that enables structured and disentangled melody control within a discrete codec modeling paradigm. Built on the non-autoregressive MaskGCT architecture, CoMelSinger replaces conventional text inputs with lyric and pitch tokens, preserving in-context generalization while enhancing melody conditioning. To suppress prosody leakage, we propose a coarse-to-fine contrastive learning strategy that explicitly regularizes pitch redundancy between the acoustic prompt and melody input. Furthermore, we incorporate a lightweight encoder-only Singing Voice Transcription (SVT) module to align acoustic tokens with pitch and duration, offering fine-grained frame-level supervision. Experimental results demonstrate that CoMelSinger achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability over competitive baselines.

Institutional Affiliations

Primary: National University of Singapore

All Institutions: National University of Singapore, Copyright may be transferred without notice, after which this version may no longer be accessible. Junchuan Zhao, This work has been submitted to the IEEE for possible publication, are affiliated with the School of Computing

ML Relevance Analysis (83)

CoMelSinger presents a novel framework for zero-shot singing voice synthesis that effectively addresses melody control and prosody leakage. The combination of innovative methodology and promising experimental results positions this work as a significant contribution to the field of machine learning in audio synthesis.

Comprehensive Analysis

Methodology Assessment

The methodology presented in CoMelSinger is innovative, leveraging a non-autoregressive MaskGCT architecture to replace traditional text inputs with discrete lyric and pitch tokens. This approach effectively addresses the challenge of prosody leakage by introducing a coarse-to-fine contrastive learning strategy, which regularizes pitch redundancy. The incorporation of a lightweight encoder-only Singing Voice Transcription (SVT) module for frame-level supervision is a significant enhancement, allowing for better alignment of acoustic tokens with pitch and duration. Overall, the methodology is well-structured and demonstrates a clear understanding of the challenges in singing voice synthesis.

Experimental Evaluation

The experimental setup is robust, with comprehensive evaluations against competitive baselines. The results indicate notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability, which are critical metrics in the field of singing voice synthesis. However, the paper could benefit from a more detailed analysis of the datasets used and the specific metrics employed to quantify improvements, as this would enhance the credibility of the findings.

Reproducibility

While the paper outlines the methodology and experimental results, it lacks sufficient implementation details that would facilitate reproducibility. Key aspects such as hyperparameter settings, data preprocessing steps, and code availability are not mentioned, which could hinder other researchers from replicating the study.

Limitations

One limitation is the potential overfitting to the training data, particularly in the context of zero-shot learning. The paper does not address how the model performs with unseen data outside of the training distribution. Additionally, the reliance on a discrete token-based approach may limit the expressiveness of the generated singing voices compared to continuous representations.

Broader Impact

The advancements made in CoMelSinger have the potential to significantly impact the fields of music technology and artificial intelligence, particularly in applications such as music composition, voice cloning, and interactive entertainment. The ability to generate expressive singing voices with structured control could lead to new creative tools for artists and musicians, enhancing the accessibility of music production. CoMelSinger presents a novel framework for zero-shot singing voice synthesis that effectively addresses melody control and prosody leakage. The combination of innovative methodology and promising experimental results positions this work as a significant contribution to the field of machine learning in audio synthesis.

Analysis: Full Paper • Full text: 1,079 characters

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran ... · arXiv

Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generat...

Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.

Institutional Affiliations

Primary: Korea Advanced Institute of Science and Technology

All Institutions: Korea Advanced Institute of Science and Technology, Ho Chi Minh City University of Technology, AITech Lab, These authors contributed equally to this work

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of MAGE, a novel Masked Audio Generative Enhancer that utilizes a scarcity-aware coarse-to-fine masking strategy and a lightweight corrector module, achieving state-of-the-art performance in speech enhancement with a significantly reduced model size. This work represents a meaningful advancement in the field of generative speech enhancement, balancing efficiency and perceptual quality, and setting a foundation for future research in practical applications.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is innovative, particularly with the introduction of the scarcity-aware coarse-to-fine masking strategy. This approach addresses the limitations of traditional masked generative models by prioritizing token frequencies, which enhances both efficiency and generalization. The inclusion of a lightweight corrector module for low-confidence predictions is a significant advancement, allowing for iterative refinement of predictions. The architecture is built upon established models like BigCodec and Qwen2.5-0.5B, yet the selective layer retention to achieve a compact model size of 200M parameters is a notable achievement in balancing performance and efficiency.

Experimental Evaluation

The experimental evaluation is robust, utilizing well-established benchmarks such as the DNS Challenge and noisy LibriSpeech. The results demonstrate that MAGE outperforms larger baselines in terms of perceptual quality and word error rate, which is critical for downstream applications like speech recognition. The paper effectively compares MAGE against both discriminative and generative models, providing comprehensive metrics that highlight its advantages. However, the reliance on simulated distortions raises questions about its real-world applicability.

Reproducibility

The paper provides sufficient details regarding the implementation, including model architecture, training parameters, and evaluation metrics. However, the lack of a publicly available code repository limits full reproducibility. The authors should consider releasing their code to facilitate further research and validation of their findings.

Limitations

While MAGE shows strong results, its performance may be limited by its training on simulated data, which could affect generalization to real-world scenarios. Additionally, the evaluation metrics focus primarily on perceptual quality and WER, potentially overlooking other important aspects of speech enhancement like latency and computational efficiency in practical applications.

Broader Impact

The advancements presented in this paper could have significant implications for real-world applications in speech enhancement, particularly in environments with background noise or reverberation. The compact design of MAGE makes it suitable for deployment in resource-constrained settings, which is crucial for applications like mobile devices and real-time communication systems. The potential for future extensions to multilingual and streaming scenarios further enhances its relevance in diverse applications. The main contribution of this paper is the introduction of MAGE, a novel Masked Audio Generative Enhancer that utilizes a scarcity-aware coarse-to-fine masking strategy and a lightweight corrector module, achieving state-of-the-art performance in speech enhancement with a significantly reduced model size. This work represents a meaningful advancement in the field of generative speech enhancement, balancing efficiency and perceptual quality, and setting a foundation for future research in practical applications.

Analysis: Full Paper • Full text: 14,144 characters

Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

Ismail Rasim Ulgen, Zongyang Du, Junchen Lu ... · arXiv

Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measur...

Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of TTScore, a novel evaluation framework for synthesized speech that provides targeted assessments of intelligibility and prosody through conditional prediction of discrete speech tokens. This work significantly advances the field by addressing the limitations of existing metrics and aligning more closely with human perceptions of speech quality.

Comprehensive Analysis

Methodology Assessment

The paper introduces TTScore, a novel evaluation framework that utilizes conditional prediction of discrete speech tokens to assess intelligibility and prosody in synthesized speech. The methodology is well-structured, employing two distinct sequence-to-sequence models tailored for intelligibility (TTScore-int) and prosody (TTScore-pro). This targeted approach addresses the limitations of existing metrics, such as WER and F0-RMSE, by providing reference-free evaluations that align more closely with human perception. The use of discrete speech tokens derived from advanced models like HuBERT and FACodec adds robustness to the evaluation process.

Experimental Evaluation

The experiments are comprehensive, utilizing multiple benchmarks (SOMOS, VoiceMOS, TTSArena) to validate the effectiveness of the proposed metrics. The paper reports strong correlations between TTScore metrics and human judgments of speech quality, outperforming traditional metrics. The evaluation setup is rigorous, comparing TTScore against established baselines and demonstrating its reliability across diverse datasets. However, the paper could benefit from a more detailed analysis of the statistical significance of the results.

Reproducibility

The authors provide a GitHub repository with code and pre-trained models, enhancing the reproducibility of their work. Implementation details are sufficiently described, including model architectures and training procedures. However, the paper lacks specific hyperparameter settings and training configurations that could further aid in reproducing the results.

Limitations

One limitation is the reliance on existing datasets for evaluation, which may not encompass all variations of synthesized speech. Additionally, while TTScore shows improved correlations with human judgments, it may still be sensitive to the quality of the underlying speech synthesis systems. The paper does not address potential biases in the datasets used for training and evaluation.

Broader Impact

The proposed evaluation framework has significant implications for the field of speech synthesis, offering a more nuanced understanding of intelligibility and prosody. This can lead to improved speech generation systems, enhancing applications in assistive technologies, human-computer interaction, and language learning. The methodology could also inspire further research into targeted evaluation metrics in other domains of machine learning. The main contribution of this paper is the introduction of TTScore, a novel evaluation framework for synthesized speech that provides targeted assessments of intelligibility and prosody through conditional prediction of discrete speech tokens. This work significantly advances the field by addressing the limitations of existing metrics and aligning more closely with human perceptions of speech quality.

Analysis: Full Paper • Full text: 40,406 characters

SEA-Spoof: Bridging The Gap in Multilingual Audio Deepfake Detection for South-East Asian

Jinyang Wu, Nana Hou, Zihan Pan ... · arXiv

The rapid growth of the digital economy in South-East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained ...

The rapid growth of the digital economy in South-East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained on high-resource languages collapse when applied to SEA, due to mismatches in synthesis quality, language-specific characteristics, and data scarcity. To close this gap, we present SEA-Spoof, the first large-scale Audio Deepfake Detection (ADD) dataset especially for SEA languages. SEA-Spoof spans 300+ hours of paired real and spoof speech across Tamil, Hindi, Thai, Indonesian, Malay, and Vietnamese. Spoof samples are generated from a diverse mix of state-of-the-art open-source and commercial systems, capturing wide variability in style and fidelity. Benchmarking state-of-the-art detection models reveals severe cross-lingual degradation, but fine-tuning on SEA-Spoof dramatically restores performance across languages and synthesis sources. These results highlight the urgent need for SEA-focused research and establish SEA-Spoof as a foundation for developing robust, cross-lingual, and fraud-resilient detection systems.

Institutional Affiliations

Primary: Institute for Infocomm Research (I2R)

All Institutions: The University of New South Wales, Nanyang Technological University, Institute for Infocomm Research (I2R), Alibaba Group, United States of America

GitHub

ML Relevance Analysis (83)

The paper presents SEA-Spoof, the first large-scale dataset for audio deepfake detection in six South-East Asian languages, filling a critical gap in existing resources and demonstrating significant improvements in detection performance through fine-tuning. The comprehensive methodology and experimental validation highlight its importance for advancing research in multilingual audio deepfake detection.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, focusing on the creation of the SEA-Spoof dataset, which is a significant contribution to the field of audio deepfake detection. The authors carefully selected six South-East Asian languages based on linguistic diversity, population coverage, and practical relevance. The dataset construction is thorough, utilizing a mix of state-of-the-art open-source and commercial systems to generate spoofed audio, which ensures a wide variability in synthesis quality. The systematic pairing of real and spoofed audio for controlled evaluations is a strong methodological aspect that enhances the dataset's utility for future research.

Experimental Evaluation

The experimental evaluation is comprehensive, benchmarking multiple state-of-the-art models against the newly created SEA-Spoof dataset. The results clearly demonstrate the cross-lingual performance degradation of existing models when applied to SEA languages, validating the necessity of the dataset. Fine-tuning experiments show significant improvements in model performance, underscoring the dataset's effectiveness as a diagnostic tool and a resource for enhancing detection capabilities.

Reproducibility

The paper provides sufficient details on the dataset's construction and the experimental setup, including the models used for benchmarking and the training protocols. However, the lack of a publicly available code repository limits the full reproducibility of the experiments. While the dataset is accessible, the absence of implementation details for the models may hinder other researchers from replicating the study completely.

Limitations

One limitation is the focus on only six languages, which, while significant, does not cover the entire spectrum of languages in the SEA region. Additionally, the dataset's reliance on specific synthesis systems may introduce biases that could affect generalizability. The paper also mentions plans for future work, indicating that the dataset may evolve, but the current version may not be exhaustive.

Broader Impact

The creation of SEA-Spoof has the potential to significantly impact the field of audio deepfake detection, particularly in multilingual contexts. By addressing the gap in resources for SEA languages, the dataset can facilitate the development of more effective detection systems tailored to the unique characteristics of these languages. This work emphasizes the importance of regional focus in AI research and could lead to broader applications in security, fraud detection, and speech technology. The paper presents SEA-Spoof, the first large-scale dataset for audio deepfake detection in six South-East Asian languages, filling a critical gap in existing resources and demonstrating significant improvements in detection performance through fine-tuning. The comprehensive methodology and experimental validation highlight its importance for advancing research in multilingual audio deepfake detection.

Analysis: Full Paper • Full text: 17,986 characters

Discrete Diffusion for Generative Modeling of Text-Aligned Speech Tokens

Pin-Jui Ku, He Huang, Jean-Marie Lemercier ... · arXiv

This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronge...

This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronger ASR performance, and faster inference. We provide a comprehensive analysis of applying DDMs to speech reconstruction, examining sampler choices, inference steps, and robustness to length-scale estimation errors. Furthermore, we improve the original TASTE by systematically comparing vector quantization modules, showing that FSQ yields up to a 35% relative WER reduction and +0.14 UT-MOS improvement over RVQ for AR models, while also enhancing DDM performance. Our model generates speech in just 10 denoising steps and even supports single-step generation with only minor quality degradation.

Demo

ML Relevance Analysis (82)

This paper presents a pioneering application of discrete diffusion models to speech tokenization and reconstruction, showcasing substantial improvements in efficiency and quality over traditional autoregressive methods. The comprehensive methodology and experimental validation contribute significantly to the field, paving the way for future research and applications in speech technology.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel discrete diffusion model (DDM) framework for speech tokenization and reconstruction, effectively replacing traditional autoregressive decoders with a more efficient DDM approach. The methodology is well-structured, providing a comprehensive analysis of various aspects such as sampler choices and vector quantization techniques. The use of finite scalar quantization (FSQ) as an alternative to residual vector quantization (RVQ) is a significant methodological improvement that enhances performance metrics like WER and UT-MOS. The detailed exploration of inference settings and robustness to length-scale estimation errors further strengthens the methodology's rigor.

Experimental Evaluation

The experiments are robust, utilizing a large dataset (Granary English-only) and employing various evaluation metrics (WER, PESQ, MOS, etc.) to assess performance comprehensively. The comparison between AR and DDM models is well-articulated, showing clear advantages in both reconstruction quality and inference speed. The results substantiate the claims made in the paper, demonstrating the effectiveness of DDMs in speech applications. However, the paper could benefit from additional comparisons with state-of-the-art models beyond the baseline.

Reproducibility

The paper provides sufficient details regarding the training setup, including the number of GPUs used, batch sizes, and training procedures. However, the absence of a publicly accessible code repository limits full reproducibility, as other researchers may struggle to replicate the results without the exact implementation details.

Limitations

One limitation is the reliance on a specific dataset that may not generalize across all speech applications. Additionally, the paper does not address potential challenges in real-world applications, such as handling diverse accents or noisy environments beyond the dataset used. The assumption that the model has access to global S3 token lengths during inference may also pose practical challenges.

Broader Impact

The proposed DDM-based TASTE framework has significant implications for the field of speech processing, particularly in applications requiring efficient and high-quality speech synthesis and recognition. The advancements could lead to improvements in voice assistants, automated transcription services, and other speech-related technologies, ultimately enhancing user experiences in various domains. This paper presents a pioneering application of discrete diffusion models to speech tokenization and reconstruction, showcasing substantial improvements in efficiency and quality over traditional autoregressive methods. The comprehensive methodology and experimental validation contribute significantly to the field, paving the way for future research and applications in speech technology.

Analysis: Full Paper • Full text: 17,960 characters

Enabling Multi-Species Bird Classification on Low-Power Bioacoustic Loggers

Stefano Ciapponi, Leonardo Mannini, Jarek Scanferla ... · arXiv

This paper introduces WrenNet, an efficient neural network enabling real-time multi-species bird audio classification on low-power microcontrollers for scalable biodiversity monitoring. We propose a semi-learnable spectral feature extractor that adapts to avian vocalizations, out...

This paper introduces WrenNet, an efficient neural network enabling real-time multi-species bird audio classification on low-power microcontrollers for scalable biodiversity monitoring. We propose a semi-learnable spectral feature extractor that adapts to avian vocalizations, outperforming standard mel-scale and fully-learnable alternatives. On an expert-curated 70-species dataset, WrenNet achieves up to 90.8\% accuracy on acoustically distinctive species and 70.1\% on the full task. When deployed on an AudioMoth device ($\leq$1MB RAM), it consumes only 77mJ per inference. Moreover, the proposed model is over 16x more energy-efficient compared to Birdnet when running on a Raspberry Pi 3B+. This work demonstrates the first practical framework for continuous, multi-species acoustic monitoring on low-power edge devices.

Institutional Affiliations

Primary: University of Trento

All Institutions: University of Trento

GitHub

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of WrenNet, a novel neural network architecture that enables efficient multi-species bird audio classification on low-power devices, significantly advancing the field of bioacoustic monitoring. This work is notable for its innovative methodology and practical applications, addressing critical challenges in environmental monitoring with a focus on energy efficiency and real-time processing.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust, featuring a well-thought-out neural architecture (WrenNet) that addresses the specific challenges of multi-species bird classification on low-power devices. The introduction of a semi-learnable spectral feature extractor is particularly innovative, allowing for adaptive frequency mapping that enhances the model's performance on avian vocalizations. The use of causal convolutions and a unidirectional GRU for temporal processing is a strong choice for maintaining memory efficiency while ensuring real-time processing capabilities. The paper effectively combines deep learning techniques with practical constraints of edge devices, showcasing a thoughtful approach to system-algorithm co-design.

Experimental Evaluation

The experimental evaluation is comprehensive, utilizing a well-curated dataset of 70 species and demonstrating the model's performance through various benchmarks. The accuracy results are promising, particularly for acoustically distinctive species, and the energy consumption metrics highlight the practical viability of the proposed system. The comparison with existing models like BirdNET, showcasing a significant reduction in energy consumption, adds to the credibility of the results. However, the paper could benefit from more detailed discussions on the statistical significance of the results and potential variations in performance across different environments.

Reproducibility

The paper provides a clear overview of the experimental setup, including the dataset creation and training processes. The availability of scripts in a public repository enhances reproducibility. However, more detailed documentation on the specific configurations used for training and testing would further facilitate replication of the results by other researchers.

Limitations

While the paper presents a significant advancement, it does have limitations. The model's performance on the full dataset (70 species) shows a drop in accuracy, indicating that further refinement may be necessary for closely related species. Additionally, the reliance on a specific dataset may limit generalizability to other geographical regions or bird species. The energy consumption metrics, while impressive, could vary significantly with different environmental conditions and hardware configurations.

Broader Impact

The implications of this work are substantial for biodiversity monitoring and conservation efforts. By enabling real-time, low-power classification of bird species, this technology can facilitate large-scale ecological studies and contribute to the understanding of avian populations and their habitats. The approach could be extended to other wildlife monitoring applications, potentially transforming how ecological data is collected and analyzed. The main contribution of this paper is the introduction of WrenNet, a novel neural network architecture that enables efficient multi-species bird audio classification on low-power devices, significantly advancing the field of bioacoustic monitoring. This work is notable for its innovative methodology and practical applications, addressing critical challenges in environmental monitoring with a focus on energy efficiency and real-time processing.

Analysis: Full Paper • Full text: 15,595 characters

MMedFD: A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition

Hongzhao Chen, XiaoYang Wang, Jing Lan ... · arXiv

Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker overlap, and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, f...

Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker overlap, and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, full-duplex settings. Captured from a deployed AI assistant, the dataset comprises 5,805 annotated sessions with synchronized user and mixed-channel views, RTTM/CTM timing, and role labels. We introduce a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio for long-context recognition. ASR evaluation includes WER, CER, and HC-WER, which measures concept-level accuracy across healthcare settings. LLM-generated responses are assessed using rubric-based and pairwise protocols. MMedFD establishes a reproducible framework for benchmarking streaming ASR and end-to-end duplex agents in healthcare deployment. The dataset and related resources are publicly available at https://github.com/Kinetics-JOJO/MMedFD

GitHub

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of MMedFD, a novel healthcare ASR corpus and a robust framework for evaluating multi-turn, full-duplex speech recognition systems. This work addresses a critical gap in the ASR field, particularly in clinical dialogue, and lays the groundwork for future advancements in healthcare communication technologies.

Comprehensive Analysis

Methodology Assessment

The paper introduces a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, which is a significant contribution to the field of ASR in healthcare. The use of the Whisper-small model fine-tuned on role-concatenated audio for long-context recognition is innovative, addressing the challenges of multi-turn and full-duplex interactions in clinical settings. The methodology is well-structured and clearly articulated, although it would benefit from more detailed comparisons with existing methods.

Experimental Evaluation

The experiments are comprehensive, utilizing a dataset of 5,805 annotated sessions, which is substantial for the domain. The evaluation metrics, including WER, CER, and HC-WER, are appropriate for assessing ASR performance in healthcare. However, the paper could enhance its impact by providing more detailed results and comparisons with baseline models to better illustrate the effectiveness of the proposed methods.

Reproducibility

The authors have made the dataset and related resources publicly available, which is commendable for reproducibility. However, the paper lacks detailed implementation instructions or code snippets that would facilitate replication of the results by other researchers. Including such details would strengthen the reproducibility aspect significantly.

Limitations

One limitation is the focus on a specific language (Chinese), which may restrict the generalizability of the findings to other languages or dialects. Additionally, while the dataset is substantial, the paper does not discuss potential biases in the data collection process or the diversity of the speakers involved, which could affect the model's performance in real-world applications.

Broader Impact

The development of MMedFD has the potential to significantly impact the healthcare sector by improving the efficiency and accuracy of ASR systems in clinical dialogues. This could lead to better patient interactions and streamlined workflows in healthcare settings. The framework established for benchmarking streaming ASR can also encourage further research and development in this area. The main contribution of this paper is the introduction of MMedFD, a novel healthcare ASR corpus and a robust framework for evaluating multi-turn, full-duplex speech recognition systems. This work addresses a critical gap in the ASR field, particularly in clinical dialogue, and lays the groundwork for future advancements in healthcare communication technologies.

Analysis: Full Paper • Full text: 1,324 characters

Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration

Yifan Yang, Bing Han, Hui Wang ... · arXiv

Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably qu...

Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.

Institutional Affiliations

Primary: ~Corresponding authors

All Institutions: ~Equal contribution, ~Corresponding authors

Demo

ML Relevance Analysis (81)

The paper presents a significant contribution to the field of TTS by introducing a new metric for assessing prosody diversity, which is crucial for improving the naturalness of synthesized speech. The methodology is innovative, and the experimental results support its effectiveness, marking a meaningful advancement in the evaluation of TTS systems.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel metric, Discretized Speech Weighted Edit Distance (DS-WED), which is a significant advancement in measuring prosody diversity in zero-shot TTS systems. The methodology is robust, leveraging weighted edit distance over semantic tokens, and is well-supported by the creation of the ProsodyEval dataset, which includes human ratings that enhance the reliability of the metric. The approach is methodologically sound, addressing a gap in the current literature regarding the correlation between acoustic metrics and human perception of prosody.

Experimental Evaluation

The experiments conducted on the ProsodyEval dataset are comprehensive, featuring 1000 speech samples and 2000 human ratings, which provide a solid foundation for evaluating the proposed metric. The results demonstrate that DS-WED correlates more strongly with human judgments than existing metrics, showcasing its effectiveness. Additionally, the benchmarking of state-of-the-art TTS systems reveals practical applications of the metric, although further details on the experimental setup could enhance transparency.

Reproducibility

The paper provides sufficient details regarding the dataset and the proposed metric, which aids in reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for the proposed methods and results. Including implementation details or a link to a code repository would significantly enhance reproducibility.

Limitations

One limitation noted is the reliance on human ratings, which can introduce variability and subjectivity into the evaluation process. Additionally, the paper mentions that current large audio language models (LALMs) are limited in capturing prosodic variations, indicating an area for further research. The scope of the dataset, while substantial, may not encompass all potential prosodic variations present in diverse languages and accents.

Broader Impact

The development of a reliable metric for prosody diversity has significant implications for the TTS field, potentially enhancing the naturalness and expressiveness of synthesized speech. This work could influence future research directions in TTS systems, particularly in improving user experience and accessibility for diverse populations. The findings may also encourage further exploration into the integration of prosody in other areas of machine learning and natural language processing. The paper presents a significant contribution to the field of TTS by introducing a new metric for assessing prosody diversity, which is crucial for improving the naturalness of synthesized speech. The methodology is innovative, and the experimental results support its effectiveness, marking a meaningful advancement in the evaluation of TTS systems.

Analysis: Full Paper • Full text: 95 characters

Advancing Speech Summarization in Multi-modal LLMs with Reinforcement Learning

Shaoshi Ling, Gang Liu, Guoli Ye ... · arXiv

Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries di...

Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.

Institutional Affiliations

Primary: Microsoft CoreAI

All Institutions: Microsoft CoreAI

ML Relevance Analysis (83)

The main contribution of this work is a novel multi-stage reinforcement learning framework that significantly enhances the speech summarization capabilities of multi-modal large language models. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and natural language processing.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a multi-stage reinforcement learning framework that effectively enhances speech summarization capabilities in multi-modal large language models (MLLMs). The combination of supervised fine-tuning on synthetic data, on-policy knowledge distillation, and Direct Preference Optimization is innovative and addresses key challenges in the field, such as error propagation and modality gaps. The approach is well-structured and leverages existing models and techniques, showcasing a thoughtful integration of various methodologies to improve performance.

Experimental Evaluation

The experimental evaluation is robust, utilizing multiple benchmarks (Golden3, AMI, and FLORAS) to assess the model's performance. The paper provides a thorough comparison with both open-source and state-of-the-art systems, demonstrating significant performance improvements. The ablation studies further validate the effectiveness of each component of the proposed framework, highlighting the importance of data quality and the choice of teacher models in knowledge distillation.

Reproducibility

While the paper provides detailed descriptions of the training processes and datasets, it lacks specific URLs for code or datasets, which could hinder reproducibility. The absence of a public repository or demo limits the ability for other researchers to replicate the results independently. However, the methodology is described in sufficient detail for knowledgeable practitioners to implement similar experiments.

Limitations

The paper acknowledges issues such as hallucinations and reward hacking, which are common in reinforcement learning settings. While the proposed methods mitigate these issues, they do not completely eliminate them. Additionally, the focus on English-only data in training may limit the model's applicability in multilingual contexts, despite showing some cross-lingual generalization.

Broader Impact

The advancements in speech summarization have significant implications for accessibility, productivity, and information retrieval in various domains, including education, business, and media. The ability to generate coherent summaries from spoken content can enhance user experiences and facilitate better information management in an increasingly audio-centric world. The main contribution of this work is a novel multi-stage reinforcement learning framework that significantly enhances the speech summarization capabilities of multi-modal large language models. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and natural language processing.

Analysis: Full Paper • Full text: 20,515 characters

Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling

Niclas Pokel, Pehuén Moure, Roman Boehringer ... · arXiv

Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This w...

Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This work introduces a data-efficient personalization method that quantifies phoneme-level uncertainty to guide fine-tuning. We leverage Monte Carlo Dropout to estimate which phonemes a model finds most difficult and use these estimates for a targeted oversampling strategy. We validate our method on English and German datasets. Crucially, we demonstrate that our model-derived uncertainty strongly correlates with phonemes identified as challenging in an expert clinical logopedic report, marking, to our knowledge, the first work to successfully align model uncertainty with expert assessment of speech difficulty. Our results show that this clinically-validated, uncertainty-guided sampling significantly improves ASR accuracy, delivering a practical framework for personalized and inclusive ASR.

Institutional Affiliations

Primary: University of Zurich

All Institutions: Technical University of Munich, School of Computation, Institute of Neuroinformatics, University of Zurich and ETH Zurich, University of Zurich, Department of Computational Linguistics, Information and Technology

ML Relevance Analysis (83)

This paper presents a novel framework for data-efficient ASR personalization that utilizes uncertainty-based phoneme difficulty scoring to improve recognition accuracy for non-normative speech. The integration of clinical validation with machine learning techniques represents a meaningful contribution to both the fields of speech recognition and assistive technology.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, leveraging Monte Carlo Dropout to quantify phoneme-level uncertainty, which is innovative in the context of ASR personalization for non-normative speech. The introduction of the Phoneme Difficulty Score (PhDScore) is a significant advancement, as it combines multiple uncertainty metrics to guide oversampling effectively. The approach to link model uncertainty with clinical assessments is particularly noteworthy and demonstrates a thoughtful integration of machine learning with clinical insights.

Experimental Evaluation

The experiments are well-structured, utilizing both English and German datasets to validate the proposed method. The results show a clear improvement in ASR accuracy for non-normative speech, and the correlation with clinical assessments adds credibility to the findings. However, the limited number of speakers in the BF-Sprache dataset may affect the generalizability of the results.

Reproducibility

While the paper provides a detailed description of the methodology, including the computation of the PhDScore and the experimental setup, it lacks specific implementation details or code availability, which could hinder reproducibility. Future work should consider sharing code or datasets to facilitate further research.

Limitations

The primary limitation is the small size of the BF-Sprache dataset, which restricts the breadth of the findings. Additionally, the subjective nature of clinical assessments may introduce variability in the validation process. The trade-off between personalization and generalization is also a concern, as it may limit the practical application of the method in real-world scenarios.

Broader Impact

This work has significant implications for the development of personalized ASR systems, particularly for individuals with speech impairments. By improving the accuracy of ASR for non-normative speech, the proposed method could enhance communication aids and assistive technologies, making them more effective and inclusive for users with diverse speech patterns. This paper presents a novel framework for data-efficient ASR personalization that utilizes uncertainty-based phoneme difficulty scoring to improve recognition accuracy for non-normative speech. The integration of clinical validation with machine learning techniques represents a meaningful contribution to both the fields of speech recognition and assistive technology.

Analysis: Full Paper • Full text: 18,048 characters

Direct Preference Optimization for Speech Autoregressive Diffusion Models

Zhijun Liu, Dongya Jia, Xiaoqiang Wang ... · arXiv

Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising al...

Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising alternative to next-token prediction, avoiding the technical complexities associated with discrete speech tokenization. As a relatively new paradigm, research on reinforcement learning (RL)-based fine-tuning of speech ARDMs remains limited. In this paper, we propose Autoregressive Diffusion-Direct Preference Optimization (ARDM-DPO) to advance this research. By fine-tuning the recently proposed zero-shot text-to-speech model DiTAR with DPO, we achieve significant improvements in terms of speech expressiveness and robustness for long texts.

Institutional Affiliations

Primary: The Chinese University of Hong Kong

All Institutions: School of Data Science, ByteDance Seed, The Chinese University of Hong Kong, School of Artificial Intelligence, Nanjing University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of ARDM-DPO, a novel method for fine-tuning autoregressive diffusion models in speech generation, which enhances expressiveness and robustness while addressing the challenges of traditional TTS systems. The comprehensive evaluation of the method demonstrates its potential impact on the field of audio generation and reinforces the importance of preference alignment in machine learning models.

Comprehensive Analysis

Methodology Assessment

The proposed method, Autoregressive Diffusion-Direct Preference Optimization (ARDM-DPO), represents a significant advancement in the application of autoregressive diffusion models for text-to-speech (TTS) systems. The methodology effectively integrates reinforcement learning principles to fine-tune the DiTAR model, addressing the limitations of traditional next-token prediction approaches. The authors provide a clear framework for preference alignment, which is critical for enhancing the expressiveness and robustness of generated speech. However, the paper could benefit from a more detailed discussion on the implementation specifics of DPO in the context of ARDMs, as well as a deeper exploration of the underlying assumptions made during model training.

Experimental Evaluation

The experiments are well-structured, utilizing comprehensive datasets and benchmarks to evaluate the performance of ARDM-DPO against baseline methods. The authors present quantitative metrics such as F0 variance and character error rate, alongside qualitative assessments through listener evaluations, which provide a balanced view of the model's performance. The results indicate significant improvements in expressiveness and robustness, although the paper notes some instability in training, which warrants further investigation. The use of a large preference dataset strengthens the findings, but additional comparisons with more baseline models could enhance the robustness of the conclusions drawn.

Reproducibility

The paper provides a reasonable level of detail regarding the experimental setup, including model architecture, training parameters, and evaluation metrics. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider releasing the model and code to facilitate further research and validation by the community.

Limitations

The paper acknowledges the instability of the ARDM-DPO training process, particularly in Task A, which can lead to degradation in speech quality. This instability raises questions about the robustness of the method in practical applications. Additionally, the reliance on preference datasets for training may introduce biases that affect the generalizability of the model. The authors also mention the need for early stopping, which could complicate the training process.

Broader Impact

The advancements presented in this paper have the potential to significantly improve TTS systems, making them more expressive and aligned with human preferences. This could enhance applications in various fields, including virtual assistants, audiobooks, and entertainment. The work contributes to the growing body of research on autoregressive diffusion models, potentially influencing future developments in multimodal generation tasks. The main contribution of this paper is the introduction of ARDM-DPO, a novel method for fine-tuning autoregressive diffusion models in speech generation, which enhances expressiveness and robustness while addressing the challenges of traditional TTS systems. The comprehensive evaluation of the method demonstrates its potential impact on the field of audio generation and reinforces the importance of preference alignment in machine learning models.

Analysis: Full Paper • Full text: 15,825 characters

Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation

Karen Rosero, Eunjung Yeo, David R. Mortensen ... · arXiv

We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with partic...

We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations.

Institutional Affiliations

Primary: Language Technologies Institute

All Institutions: Department of Computer Science, Language Technologies Institute, University of Texas at Austin, Analytical Imaging and Modeling Center, Cortney Van’t Slot, Carnegie Mellon University, Children's Health, Department of Plastic Surgery, University of Texas Southwestern Medical Center

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of ChiReSSD, a novel speech reconstruction framework that effectively addresses the unique challenges of disordered speech in children while preserving speaker identity. This work represents a meaningful advancement in the intersection of machine learning and clinical speech pathology, with the potential to significantly impact both research and practical applications in the field.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is innovative, leveraging a modified version of StyleTTS2 to specifically address the challenges of reconstructing speech for children with speech sound disorders (SSD). The framework's ability to disentangle acoustic and prosodic features while preserving speaker identity is a significant advancement over traditional methods that often fail to account for the unique characteristics of children's speech. The adaptation of the model to handle the higher pitch and prosodic patterns of child speech is well-justified and effectively executed. However, the paper could benefit from a more detailed description of the training process and hyperparameter tuning, as these are critical for replicating the results.

Experimental Evaluation

The experimental evaluation is robust, utilizing multiple datasets (STAR, UltraSuite, and TORGO) to demonstrate the effectiveness of ChiReSSD across different populations. The results show substantial improvements in lexical accuracy and speaker identity preservation, with clear metrics such as WER, CER, and PCC providing quantitative support for the claims made. The correlation of automatic evaluations with human expert annotations (Pearson correlation of 0.63) is particularly noteworthy, as it suggests a practical application for reducing manual transcription efforts in clinical settings. The experiments are well-structured, but the paper could enhance clarity by providing more context for the choice of evaluation metrics.

Reproducibility

While the paper provides a general overview of the methods and datasets used, it lacks specific implementation details that would aid in reproducibility. For instance, the exact configurations of the model training, including learning rates, batch sizes, and the specific architecture of the StyleTTS2 modifications, are not thoroughly detailed. Including a supplementary material section with code snippets or a link to a repository would greatly enhance reproducibility.

Limitations

One limitation of the study is the reliance on specific datasets that may not fully represent the diversity of speech disorders in children. The generalization to adult dysarthric speech is promising, but the paper does not address potential limitations in applying the model to other populations or languages. Additionally, while the model shows improvements in phonetic accuracy, residual errors remain, and the paper suggests future work to address these, indicating that the current version may not be fully optimized.

Broader Impact

The implications of this research are significant, particularly in the fields of speech-language pathology and assistive technology. By providing a framework that can improve the intelligibility of disordered speech while preserving the speaker's identity, ChiReSSD has the potential to enhance communication for children with SSD, thereby improving their social and academic outcomes. Furthermore, the ability to automate clinical evaluations could alleviate some of the burdens on speech-language therapists, allowing them to focus on more complex cases. The main contribution of this paper is the introduction of ChiReSSD, a novel speech reconstruction framework that effectively addresses the unique challenges of disordered speech in children while preserving speaker identity. This work represents a meaningful advancement in the intersection of machine learning and clinical speech pathology, with the potential to significantly impact both research and practical applications in the field.

Analysis: Full Paper • Full text: 20,380 characters

FlexSED: Towards Open-Vocabulary Sound Event Detection

Jiarui Hai, Helin Wang, Weizhe Guo ... · arXiv

Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-...

Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.

Institutional Affiliations

Primary: Fictional University

All Institutions: 2133 Long Road, 8765 Dream Blvd, This Work is supported by ONR N00014-23-1-2050 and N00014-23- 1-2086, Johns Hopkins University, An unnumbered footnote that may come in handy, University Imagination, Department of Electrical and Computer Engineering, Fictional University, Industry Lab, Important Laboratory

GitHub

ML Relevance Analysis (83)

The paper presents FlexSED, a novel open-vocabulary sound event detection framework that effectively addresses existing limitations in sound classification and adapts well to diverse real-world applications. The innovative integration of pretrained models and robust training strategies positions this work as a significant contribution to the field of audio machine learning.

Comprehensive Analysis

Methodology Assessment

The proposed FlexSED framework introduces a novel architecture that integrates pretrained audio and text models, addressing the limitations of traditional sound event detection systems. The encoder-decoder structure and adaptive fusion strategy are innovative, allowing for effective continuous training and improved performance in open-vocabulary contexts. The use of large language models for negative query filtering is particularly noteworthy, as it enhances the robustness of the training process by mitigating issues related to missing labels. Overall, the methodology is well-structured and leverages existing technologies in a creative manner.

Experimental Evaluation

The experiments conducted on the AudioSet-Strong dataset demonstrate the effectiveness of FlexSED, showcasing significant improvements over traditional models. The evaluation metrics used, including PSDS1, provide a fine-grained analysis of the model's performance in terms of temporal localization and sound event detection accuracy. The results from zero-shot and few-shot learning scenarios further validate the model's adaptability and generalization capabilities, which are crucial for real-world applications. However, the paper could benefit from additional comparisons with more diverse baseline models to strengthen its claims.

Reproducibility

The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameters, which facilitate reproducibility. The authors have also made the code and pretrained models available on GitHub, enhancing the accessibility of their work for further research and experimentation. However, the absence of a demo URL limits immediate practical engagement with the model.

Limitations

One limitation of the study is the reliance on the AudioSet-Strong dataset, which, while substantial, may not encompass the full diversity of sound events encountered in real-world scenarios. Additionally, the model's performance in highly noisy environments or with overlapping sound events could be further explored. The paper also does not address potential computational costs associated with using large language models for negative query filtering, which may limit practical deployment in resource-constrained settings.

Broader Impact

The FlexSED framework has the potential to significantly advance the field of sound event detection by enabling more flexible and user-friendly interactions through open-vocabulary capabilities. Its applications could extend to various domains, including smart home technologies, wildlife monitoring, and assistive devices for the hearing impaired. By improving the adaptability of sound event detection systems, this work could lead to more intelligent and responsive audio processing solutions in everyday environments. The paper presents FlexSED, a novel open-vocabulary sound event detection framework that effectively addresses existing limitations in sound classification and adapts well to diverse real-world applications. The innovative integration of pretrained models and robust training strategies positions this work as a significant contribution to the field of audio machine learning.

Analysis: Full Paper • Full text: 23,137 characters

Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation

Runyan Yang, Yuke Si, Yingying Gao ... · arXiv

While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge...

While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio models while preserving its acoustic competence. Our method introduces two key dimensions: source-wise distillation, which leverages both textual and acoustic teachers to provide complementary modality-specific supervision; and layer-wise distillation, which aligns teacher signals with appropriate student layers to improve transfer efficiency. This dual-dimensional strategy enables fine-grained control over the distillation process, effectively bridging the gap between symbolic reasoning and speech representations. Experimental results show significant improvements in audio reasoning performance, demonstrating the effectiveness of our framework as a reasoning transfer solution for audio modeling.

Institutional Affiliations

Primary: Peking University

All Institutions: * Corresponding author, Peking University, The State Key Laboratory of Multimedia Information Processing, Jiutian Artificial Intelligence Research Institute, † Equal contribution

ML Relevance Analysis (83)

This paper presents a novel framework for knowledge distillation that enhances reasoning capabilities in audio models by leveraging both textual and acoustic supervision. The comprehensive methodology and strong experimental results indicate a meaningful contribution to the field of machine learning, particularly in audio processing and reasoning tasks.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a dual-dimensional knowledge distillation framework that effectively addresses the challenges of reasoning in audio models by incorporating both source-wise and layer-wise distillation. This approach is innovative as it not only leverages the strengths of textual and acoustic teachers but also aligns the distillation process with the architecture of the student model, allowing for a more nuanced transfer of knowledge. The textualization of audio to bridge the modality gap is particularly noteworthy, as it enables the application of textual reasoning techniques to audio data.

Experimental Evaluation

The experimental evaluation is robust, utilizing relevant datasets such as CoTA and MMAU to assess the performance of the proposed framework. The results demonstrate significant improvements in reasoning accuracy across various tasks, indicating the effectiveness of the proposed distillation methods. The comparison against baseline models and different distillation strategies provides a comprehensive understanding of the framework's impact.

Reproducibility

The paper includes sufficient detail regarding the training setup, model configurations, and evaluation metrics, which aids in reproducibility. However, the absence of publicly available code or a project URL limits the ease with which other researchers can replicate the work.

Limitations

One limitation is the reliance on specific datasets, which may not generalize across all audio reasoning tasks. Additionally, while the framework shows improvements, the paper does not extensively discuss potential computational costs associated with the dual-dimensional distillation process, which could impact scalability in real-world applications.

Broader Impact

The proposed framework has significant implications for advancing audio models, particularly in applications requiring complex reasoning, such as automated transcription, sentiment analysis, and interactive voice assistants. By enhancing the reasoning capabilities of audio models, this work could lead to more intelligent and context-aware audio processing systems. This paper presents a novel framework for knowledge distillation that enhances reasoning capabilities in audio models by leveraging both textual and acoustic supervision. The comprehensive methodology and strong experimental results indicate a meaningful contribution to the field of machine learning, particularly in audio processing and reasoning tasks.

Analysis: Full Paper • Full text: 19,607 characters

Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition

Niclas Pokel, Pehuén Moure, Roman Boehringer ... · arXiv

Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite rece...

Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite recent advancements, state-of-the-art ASR models like Whisper still struggle with non-normative speech due to limited training data availability and high acoustic variability. Moreover, collecting and annotating non-normative speech is burdensome: speaking is effortful for many affected individuals, while laborious annotation often requires caregivers familiar with the speaker. This work introduces a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning. We validate our method on the English UA-Speech dataset and a newly collected German speech dataset, BF-Sprache, from a child with structural speech impairment. The dataset and approach are designed to reflect the challenges of low-resource settings that include individuals with speech impairments. Our method significantly improves ASR accuracy for impaired speech while maintaining data and annotation efficiency, offering a practical path toward inclusive ASR.

Institutional Affiliations

Primary: University of Zurich

All Institutions: Technical University of Munich, School of Computation, Institute of Neuroinformatics, University of Zurich and ETH Zurich, University of Zurich, Department of Computational Linguistics, Information and Technology

ML Relevance Analysis (83)

This paper presents a novel Bayesian Low-rank Adaptation framework for personalized impaired speech recognition, significantly improving ASR accuracy while addressing the challenges of data scarcity and variability in non-normative speech. The methodology and results contribute meaningfully to the field, offering practical solutions for inclusive communication technologies.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel Bayesian Low-rank Adaptation (VI LoRA) framework, which effectively addresses the challenges of data scarcity and high variability in impaired speech recognition. The incorporation of variational inference to estimate the posterior distributions of adaptation parameters is a significant advancement over traditional low-rank adaptation methods. The dual prior approach for layer-wise weight variations is particularly innovative, allowing for a more informed adaptation process. However, the assumption of independence in the factorization of the variational parameters may limit the model's ability to capture complex interactions between layers.

Experimental Evaluation

The experiments are well-structured, utilizing two distinct datasets (UA-Speech and BF-Sprache) that highlight the effectiveness of the proposed method across different languages and intelligibility levels. The comparative analysis against various baselines, including full fine-tuning and standard LoRA, demonstrates the robustness and efficiency of VI LoRA, particularly in low-data scenarios. The results indicate substantial improvements in word and character error rates, especially for speakers with very low intelligibility, underscoring the practical applicability of the method.

Reproducibility

While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details that would facilitate reproducibility. The absence of a publicly available code repository or demo URL further hinders the ability of other researchers to replicate the findings. Clearer documentation of hyperparameters, training procedures, and data preprocessing steps would enhance reproducibility.

Limitations

The study acknowledges limitations related to the small speaker pool in the BF-Sprache dataset, which may affect the generalizability of the findings. Additionally, the reliance on contextual pattern matching in existing ASR systems could hinder language learning for children with speech impairments. The assumption of independent factorization in the variational parameters may not fully capture the complexities of the model, potentially impacting performance.

Broader Impact

The proposed framework has significant implications for the development of inclusive ASR systems that can accommodate individuals with speech impairments. By improving recognition accuracy and maintaining data efficiency, the method can enhance communication for affected individuals, fostering social inclusion and educational opportunities. The approach also opens avenues for further research in low-resource speech recognition across languages, contributing to the broader field of assistive technologies. This paper presents a novel Bayesian Low-rank Adaptation framework for personalized impaired speech recognition, significantly improving ASR accuracy while addressing the challenges of data scarcity and variability in non-normative speech. The methodology and results contribute meaningfully to the field, offering practical solutions for inclusive communication technologies.

Analysis: Full Paper • Full text: 19,408 characters

Scalable Evaluation for Audio Identification via Synthetic Latent Fingerprint Generation

Aditya Bhattacharjee, Marco Pasini, Emmanouil Benetos · Under review for International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, 2026

The evaluation of audio fingerprinting at a realistic scale is limited by the scarcity of large public music databases. We present an audio-free approach that synthesises latent fingerprints which approximate the distribution of real fingerprints. Our method trains a Rectified Fl...

The evaluation of audio fingerprinting at a realistic scale is limited by the scarcity of large public music databases. We present an audio-free approach that synthesises latent fingerprints which approximate the distribution of real fingerprints. Our method trains a Rectified Flow model on embeddings extracted by pre-trained neural audio fingerprinting systems. The synthetic fingerprints generated using our system act as realistic distractors and enable the simulation of retrieval performance at a large scale without requiring additional audio. We assess the fidelity of synthetic fingerprints by comparing the distributions to real data. We further benchmark the retrieval performances across multiple state-of-the-art audio fingerprinting frameworks by augmenting real reference databases with synthetic distractors, and show that the scaling trends obtained with synthetic distractors closely track those obtained with real distractors. Finally, we scale the synthetic distractor database to model retrieval performance for very large databases, providing a practical metric of system scalability that does not depend on access to audio corpora.

Institutional Affiliations

Primary: Queen Mary University of London

All Institutions: Queen Mary University of London, supported jointly by UK Research and Innovation [grant number EP/S022694/1] and Queen Mary University of London, School of Electronic Engineering and Computer Science, A. Bhattacharjee and M. Pasini are research students at the UKRI Centre for Doctoral Training in Artificial Intelligence and Music

GitHub

ML Relevance Analysis (82)

The paper presents a framework for scalable evaluation of audio fingerprinting systems using synthetic latent fingerprints generated by a rectified flow model. The methodology is innovative and addresses a critical challenge in the field, with potential applications that could enhance the performance and scalability of audio identification systems.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel approach to audio fingerprinting by synthesizing latent fingerprints using a Rectified Flow model, which is a significant advancement in the field. The methodology is well-structured, leveraging generative modeling to create realistic distractors without requiring additional audio data. The use of embeddings from pre-trained systems enhances the fidelity of the synthetic fingerprints, and the approach is theoretically sound, with a clear explanation of the model architecture and training process. The authors provide a comprehensive description of how the generative model approximates the distribution of real fingerprints, which is a critical aspect of their methodology.

Experimental Evaluation

The experimental setup is robust, employing a well-defined evaluation framework that assesses both the fidelity of synthetic fingerprints and their effectiveness as distractors in retrieval tasks. The use of multiple state-of-the-art audio fingerprinting systems for benchmarking adds credibility to the results. The experiments demonstrate that synthetic distractors can effectively simulate real-world conditions, with results indicating that scaling trends are closely tracked. However, the paper could benefit from more extensive statistical analysis to further validate the findings.

Reproducibility

The authors have made their code and trained models available on GitHub, which is a positive aspect for reproducibility. The detailed description of the training process, including hyperparameters and dataset specifics, supports the reproducibility of the experiments. However, some equations and figures referenced in the text are not fully detailed, which could hinder complete replication of the results.

Limitations

One limitation of the study is the reliance on a single dataset (Free Music Archive) for training and evaluation, which may affect the generalizability of the findings to other audio domains. Additionally, while the synthetic fingerprints closely match real distributions, there may still be nuances in real data that are not captured by the generative model. The paper also does not explore the potential biases introduced by the dataset used for training.

Broader Impact

This research has significant implications for the field of music information retrieval, particularly in scenarios where large annotated audio datasets are not available. By enabling scalable evaluation of audio fingerprinting systems, the proposed framework can facilitate advancements in real-time audio identification applications, such as music recognition services and copyright enforcement. The approach could also inspire further research into generative modeling techniques in other areas of machine learning. The paper presents a framework for scalable evaluation of audio fingerprinting systems using synthetic latent fingerprints generated by a rectified flow model. The methodology is innovative and addresses a critical challenge in the field, with potential applications that could enhance the performance and scalability of audio identification systems.

Analysis: Full Paper • Full text: 17,102 characters

No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

Seungyoun Shin, Dongha Ahn, Jiwoo Kim ... · arXiv

Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into mon...

Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}

Institutional Affiliations

Primary: * Work done during internship at Channel Corporation

All Institutions: * Work done during internship at Channel Corporation, Channel Corporation, Corresponding author

Demo

ML Relevance Analysis (81)

The main contribution of this paper is the introduction of a novel preference-guided optimization approach for prosody learning in TTS systems, which effectively addresses the limitations of existing methods by utilizing human feedback to enhance the naturalness of synthesized speech. This work represents a meaningful step forward in the field of TTS, providing a practical solution to a longstanding challenge in achieving expressive and natural speech synthesis.

Comprehensive Analysis

Methodology Assessment

The paper introduces an iterative Direct Preference Optimization (DPO) scheme that innovatively addresses the challenge of optimizing prosody in TTS systems without a verifiable reward signal. The methodology is well-structured, leveraging human-labeled preference pairs to guide the model towards more natural prosody, which is a significant advancement over traditional methods that rely heavily on transcription-oriented signals. The regularization to the current model is a thoughtful addition that helps maintain stability during training, which is critical in TTS applications.

Experimental Evaluation

The experiments are robust, utilizing the KoCC-TTS dataset, which is specifically curated for authentic Korean call center interactions. The results demonstrate a clear improvement in human preference ratings (ELO) and competitive character error rates (CER) compared to both GRPO and commercial baselines. This empirical validation strengthens the claims made in the paper and showcases the effectiveness of the proposed method in a real-world context.

Reproducibility

The paper provides sufficient detail regarding the methodology and experimental setup, which is crucial for reproducibility. However, it would benefit from the inclusion of hyperparameters, model architectures, and specific training procedures to enhance clarity for future researchers attempting to replicate or build upon this work.

Limitations

One limitation acknowledged is the reliance on human preference pairs, which may introduce variability and subjectivity into the training process. Additionally, the method's performance in diverse linguistic contexts beyond Korean remains untested, which could limit its generalizability.

Broader Impact

The findings have significant implications for the development of more natural and human-like TTS systems, which can enhance user experience in various applications, including virtual assistants, audiobooks, and customer service interactions. By improving prosody in TTS, this work contributes to the broader goal of creating more engaging and effective human-computer interactions. The main contribution of this paper is the introduction of a novel preference-guided optimization approach for prosody learning in TTS systems, which effectively addresses the limitations of existing methods by utilizing human feedback to enhance the naturalness of synthesized speech. This work represents a meaningful step forward in the field of TTS, providing a practical solution to a longstanding challenge in achieving expressive and natural speech synthesis.

Analysis: Full Paper • Full text: 137 characters

Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models

Junyu Wang, Ziyang Ma, Zhengding Luo ... · arXiv

Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, cau...

Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose \textbf{MATA}, a novel training-free method that dynamically pushes LALMs to pay \textbf{M}ore \textbf{A}ttention \textbf{T}o \textbf{A}udio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.

Institutional Affiliations

Primary: Corresponding author

All Institutions: Corresponding author

ML Relevance Analysis (80)

The main contribution of this paper is the introduction of MATA, a novel training-free method that enhances audio attention in LALMs, which significantly improves their performance on audio reasoning tasks. The study's findings are relevant and timely, addressing a crucial challenge in the field of multi-modal machine learning and paving the way for future advancements.

Comprehensive Analysis

Methodology Assessment

The proposed MATA method is innovative in its approach to addressing the audio-textual attention imbalance in Large Audio-Language Models (LALMs). By dynamically adjusting attention weights post raw scoring without retraining the model, MATA offers a practical solution that is both efficient and effective. The choice to target only the last token in intermediate layers is particularly insightful, as it aligns with the model's architecture and the critical role of these layers in multi-modal fusion. However, the lack of detailed hyperparameter tuning and exploration of different enhancement strengths could limit the method's applicability across various models.

Experimental Evaluation

The experiments conducted on the MMAU and MMAR benchmarks provide strong evidence for the efficacy of MATA, showcasing significant performance improvements over baseline models. The results are compelling, especially the claim that MATA enables an open-source model to outperform a proprietary one for the first time. However, the paper could benefit from additional details on the experimental setup, such as the specific configurations of the baseline models and the statistical significance of the results presented.

Reproducibility

The paper does not provide a clear path for reproducing the results, as it lacks links to code repositories or detailed implementation instructions. While the methodology is described, the absence of a public implementation or demo limits the ability for other researchers to validate the findings independently.

Limitations

One limitation is the focus on only two benchmarks, which may not fully capture the generalizability of MATA across diverse audio reasoning tasks. Additionally, the method's reliance on a single hyperparameter for attention enhancement may not be optimal for all scenarios, and further exploration of this aspect could yield more robust results.

Broader Impact

The implications of this work are significant, as it addresses a critical gap in multi-modal model performance, particularly in audio reasoning tasks. By improving the attention allocation towards audio, MATA could enhance applications in various fields, including human-computer interaction, assistive technologies, and multimedia content analysis. This research opens avenues for further exploration into multi-modal learning, potentially leading to more balanced and capable AI systems. The main contribution of this paper is the introduction of MATA, a novel training-free method that enhances audio attention in LALMs, which significantly improves their performance on audio reasoning tasks. The study's findings are relevant and timely, addressing a crucial challenge in the field of multi-modal machine learning and paving the way for future advancements.

Analysis: Full Paper • Full text: 15,606 characters

Audio Super-Resolution with Latent Bridge Models

Chang Li, Zehua Chen, Liyuan Wang ... · Accepted at NeurIPS 2025

Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative gene...

Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.

Institutional Affiliations

Primary: Tsinghua University

All Institutions: and propose the prior augmentation strategies to reduce cascading errors. Comprehensive experimental results demonstrate that our AudioLBM outperforms previous audio upsampling systems by a large margin across speech, Corresponding author: Jun Zhu, Equal contribution, Shengshu AI, Tsinghua University, Department of CST, kHz on 192Audio and 192Music

Demo

ML Relevance Analysis (92)

The main contribution of this work is the introduction of a novel audio super-resolution system utilizing Latent Bridge Models, which significantly enhances the quality of audio upsampling beyond existing methods. The comprehensive methodology, rigorous experimental validation, and potential applications highlight its significance in advancing the field of audio processing.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel approach to audio super-resolution using Latent Bridge Models (LBMs), which compress audio waveforms into a continuous latent space. The methodology is well-structured, leveraging frequency-aware LBMs and a cascaded design to enhance the upsampling process beyond 48 kHz. The integration of informative priors from low-resolution (LR) signals into the generative framework is innovative, allowing for better quality audio synthesis. The paper also presents two prior augmentation strategies to mitigate cascading errors, which is a thoughtful addition to the overall framework. The use of variational autoencoders (VAEs) for compression and the detailed explanation of the bridge process further demonstrate the robustness of the proposed methodology.

Experimental Evaluation

The experimental setup is comprehensive, utilizing multiple benchmark datasets (VCTK, ESC-50, Song-Describer) and internal test sets to evaluate the performance of the proposed method. The results indicate a significant improvement over existing methods, achieving state-of-the-art performance in both objective and perceptual quality metrics. The paper effectively compares its results against various baselines, providing clear evidence of the advantages of the proposed approach. The ablation studies conducted further validate the contributions of each component of the model.

Reproducibility

The paper includes sufficient details regarding the training setup, model architecture, and evaluation metrics, which enhances reproducibility. However, the absence of a public code repository limits the ability for independent verification of results. The authors mention a demo URL, which may provide some interactive insights, but a complete code release would be beneficial for the community.

Limitations

While the proposed method shows promising results, it is important to note that the reliance on high-quality training data may limit its applicability in scenarios where such data is scarce. Additionally, the paper acknowledges potential misuse of the technology, such as unauthorized synthesis of audio, which raises ethical considerations. The cascading approach, while innovative, may still introduce artifacts that could affect the final output quality if not managed properly.

Broader Impact

The implications of this research are significant for various applications, including audio restoration, music production, and hearing aids, where high-quality audio is essential. The ability to upscale audio beyond traditional limits opens new avenues for creative industries and enhances user experiences in audio consumption. However, the ethical concerns regarding misuse must be addressed to prevent potential negative impacts on the industry. The main contribution of this work is the introduction of a novel audio super-resolution system utilizing Latent Bridge Models, which significantly enhances the quality of audio upsampling beyond existing methods. The comprehensive methodology, rigorous experimental validation, and potential applications highlight its significance in advancing the field of audio processing.

Analysis: Full Paper • Full text: 50,026 characters

Attention-based Mixture of Experts for Robust Speech Deepfake Detection

Viola Negroni, Davide Salvi, Alessandro Ilic Mezza ... · Accepted @ IEEE WIFS 2025

AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfa...

AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfakes become nearly indistinguishable from real human speech, the need for robust detection methods and effective countermeasures has become critically urgent. In this paper, we present the ISPL's submission to the SAFE challenge at IH&MMSec 2025, where our system ranked first across all tasks. Our solution introduces a novel approach to audio deepfake detection based on a Mixture of Experts architecture. The proposed system leverages multiple state-of-the-art detectors, combining their outputs through an attention-based gating network that dynamically weights each expert based on the input speech signal. In this design, each expert develops a specialized understanding of the shared training data by learning to capture different complementary aspects of the same input through inductive biases. Experimental results indicate that our method outperforms existing approaches across multiple datasets. We further evaluate and analyze the performance of our system in the SAFE challenge.

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of an attention-based Mixture of Experts architecture for robust speech deepfake detection, which combines the strengths of multiple detectors to improve performance. This innovative approach, along with strong experimental results, positions the work as a valuable addition to the ongoing efforts in combating audio deepfakes, though it requires further details for reproducibility and practical application.

Comprehensive Analysis

Methodology Assessment

The paper introduces a Mixture of Experts (MoE) architecture enhanced by an attention-based gating mechanism, which is a sophisticated approach to audio deepfake detection. The use of multiple state-of-the-art detectors allows the model to leverage complementary strengths, effectively addressing the challenge of distinguishing between real and synthetic speech. The attention mechanism dynamically adjusts the contribution of each expert based on the input, which is a notable innovation that enhances the model's adaptability and robustness. However, the paper could benefit from a more detailed description of the inductive biases employed and how they are integrated into the learning process.

Experimental Evaluation

The experimental results demonstrate that the proposed method outperforms existing approaches across multiple datasets, which is a strong indicator of its effectiveness. The authors participated in the SAFE challenge, achieving first place across all tasks, which adds credibility to their claims. However, the paper lacks detailed information about the datasets used, including their sizes, diversity, and how they were split for training and testing. This information is crucial for assessing the generalizability of the results.

Reproducibility

The paper does not provide sufficient implementation details or access to code repositories, which raises concerns about reproducibility. While the methodology is described, the absence of a clear path for other researchers to replicate the experiments limits the impact of the findings. Providing a GitHub repository or similar would greatly enhance the paper's contribution to the field.

Limitations

One limitation is the potential overfitting to the datasets used, especially if they are not sufficiently diverse. Additionally, the reliance on multiple experts may increase computational complexity, which could hinder real-time applications. The paper does not address how the model performs under adversarial conditions or with varying qualities of input audio, which is critical for practical deployment.

Broader Impact

The implications of this research are significant, particularly in the context of increasing concerns about misinformation and biometric spoofing. Effective detection methods for speech deepfakes can enhance security in various applications, including virtual assistants and online communications. However, the potential for misuse of such technologies also warrants careful consideration of ethical implications and the need for responsible deployment. The main contribution of this paper is the introduction of an attention-based Mixture of Experts architecture for robust speech deepfake detection, which combines the strengths of multiple detectors to improve performance. This innovative approach, along with strong experimental results, positions the work as a valuable addition to the ongoing efforts in combating audio deepfakes, though it requires further details for reproducibility and practical application.

Analysis: Full Paper • Full text: 2,539 characters

GAN-Based Multi-Microphone Spatial Target Speaker Extraction

Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel · arXiv

Spatial target speaker extraction isolates a desired speaker's voice in multi-speaker environments using spatial information, such as the direction of arrival (DoA). Although recent deep neural network (DNN)-based discriminative methods have shown significant performance improvem...

Spatial target speaker extraction isolates a desired speaker's voice in multi-speaker environments using spatial information, such as the direction of arrival (DoA). Although recent deep neural network (DNN)-based discriminative methods have shown significant performance improvements, the potential of generative approaches, such as generative adversarial networks (GANs), remains largely unexplored for this problem. In this work, we demonstrate that a GAN can effectively leverage both noisy mixtures and spatial information to extract and generate the target speaker's speech. By conditioning the GAN on intermediate features of a discriminative spatial filtering model in addition to DoA, we enable steerable target extraction with high spatial resolution of 5 degrees, outperforming state-of-the-art discriminative methods in perceptual quality-based objective metrics.

Institutional Affiliations

Primary: 6hours of data. For the steerable-target scenario

All Institutions: 6hours of data. For the steerable-target scenario, Fraunhofer IIS, 91058 Erlangen, Am Wolfsmantel 33

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a GAN-based framework for spatial target speaker extraction that effectively utilizes spatial information and intermediate features from discriminative models, demonstrating superior performance in perceptual quality metrics compared to existing methods. The comprehensive methodology and rigorous experimental evaluation underscore its significance in advancing the field of audio signal processing.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel GAN-based approach for multi-microphone spatial target speaker extraction, leveraging both spatial information (DoA) and intermediate features from discriminative models. The methodology is well-structured, employing an end-to-end training framework that combines adversarial, reconstruction, and feature-matching losses. The use of a U-Net-like architecture for the generator and a multi-scale STFT-based discriminator is appropriate for the task, allowing for effective feature extraction and conditioning. The conditioning on both DoA and intermediate discriminative features represents a significant methodological advancement, enhancing the model's ability to isolate target speakers in complex acoustic environments.

Experimental Evaluation

The experimental setup is robust, utilizing a comprehensive dataset generated through simulated acoustic environments. The authors provide a clear comparison against state-of-the-art discriminative methods, with results indicating superior performance in perceptual quality metrics (PESQ and SCOREQ) while maintaining strong spatial selectivity. The inclusion of multiple SNR levels in testing strengthens the evaluation, demonstrating the model's effectiveness across varying conditions. However, the reliance on synthetic data may limit the generalizability of the results in real-world applications.

Reproducibility

The paper provides detailed descriptions of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. The authors do mention the use of specific datasets and training configurations, which is helpful for reproducing the experiments.

Limitations

One limitation of the proposed method is its dependence on the quality of the simulated data, which may not fully capture the complexities of real-world acoustic environments. Additionally, while the model shows improved performance in perceptual metrics, it may still struggle in scenarios with very low SNR or highly reverberant conditions. The paper also does not explore the computational efficiency of the proposed GAN model, which could be a concern for real-time applications.

Broader Impact

The proposed method has significant implications for various applications, including hearing aids, conference systems, and automatic speech recognition. By improving the ability to isolate target speakers in noisy environments, this research could enhance communication technologies and accessibility tools for individuals with hearing impairments. The advancements in generative modeling for audio tasks could also inspire further research in related fields, such as speech synthesis and enhancement. The main contribution of this paper is the introduction of a GAN-based framework for spatial target speaker extraction that effectively utilizes spatial information and intermediate features from discriminative models, demonstrating superior performance in perceptual quality metrics compared to existing methods. The comprehensive methodology and rigorous experimental evaluation underscore its significance in advancing the field of audio signal processing.

Analysis: Full Paper • Full text: 18,036 characters

Brainprint-Modulated Target Speaker Extraction

Qiushi Han, Yuan Liao, Youhao Si ... · 5 pages, 2 figures, conference

Achieving robust and personalized performance in neuro-steered Target Speaker Extraction (TSE) remains a significant challenge for next-generation hearing aids. This is primarily due to two factors: the inherent non-stationarity of EEG signals across sessions, and the high inter-...

Achieving robust and personalized performance in neuro-steered Target Speaker Extraction (TSE) remains a significant challenge for next-generation hearing aids. This is primarily due to two factors: the inherent non-stationarity of EEG signals across sessions, and the high inter-subject variability that limits the efficacy of generalized models. To address these issues, we propose Brainprint-Modulated Target Speaker Extraction (BM-TSE), a novel framework for personalized and high-fidelity extraction. BM-TSE first employs a spatio-temporal EEG encoder with an Adaptive Spectral Gain (ASG) module to extract stable features resilient to non-stationarity. The core of our framework is a personalized modulation mechanism, where a unified brainmap embedding is learned under the joint supervision of subject identification (SID) and auditory attention decoding (AAD) tasks. This learned brainmap, encoding both static user traits and dynamic attentional states, actively refines the audio separation process, dynamically tailoring the output to each user. Evaluations on the public KUL and Cocktail Party datasets demonstrate that BM-TSE achieves state-of-the-art performance, significantly outperforming existing methods. Our code is publicly accessible at: https://github.com/rosshan-orz/BM-TSE.

Institutional Affiliations

Primary: These authors contributed equally to this work

All Institutions: School of Data Science, School of Artificial Intelligence, These authors contributed equally to this work

GitHub

ML Relevance Analysis (82)

The paper presents the Brainprint-Modulated Target Speaker Extraction (BM-TSE) framework, which significantly advances personalized neuro-steered audio extraction by integrating EEG signal processing with innovative modulation techniques. The methodology is robust and well-structured, addressing critical challenges in the field and demonstrating substantial technical contributions with promising experimental results.

Comprehensive Analysis

Methodology Assessment

The proposed BM-TSE framework introduces a robust spatio-temporal EEG encoder combined with an Adaptive Spectral Gain (ASG) module, which addresses the non-stationarity of EEG signals effectively. The architecture's unique feature is the personalized brainmap modulation mechanism that integrates subject identification and auditory attention decoding tasks, enabling dynamic audio refinement based on individual neural patterns. This approach is innovative as it leverages stable, user-specific EEG features to enhance target speaker extraction, which is a significant advancement over existing generalized models.

Experimental Evaluation

The experiments conducted on the KUL and Cocktail Party datasets demonstrate the model's superiority over existing methods, achieving state-of-the-art results in terms of speech quality and intelligibility. The ablation studies provide a clear understanding of the contributions of each component, reinforcing the importance of the proposed architecture. The metrics used for evaluation, including SI-SDR, PESQ, and STOI, are appropriate and relevant for assessing the model's performance in audio processing tasks.

Reproducibility

The paper provides sufficient implementation details, including the use of PyTorch, the training setup, and the datasets. The code is publicly available on GitHub, which enhances reproducibility. However, the absence of a live demo or interactive visualization limits immediate accessibility for other researchers.

Limitations

One limitation is the reliance on EEG data, which may not be universally applicable across all populations or settings due to inter-subject variability. Additionally, while the model shows promise, the performance may vary with different types of auditory stimuli or in more complex acoustic environments. The paper could also benefit from a discussion on the computational efficiency and real-time applicability of the proposed framework.

Broader Impact

The BM-TSE framework has significant implications for the development of advanced hearing aids and assistive listening technologies, potentially improving the quality of life for individuals with hearing impairments. By personalizing audio extraction based on neural signatures, this research paves the way for more adaptive and user-centered auditory processing systems. The paper presents the Brainprint-Modulated Target Speaker Extraction (BM-TSE) framework, which significantly advances personalized neuro-steered audio extraction by integrating EEG signal processing with innovative modulation techniques. The methodology is robust and well-structured, addressing critical challenges in the field and demonstrating substantial technical contributions with promising experimental results.

Analysis: Full Paper • Full text: 17,890 characters

Identifying birdsong syllables without labelled data

Mélisande Teng, Julien Boussard, David Rolnick ... · arXiv

Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great p...

Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great potential to alleviate the need for experts to label long audio recordings by hand. However, they still typically rely on the availability of labelled data for model training, restricting applicability to a few species and datasets. In this work, we build the first fully unsupervised algorithm to decompose birdsong recordings into sequences of syllables. We first detect syllable events, then cluster them to extract templates -- syllable representations -- before performing matching pursuit to decompose the recording as a sequence of syllables. We evaluate our automatic annotations against human labels on a dataset of Bengalese finch songs and find that our unsupervised method achieves high performance. We also demonstrate that our approach can distinguish individual birds within a species through their unique vocal signatures, for both Bengalese finches and another species, the great tit.

ML Relevance Analysis (82)

The main contribution of this paper is the development of a fully unsupervised method for annotating birdsongs at the syllable level, which addresses the significant challenge of data labeling in bioacoustics. The innovative approach and promising results position this work as a valuable addition to the field, with potential applications in conservation and animal behavior studies.

Comprehensive Analysis

Methodology Assessment

The paper presents a fully unsupervised algorithm for identifying and segmenting syllables in birdsong recordings, which is a significant advancement given the reliance on labeled data in previous methods. The methodology involves detecting syllable events, clustering them to create templates, and using a matching pursuit approach to decompose recordings into syllable sequences. The use of PCA and HDBSCAN for clustering, along with a split-merge strategy to refine syllable templates, demonstrates a thoughtful approach to handling the complexities of audio data. However, the methodology could benefit from more detailed explanations of the parameter choices and the impact of different thresholds on performance.

Experimental Evaluation

The experiments are well-structured, utilizing two distinct datasets (Bengalese finches and great tits) to validate the method's effectiveness. The evaluation metrics, including precision and recall, are appropriate for the task, and the results show promising performance, particularly in distinguishing individual birds. However, the paper lacks a comprehensive comparison with existing methods, which would provide context for the reported performance metrics. The choice of hyperparameters appears to be somewhat arbitrary, and further tuning could potentially enhance results.

Reproducibility

The paper provides a reasonable level of detail regarding the experimental setup and methodology, but it lacks specific URLs for code or data access, which hinders reproducibility. The absence of a publicly available implementation means that other researchers cannot easily replicate the findings or build upon the work. Including a GitHub repository or similar would significantly improve this aspect.

Limitations

The paper acknowledges that the method may not perform well in the presence of structured noise, which is a significant limitation for real-world applications. Additionally, the reliance on a fixed-size support set for template generation may restrict the method's adaptability to varying datasets. The potential for oversplitting clusters is also a concern, as it could lead to inaccuracies in syllable identification.

Broader Impact

The implications of this research are substantial, particularly in the fields of bioacoustics and wildlife conservation. By enabling the automatic annotation of birdsong, the method could facilitate large-scale studies of bird populations and behaviors, contributing to biodiversity monitoring and conservation efforts. Furthermore, the approach has the potential to be adapted for other taxa, broadening its applicability beyond avian species. The main contribution of this paper is the development of a fully unsupervised method for annotating birdsongs at the syllable level, which addresses the significant challenge of data labeling in bioacoustics. The innovative approach and promising results position this work as a valuable addition to the field, with potential applications in conservation and animal behavior studies.

Analysis: Full Paper • Full text: 18,725 characters

StereoFoley: Object-Aware Stereo Audio Generation from Video

Tornike Karchkhadze, Kuan-Lin Chen, Mojtaba ... · arXiv

We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely rema...

We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.

Institutional Affiliations

Primary: using AdamW with a learning rate of

All Institutions: on 8×NVIDIA A100 GPUs, denoising steps with a CFG scale of, UC San Diego, using AdamW with a learning rate of

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of StereoFoley, an end-to-end framework for generating object-aware stereo audio from video, addressing a critical gap in the field of video-to-audio generation. This work significantly advances the state-of-the-art by combining innovative methodologies with a strong experimental foundation, paving the way for future research and applications in audio synthesis.

Comprehensive Analysis

Methodology Assessment

The methodology presented in StereoFoley is robust, integrating various components such as video analysis, object tracking, and audio synthesis to create a comprehensive framework for stereo audio generation. The introduction of a synthetic data generation pipeline to address the limitations of existing datasets is a notable strength, showcasing innovation in data handling. The use of latent diffusion models and the design of a two-stage audio generation process enhances the model's performance. However, the reliance on synthetic data may raise questions about the generalizability of the results to real-world scenarios.

Experimental Evaluation

The experiments are well-structured, comparing the proposed model against state-of-the-art baselines. The use of both objective metrics and a human listening study provides a balanced evaluation of the model's performance. The results indicate that StereoFoley achieves competitive performance, particularly in object-aware audio generation. However, the marginal differences in some metrics suggest that while improvements are present, they may not be as pronounced as claimed.

Reproducibility

The paper provides sufficient details regarding the model architecture, training process, and evaluation metrics, which supports reproducibility. However, the absence of publicly available code or datasets limits the ability of other researchers to fully replicate the study. The authors should consider releasing their code and synthetic datasets to enhance reproducibility and facilitate further research.

Limitations

The primary limitation of the study is its reliance on synthetic data, which may not fully capture the complexities of real-world audio-visual interactions. Additionally, the evaluation metrics used may not be entirely suitable for high-sample-rate stereo sound, potentially underrepresenting the model's capabilities. The paper also acknowledges that the performance of the model may vary based on the quality of the input video data.

Broader Impact

The implications of this research are significant, particularly in fields such as film production, gaming, and virtual reality, where high-quality audio-visual synchronization is crucial. The ability to generate object-aware stereo audio could enhance user experiences in immersive environments. Furthermore, the framework could serve as a foundation for future developments in audio generation, potentially influencing related areas such as sound design and machine learning applications in multimedia. The main contribution of this paper is the introduction of StereoFoley, an end-to-end framework for generating object-aware stereo audio from video, addressing a critical gap in the field of video-to-audio generation. This work significantly advances the state-of-the-art by combining innovative methodologies with a strong experimental foundation, paving the way for future research and applications in audio synthesis.

Analysis: Full Paper • Full text: 22,645 characters

AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

Yan Rong, Chenxing Li, Dong Yu ... · arXiv

Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training dat...

Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To address these challenges, we propose AudioGenie-Reasoner (AGR), the first unified training-free multi-agent system that coordinates perception and reasoning over an evolving chain of textual evidence. Our key idea is a paradigm shift that transforms audio deep reasoning into complex text understanding task from a new perspective, thereby unlocking the full potential of large language models. Specifically, the design of AGR mimics the human coarse-to-fine cognitive process. It first transforms the input audio into a coarse text-based document. Then, we design a novel proactive iterative document refinement loop, featuring tool-augmented routes and specialized agents, to continuously search for missing information and augment the evidence chain in a coarse-to-fine manner until sufficient question-related information is gathered for making final predictions. Experimental results show that AGR achieves state-of-the-art (SOTA) performance over existing open-source audio deep reasoning models across various benchmarks. The code will be made publicly available.

Institutional Affiliations

Primary: The Hong Kong University of Science and Technology (Guangzhou)

All Institutions: The Hong Kong University of Science and Technology (Guangzhou)

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of AudioGenie-Reasoner, a training-free multi-agent framework that effectively integrates audio perception and reasoning through a novel iterative refinement process. This work significantly advances the field of audio deep reasoning by proposing a unique methodology that leverages large language models and human-like cognitive processes, although it could benefit from more comprehensive experimental details and reproducibility measures.

Comprehensive Analysis

Methodology Assessment

The proposed AudioGenie-Reasoner (AGR) framework introduces a novel approach to audio deep reasoning by leveraging a multi-agent system that operates in a training-free manner. The methodology is commendable as it mimics human cognitive processes, allowing for a coarse-to-fine transformation of audio inputs into textual evidence. The proactive iterative document refinement loop is particularly innovative, as it emphasizes active exploration and information augmentation, which are critical in reasoning tasks. However, the paper could benefit from a more detailed explanation of the specific algorithms used by the specialized agents and how they interact with the tool-augmented routes.

Experimental Evaluation

The experimental results indicate that AGR achieves state-of-the-art performance across various benchmarks, demonstrating its effectiveness compared to existing models. However, the paper lacks detailed descriptions of the datasets used, including their sizes and characteristics, which are essential for understanding the generalizability of the results. Additionally, the evaluation metrics employed should be clearly defined to assess the robustness of the findings.

Reproducibility

The authors mention that the code will be made publicly available, which is a positive aspect for reproducibility. However, the paper does not provide sufficient implementation details or a clear methodology for reproducing the experiments, such as hyperparameter settings or the computational resources required. This lack of detail could hinder the ability of other researchers to replicate the study.

Limitations

One significant limitation is the reliance on a training-free approach, which may restrict the model's adaptability to specific tasks or domains that require fine-tuning. Additionally, the absence of a comprehensive comparison with other training-free models in the literature leaves questions about the relative performance and applicability of AGR. The paper also does not address potential scalability issues when dealing with larger audio datasets.

Broader Impact

The implications of this research are substantial, as it opens new avenues for audio processing and reasoning applications, such as in automated transcription, audio-based question answering, and interactive audio systems. By bridging the gap between perception and reasoning, AGR could enhance user experiences in various domains, including education, entertainment, and accessibility technologies. The main contribution of this paper is the introduction of AudioGenie-Reasoner, a training-free multi-agent framework that effectively integrates audio perception and reasoning through a novel iterative refinement process. This work significantly advances the field of audio deep reasoning by proposing a unique methodology that leverages large language models and human-like cognitive processes, although it could benefit from more comprehensive experimental details and reproducibility measures.

Analysis: Full Paper • Full text: 340 characters

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

Junhyeok Lee, Helin Wang, Yaohan Guan ... · arXiv

We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To furt...

We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intellgibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.

Institutional Affiliations

Primary: Johns Hopkins University

All Institutions: Center for Language and Speech Processing, We gratefully acknowledge support from IARPA ARTS (award number 140D0424C0067) from the Office of the Director of National Intelligence, Johns Hopkins University

Demo

ML Relevance Analysis (83)

MaskVCT introduces a novel approach to zero-shot voice conversion, significantly enhancing controllability through multiple classifier-free guidances. The comprehensive analysis of its technical contributions, methodology, and experimental results positions it as a meaningful advancement in the field of audio machine learning.

Comprehensive Analysis

Methodology Assessment

The methodology presented in MaskVCT is innovative, leveraging a masked modeling approach combined with classifier-free guidance to enhance voice conversion capabilities. The integration of multiple factors for controllability—such as linguistic features and pitch contour—demonstrates a sophisticated understanding of the complexities involved in voice conversion. The ability to balance speaker identity, linguistic content, and prosody in a zero-shot setting is a significant advancement over traditional methods that rely on fixed conditioning schemes. However, the paper could benefit from a more detailed explanation of the underlying algorithms and the specific mechanisms by which the CFGs operate.

Experimental Evaluation

The experiments are extensive and well-structured, showcasing the model's performance against existing baselines. The authors provide quantitative metrics such as target speaker and accent similarities, as well as word and character error rates, which are critical for evaluating the effectiveness of voice conversion systems. The results indicate that MaskVCT achieves competitive performance, suggesting that the proposed model is both robust and effective. However, further qualitative evaluations, such as user studies or perceptual tests, could strengthen the findings.

Reproducibility

The paper does not provide sufficient details regarding the implementation of MaskVCT, which could hinder reproducibility. While the authors mention extensive experiments, the lack of a public code repository limits the ability of other researchers to replicate the results. Clearer documentation of the training process, hyperparameters, and dataset specifics would enhance reproducibility.

Limitations

One limitation of the study is the reliance on existing datasets for evaluation, which may not fully capture the diversity of voice characteristics in real-world applications. Additionally, while the model offers increased controllability, the complexity of managing multiple factors may pose challenges for users unfamiliar with voice conversion technologies. The authors should also address potential biases in the datasets used, which could affect the generalizability of the results.

Broader Impact

The advancements presented in MaskVCT have significant implications for applications in entertainment, accessibility, and telecommunication. The ability to perform zero-shot voice conversion with high controllability can enhance personalized user experiences in virtual assistants, dubbing, and gaming. Furthermore, the model's robustness could contribute to more inclusive technologies for individuals with speech impairments. MaskVCT introduces a novel approach to zero-shot voice conversion, significantly enhancing controllability through multiple classifier-free guidances. The comprehensive analysis of its technical contributions, methodology, and experimental results positions it as a meaningful advancement in the field of audio machine learning.

Analysis: Full Paper • Full text: 1,090 characters

PGSTalker: Real-Time Audio-Driven Talking Head Generation via 3D Gaussian Splatting with Pixel-Aware Density Control

Tianheng Zhu, Yinfeng Yu, Liejun Wang ... · Main paper (15 pages). Accepted for publication by ICONIP( International Conference on Neural Information Processing) 2025

Audio-driven talking head generation is crucial for applications in virtual reality, digital avatars, and film production. While NeRF-based methods enable high-fidelity reconstruction, they suffer from low rendering efficiency and suboptimal audio-visual synchronization. This wor...

Audio-driven talking head generation is crucial for applications in virtual reality, digital avatars, and film production. While NeRF-based methods enable high-fidelity reconstruction, they suffer from low rendering efficiency and suboptimal audio-visual synchronization. This work presents PGSTalker, a real-time audio-driven talking head synthesis framework based on 3D Gaussian Splatting (3DGS). To improve rendering performance, we propose a pixel-aware density control strategy that adaptively allocates point density, enhancing detail in dynamic facial regions while reducing redundancy elsewhere. Additionally, we introduce a lightweight Multimodal Gated Fusion Module to effectively fuse audio and spatial features, thereby improving the accuracy of Gaussian deformation prediction. Extensive experiments on public datasets demonstrate that PGSTalker outperforms existing NeRF- and 3DGS-based approaches in rendering quality, lip-sync precision, and inference speed. Our method exhibits strong generalization capabilities and practical potential for real-world deployment.

Institutional Affiliations

Primary: Xinjiang University

All Institutions: small Liejun Wang1, small School of Computer Science and Technology, small Yinfeng Yu1(, Xinjiang University, Urumqi 830017, small Fuchun Sun2, small Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center, small Wendong Zheng3, small Tianheng Zhu1

ML Relevance Analysis (83)

PGSTalker presents a novel framework for real-time audio-driven talking head generation, significantly improving rendering efficiency and lip-sync accuracy through innovative methodologies. The work addresses critical challenges in the field, demonstrating strong experimental validation and practical potential for real-world applications.

Comprehensive Analysis

Methodology Assessment

The proposed methodology in PGSTalker is innovative, leveraging 3D Gaussian Splatting (3DGS) and introducing a pixel-aware density control strategy that enhances rendering efficiency and detail in dynamic facial regions. The Multimodal Gated Fusion Module (MGF) is a significant contribution, effectively integrating audio and spatial features for improved Gaussian deformation prediction. The approach is well-structured, with a clear delineation of the face and inside mouth branches, allowing for targeted modeling of distinct facial dynamics. The methodology is robust, addressing limitations of prior NeRF and 3DGS methods while maintaining real-time performance.

Experimental Evaluation

The experiments are extensive, utilizing public datasets and comparing PGSTalker against several state-of-the-art methods. The evaluation metrics are comprehensive, including rendering quality, lip-sync accuracy, and inference speed, which are critical for real-time applications. The results demonstrate PGSTalker's superiority in all evaluated metrics, providing strong evidence of its effectiveness. The ablation study further validates the contributions of key components, enhancing the credibility of the findings.

Reproducibility

While the paper provides a detailed description of the methods and experiments, it lacks specific URLs for code or demo access, which could hinder reproducibility. The implementation details, such as the training pipeline and loss functions, are well-explained, but without a public repository, independent verification of results may be challenging.

Limitations

One identified limitation is the reliance on high-quality training data, which may not be readily available for all potential users. Additionally, the model's performance in highly variable audio conditions or with diverse speaker characteristics has not been extensively tested, which could affect generalization in real-world applications. The computational requirements, while improved, still necessitate significant resources for real-time performance.

Broader Impact

The implications of PGSTalker are significant, particularly in fields such as virtual reality, digital avatars, and film production. The ability to generate realistic, audio-driven talking heads in real-time could revolutionize user interactions in digital environments and enhance content creation in media. The framework's potential for practical deployment suggests a wide range of applications, from entertainment to telecommunication. PGSTalker presents a novel framework for real-time audio-driven talking head generation, significantly improving rendering efficiency and lip-sync accuracy through innovative methodologies. The work addresses critical challenges in the field, demonstrating strong experimental validation and practical potential for real-world applications.

Analysis: Full Paper • Full text: 22,444 characters

Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment

Ragib Amin Nihal, Benjamin Yen, Takeshi Ashizawa ... · Accepted on Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025)

Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization. However, existing methods often struggle to address nonlinear clock drift and lack mechanisms for quantifying uncertainty. Traditional methods like Cros...

Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization. However, existing methods often struggle to address nonlinear clock drift and lack mechanisms for quantifying uncertainty. Traditional methods like Cross-correlation and Dynamic Time Warping assume simple drift patterns and provide no reliability measures. Meanwhile, recent deep learning models typically treat alignment as a binary classification task, overlooking inter-channel dependencies and uncertainty estimation. We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization. We extend BEATs encoders with cross-attention layers to model temporal relationships between channels. We also develop a confidence-weighted scoring function that uses the full prediction distribution instead of binary thresholding. Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline. On individual datasets, we achieved 0.14 MSE on ARU data (77% reduction) and 0.45 MSE on zebra finch data (18% reduction). The framework supports probabilistic temporal alignment, moving beyond point estimates. While validated in a bioacoustic context, the approach is applicable to a broader range of multi-channel audio tasks where alignment confidence is critical. Code available on: https://github.com/Ragib-Amin-Nihal/BEATsCA

Institutional Affiliations

Primary: Fictional University

All Institutions: 2133 Long Road, 8765 Dream Blvd, An unnumbered footnote that may come in handy, University Imagination, Fictional University, Important Laboratory

GitHub

ML Relevance Analysis (82)

The paper presents a significant advancement in multi-channel audio alignment through the integration of cross-attention mechanisms and confidence-weighted scoring, addressing key limitations of existing methods. The comprehensive evaluation and validation of the proposed approach underscore its potential impact on the field of audio processing and related applications.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel method that integrates cross-attention mechanisms with confidence-weighted scoring to enhance multi-channel audio alignment. The approach effectively models inter-channel dependencies and provides uncertainty quantification, addressing limitations of traditional methods and binary classification systems. The use of BEATs encoders with cross-attention layers is a significant innovation, allowing for better temporal relationship modeling. The confidence-weighted scoring function is well-conceived, utilizing the full prediction distribution rather than binary thresholds, which is a notable advancement in the field.

Experimental Evaluation

The experimental results are compelling, demonstrating substantial improvements in Mean Squared Error (MSE) over baseline methods across multiple datasets. Achieving first place in the BioDCASE 2025 Task 1 challenge validates the effectiveness of the proposed method. The paper includes thorough validation and ablation studies that provide insight into the contributions of different components of the model, enhancing the credibility of the results.

Reproducibility

The implementation details are well-documented, including the architecture, training configurations, and data augmentation techniques. The use of fixed random seeds for reproducibility is a strong point. However, the absence of a demo page limits the accessibility of the results for further exploration by the community.

Limitations

While the method shows promise, it relies on affine drift approximations for candidate generation, which may not generalize to all scenarios of clock drift. Additionally, the paper could benefit from a more extensive discussion on the computational efficiency of the proposed method, especially given the complexity introduced by cross-attention mechanisms.

Broader Impact

The proposed framework has significant implications for various applications requiring precise audio synchronization, such as bioacoustic monitoring, spatial audio systems, and distributed sensor networks. Its ability to quantify uncertainty in alignment decisions could enhance the reliability of systems in critical applications. The paper presents a significant advancement in multi-channel audio alignment through the integration of cross-attention mechanisms and confidence-weighted scoring, addressing key limitations of existing methods. The comprehensive evaluation and validation of the proposed approach underscore its potential impact on the field of audio processing and related applications.

Analysis: Full Paper • Full text: 17,255 characters

DeepASA: An Object-Oriented One-for-All Network for Auditory Scene Analysis

Dongheon Lee, Younghoo Kwon, Jung-Woo Choi · 26 pages, 13 figures, 8 tables, accepted in NeurIPS 2025

We propose DeepASA, a one-for-all model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA ...

We propose DeepASA, a one-for-all model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA is designed for complex auditory scenes where multiple, often similar, sound sources overlap in time and move dynamically in space. To achieve robust and consistent inference across tasks, we introduce an object-oriented processing (OOP) strategy. This approach encapsulates diverse auditory features into object-centric representations and refines them through a chain-of-inference (CoI) mechanism. The pipeline comprises a dynamic temporal kernel-based feature extractor, a transformer-based aggregator, and an object separator that yields per-object features. These features feed into multiple task-specific decoders. Our object-centric representations naturally resolve the parameter association ambiguity inherent in traditional track-wise processing. However, early-stage object separation can lead to failure in downstream ASA tasks. To address this, we implement temporal coherence matching (TCM) within the chain-of-inference, enabling multi-task fusion and iterative refinement of object features using estimated auditory parameters. We evaluate DeepASA on representative spatial audio benchmark datasets, including ASA2, MC-FUSS, and STARSS23. Experimental results show that our model achieves state-of-the-art performance across all evaluated tasks, demonstrating its effectiveness in both source separation and auditory parameter estimation under diverse spatial auditory scenes.

Institutional Affiliations

Primary: and the channel dimension for the DeepASA was

All Institutions: The initial learning rate, and the channel dimension for the DeepASA was

Demo

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of DeepASA, a unified framework for auditory scene analysis that effectively integrates multiple auditory tasks through innovative object-oriented processing and a chain-of-inference mechanism. This work significantly advances the state of the art in audio processing by providing a comprehensive solution to the challenges posed by complex auditory environments.

Comprehensive Analysis

Methodology Assessment

The proposed DeepASA framework introduces an innovative object-oriented processing (OOP) strategy that effectively encapsulates auditory features into object-centric representations, allowing for robust multi-task learning in auditory scene analysis. The integration of a chain-of-inference (CoI) mechanism to refine these representations through temporal coherence matching is a significant methodological advancement, addressing common pitfalls in traditional track-wise processing. The architecture's use of dynamic temporal kernels and transformer-based aggregators enhances its adaptability to complex auditory environments, showcasing a well-thought-out design that leverages state-of-the-art techniques in deep learning.

Experimental Evaluation

The experimental validation of DeepASA on multiple benchmark datasets (ASA2, MC-FUSS, and STARSS23) demonstrates its effectiveness across various tasks, achieving state-of-the-art performance metrics. The comprehensive ablation studies provide clear insights into the contributions of each component, reinforcing the robustness of the proposed architecture. However, the paper could benefit from additional comparisons with more diverse models to further contextualize its performance.

Reproducibility

The paper provides a detailed description of the architecture, training procedures, and evaluation metrics, which supports reproducibility. The availability of a demo page enhances accessibility, allowing other researchers to explore the model's capabilities. However, the lack of a public code repository may hinder full reproducibility for some practitioners.

Limitations

One notable limitation is the large parameter size associated with the ATST used in the SED decoder, which may restrict deployment in resource-constrained environments. Additionally, the reliance on specific datasets for training and evaluation may limit the generalizability of the results to other auditory scenarios.

Broader Impact

The development of DeepASA has significant implications for advancing auditory scene analysis and sound separation technologies, particularly in applications such as hearing aids, surveillance systems, and interactive audio environments. By emulating human auditory processing, this research opens new avenues for improving machine understanding of complex auditory scenes, potentially benefiting various fields including robotics, virtual reality, and assistive technologies. The main contribution of this paper is the introduction of DeepASA, a unified framework for auditory scene analysis that effectively integrates multiple auditory tasks through innovative object-oriented processing and a chain-of-inference mechanism. This work significantly advances the state of the art in audio processing by providing a comprehensive solution to the challenges posed by complex auditory environments.

Analysis: Full Paper • Full text: 35,842 characters

SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions

Massa Baali, Sarthak Bisht, Francisco Teixeira ... · Accepted to EMNLP 2025 Findings

Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions caus...

Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions causing signal degradations or mismatches between enrollment and test data, impacting performance. Existing benchmarks evaluate only subsets of these conditions, missing others entirely. We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks. While several benchmarks do exist that each cover some of these issues, SVeritas is the first comprehensive evaluation that not only includes all of these, but also several other entirely new, but nonetheless important, real-life conditions that have not previously been benchmarked. We use SVeritas to evaluate several state-of-the-art SV models and observe that while some architectures maintain stability under common distortions, they suffer substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression. Extending our analysis across demographic subgroups, we further identify disparities in robustness across age groups, gender, and linguistic backgrounds. By standardizing evaluation under realistic and synthetic stress conditions, SVeritas enables precise diagnosis of model weaknesses and establishes a foundation for advancing equitable and reliable speaker verification systems.

Institutional Affiliations

Primary: and Author n

All Institutions: Address line, and Author n

ML Relevance Analysis (81)

The main contribution of this paper is the introduction of SVeritas, a comprehensive benchmark for evaluating speaker verification systems under diverse conditions, which significantly enhances the understanding of model robustness and fairness across demographic groups. This work is poised to influence future research directions in speaker verification, emphasizing the importance of robustness in real-world applications.

Comprehensive Analysis

Methodology Assessment

The paper introduces SVeritas, a comprehensive benchmark for speaker verification (SV) systems that evaluates models under a wide array of stress conditions. The methodology is robust, incorporating both real-world and synthetic stressors, which is a significant advancement over existing benchmarks that only cover limited scenarios. The modular design allows for easy integration of new models and evaluation settings, enhancing its utility for future research. However, the fixed levels of stress conditions may limit the granularity of the analysis.

Experimental Evaluation

The experiments conducted using SVeritas are thorough, evaluating several state-of-the-art SV models across diverse conditions, including demographic factors. The results reveal critical insights into model performance under various stressors, highlighting disparities in robustness across different demographic groups. The statistical methods employed, such as paired t-tests, are appropriate for the analysis, although the paper could benefit from more extensive discussions on the implications of the findings.

Reproducibility

The paper does not provide explicit URLs for code or datasets, which raises concerns about reproducibility. While the methodology is well-documented, the absence of a public repository or demo limits the ability of other researchers to replicate the experiments or utilize the benchmark effectively.

Limitations

The primary limitation noted is the fixed severity levels of stress conditions, which may not reflect real-world variability in audio degradation. Additionally, the paper acknowledges that the sample sizes for some demographic groups may be small, potentially affecting the statistical power of the comparisons.

Broader Impact

The development of SVeritas has significant implications for the field of speaker verification, particularly in enhancing the robustness and fairness of SV systems. By addressing real-world challenges and providing a standardized evaluation framework, this work lays the groundwork for future advancements in equitable and reliable speaker verification technologies. The main contribution of this paper is the introduction of SVeritas, a comprehensive benchmark for evaluating speaker verification systems under diverse conditions, which significantly enhances the understanding of model robustness and fairness across demographic groups. This work is poised to influence future research directions in speaker verification, emphasizing the importance of robustness in real-world applications.

Analysis: Full Paper • Full text: 16,977 characters

FakeSound2: A Benchmark for Explainable and Generalizable Deepfake Sound Detection

Zeyu Xie, Yaoyun Zhang, Xuenan Xu ... · arXiv

The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largel...

The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largely focus on binary classification and fall short in explaining how manipulations occur, tracing where the sources originated, or generalizing to unseen sources-thereby limiting the explainability and reliability of detection. To address these limitations, we present FakeSound2, a benchmark designed to advance deepfake sound detection beyond binary accuracy. FakeSound2 evaluates models across three dimensions: localization, traceability, and generalization, covering 6 manipulation types and 12 diverse sources. Experimental results show that although current systems achieve high classification accuracy, they struggle to recognize forged pattern distributions and provide reliable explanations. By highlighting these gaps, FakeSound2 establishes a comprehensive benchmark that reveals key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.

Institutional Affiliations

Primary: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology

All Institutions: epochs using the AdamW optimizer with a learning rate of, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology

Demo · GitHub

ML Relevance Analysis (77)

The paper presents FakeSound2, a benchmark aimed at enhancing deepfake sound detection through improved explainability and generalization. This comprehensive approach addresses significant gaps in the current literature and has the potential to drive advancements in the field of audio forensics.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel benchmark, FakeSound2, which evaluates deepfake sound detection across three critical dimensions: localization, traceability, and generalization. The methodology is well-structured, leveraging a comprehensive dataset that includes various manipulation types and sources. The automated pipeline for dataset construction is a significant contribution, as it ensures a diverse and high-quality dataset for training and evaluation. However, the reliance on existing models for baseline comparisons may limit the perceived novelty of the proposed methods.

Experimental Evaluation

The experimental results are thorough, showcasing the performance of current models on the FakeSound2 benchmark. The results highlight the strengths of existing systems in binary classification while revealing their weaknesses in explainability and generalization. The use of metrics like Acc$_identify$, Acc$_manipulation$, and F1$_segment$ provides a clear assessment of model capabilities. However, the paper could benefit from more detailed comparisons with state-of-the-art methods to contextualize the findings further.

Reproducibility

The paper provides sufficient details about the dataset construction and evaluation metrics, which supports reproducibility. However, the implementation specifics of the baseline model and the training process could be elaborated further to enhance clarity for future researchers attempting to replicate the study.

Limitations

The primary limitations identified include the models' struggles with explainability and generalization, particularly in distinguishing between similar manipulation types. The dataset's reliance on specific generative models may also introduce biases that could affect generalization to unseen sources. Additionally, the current evaluation metrics may not fully capture the complexity of audio manipulation tasks.

Broader Impact

The work addresses a pressing issue in the realm of synthetic media, particularly concerning ethical and security implications. By establishing a benchmark for explainable and generalizable deepfake sound detection, the paper has the potential to influence future research directions and promote the development of more robust detection systems. This is particularly relevant in contexts where audio authenticity is critical, such as journalism, law enforcement, and digital forensics. The paper presents FakeSound2, a benchmark aimed at enhancing deepfake sound detection through improved explainability and generalization. This comprehensive approach addresses significant gaps in the current literature and has the potential to drive advancements in the field of audio forensics.

Analysis: Full Paper • Full text: 19,054 characters

MBCodec:Thorough disentangle for high-fidelity audio compression

Ruonan Zhang, Xiaoyang Hao, Yichen Han ... · arXiv

High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and semantic information within tokens, leading to ...

High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and semantic information within tokens, leading to a lack of fine-grained details in synthesized speech. In this study, we propose MBCodec, a novel multi-codebook audio codec based on Residual Vector Quantization (RVQ) that learns a hierarchically structured representation. MBCodec leverages self-supervised semantic tokenization and audio subband features from the raw signals to construct a functionally-disentangled latent space. In order to encourage comprehensive learning across various layers of the codec embedding space, we introduce adaptive dropout depths to differentially train codebooks across layers, and employ a multi-channel pseudo-quadrature mirror filter (PQMF) during training. By thoroughly decoupling semantic and acoustic features, our method not only achieves near-lossless speech reconstruction but also enables a remarkable 170x compression of 24 kHz audio, resulting in a low bit rate of just 2.2 kbps. Experimental evaluations confirm its consistent and substantial outperformance of baselines across all evaluations.

Institutional Affiliations

Primary: No funding was received for conducting this study. The authors have no relevant financial or nonfinancial interests to disclose

All Institutions: No funding was received for conducting this study. The authors have no relevant financial or nonfinancial interests to disclose

ML Relevance Analysis (77)

The main contribution of this paper is the introduction of MBCodec, a novel audio codec that effectively disentangles semantic and acoustic information, achieving near-lossless reconstruction at a remarkable compression ratio. This work represents a significant step forward in the field of neural audio codecs, addressing critical challenges in audio quality and compression efficiency while paving the way for future research and applications in high-fidelity audio processing.

Comprehensive Analysis

Methodology Assessment

The proposed MBCodec introduces a multi-codebook architecture that employs Residual Vector Quantization (RVQ) to achieve effective disentanglement of semantic and acoustic features. The use of self-supervised learning for semantic tokenization and the introduction of a pseudo-quadrature mirror filter (PQMF) to supervise acoustic information are innovative aspects that enhance model interpretability and performance. The adaptive dropout strategy is a thoughtful addition that optimizes training efficiency by dynamically adjusting the number of active codebooks based on their contribution to the reconstruction quality. Overall, the methodology is well-structured and addresses key limitations in existing audio codecs.

Experimental Evaluation

The experimental setup is robust, benchmarking MBCodec against established baselines like DAC and Encodec. The authors provide comprehensive evaluations across multiple metrics, including PESQ and SI-SDR, demonstrating significant improvements in audio reconstruction quality and compression efficiency. The ablation study further strengthens the findings by isolating the contributions of key components, validating the necessity of both the PQMF and adaptive dropout mechanisms. However, the paper lacks clarity on the datasets used, which could affect the reproducibility of results.

Reproducibility

The paper provides some implementation details, such as the architecture of the encoder and decoder, training duration, and hyperparameters. However, it lacks a clear description of the dataset preprocessing steps and the specific configurations used for the experiments. The absence of a publicly available code repository or demo limits the reproducibility of the results, as external researchers would need to replicate the entire setup without direct access to the code or data.

Limitations

While MBCodec shows promising results, it is essential to note that the paper does not address potential limitations in terms of scalability or real-time application viability. The complexity of the model may hinder its deployment in resource-constrained environments. Additionally, the reliance on large-scale datasets for training may limit its applicability to domains with less available data.

Broader Impact

The advancements in audio codec technology presented in this paper have significant implications for various applications, including speech synthesis, telecommunications, and multimedia streaming. The ability to achieve high-fidelity audio reconstruction at extremely low bitrates can enhance user experiences in voice communication and media consumption, particularly in bandwidth-limited scenarios. The main contribution of this paper is the introduction of MBCodec, a novel audio codec that effectively disentangles semantic and acoustic information, achieving near-lossless reconstruction at a remarkable compression ratio. This work represents a significant step forward in the field of neural audio codecs, addressing critical challenges in audio quality and compression efficiency while paving the way for future research and applications in high-fidelity audio processing.

Analysis: Full Paper • Full text: 15,795 characters

Audio Papers

ML Relevance Analysis (86)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

ML Relevance Analysis (80)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

ML Relevance Analysis (82)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (82)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

ML Relevance Analysis (82)

Comprehensive Analysis

Methodology Assessment