Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.
Primary: University of Hamburg
All Institutions: University of Hamburg, CISPA Helmholtz Center for Information Security, Signal Processing (SP)
The main contribution of this paper is the demonstration that modern speech enhancement systems are vulnerable to adversarial attacks, highlighting the need for robust defenses in the field. The comprehensive methodology and experimental evaluation provide valuable insights into the vulnerabilities of both predictive and generative models, making a meaningful contribution to the ongoing discourse on security in machine learning applications.
The paper proposes a novel approach to adversarial attacks on speech enhancement systems by leveraging psychoacoustic principles to mask adversarial noise. The methodology is well-structured, incorporating a white-box attack scenario where the adversary has full knowledge of the model. The introduction of a psychoacoustic model to optimize the inaudibility of the perturbation is particularly innovative. The authors also provide a detailed description of the optimization process, including the use of projected gradient descent and the incorporation of constraints to balance attack success and audibility. This methodological rigor enhances the credibility of the findings.
The experiments are comprehensive, utilizing the EARS-WHAM-v2 dataset, which is appropriate for evaluating speech enhancement systems. The evaluation metrics are well-chosen, including both attack success (WER, POLQA, ESTOI) and perturbation impact (SNR). The results are presented clearly, showing a systematic comparison between predictive and generative models, with insightful analysis on the effects of different configurations. The paper effectively demonstrates the vulnerability of speech enhancement systems to adversarial attacks and highlights the robustness of diffusion models.
The authors provide sufficient details regarding the experimental setup, including model architectures and training procedures. The inclusion of links to the project page and GitHub repository enhances reproducibility. However, the paper could benefit from more explicit instructions on replicating the psychoacoustic model and the adversarial attack process, as these are critical to understanding the full scope of the methodology.
One limitation of the study is that it primarily focuses on white-box attacks, which may not fully represent real-world scenarios where adversaries have limited knowledge of the model. Additionally, while the paper discusses the robustness of diffusion models, it does not explore the potential trade-offs in performance or the computational complexity associated with these models. The generalizability of the findings to other speech enhancement systems beyond those tested is also not addressed.
This research has significant implications for the security of speech enhancement systems, which are increasingly used in applications such as hearing aids and telecommunication devices. By demonstrating vulnerabilities to adversarial attacks, the work raises awareness about the need for more robust models in real-world applications. The findings could inform future research aimed at developing defenses against such attacks, ultimately contributing to safer and more reliable speech processing technologies. The main contribution of this paper is the demonstration that modern speech enhancement systems are vulnerable to adversarial attacks, highlighting the need for robust defenses in the field. The comprehensive methodology and experimental evaluation provide valuable insights into the vulnerabilities of both predictive and generative models, making a meaningful contribution to the ongoing discourse on security in machine learning applications.
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.
The main contribution of this paper is the introduction of UniSS, a unified single-stage framework for expressive speech-to-speech translation that significantly advances the state of the art by integrating large language models and addressing key challenges in the field. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its significance in machine learning research.
The paper introduces a unified single-stage framework for expressive speech-to-speech translation (S2ST) called UniSS, which effectively addresses the challenges of preserving speaker identity and emotional style during translation. The methodology is innovative, employing a cross-modal chain-of-thought prompting process that allows for the integration of large language models (LLMs) into the speech domain. The use of a triple-tokenizer strategy to represent different aspects of speech (speaker, linguistic, and semantic tokens) is a notable strength, as it enhances the model's ability to capture and reproduce expressive characteristics. The progressive training strategy is well-structured, emphasizing the importance of data quality and alignment between speech and text modalities.
The experimental results are robust, demonstrating that UniSS significantly outperforms existing methods in translation fidelity, speech quality, and emotional preservation. The authors provide a comprehensive evaluation using both objective metrics (e.g., BLEU scores, prosody preservation) and subjective assessments (e.g., MOS scores), which lend credibility to their claims. The introduction of the UniST dataset, comprising 44.8k hours of expressive S2ST data, is a significant contribution that enhances the reproducibility of results and provides a valuable resource for future research.
The paper includes detailed implementation details, including the training configuration, hyperparameters, and the data construction process for the UniST dataset. The availability of the code and demo enhances reproducibility, allowing other researchers to replicate the findings and build upon the work. However, the complexity of the model and the extensive training data required may pose challenges for some researchers in terms of resource availability.
While the paper presents a strong framework, it acknowledges limitations such as the focus on only Chinese and English languages, which restricts the applicability of the model to multilingual scenarios. Additionally, the reliance on a large-scale dataset may limit the model's accessibility for smaller research teams or institutions. The authors also mention the need for a unified tokenizer to optimize vocabulary size, indicating potential areas for further improvement.
The proposed UniSS framework has significant implications for real-time interpretation, cross-lingual video dubbing, and other applications requiring high-quality expressive S2ST. By effectively preserving emotional style and speaker identity, this work could enhance user experiences in various communication technologies, making it particularly relevant in globalized contexts where multilingual interactions are common. The main contribution of this paper is the introduction of UniSS, a unified single-stage framework for expressive speech-to-speech translation that significantly advances the state of the art by integrating large language models and addressing key challenges in the field. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its significance in machine learning research.
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.1\% on MMAR, and 70.7\% on MMSU, establishing new state-of-the-art performance across these benchmarks.
Primary: South China University of Technology
All Institutions: South China University of Technology, Antgroup, Shanghai Jiao Tong University, University of Rochester, The Chinese University of Hong Kong, King's College London
The paper presents a comprehensive approach to improving Large Audio Language Models through innovative dataset construction and training paradigms, addressing critical gaps in the current research landscape. The technical contributions, particularly in the context of audio contribution analysis, position this work as a notable advancement in the field of audio processing and multimodal AI.
The paper introduces a novel dataset, AudioMCQ, which is substantial in size (571k samples) and includes chain-of-thought annotations. The methodology for dataset construction is well-structured, avoiding reliance on existing LALMs to prevent hallucinations. The introduction of Audio-Contribution Filtering to categorize audio contributions into weak and strong subsets is a significant methodological advancement. The proposed post-training paradigms (Weak-to-Strong and Mixed-to-Strong) are innovative and provide a framework for enhancing LALM performance based on audio contributions, which is a relatively unexplored area in the field.
The experiments are robust, demonstrating the effectiveness of the proposed methods through competitive results in the DCASE 2025 Audio-Question-Answering challenge, where the authors achieved first place. The performance metrics across multiple benchmarks (MMAU-test-mini, MMAU, MMAR, MMSU) indicate that the proposed strategies lead to state-of-the-art results. The systematic evaluation of the zero audio-contribution phenomenon adds depth to the experimental design, showcasing the authors' thorough understanding of the challenges in LALM training.
The paper provides sufficient details regarding the dataset construction pipeline, training strategies, and evaluation protocols, which enhances reproducibility. However, the absence of a public repository or demo URL limits the ease with which others can replicate the results.
One limitation is the reliance on the quality of the audio data and its annotations, which can introduce biases or inaccuracies. Additionally, while the dataset is large, the diversity of audio types and contexts may still be limited, potentially affecting generalizability. The paper does not address the potential computational costs associated with the proposed training paradigms.
The findings have significant implications for the development of more effective multimodal AI systems, particularly in audio understanding tasks. The methodologies proposed could be applied to other domains where audio and text modalities intersect, potentially influencing future research directions in LALMs and related fields. The paper presents a comprehensive approach to improving Large Audio Language Models through innovative dataset construction and training paradigms, addressing critical gaps in the current research landscape. The technical contributions, particularly in the context of audio contribution analysis, position this work as a notable advancement in the field of audio processing and multimodal AI.
Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fidelity, while diffusion models remain impractical for streaming. Moreover, most assume a fixed target sampling rate, requiring external resampling that leads to redundant computations. We present TF-Restormer, an encoder-decoder architecture that concentrates analysis on input-bandwidth with a time-frequency dual-path encoder and reconstructs missing high-frequency bands through a light decoder with frequency extension queries. It enables efficient and universal restoration across arbitrary input-output rates without redundant resampling. To support adversarial training across diverse rates, we introduce a shared sampling-frequency-independent (SFI) STFT discriminator. TF-Restormer further supports streaming with a causal time module, and improves robustness under extreme degradations by injecting spectral inductive bias into the frequency module. Finally, we propose a scaled log-spectral loss that stabilizes optimization under severe conditions while emphasizing well-predicted spectral details. As a single model across sampling rates, TF-Restormer consistently outperforms prior systems, achieving balanced gains in signal fidelity and perceptual quality, while its streaming mode maintains competitive effectiveness for real-time application. Code and demos are available at https://tf-restormer.github.io/demo.
The main contribution of this paper is the introduction of TF-Restormer, a novel speech restoration model that effectively addresses the challenges of restoring speech signals under various distortions while maintaining efficiency across different sampling rates. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its significance in the field of audio processing and machine learning.
The paper presents TF-Restormer, an innovative encoder-decoder architecture that utilizes a time-frequency dual-path approach to address the challenges of speech restoration under various distortions. The methodology is well-structured, focusing on the input bandwidth while employing a lightweight decoder for high-frequency reconstruction. The introduction of a shared sampling-frequency-independent (SFI) STFT discriminator for adversarial training is a notable contribution, allowing the model to operate efficiently across different sampling rates without the need for redundant resampling. The use of a scaled log-spectral loss to stabilize optimization under severe conditions is also a significant methodological advancement. Overall, the methodology is robust and addresses key limitations in existing approaches.
The experiments are thorough, utilizing diverse datasets such as UNIVERSE and VCTK for evaluation across various tasks, including denoising and super-resolution. The results demonstrate consistent improvements over prior systems in terms of signal fidelity and perceptual quality, with detailed comparisons against state-of-the-art models. The use of multiple metrics (PESQ, SDR, LSD, etc.) provides a comprehensive assessment of the model's performance. However, the paper could benefit from additional ablation studies to further validate the impact of individual components within the architecture.
The authors provide a clear implementation strategy, including training details, model configurations, and the use of publicly available datasets. The availability of code and demos enhances reproducibility, although the lack of a direct GitHub repository link may hinder ease of access for some researchers. The detailed training pipeline and parameter settings are well-documented, which is a positive aspect for reproducibility.
While the paper presents a strong framework, it does not extensively address potential limitations, such as the computational cost associated with the dual-path architecture at higher sampling rates. Additionally, the model's performance in real-world scenarios could be further validated with more extensive testing on diverse datasets beyond the synthetic and controlled environments used in the experiments.
The TF-Restormer model has significant implications for real-time speech restoration applications, particularly in scenarios involving low-bandwidth communication and various distortions. Its ability to operate across different sampling rates without redundant resampling makes it a practical solution for real-world applications. The advancements in spectral prediction and adversarial training could also inspire further research in audio processing and enhancement. The main contribution of this paper is the introduction of TF-Restormer, a novel speech restoration model that effectively addresses the challenges of restoring speech signals under various distortions while maintaining efficiency across different sampling rates. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its significance in the field of audio processing and machine learning.
Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.
The main contribution of this paper is the introduction of TTScore, a novel evaluation framework for synthesized speech that provides targeted assessments of intelligibility and prosody through conditional prediction of discrete speech tokens. This work significantly advances the field by addressing the limitations of existing metrics and aligning more closely with human perceptions of speech quality.
The paper introduces TTScore, a novel evaluation framework that utilizes conditional prediction of discrete speech tokens to assess intelligibility and prosody in synthesized speech. The methodology is well-structured, employing two distinct sequence-to-sequence models tailored for intelligibility (TTScore-int) and prosody (TTScore-pro). This targeted approach addresses the limitations of existing metrics, such as WER and F0-RMSE, by providing reference-free evaluations that align more closely with human perception. The use of discrete speech tokens derived from advanced models like HuBERT and FACodec adds robustness to the evaluation process.
The experiments are comprehensive, utilizing multiple benchmarks (SOMOS, VoiceMOS, TTSArena) to validate the effectiveness of the proposed metrics. The paper reports strong correlations between TTScore metrics and human judgments of speech quality, outperforming traditional metrics. The evaluation setup is rigorous, comparing TTScore against established baselines and demonstrating its reliability across diverse datasets. However, the paper could benefit from a more detailed analysis of the statistical significance of the results.
The authors provide a GitHub repository with code and pre-trained models, enhancing the reproducibility of their work. Implementation details are sufficiently described, including model architectures and training procedures. However, the paper lacks specific hyperparameter settings and training configurations that could further aid in reproducing the results.
One limitation is the reliance on existing datasets for evaluation, which may not encompass all variations of synthesized speech. Additionally, while TTScore shows improved correlations with human judgments, it may still be sensitive to the quality of the underlying speech synthesis systems. The paper does not address potential biases in the datasets used for training and evaluation.
The proposed evaluation framework has significant implications for the field of speech synthesis, offering a more nuanced understanding of intelligibility and prosody. This can lead to improved speech generation systems, enhancing applications in assistive technologies, human-computer interaction, and language learning. The methodology could also inspire further research into targeted evaluation metrics in other domains of machine learning. The main contribution of this paper is the introduction of TTScore, a novel evaluation framework for synthesized speech that provides targeted assessments of intelligibility and prosody through conditional prediction of discrete speech tokens. This work significantly advances the field by addressing the limitations of existing metrics and aligning more closely with human perceptions of speech quality.
This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronger ASR performance, and faster inference. We provide a comprehensive analysis of applying DDMs to speech reconstruction, examining sampler choices, inference steps, and robustness to length-scale estimation errors. Furthermore, we improve the original TASTE by systematically comparing vector quantization modules, showing that FSQ yields up to a 35% relative WER reduction and +0.14 UT-MOS improvement over RVQ for AR models, while also enhancing DDM performance. Our model generates speech in just 10 denoising steps and even supports single-step generation with only minor quality degradation.
This paper presents a pioneering application of discrete diffusion models to speech tokenization and reconstruction, showcasing substantial improvements in efficiency and quality over traditional autoregressive methods. The comprehensive methodology and experimental validation contribute significantly to the field, paving the way for future research and applications in speech technology.
The paper introduces a novel discrete diffusion model (DDM) framework for speech tokenization and reconstruction, effectively replacing traditional autoregressive decoders with a more efficient DDM approach. The methodology is well-structured, providing a comprehensive analysis of various aspects such as sampler choices and vector quantization techniques. The use of finite scalar quantization (FSQ) as an alternative to residual vector quantization (RVQ) is a significant methodological improvement that enhances performance metrics like WER and UT-MOS. The detailed exploration of inference settings and robustness to length-scale estimation errors further strengthens the methodology's rigor.
The experiments are robust, utilizing a large dataset (Granary English-only) and employing various evaluation metrics (WER, PESQ, MOS, etc.) to assess performance comprehensively. The comparison between AR and DDM models is well-articulated, showing clear advantages in both reconstruction quality and inference speed. The results substantiate the claims made in the paper, demonstrating the effectiveness of DDMs in speech applications. However, the paper could benefit from additional comparisons with state-of-the-art models beyond the baseline.
The paper provides sufficient details regarding the training setup, including the number of GPUs used, batch sizes, and training procedures. However, the absence of a publicly accessible code repository limits full reproducibility, as other researchers may struggle to replicate the results without the exact implementation details.
One limitation is the reliance on a specific dataset that may not generalize across all speech applications. Additionally, the paper does not address potential challenges in real-world applications, such as handling diverse accents or noisy environments beyond the dataset used. The assumption that the model has access to global S3 token lengths during inference may also pose practical challenges.
The proposed DDM-based TASTE framework has significant implications for the field of speech processing, particularly in applications requiring efficient and high-quality speech synthesis and recognition. The advancements could lead to improvements in voice assistants, automated transcription services, and other speech-related technologies, ultimately enhancing user experiences in various domains. This paper presents a pioneering application of discrete diffusion models to speech tokenization and reconstruction, showcasing substantial improvements in efficiency and quality over traditional autoregressive methods. The comprehensive methodology and experimental validation contribute significantly to the field, paving the way for future research and applications in speech technology.
This paper introduces WrenNet, an efficient neural network enabling real-time multi-species bird audio classification on low-power microcontrollers for scalable biodiversity monitoring. We propose a semi-learnable spectral feature extractor that adapts to avian vocalizations, outperforming standard mel-scale and fully-learnable alternatives. On an expert-curated 70-species dataset, WrenNet achieves up to 90.8\% accuracy on acoustically distinctive species and 70.1\% on the full task. When deployed on an AudioMoth device ($\leq$1MB RAM), it consumes only 77mJ per inference. Moreover, the proposed model is over 16x more energy-efficient compared to Birdnet when running on a Raspberry Pi 3B+. This work demonstrates the first practical framework for continuous, multi-species acoustic monitoring on low-power edge devices.
Primary: University of Trento
All Institutions: University of Trento
The main contribution of this paper is the introduction of WrenNet, a novel neural network architecture that enables efficient multi-species bird audio classification on low-power devices, significantly advancing the field of bioacoustic monitoring. This work is notable for its innovative methodology and practical applications, addressing critical challenges in environmental monitoring with a focus on energy efficiency and real-time processing.
The methodology presented in this paper is robust, featuring a well-thought-out neural architecture (WrenNet) that addresses the specific challenges of multi-species bird classification on low-power devices. The introduction of a semi-learnable spectral feature extractor is particularly innovative, allowing for adaptive frequency mapping that enhances the model's performance on avian vocalizations. The use of causal convolutions and a unidirectional GRU for temporal processing is a strong choice for maintaining memory efficiency while ensuring real-time processing capabilities. The paper effectively combines deep learning techniques with practical constraints of edge devices, showcasing a thoughtful approach to system-algorithm co-design.
The experimental evaluation is comprehensive, utilizing a well-curated dataset of 70 species and demonstrating the model's performance through various benchmarks. The accuracy results are promising, particularly for acoustically distinctive species, and the energy consumption metrics highlight the practical viability of the proposed system. The comparison with existing models like BirdNET, showcasing a significant reduction in energy consumption, adds to the credibility of the results. However, the paper could benefit from more detailed discussions on the statistical significance of the results and potential variations in performance across different environments.
The paper provides a clear overview of the experimental setup, including the dataset creation and training processes. The availability of scripts in a public repository enhances reproducibility. However, more detailed documentation on the specific configurations used for training and testing would further facilitate replication of the results by other researchers.
While the paper presents a significant advancement, it does have limitations. The model's performance on the full dataset (70 species) shows a drop in accuracy, indicating that further refinement may be necessary for closely related species. Additionally, the reliance on a specific dataset may limit generalizability to other geographical regions or bird species. The energy consumption metrics, while impressive, could vary significantly with different environmental conditions and hardware configurations.
The implications of this work are substantial for biodiversity monitoring and conservation efforts. By enabling real-time, low-power classification of bird species, this technology can facilitate large-scale ecological studies and contribute to the understanding of avian populations and their habitats. The approach could be extended to other wildlife monitoring applications, potentially transforming how ecological data is collected and analyzed. The main contribution of this paper is the introduction of WrenNet, a novel neural network architecture that enables efficient multi-species bird audio classification on low-power devices, significantly advancing the field of bioacoustic monitoring. This work is notable for its innovative methodology and practical applications, addressing critical challenges in environmental monitoring with a focus on energy efficiency and real-time processing.
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.
Primary: ~Corresponding authors
All Institutions: ~Equal contribution, ~Corresponding authors
The paper presents a significant contribution to the field of TTS by introducing a new metric for assessing prosody diversity, which is crucial for improving the naturalness of synthesized speech. The methodology is innovative, and the experimental results support its effectiveness, marking a meaningful advancement in the evaluation of TTS systems.
The paper introduces a novel metric, Discretized Speech Weighted Edit Distance (DS-WED), which is a significant advancement in measuring prosody diversity in zero-shot TTS systems. The methodology is robust, leveraging weighted edit distance over semantic tokens, and is well-supported by the creation of the ProsodyEval dataset, which includes human ratings that enhance the reliability of the metric. The approach is methodologically sound, addressing a gap in the current literature regarding the correlation between acoustic metrics and human perception of prosody.
The experiments conducted on the ProsodyEval dataset are comprehensive, featuring 1000 speech samples and 2000 human ratings, which provide a solid foundation for evaluating the proposed metric. The results demonstrate that DS-WED correlates more strongly with human judgments than existing metrics, showcasing its effectiveness. Additionally, the benchmarking of state-of-the-art TTS systems reveals practical applications of the metric, although further details on the experimental setup could enhance transparency.
The paper provides sufficient details regarding the dataset and the proposed metric, which aids in reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for the proposed methods and results. Including implementation details or a link to a code repository would significantly enhance reproducibility.
One limitation noted is the reliance on human ratings, which can introduce variability and subjectivity into the evaluation process. Additionally, the paper mentions that current large audio language models (LALMs) are limited in capturing prosodic variations, indicating an area for further research. The scope of the dataset, while substantial, may not encompass all potential prosodic variations present in diverse languages and accents.
The development of a reliable metric for prosody diversity has significant implications for the TTS field, potentially enhancing the naturalness and expressiveness of synthesized speech. This work could influence future research directions in TTS systems, particularly in improving user experience and accessibility for diverse populations. The findings may also encourage further exploration into the integration of prosody in other areas of machine learning and natural language processing. The paper presents a significant contribution to the field of TTS by introducing a new metric for assessing prosody diversity, which is crucial for improving the naturalness of synthesized speech. The methodology is innovative, and the experimental results support its effectiveness, marking a meaningful advancement in the evaluation of TTS systems.
Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.
Primary: Korea Advanced Institute of Science and Technology
All Institutions: Korea Advanced Institute of Science and Technology, Ho Chi Minh City University of Technology, AITech Lab, These authors contributed equally to this work
The main contribution of this paper is the introduction of MAGE, a novel Masked Audio Generative Enhancer that utilizes a scarcity-aware coarse-to-fine masking strategy and a lightweight corrector module, achieving state-of-the-art performance in speech enhancement with a significantly reduced model size. This work represents a meaningful advancement in the field of generative speech enhancement, balancing efficiency and perceptual quality, and setting a foundation for future research in practical applications.
The methodology presented in this paper is innovative, particularly with the introduction of the scarcity-aware coarse-to-fine masking strategy. This approach addresses the limitations of traditional masked generative models by prioritizing token frequencies, which enhances both efficiency and generalization. The inclusion of a lightweight corrector module for low-confidence predictions is a significant advancement, allowing for iterative refinement of predictions. The architecture is built upon established models like BigCodec and Qwen2.5-0.5B, yet the selective layer retention to achieve a compact model size of 200M parameters is a notable achievement in balancing performance and efficiency.
The experimental evaluation is robust, utilizing well-established benchmarks such as the DNS Challenge and noisy LibriSpeech. The results demonstrate that MAGE outperforms larger baselines in terms of perceptual quality and word error rate, which is critical for downstream applications like speech recognition. The paper effectively compares MAGE against both discriminative and generative models, providing comprehensive metrics that highlight its advantages. However, the reliance on simulated distortions raises questions about its real-world applicability.
The paper provides sufficient details regarding the implementation, including model architecture, training parameters, and evaluation metrics. However, the lack of a publicly available code repository limits full reproducibility. The authors should consider releasing their code to facilitate further research and validation of their findings.
While MAGE shows strong results, its performance may be limited by its training on simulated data, which could affect generalization to real-world scenarios. Additionally, the evaluation metrics focus primarily on perceptual quality and WER, potentially overlooking other important aspects of speech enhancement like latency and computational efficiency in practical applications.
The advancements presented in this paper could have significant implications for real-world applications in speech enhancement, particularly in environments with background noise or reverberation. The compact design of MAGE makes it suitable for deployment in resource-constrained settings, which is crucial for applications like mobile devices and real-time communication systems. The potential for future extensions to multilingual and streaming scenarios further enhances its relevance in diverse applications. The main contribution of this paper is the introduction of MAGE, a novel Masked Audio Generative Enhancer that utilizes a scarcity-aware coarse-to-fine masking strategy and a lightweight corrector module, achieving state-of-the-art performance in speech enhancement with a significantly reduced model size. This work represents a meaningful advancement in the field of generative speech enhancement, balancing efficiency and perceptual quality, and setting a foundation for future research in practical applications.
Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extending these techniques to SVS remains non-trivial due to the requirement for precise melody control. In particular, prompt-based generation often introduces prosody leakage, where pitch information is inadvertently entangled within the timbre prompt, compromising controllability. We present CoMelSinger, a zero-shot SVS framework that enables structured and disentangled melody control within a discrete codec modeling paradigm. Built on the non-autoregressive MaskGCT architecture, CoMelSinger replaces conventional text inputs with lyric and pitch tokens, preserving in-context generalization while enhancing melody conditioning. To suppress prosody leakage, we propose a coarse-to-fine contrastive learning strategy that explicitly regularizes pitch redundancy between the acoustic prompt and melody input. Furthermore, we incorporate a lightweight encoder-only Singing Voice Transcription (SVT) module to align acoustic tokens with pitch and duration, offering fine-grained frame-level supervision. Experimental results demonstrate that CoMelSinger achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability over competitive baselines.
Primary: National University of Singapore
All Institutions: National University of Singapore, Copyright may be transferred without notice, after which this version may no longer be accessible. Junchuan Zhao, This work has been submitted to the IEEE for possible publication, are affiliated with the School of Computing
CoMelSinger presents a novel framework for zero-shot singing voice synthesis that effectively addresses melody control and prosody leakage. The combination of innovative methodology and promising experimental results positions this work as a significant contribution to the field of machine learning in audio synthesis.
The methodology presented in CoMelSinger is innovative, leveraging a non-autoregressive MaskGCT architecture to replace traditional text inputs with discrete lyric and pitch tokens. This approach effectively addresses the challenge of prosody leakage by introducing a coarse-to-fine contrastive learning strategy, which regularizes pitch redundancy. The incorporation of a lightweight encoder-only Singing Voice Transcription (SVT) module for frame-level supervision is a significant enhancement, allowing for better alignment of acoustic tokens with pitch and duration. Overall, the methodology is well-structured and demonstrates a clear understanding of the challenges in singing voice synthesis.
The experimental setup is robust, with comprehensive evaluations against competitive baselines. The results indicate notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability, which are critical metrics in the field of singing voice synthesis. However, the paper could benefit from a more detailed analysis of the datasets used and the specific metrics employed to quantify improvements, as this would enhance the credibility of the findings.
While the paper outlines the methodology and experimental results, it lacks sufficient implementation details that would facilitate reproducibility. Key aspects such as hyperparameter settings, data preprocessing steps, and code availability are not mentioned, which could hinder other researchers from replicating the study.
One limitation is the potential overfitting to the training data, particularly in the context of zero-shot learning. The paper does not address how the model performs with unseen data outside of the training distribution. Additionally, the reliance on a discrete token-based approach may limit the expressiveness of the generated singing voices compared to continuous representations.
The advancements made in CoMelSinger have the potential to significantly impact the fields of music technology and artificial intelligence, particularly in applications such as music composition, voice cloning, and interactive entertainment. The ability to generate expressive singing voices with structured control could lead to new creative tools for artists and musicians, enhancing the accessibility of music production. CoMelSinger presents a novel framework for zero-shot singing voice synthesis that effectively addresses melody control and prosody leakage. The combination of innovative methodology and promising experimental results positions this work as a significant contribution to the field of machine learning in audio synthesis.
The rapid growth of the digital economy in South-East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained on high-resource languages collapse when applied to SEA, due to mismatches in synthesis quality, language-specific characteristics, and data scarcity. To close this gap, we present SEA-Spoof, the first large-scale Audio Deepfake Detection (ADD) dataset especially for SEA languages. SEA-Spoof spans 300+ hours of paired real and spoof speech across Tamil, Hindi, Thai, Indonesian, Malay, and Vietnamese. Spoof samples are generated from a diverse mix of state-of-the-art open-source and commercial systems, capturing wide variability in style and fidelity. Benchmarking state-of-the-art detection models reveals severe cross-lingual degradation, but fine-tuning on SEA-Spoof dramatically restores performance across languages and synthesis sources. These results highlight the urgent need for SEA-focused research and establish SEA-Spoof as a foundation for developing robust, cross-lingual, and fraud-resilient detection systems.
Primary: Institute for Infocomm Research (I2R)
All Institutions: The University of New South Wales, Nanyang Technological University, Institute for Infocomm Research (I2R), Alibaba Group, United States of America
The paper presents SEA-Spoof, the first large-scale dataset for audio deepfake detection in six South-East Asian languages, filling a critical gap in existing resources and demonstrating significant improvements in detection performance through fine-tuning. The comprehensive methodology and experimental validation highlight its importance for advancing research in multilingual audio deepfake detection.
The methodology is robust, focusing on the creation of the SEA-Spoof dataset, which is a significant contribution to the field of audio deepfake detection. The authors carefully selected six South-East Asian languages based on linguistic diversity, population coverage, and practical relevance. The dataset construction is thorough, utilizing a mix of state-of-the-art open-source and commercial systems to generate spoofed audio, which ensures a wide variability in synthesis quality. The systematic pairing of real and spoofed audio for controlled evaluations is a strong methodological aspect that enhances the dataset's utility for future research.
The experimental evaluation is comprehensive, benchmarking multiple state-of-the-art models against the newly created SEA-Spoof dataset. The results clearly demonstrate the cross-lingual performance degradation of existing models when applied to SEA languages, validating the necessity of the dataset. Fine-tuning experiments show significant improvements in model performance, underscoring the dataset's effectiveness as a diagnostic tool and a resource for enhancing detection capabilities.
The paper provides sufficient details on the dataset's construction and the experimental setup, including the models used for benchmarking and the training protocols. However, the lack of a publicly available code repository limits the full reproducibility of the experiments. While the dataset is accessible, the absence of implementation details for the models may hinder other researchers from replicating the study completely.
One limitation is the focus on only six languages, which, while significant, does not cover the entire spectrum of languages in the SEA region. Additionally, the dataset's reliance on specific synthesis systems may introduce biases that could affect generalizability. The paper also mentions plans for future work, indicating that the dataset may evolve, but the current version may not be exhaustive.
The creation of SEA-Spoof has the potential to significantly impact the field of audio deepfake detection, particularly in multilingual contexts. By addressing the gap in resources for SEA languages, the dataset can facilitate the development of more effective detection systems tailored to the unique characteristics of these languages. This work emphasizes the importance of regional focus in AI research and could lead to broader applications in security, fraud detection, and speech technology. The paper presents SEA-Spoof, the first large-scale dataset for audio deepfake detection in six South-East Asian languages, filling a critical gap in existing resources and demonstrating significant improvements in detection performance through fine-tuning. The comprehensive methodology and experimental validation highlight its importance for advancing research in multilingual audio deepfake detection.
Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker overlap, and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, full-duplex settings. Captured from a deployed AI assistant, the dataset comprises 5,805 annotated sessions with synchronized user and mixed-channel views, RTTM/CTM timing, and role labels. We introduce a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio for long-context recognition. ASR evaluation includes WER, CER, and HC-WER, which measures concept-level accuracy across healthcare settings. LLM-generated responses are assessed using rubric-based and pairwise protocols. MMedFD establishes a reproducible framework for benchmarking streaming ASR and end-to-end duplex agents in healthcare deployment. The dataset and related resources are publicly available at https://github.com/Kinetics-JOJO/MMedFD
The main contribution of this paper is the introduction of MMedFD, a novel healthcare ASR corpus and a robust framework for evaluating multi-turn, full-duplex speech recognition systems. This work addresses a critical gap in the ASR field, particularly in clinical dialogue, and lays the groundwork for future advancements in healthcare communication technologies.
The paper introduces a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, which is a significant contribution to the field of ASR in healthcare. The use of the Whisper-small model fine-tuned on role-concatenated audio for long-context recognition is innovative, addressing the challenges of multi-turn and full-duplex interactions in clinical settings. The methodology is well-structured and clearly articulated, although it would benefit from more detailed comparisons with existing methods.
The experiments are comprehensive, utilizing a dataset of 5,805 annotated sessions, which is substantial for the domain. The evaluation metrics, including WER, CER, and HC-WER, are appropriate for assessing ASR performance in healthcare. However, the paper could enhance its impact by providing more detailed results and comparisons with baseline models to better illustrate the effectiveness of the proposed methods.
The authors have made the dataset and related resources publicly available, which is commendable for reproducibility. However, the paper lacks detailed implementation instructions or code snippets that would facilitate replication of the results by other researchers. Including such details would strengthen the reproducibility aspect significantly.
One limitation is the focus on a specific language (Chinese), which may restrict the generalizability of the findings to other languages or dialects. Additionally, while the dataset is substantial, the paper does not discuss potential biases in the data collection process or the diversity of the speakers involved, which could affect the model's performance in real-world applications.
The development of MMedFD has the potential to significantly impact the healthcare sector by improving the efficiency and accuracy of ASR systems in clinical dialogues. This could lead to better patient interactions and streamlined workflows in healthcare settings. The framework established for benchmarking streaming ASR can also encourage further research and development in this area. The main contribution of this paper is the introduction of MMedFD, a novel healthcare ASR corpus and a robust framework for evaluating multi-turn, full-duplex speech recognition systems. This work addresses a critical gap in the ASR field, particularly in clinical dialogue, and lays the groundwork for future advancements in healthcare communication technologies.
Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.
Primary: Microsoft CoreAI
All Institutions: Microsoft CoreAI
The main contribution of this work is a novel multi-stage reinforcement learning framework that significantly enhances the speech summarization capabilities of multi-modal large language models. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and natural language processing.
The proposed methodology introduces a multi-stage reinforcement learning framework that effectively enhances speech summarization capabilities in multi-modal large language models (MLLMs). The combination of supervised fine-tuning on synthetic data, on-policy knowledge distillation, and Direct Preference Optimization is innovative and addresses key challenges in the field, such as error propagation and modality gaps. The approach is well-structured and leverages existing models and techniques, showcasing a thoughtful integration of various methodologies to improve performance.
The experimental evaluation is robust, utilizing multiple benchmarks (Golden3, AMI, and FLORAS) to assess the model's performance. The paper provides a thorough comparison with both open-source and state-of-the-art systems, demonstrating significant performance improvements. The ablation studies further validate the effectiveness of each component of the proposed framework, highlighting the importance of data quality and the choice of teacher models in knowledge distillation.
While the paper provides detailed descriptions of the training processes and datasets, it lacks specific URLs for code or datasets, which could hinder reproducibility. The absence of a public repository or demo limits the ability for other researchers to replicate the results independently. However, the methodology is described in sufficient detail for knowledgeable practitioners to implement similar experiments.
The paper acknowledges issues such as hallucinations and reward hacking, which are common in reinforcement learning settings. While the proposed methods mitigate these issues, they do not completely eliminate them. Additionally, the focus on English-only data in training may limit the model's applicability in multilingual contexts, despite showing some cross-lingual generalization.
The advancements in speech summarization have significant implications for accessibility, productivity, and information retrieval in various domains, including education, business, and media. The ability to generate coherent summaries from spoken content can enhance user experiences and facilitate better information management in an increasingly audio-centric world. The main contribution of this work is a novel multi-stage reinforcement learning framework that significantly enhances the speech summarization capabilities of multi-modal large language models. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and natural language processing.
We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations.
Primary: Language Technologies Institute
All Institutions: Department of Computer Science, Language Technologies Institute, University of Texas at Austin, Analytical Imaging and Modeling Center, Cortney Van’t Slot, Carnegie Mellon University, Children's Health, Department of Plastic Surgery, University of Texas Southwestern Medical Center
The main contribution of this paper is the introduction of ChiReSSD, a novel speech reconstruction framework that effectively addresses the unique challenges of disordered speech in children while preserving speaker identity. This work represents a meaningful advancement in the intersection of machine learning and clinical speech pathology, with the potential to significantly impact both research and practical applications in the field.
The methodology presented in this paper is innovative, leveraging a modified version of StyleTTS2 to specifically address the challenges of reconstructing speech for children with speech sound disorders (SSD). The framework's ability to disentangle acoustic and prosodic features while preserving speaker identity is a significant advancement over traditional methods that often fail to account for the unique characteristics of children's speech. The adaptation of the model to handle the higher pitch and prosodic patterns of child speech is well-justified and effectively executed. However, the paper could benefit from a more detailed description of the training process and hyperparameter tuning, as these are critical for replicating the results.
The experimental evaluation is robust, utilizing multiple datasets (STAR, UltraSuite, and TORGO) to demonstrate the effectiveness of ChiReSSD across different populations. The results show substantial improvements in lexical accuracy and speaker identity preservation, with clear metrics such as WER, CER, and PCC providing quantitative support for the claims made. The correlation of automatic evaluations with human expert annotations (Pearson correlation of 0.63) is particularly noteworthy, as it suggests a practical application for reducing manual transcription efforts in clinical settings. The experiments are well-structured, but the paper could enhance clarity by providing more context for the choice of evaluation metrics.
While the paper provides a general overview of the methods and datasets used, it lacks specific implementation details that would aid in reproducibility. For instance, the exact configurations of the model training, including learning rates, batch sizes, and the specific architecture of the StyleTTS2 modifications, are not thoroughly detailed. Including a supplementary material section with code snippets or a link to a repository would greatly enhance reproducibility.
One limitation of the study is the reliance on specific datasets that may not fully represent the diversity of speech disorders in children. The generalization to adult dysarthric speech is promising, but the paper does not address potential limitations in applying the model to other populations or languages. Additionally, while the model shows improvements in phonetic accuracy, residual errors remain, and the paper suggests future work to address these, indicating that the current version may not be fully optimized.
The implications of this research are significant, particularly in the fields of speech-language pathology and assistive technology. By providing a framework that can improve the intelligibility of disordered speech while preserving the speaker's identity, ChiReSSD has the potential to enhance communication for children with SSD, thereby improving their social and academic outcomes. Furthermore, the ability to automate clinical evaluations could alleviate some of the burdens on speech-language therapists, allowing them to focus on more complex cases. The main contribution of this paper is the introduction of ChiReSSD, a novel speech reconstruction framework that effectively addresses the unique challenges of disordered speech in children while preserving speaker identity. This work represents a meaningful advancement in the intersection of machine learning and clinical speech pathology, with the potential to significantly impact both research and practical applications in the field.
Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising alternative to next-token prediction, avoiding the technical complexities associated with discrete speech tokenization. As a relatively new paradigm, research on reinforcement learning (RL)-based fine-tuning of speech ARDMs remains limited. In this paper, we propose Autoregressive Diffusion-Direct Preference Optimization (ARDM-DPO) to advance this research. By fine-tuning the recently proposed zero-shot text-to-speech model DiTAR with DPO, we achieve significant improvements in terms of speech expressiveness and robustness for long texts.
Primary: The Chinese University of Hong Kong
All Institutions: School of Data Science, ByteDance Seed, The Chinese University of Hong Kong, School of Artificial Intelligence, Nanjing University
The main contribution of this paper is the introduction of ARDM-DPO, a novel method for fine-tuning autoregressive diffusion models in speech generation, which enhances expressiveness and robustness while addressing the challenges of traditional TTS systems. The comprehensive evaluation of the method demonstrates its potential impact on the field of audio generation and reinforces the importance of preference alignment in machine learning models.
The proposed method, Autoregressive Diffusion-Direct Preference Optimization (ARDM-DPO), represents a significant advancement in the application of autoregressive diffusion models for text-to-speech (TTS) systems. The methodology effectively integrates reinforcement learning principles to fine-tune the DiTAR model, addressing the limitations of traditional next-token prediction approaches. The authors provide a clear framework for preference alignment, which is critical for enhancing the expressiveness and robustness of generated speech. However, the paper could benefit from a more detailed discussion on the implementation specifics of DPO in the context of ARDMs, as well as a deeper exploration of the underlying assumptions made during model training.
The experiments are well-structured, utilizing comprehensive datasets and benchmarks to evaluate the performance of ARDM-DPO against baseline methods. The authors present quantitative metrics such as F0 variance and character error rate, alongside qualitative assessments through listener evaluations, which provide a balanced view of the model's performance. The results indicate significant improvements in expressiveness and robustness, although the paper notes some instability in training, which warrants further investigation. The use of a large preference dataset strengthens the findings, but additional comparisons with more baseline models could enhance the robustness of the conclusions drawn.
The paper provides a reasonable level of detail regarding the experimental setup, including model architecture, training parameters, and evaluation metrics. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider releasing the model and code to facilitate further research and validation by the community.
The paper acknowledges the instability of the ARDM-DPO training process, particularly in Task A, which can lead to degradation in speech quality. This instability raises questions about the robustness of the method in practical applications. Additionally, the reliance on preference datasets for training may introduce biases that affect the generalizability of the model. The authors also mention the need for early stopping, which could complicate the training process.
The advancements presented in this paper have the potential to significantly improve TTS systems, making them more expressive and aligned with human preferences. This could enhance applications in various fields, including virtual assistants, audiobooks, and entertainment. The work contributes to the growing body of research on autoregressive diffusion models, potentially influencing future developments in multimodal generation tasks. The main contribution of this paper is the introduction of ARDM-DPO, a novel method for fine-tuning autoregressive diffusion models in speech generation, which enhances expressiveness and robustness while addressing the challenges of traditional TTS systems. The comprehensive evaluation of the method demonstrates its potential impact on the field of audio generation and reinforces the importance of preference alignment in machine learning models.
Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite recent advancements, state-of-the-art ASR models like Whisper still struggle with non-normative speech due to limited training data availability and high acoustic variability. Moreover, collecting and annotating non-normative speech is burdensome: speaking is effortful for many affected individuals, while laborious annotation often requires caregivers familiar with the speaker. This work introduces a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning. We validate our method on the English UA-Speech dataset and a newly collected German speech dataset, BF-Sprache, from a child with structural speech impairment. The dataset and approach are designed to reflect the challenges of low-resource settings that include individuals with speech impairments. Our method significantly improves ASR accuracy for impaired speech while maintaining data and annotation efficiency, offering a practical path toward inclusive ASR.
Primary: University of Zurich
All Institutions: Technical University of Munich, School of Computation, Institute of Neuroinformatics, University of Zurich and ETH Zurich, University of Zurich, Department of Computational Linguistics, Information and Technology
This paper presents a novel Bayesian Low-rank Adaptation framework for personalized impaired speech recognition, significantly improving ASR accuracy while addressing the challenges of data scarcity and variability in non-normative speech. The methodology and results contribute meaningfully to the field, offering practical solutions for inclusive communication technologies.
The proposed methodology introduces a novel Bayesian Low-rank Adaptation (VI LoRA) framework, which effectively addresses the challenges of data scarcity and high variability in impaired speech recognition. The incorporation of variational inference to estimate the posterior distributions of adaptation parameters is a significant advancement over traditional low-rank adaptation methods. The dual prior approach for layer-wise weight variations is particularly innovative, allowing for a more informed adaptation process. However, the assumption of independence in the factorization of the variational parameters may limit the model's ability to capture complex interactions between layers.
The experiments are well-structured, utilizing two distinct datasets (UA-Speech and BF-Sprache) that highlight the effectiveness of the proposed method across different languages and intelligibility levels. The comparative analysis against various baselines, including full fine-tuning and standard LoRA, demonstrates the robustness and efficiency of VI LoRA, particularly in low-data scenarios. The results indicate substantial improvements in word and character error rates, especially for speakers with very low intelligibility, underscoring the practical applicability of the method.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details that would facilitate reproducibility. The absence of a publicly available code repository or demo URL further hinders the ability of other researchers to replicate the findings. Clearer documentation of hyperparameters, training procedures, and data preprocessing steps would enhance reproducibility.
The study acknowledges limitations related to the small speaker pool in the BF-Sprache dataset, which may affect the generalizability of the findings. Additionally, the reliance on contextual pattern matching in existing ASR systems could hinder language learning for children with speech impairments. The assumption of independent factorization in the variational parameters may not fully capture the complexities of the model, potentially impacting performance.
The proposed framework has significant implications for the development of inclusive ASR systems that can accommodate individuals with speech impairments. By improving recognition accuracy and maintaining data efficiency, the method can enhance communication for affected individuals, fostering social inclusion and educational opportunities. The approach also opens avenues for further research in low-resource speech recognition across languages, contributing to the broader field of assistive technologies. This paper presents a novel Bayesian Low-rank Adaptation framework for personalized impaired speech recognition, significantly improving ASR accuracy while addressing the challenges of data scarcity and variability in non-normative speech. The methodology and results contribute meaningfully to the field, offering practical solutions for inclusive communication technologies.
Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This work introduces a data-efficient personalization method that quantifies phoneme-level uncertainty to guide fine-tuning. We leverage Monte Carlo Dropout to estimate which phonemes a model finds most difficult and use these estimates for a targeted oversampling strategy. We validate our method on English and German datasets. Crucially, we demonstrate that our model-derived uncertainty strongly correlates with phonemes identified as challenging in an expert clinical logopedic report, marking, to our knowledge, the first work to successfully align model uncertainty with expert assessment of speech difficulty. Our results show that this clinically-validated, uncertainty-guided sampling significantly improves ASR accuracy, delivering a practical framework for personalized and inclusive ASR.
Primary: University of Zurich
All Institutions: Technical University of Munich, School of Computation, Institute of Neuroinformatics, University of Zurich and ETH Zurich, University of Zurich, Department of Computational Linguistics, Information and Technology
This paper presents a novel framework for data-efficient ASR personalization that utilizes uncertainty-based phoneme difficulty scoring to improve recognition accuracy for non-normative speech. The integration of clinical validation with machine learning techniques represents a meaningful contribution to both the fields of speech recognition and assistive technology.
The methodology is robust, leveraging Monte Carlo Dropout to quantify phoneme-level uncertainty, which is innovative in the context of ASR personalization for non-normative speech. The introduction of the Phoneme Difficulty Score (PhDScore) is a significant advancement, as it combines multiple uncertainty metrics to guide oversampling effectively. The approach to link model uncertainty with clinical assessments is particularly noteworthy and demonstrates a thoughtful integration of machine learning with clinical insights.
The experiments are well-structured, utilizing both English and German datasets to validate the proposed method. The results show a clear improvement in ASR accuracy for non-normative speech, and the correlation with clinical assessments adds credibility to the findings. However, the limited number of speakers in the BF-Sprache dataset may affect the generalizability of the results.
While the paper provides a detailed description of the methodology, including the computation of the PhDScore and the experimental setup, it lacks specific implementation details or code availability, which could hinder reproducibility. Future work should consider sharing code or datasets to facilitate further research.
The primary limitation is the small size of the BF-Sprache dataset, which restricts the breadth of the findings. Additionally, the subjective nature of clinical assessments may introduce variability in the validation process. The trade-off between personalization and generalization is also a concern, as it may limit the practical application of the method in real-world scenarios.
This work has significant implications for the development of personalized ASR systems, particularly for individuals with speech impairments. By improving the accuracy of ASR for non-normative speech, the proposed method could enhance communication aids and assistive technologies, making them more effective and inclusive for users with diverse speech patterns. This paper presents a novel framework for data-efficient ASR personalization that utilizes uncertainty-based phoneme difficulty scoring to improve recognition accuracy for non-normative speech. The integration of clinical validation with machine learning techniques represents a meaningful contribution to both the fields of speech recognition and assistive technology.
Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose \textbf{MATA}, a novel training-free method that dynamically pushes LALMs to pay \textbf{M}ore \textbf{A}ttention \textbf{T}o \textbf{A}udio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.
Primary: Corresponding author
All Institutions: Corresponding author
The main contribution of this paper is the introduction of MATA, a novel training-free method that enhances audio attention in LALMs, which significantly improves their performance on audio reasoning tasks. The study's findings are relevant and timely, addressing a crucial challenge in the field of multi-modal machine learning and paving the way for future advancements.
The proposed MATA method is innovative in its approach to addressing the audio-textual attention imbalance in Large Audio-Language Models (LALMs). By dynamically adjusting attention weights post raw scoring without retraining the model, MATA offers a practical solution that is both efficient and effective. The choice to target only the last token in intermediate layers is particularly insightful, as it aligns with the model's architecture and the critical role of these layers in multi-modal fusion. However, the lack of detailed hyperparameter tuning and exploration of different enhancement strengths could limit the method's applicability across various models.
The experiments conducted on the MMAU and MMAR benchmarks provide strong evidence for the efficacy of MATA, showcasing significant performance improvements over baseline models. The results are compelling, especially the claim that MATA enables an open-source model to outperform a proprietary one for the first time. However, the paper could benefit from additional details on the experimental setup, such as the specific configurations of the baseline models and the statistical significance of the results presented.
The paper does not provide a clear path for reproducing the results, as it lacks links to code repositories or detailed implementation instructions. While the methodology is described, the absence of a public implementation or demo limits the ability for other researchers to validate the findings independently.
One limitation is the focus on only two benchmarks, which may not fully capture the generalizability of MATA across diverse audio reasoning tasks. Additionally, the method's reliance on a single hyperparameter for attention enhancement may not be optimal for all scenarios, and further exploration of this aspect could yield more robust results.
The implications of this work are significant, as it addresses a critical gap in multi-modal model performance, particularly in audio reasoning tasks. By improving the attention allocation towards audio, MATA could enhance applications in various fields, including human-computer interaction, assistive technologies, and multimedia content analysis. This research opens avenues for further exploration into multi-modal learning, potentially leading to more balanced and capable AI systems. The main contribution of this paper is the introduction of MATA, a novel training-free method that enhances audio attention in LALMs, which significantly improves their performance on audio reasoning tasks. The study's findings are relevant and timely, addressing a crucial challenge in the field of multi-modal machine learning and paving the way for future advancements.
The evaluation of audio fingerprinting at a realistic scale is limited by the scarcity of large public music databases. We present an audio-free approach that synthesises latent fingerprints which approximate the distribution of real fingerprints. Our method trains a Rectified Flow model on embeddings extracted by pre-trained neural audio fingerprinting systems. The synthetic fingerprints generated using our system act as realistic distractors and enable the simulation of retrieval performance at a large scale without requiring additional audio. We assess the fidelity of synthetic fingerprints by comparing the distributions to real data. We further benchmark the retrieval performances across multiple state-of-the-art audio fingerprinting frameworks by augmenting real reference databases with synthetic distractors, and show that the scaling trends obtained with synthetic distractors closely track those obtained with real distractors. Finally, we scale the synthetic distractor database to model retrieval performance for very large databases, providing a practical metric of system scalability that does not depend on access to audio corpora.
Primary: Queen Mary University of London
All Institutions: Queen Mary University of London, supported jointly by UK Research and Innovation [grant number EP/S022694/1] and Queen Mary University of London, School of Electronic Engineering and Computer Science, A. Bhattacharjee and M. Pasini are research students at the UKRI Centre for Doctoral Training in Artificial Intelligence and Music
The paper presents a framework for scalable evaluation of audio fingerprinting systems using synthetic latent fingerprints generated by a rectified flow model. The methodology is innovative and addresses a critical challenge in the field, with potential applications that could enhance the performance and scalability of audio identification systems.
The paper introduces a novel approach to audio fingerprinting by synthesizing latent fingerprints using a Rectified Flow model, which is a significant advancement in the field. The methodology is well-structured, leveraging generative modeling to create realistic distractors without requiring additional audio data. The use of embeddings from pre-trained systems enhances the fidelity of the synthetic fingerprints, and the approach is theoretically sound, with a clear explanation of the model architecture and training process. The authors provide a comprehensive description of how the generative model approximates the distribution of real fingerprints, which is a critical aspect of their methodology.
The experimental setup is robust, employing a well-defined evaluation framework that assesses both the fidelity of synthetic fingerprints and their effectiveness as distractors in retrieval tasks. The use of multiple state-of-the-art audio fingerprinting systems for benchmarking adds credibility to the results. The experiments demonstrate that synthetic distractors can effectively simulate real-world conditions, with results indicating that scaling trends are closely tracked. However, the paper could benefit from more extensive statistical analysis to further validate the findings.
The authors have made their code and trained models available on GitHub, which is a positive aspect for reproducibility. The detailed description of the training process, including hyperparameters and dataset specifics, supports the reproducibility of the experiments. However, some equations and figures referenced in the text are not fully detailed, which could hinder complete replication of the results.
One limitation of the study is the reliance on a single dataset (Free Music Archive) for training and evaluation, which may affect the generalizability of the findings to other audio domains. Additionally, while the synthetic fingerprints closely match real distributions, there may still be nuances in real data that are not captured by the generative model. The paper also does not explore the potential biases introduced by the dataset used for training.
This research has significant implications for the field of music information retrieval, particularly in scenarios where large annotated audio datasets are not available. By enabling scalable evaluation of audio fingerprinting systems, the proposed framework can facilitate advancements in real-time audio identification applications, such as music recognition services and copyright enforcement. The approach could also inspire further research into generative modeling techniques in other areas of machine learning. The paper presents a framework for scalable evaluation of audio fingerprinting systems using synthetic latent fingerprints generated by a rectified flow model. The methodology is innovative and addresses a critical challenge in the field, with potential applications that could enhance the performance and scalability of audio identification systems.
Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.
Primary: Fictional University
All Institutions: 2133 Long Road, 8765 Dream Blvd, This Work is supported by ONR N00014-23-1-2050 and N00014-23- 1-2086, Johns Hopkins University, An unnumbered footnote that may come in handy, University Imagination, Department of Electrical and Computer Engineering, Fictional University, Industry Lab, Important Laboratory
The paper presents FlexSED, a novel open-vocabulary sound event detection framework that effectively addresses existing limitations in sound classification and adapts well to diverse real-world applications. The innovative integration of pretrained models and robust training strategies positions this work as a significant contribution to the field of audio machine learning.
The proposed FlexSED framework introduces a novel architecture that integrates pretrained audio and text models, addressing the limitations of traditional sound event detection systems. The encoder-decoder structure and adaptive fusion strategy are innovative, allowing for effective continuous training and improved performance in open-vocabulary contexts. The use of large language models for negative query filtering is particularly noteworthy, as it enhances the robustness of the training process by mitigating issues related to missing labels. Overall, the methodology is well-structured and leverages existing technologies in a creative manner.
The experiments conducted on the AudioSet-Strong dataset demonstrate the effectiveness of FlexSED, showcasing significant improvements over traditional models. The evaluation metrics used, including PSDS1, provide a fine-grained analysis of the model's performance in terms of temporal localization and sound event detection accuracy. The results from zero-shot and few-shot learning scenarios further validate the model's adaptability and generalization capabilities, which are crucial for real-world applications. However, the paper could benefit from additional comparisons with more diverse baseline models to strengthen its claims.
The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameters, which facilitate reproducibility. The authors have also made the code and pretrained models available on GitHub, enhancing the accessibility of their work for further research and experimentation. However, the absence of a demo URL limits immediate practical engagement with the model.
One limitation of the study is the reliance on the AudioSet-Strong dataset, which, while substantial, may not encompass the full diversity of sound events encountered in real-world scenarios. Additionally, the model's performance in highly noisy environments or with overlapping sound events could be further explored. The paper also does not address potential computational costs associated with using large language models for negative query filtering, which may limit practical deployment in resource-constrained settings.
The FlexSED framework has the potential to significantly advance the field of sound event detection by enabling more flexible and user-friendly interactions through open-vocabulary capabilities. Its applications could extend to various domains, including smart home technologies, wildlife monitoring, and assistive devices for the hearing impaired. By improving the adaptability of sound event detection systems, this work could lead to more intelligent and responsive audio processing solutions in everyday environments. The paper presents FlexSED, a novel open-vocabulary sound event detection framework that effectively addresses existing limitations in sound classification and adapts well to diverse real-world applications. The innovative integration of pretrained models and robust training strategies positions this work as a significant contribution to the field of audio machine learning.
While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio models while preserving its acoustic competence. Our method introduces two key dimensions: source-wise distillation, which leverages both textual and acoustic teachers to provide complementary modality-specific supervision; and layer-wise distillation, which aligns teacher signals with appropriate student layers to improve transfer efficiency. This dual-dimensional strategy enables fine-grained control over the distillation process, effectively bridging the gap between symbolic reasoning and speech representations. Experimental results show significant improvements in audio reasoning performance, demonstrating the effectiveness of our framework as a reasoning transfer solution for audio modeling.
Primary: Peking University
All Institutions: * Corresponding author, Peking University, The State Key Laboratory of Multimedia Information Processing, Jiutian Artificial Intelligence Research Institute, † Equal contribution
This paper presents a novel framework for knowledge distillation that enhances reasoning capabilities in audio models by leveraging both textual and acoustic supervision. The comprehensive methodology and strong experimental results indicate a meaningful contribution to the field of machine learning, particularly in audio processing and reasoning tasks.
The proposed methodology introduces a dual-dimensional knowledge distillation framework that effectively addresses the challenges of reasoning in audio models by incorporating both source-wise and layer-wise distillation. This approach is innovative as it not only leverages the strengths of textual and acoustic teachers but also aligns the distillation process with the architecture of the student model, allowing for a more nuanced transfer of knowledge. The textualization of audio to bridge the modality gap is particularly noteworthy, as it enables the application of textual reasoning techniques to audio data.
The experimental evaluation is robust, utilizing relevant datasets such as CoTA and MMAU to assess the performance of the proposed framework. The results demonstrate significant improvements in reasoning accuracy across various tasks, indicating the effectiveness of the proposed distillation methods. The comparison against baseline models and different distillation strategies provides a comprehensive understanding of the framework's impact.
The paper includes sufficient detail regarding the training setup, model configurations, and evaluation metrics, which aids in reproducibility. However, the absence of publicly available code or a project URL limits the ease with which other researchers can replicate the work.
One limitation is the reliance on specific datasets, which may not generalize across all audio reasoning tasks. Additionally, while the framework shows improvements, the paper does not extensively discuss potential computational costs associated with the dual-dimensional distillation process, which could impact scalability in real-world applications.
The proposed framework has significant implications for advancing audio models, particularly in applications requiring complex reasoning, such as automated transcription, sentiment analysis, and interactive voice assistants. By enhancing the reasoning capabilities of audio models, this work could lead to more intelligent and context-aware audio processing systems. This paper presents a novel framework for knowledge distillation that enhances reasoning capabilities in audio models by leveraging both textual and acoustic supervision. The comprehensive methodology and strong experimental results indicate a meaningful contribution to the field of machine learning, particularly in audio processing and reasoning tasks.
Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}
Primary: * Work done during internship at Channel Corporation
All Institutions: * Work done during internship at Channel Corporation, Channel Corporation, Corresponding author
The main contribution of this paper is the introduction of a novel preference-guided optimization approach for prosody learning in TTS systems, which effectively addresses the limitations of existing methods by utilizing human feedback to enhance the naturalness of synthesized speech. This work represents a meaningful step forward in the field of TTS, providing a practical solution to a longstanding challenge in achieving expressive and natural speech synthesis.
The paper introduces an iterative Direct Preference Optimization (DPO) scheme that innovatively addresses the challenge of optimizing prosody in TTS systems without a verifiable reward signal. The methodology is well-structured, leveraging human-labeled preference pairs to guide the model towards more natural prosody, which is a significant advancement over traditional methods that rely heavily on transcription-oriented signals. The regularization to the current model is a thoughtful addition that helps maintain stability during training, which is critical in TTS applications.
The experiments are robust, utilizing the KoCC-TTS dataset, which is specifically curated for authentic Korean call center interactions. The results demonstrate a clear improvement in human preference ratings (ELO) and competitive character error rates (CER) compared to both GRPO and commercial baselines. This empirical validation strengthens the claims made in the paper and showcases the effectiveness of the proposed method in a real-world context.
The paper provides sufficient detail regarding the methodology and experimental setup, which is crucial for reproducibility. However, it would benefit from the inclusion of hyperparameters, model architectures, and specific training procedures to enhance clarity for future researchers attempting to replicate or build upon this work.
One limitation acknowledged is the reliance on human preference pairs, which may introduce variability and subjectivity into the training process. Additionally, the method's performance in diverse linguistic contexts beyond Korean remains untested, which could limit its generalizability.
The findings have significant implications for the development of more natural and human-like TTS systems, which can enhance user experience in various applications, including virtual assistants, audiobooks, and customer service interactions. By improving prosody in TTS, this work contributes to the broader goal of creating more engaging and effective human-computer interactions. The main contribution of this paper is the introduction of a novel preference-guided optimization approach for prosody learning in TTS systems, which effectively addresses the limitations of existing methods by utilizing human feedback to enhance the naturalness of synthesized speech. This work represents a meaningful step forward in the field of TTS, providing a practical solution to a longstanding challenge in achieving expressive and natural speech synthesis.
Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great potential to alleviate the need for experts to label long audio recordings by hand. However, they still typically rely on the availability of labelled data for model training, restricting applicability to a few species and datasets. In this work, we build the first fully unsupervised algorithm to decompose birdsong recordings into sequences of syllables. We first detect syllable events, then cluster them to extract templates -- syllable representations -- before performing matching pursuit to decompose the recording as a sequence of syllables. We evaluate our automatic annotations against human labels on a dataset of Bengalese finch songs and find that our unsupervised method achieves high performance. We also demonstrate that our approach can distinguish individual birds within a species through their unique vocal signatures, for both Bengalese finches and another species, the great tit.
The main contribution of this paper is the development of a fully unsupervised method for annotating birdsongs at the syllable level, which addresses the significant challenge of data labeling in bioacoustics. The innovative approach and promising results position this work as a valuable addition to the field, with potential applications in conservation and animal behavior studies.
The paper presents a fully unsupervised algorithm for identifying and segmenting syllables in birdsong recordings, which is a significant advancement given the reliance on labeled data in previous methods. The methodology involves detecting syllable events, clustering them to create templates, and using a matching pursuit approach to decompose recordings into syllable sequences. The use of PCA and HDBSCAN for clustering, along with a split-merge strategy to refine syllable templates, demonstrates a thoughtful approach to handling the complexities of audio data. However, the methodology could benefit from more detailed explanations of the parameter choices and the impact of different thresholds on performance.
The experiments are well-structured, utilizing two distinct datasets (Bengalese finches and great tits) to validate the method's effectiveness. The evaluation metrics, including precision and recall, are appropriate for the task, and the results show promising performance, particularly in distinguishing individual birds. However, the paper lacks a comprehensive comparison with existing methods, which would provide context for the reported performance metrics. The choice of hyperparameters appears to be somewhat arbitrary, and further tuning could potentially enhance results.
The paper provides a reasonable level of detail regarding the experimental setup and methodology, but it lacks specific URLs for code or data access, which hinders reproducibility. The absence of a publicly available implementation means that other researchers cannot easily replicate the findings or build upon the work. Including a GitHub repository or similar would significantly improve this aspect.
The paper acknowledges that the method may not perform well in the presence of structured noise, which is a significant limitation for real-world applications. Additionally, the reliance on a fixed-size support set for template generation may restrict the method's adaptability to varying datasets. The potential for oversplitting clusters is also a concern, as it could lead to inaccuracies in syllable identification.
The implications of this research are substantial, particularly in the fields of bioacoustics and wildlife conservation. By enabling the automatic annotation of birdsong, the method could facilitate large-scale studies of bird populations and behaviors, contributing to biodiversity monitoring and conservation efforts. Furthermore, the approach has the potential to be adapted for other taxa, broadening its applicability beyond avian species. The main contribution of this paper is the development of a fully unsupervised method for annotating birdsongs at the syllable level, which addresses the significant challenge of data labeling in bioacoustics. The innovative approach and promising results position this work as a valuable addition to the field, with potential applications in conservation and animal behavior studies.
We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.
Primary: using AdamW with a learning rate of
All Institutions: on 8×NVIDIA A100 GPUs, denoising steps with a CFG scale of, UC San Diego, using AdamW with a learning rate of
The main contribution of this paper is the introduction of StereoFoley, an end-to-end framework for generating object-aware stereo audio from video, addressing a critical gap in the field of video-to-audio generation. This work significantly advances the state-of-the-art by combining innovative methodologies with a strong experimental foundation, paving the way for future research and applications in audio synthesis.
The methodology presented in StereoFoley is robust, integrating various components such as video analysis, object tracking, and audio synthesis to create a comprehensive framework for stereo audio generation. The introduction of a synthetic data generation pipeline to address the limitations of existing datasets is a notable strength, showcasing innovation in data handling. The use of latent diffusion models and the design of a two-stage audio generation process enhances the model's performance. However, the reliance on synthetic data may raise questions about the generalizability of the results to real-world scenarios.
The experiments are well-structured, comparing the proposed model against state-of-the-art baselines. The use of both objective metrics and a human listening study provides a balanced evaluation of the model's performance. The results indicate that StereoFoley achieves competitive performance, particularly in object-aware audio generation. However, the marginal differences in some metrics suggest that while improvements are present, they may not be as pronounced as claimed.
The paper provides sufficient details regarding the model architecture, training process, and evaluation metrics, which supports reproducibility. However, the absence of publicly available code or datasets limits the ability of other researchers to fully replicate the study. The authors should consider releasing their code and synthetic datasets to enhance reproducibility and facilitate further research.
The primary limitation of the study is its reliance on synthetic data, which may not fully capture the complexities of real-world audio-visual interactions. Additionally, the evaluation metrics used may not be entirely suitable for high-sample-rate stereo sound, potentially underrepresenting the model's capabilities. The paper also acknowledges that the performance of the model may vary based on the quality of the input video data.
The implications of this research are significant, particularly in fields such as film production, gaming, and virtual reality, where high-quality audio-visual synchronization is crucial. The ability to generate object-aware stereo audio could enhance user experiences in immersive environments. Furthermore, the framework could serve as a foundation for future developments in audio generation, potentially influencing related areas such as sound design and machine learning applications in multimedia. The main contribution of this paper is the introduction of StereoFoley, an end-to-end framework for generating object-aware stereo audio from video, addressing a critical gap in the field of video-to-audio generation. This work significantly advances the state-of-the-art by combining innovative methodologies with a strong experimental foundation, paving the way for future research and applications in audio synthesis.
Achieving robust and personalized performance in neuro-steered Target Speaker Extraction (TSE) remains a significant challenge for next-generation hearing aids. This is primarily due to two factors: the inherent non-stationarity of EEG signals across sessions, and the high inter-subject variability that limits the efficacy of generalized models. To address these issues, we propose Brainprint-Modulated Target Speaker Extraction (BM-TSE), a novel framework for personalized and high-fidelity extraction. BM-TSE first employs a spatio-temporal EEG encoder with an Adaptive Spectral Gain (ASG) module to extract stable features resilient to non-stationarity. The core of our framework is a personalized modulation mechanism, where a unified brainmap embedding is learned under the joint supervision of subject identification (SID) and auditory attention decoding (AAD) tasks. This learned brainmap, encoding both static user traits and dynamic attentional states, actively refines the audio separation process, dynamically tailoring the output to each user. Evaluations on the public KUL and Cocktail Party datasets demonstrate that BM-TSE achieves state-of-the-art performance, significantly outperforming existing methods. Our code is publicly accessible at: https://github.com/rosshan-orz/BM-TSE.
Primary: These authors contributed equally to this work
All Institutions: School of Data Science, School of Artificial Intelligence, These authors contributed equally to this work
The paper presents the Brainprint-Modulated Target Speaker Extraction (BM-TSE) framework, which significantly advances personalized neuro-steered audio extraction by integrating EEG signal processing with innovative modulation techniques. The methodology is robust and well-structured, addressing critical challenges in the field and demonstrating substantial technical contributions with promising experimental results.
The proposed BM-TSE framework introduces a robust spatio-temporal EEG encoder combined with an Adaptive Spectral Gain (ASG) module, which addresses the non-stationarity of EEG signals effectively. The architecture's unique feature is the personalized brainmap modulation mechanism that integrates subject identification and auditory attention decoding tasks, enabling dynamic audio refinement based on individual neural patterns. This approach is innovative as it leverages stable, user-specific EEG features to enhance target speaker extraction, which is a significant advancement over existing generalized models.
The experiments conducted on the KUL and Cocktail Party datasets demonstrate the model's superiority over existing methods, achieving state-of-the-art results in terms of speech quality and intelligibility. The ablation studies provide a clear understanding of the contributions of each component, reinforcing the importance of the proposed architecture. The metrics used for evaluation, including SI-SDR, PESQ, and STOI, are appropriate and relevant for assessing the model's performance in audio processing tasks.
The paper provides sufficient implementation details, including the use of PyTorch, the training setup, and the datasets. The code is publicly available on GitHub, which enhances reproducibility. However, the absence of a live demo or interactive visualization limits immediate accessibility for other researchers.
One limitation is the reliance on EEG data, which may not be universally applicable across all populations or settings due to inter-subject variability. Additionally, while the model shows promise, the performance may vary with different types of auditory stimuli or in more complex acoustic environments. The paper could also benefit from a discussion on the computational efficiency and real-time applicability of the proposed framework.
The BM-TSE framework has significant implications for the development of advanced hearing aids and assistive listening technologies, potentially improving the quality of life for individuals with hearing impairments. By personalizing audio extraction based on neural signatures, this research paves the way for more adaptive and user-centered auditory processing systems. The paper presents the Brainprint-Modulated Target Speaker Extraction (BM-TSE) framework, which significantly advances personalized neuro-steered audio extraction by integrating EEG signal processing with innovative modulation techniques. The methodology is robust and well-structured, addressing critical challenges in the field and demonstrating substantial technical contributions with promising experimental results.
Spatial target speaker extraction isolates a desired speaker's voice in multi-speaker environments using spatial information, such as the direction of arrival (DoA). Although recent deep neural network (DNN)-based discriminative methods have shown significant performance improvements, the potential of generative approaches, such as generative adversarial networks (GANs), remains largely unexplored for this problem. In this work, we demonstrate that a GAN can effectively leverage both noisy mixtures and spatial information to extract and generate the target speaker's speech. By conditioning the GAN on intermediate features of a discriminative spatial filtering model in addition to DoA, we enable steerable target extraction with high spatial resolution of 5 degrees, outperforming state-of-the-art discriminative methods in perceptual quality-based objective metrics.
Primary: 6hours of data. For the steerable-target scenario
All Institutions: 6hours of data. For the steerable-target scenario, Fraunhofer IIS, 91058 Erlangen, Am Wolfsmantel 33
The main contribution of this paper is the introduction of a GAN-based framework for spatial target speaker extraction that effectively utilizes spatial information and intermediate features from discriminative models, demonstrating superior performance in perceptual quality metrics compared to existing methods. The comprehensive methodology and rigorous experimental evaluation underscore its significance in advancing the field of audio signal processing.
The paper introduces a novel GAN-based approach for multi-microphone spatial target speaker extraction, leveraging both spatial information (DoA) and intermediate features from discriminative models. The methodology is well-structured, employing an end-to-end training framework that combines adversarial, reconstruction, and feature-matching losses. The use of a U-Net-like architecture for the generator and a multi-scale STFT-based discriminator is appropriate for the task, allowing for effective feature extraction and conditioning. The conditioning on both DoA and intermediate discriminative features represents a significant methodological advancement, enhancing the model's ability to isolate target speakers in complex acoustic environments.
The experimental setup is robust, utilizing a comprehensive dataset generated through simulated acoustic environments. The authors provide a clear comparison against state-of-the-art discriminative methods, with results indicating superior performance in perceptual quality metrics (PESQ and SCOREQ) while maintaining strong spatial selectivity. The inclusion of multiple SNR levels in testing strengthens the evaluation, demonstrating the model's effectiveness across varying conditions. However, the reliance on synthetic data may limit the generalizability of the results in real-world applications.
The paper provides detailed descriptions of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. The authors do mention the use of specific datasets and training configurations, which is helpful for reproducing the experiments.
One limitation of the proposed method is its dependence on the quality of the simulated data, which may not fully capture the complexities of real-world acoustic environments. Additionally, while the model shows improved performance in perceptual metrics, it may still struggle in scenarios with very low SNR or highly reverberant conditions. The paper also does not explore the computational efficiency of the proposed GAN model, which could be a concern for real-time applications.
The proposed method has significant implications for various applications, including hearing aids, conference systems, and automatic speech recognition. By improving the ability to isolate target speakers in noisy environments, this research could enhance communication technologies and accessibility tools for individuals with hearing impairments. The advancements in generative modeling for audio tasks could also inspire further research in related fields, such as speech synthesis and enhancement. The main contribution of this paper is the introduction of a GAN-based framework for spatial target speaker extraction that effectively utilizes spatial information and intermediate features from discriminative models, demonstrating superior performance in perceptual quality metrics compared to existing methods. The comprehensive methodology and rigorous experimental evaluation underscore its significance in advancing the field of audio signal processing.
AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfakes become nearly indistinguishable from real human speech, the need for robust detection methods and effective countermeasures has become critically urgent. In this paper, we present the ISPL's submission to the SAFE challenge at IH&MMSec 2025, where our system ranked first across all tasks. Our solution introduces a novel approach to audio deepfake detection based on a Mixture of Experts architecture. The proposed system leverages multiple state-of-the-art detectors, combining their outputs through an attention-based gating network that dynamically weights each expert based on the input speech signal. In this design, each expert develops a specialized understanding of the shared training data by learning to capture different complementary aspects of the same input through inductive biases. Experimental results indicate that our method outperforms existing approaches across multiple datasets. We further evaluate and analyze the performance of our system in the SAFE challenge.
The main contribution of this paper is the introduction of an attention-based Mixture of Experts architecture for robust speech deepfake detection, which combines the strengths of multiple detectors to improve performance. This innovative approach, along with strong experimental results, positions the work as a valuable addition to the ongoing efforts in combating audio deepfakes, though it requires further details for reproducibility and practical application.
The paper introduces a Mixture of Experts (MoE) architecture enhanced by an attention-based gating mechanism, which is a sophisticated approach to audio deepfake detection. The use of multiple state-of-the-art detectors allows the model to leverage complementary strengths, effectively addressing the challenge of distinguishing between real and synthetic speech. The attention mechanism dynamically adjusts the contribution of each expert based on the input, which is a notable innovation that enhances the model's adaptability and robustness. However, the paper could benefit from a more detailed description of the inductive biases employed and how they are integrated into the learning process.
The experimental results demonstrate that the proposed method outperforms existing approaches across multiple datasets, which is a strong indicator of its effectiveness. The authors participated in the SAFE challenge, achieving first place across all tasks, which adds credibility to their claims. However, the paper lacks detailed information about the datasets used, including their sizes, diversity, and how they were split for training and testing. This information is crucial for assessing the generalizability of the results.
The paper does not provide sufficient implementation details or access to code repositories, which raises concerns about reproducibility. While the methodology is described, the absence of a clear path for other researchers to replicate the experiments limits the impact of the findings. Providing a GitHub repository or similar would greatly enhance the paper's contribution to the field.
One limitation is the potential overfitting to the datasets used, especially if they are not sufficiently diverse. Additionally, the reliance on multiple experts may increase computational complexity, which could hinder real-time applications. The paper does not address how the model performs under adversarial conditions or with varying qualities of input audio, which is critical for practical deployment.
The implications of this research are significant, particularly in the context of increasing concerns about misinformation and biometric spoofing. Effective detection methods for speech deepfakes can enhance security in various applications, including virtual assistants and online communications. However, the potential for misuse of such technologies also warrants careful consideration of ethical implications and the need for responsible deployment. The main contribution of this paper is the introduction of an attention-based Mixture of Experts architecture for robust speech deepfake detection, which combines the strengths of multiple detectors to improve performance. This innovative approach, along with strong experimental results, positions the work as a valuable addition to the ongoing efforts in combating audio deepfakes, though it requires further details for reproducibility and practical application.
Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.
Primary: Tsinghua University
All Institutions: and propose the prior augmentation strategies to reduce cascading errors. Comprehensive experimental results demonstrate that our AudioLBM outperforms previous audio upsampling systems by a large margin across speech, Corresponding author: Jun Zhu, Equal contribution, Shengshu AI, Tsinghua University, Department of CST, kHz on 192Audio and 192Music
The main contribution of this work is the introduction of a novel audio super-resolution system utilizing Latent Bridge Models, which significantly enhances the quality of audio upsampling beyond existing methods. The comprehensive methodology, rigorous experimental validation, and potential applications highlight its significance in advancing the field of audio processing.
The paper introduces a novel approach to audio super-resolution using Latent Bridge Models (LBMs), which compress audio waveforms into a continuous latent space. The methodology is well-structured, leveraging frequency-aware LBMs and a cascaded design to enhance the upsampling process beyond 48 kHz. The integration of informative priors from low-resolution (LR) signals into the generative framework is innovative, allowing for better quality audio synthesis. The paper also presents two prior augmentation strategies to mitigate cascading errors, which is a thoughtful addition to the overall framework. The use of variational autoencoders (VAEs) for compression and the detailed explanation of the bridge process further demonstrate the robustness of the proposed methodology.
The experimental setup is comprehensive, utilizing multiple benchmark datasets (VCTK, ESC-50, Song-Describer) and internal test sets to evaluate the performance of the proposed method. The results indicate a significant improvement over existing methods, achieving state-of-the-art performance in both objective and perceptual quality metrics. The paper effectively compares its results against various baselines, providing clear evidence of the advantages of the proposed approach. The ablation studies conducted further validate the contributions of each component of the model.
The paper includes sufficient details regarding the training setup, model architecture, and evaluation metrics, which enhances reproducibility. However, the absence of a public code repository limits the ability for independent verification of results. The authors mention a demo URL, which may provide some interactive insights, but a complete code release would be beneficial for the community.
While the proposed method shows promising results, it is important to note that the reliance on high-quality training data may limit its applicability in scenarios where such data is scarce. Additionally, the paper acknowledges potential misuse of the technology, such as unauthorized synthesis of audio, which raises ethical considerations. The cascading approach, while innovative, may still introduce artifacts that could affect the final output quality if not managed properly.
The implications of this research are significant for various applications, including audio restoration, music production, and hearing aids, where high-quality audio is essential. The ability to upscale audio beyond traditional limits opens new avenues for creative industries and enhances user experiences in audio consumption. However, the ethical concerns regarding misuse must be addressed to prevent potential negative impacts on the industry. The main contribution of this work is the introduction of a novel audio super-resolution system utilizing Latent Bridge Models, which significantly enhances the quality of audio upsampling beyond existing methods. The comprehensive methodology, rigorous experimental validation, and potential applications highlight its significance in advancing the field of audio processing.
We propose DeepASA, a one-for-all model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA is designed for complex auditory scenes where multiple, often similar, sound sources overlap in time and move dynamically in space. To achieve robust and consistent inference across tasks, we introduce an object-oriented processing (OOP) strategy. This approach encapsulates diverse auditory features into object-centric representations and refines them through a chain-of-inference (CoI) mechanism. The pipeline comprises a dynamic temporal kernel-based feature extractor, a transformer-based aggregator, and an object separator that yields per-object features. These features feed into multiple task-specific decoders. Our object-centric representations naturally resolve the parameter association ambiguity inherent in traditional track-wise processing. However, early-stage object separation can lead to failure in downstream ASA tasks. To address this, we implement temporal coherence matching (TCM) within the chain-of-inference, enabling multi-task fusion and iterative refinement of object features using estimated auditory parameters. We evaluate DeepASA on representative spatial audio benchmark datasets, including ASA2, MC-FUSS, and STARSS23. Experimental results show that our model achieves state-of-the-art performance across all evaluated tasks, demonstrating its effectiveness in both source separation and auditory parameter estimation under diverse spatial auditory scenes.
Primary: and the channel dimension for the DeepASA was
All Institutions: The initial learning rate, and the channel dimension for the DeepASA was
The main contribution of this paper is the introduction of DeepASA, a unified framework for auditory scene analysis that effectively integrates multiple auditory tasks through innovative object-oriented processing and a chain-of-inference mechanism. This work significantly advances the state of the art in audio processing by providing a comprehensive solution to the challenges posed by complex auditory environments.
The proposed DeepASA framework introduces an innovative object-oriented processing (OOP) strategy that effectively encapsulates auditory features into object-centric representations, allowing for robust multi-task learning in auditory scene analysis. The integration of a chain-of-inference (CoI) mechanism to refine these representations through temporal coherence matching is a significant methodological advancement, addressing common pitfalls in traditional track-wise processing. The architecture's use of dynamic temporal kernels and transformer-based aggregators enhances its adaptability to complex auditory environments, showcasing a well-thought-out design that leverages state-of-the-art techniques in deep learning.
The experimental validation of DeepASA on multiple benchmark datasets (ASA2, MC-FUSS, and STARSS23) demonstrates its effectiveness across various tasks, achieving state-of-the-art performance metrics. The comprehensive ablation studies provide clear insights into the contributions of each component, reinforcing the robustness of the proposed architecture. However, the paper could benefit from additional comparisons with more diverse models to further contextualize its performance.
The paper provides a detailed description of the architecture, training procedures, and evaluation metrics, which supports reproducibility. The availability of a demo page enhances accessibility, allowing other researchers to explore the model's capabilities. However, the lack of a public code repository may hinder full reproducibility for some practitioners.
One notable limitation is the large parameter size associated with the ATST used in the SED decoder, which may restrict deployment in resource-constrained environments. Additionally, the reliance on specific datasets for training and evaluation may limit the generalizability of the results to other auditory scenarios.
The development of DeepASA has significant implications for advancing auditory scene analysis and sound separation technologies, particularly in applications such as hearing aids, surveillance systems, and interactive audio environments. By emulating human auditory processing, this research opens new avenues for improving machine understanding of complex auditory scenes, potentially benefiting various fields including robotics, virtual reality, and assistive technologies. The main contribution of this paper is the introduction of DeepASA, a unified framework for auditory scene analysis that effectively integrates multiple auditory tasks through innovative object-oriented processing and a chain-of-inference mechanism. This work significantly advances the state of the art in audio processing by providing a comprehensive solution to the challenges posed by complex auditory environments.
We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intellgibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.
Primary: Johns Hopkins University
All Institutions: Center for Language and Speech Processing, We gratefully acknowledge support from IARPA ARTS (award number 140D0424C0067) from the Office of the Director of National Intelligence, Johns Hopkins University
MaskVCT introduces a novel approach to zero-shot voice conversion, significantly enhancing controllability through multiple classifier-free guidances. The comprehensive analysis of its technical contributions, methodology, and experimental results positions it as a meaningful advancement in the field of audio machine learning.
The methodology presented in MaskVCT is innovative, leveraging a masked modeling approach combined with classifier-free guidance to enhance voice conversion capabilities. The integration of multiple factors for controllability—such as linguistic features and pitch contour—demonstrates a sophisticated understanding of the complexities involved in voice conversion. The ability to balance speaker identity, linguistic content, and prosody in a zero-shot setting is a significant advancement over traditional methods that rely on fixed conditioning schemes. However, the paper could benefit from a more detailed explanation of the underlying algorithms and the specific mechanisms by which the CFGs operate.
The experiments are extensive and well-structured, showcasing the model's performance against existing baselines. The authors provide quantitative metrics such as target speaker and accent similarities, as well as word and character error rates, which are critical for evaluating the effectiveness of voice conversion systems. The results indicate that MaskVCT achieves competitive performance, suggesting that the proposed model is both robust and effective. However, further qualitative evaluations, such as user studies or perceptual tests, could strengthen the findings.
The paper does not provide sufficient details regarding the implementation of MaskVCT, which could hinder reproducibility. While the authors mention extensive experiments, the lack of a public code repository limits the ability of other researchers to replicate the results. Clearer documentation of the training process, hyperparameters, and dataset specifics would enhance reproducibility.
One limitation of the study is the reliance on existing datasets for evaluation, which may not fully capture the diversity of voice characteristics in real-world applications. Additionally, while the model offers increased controllability, the complexity of managing multiple factors may pose challenges for users unfamiliar with voice conversion technologies. The authors should also address potential biases in the datasets used, which could affect the generalizability of the results.
The advancements presented in MaskVCT have significant implications for applications in entertainment, accessibility, and telecommunication. The ability to perform zero-shot voice conversion with high controllability can enhance personalized user experiences in virtual assistants, dubbing, and gaming. Furthermore, the model's robustness could contribute to more inclusive technologies for individuals with speech impairments. MaskVCT introduces a novel approach to zero-shot voice conversion, significantly enhancing controllability through multiple classifier-free guidances. The comprehensive analysis of its technical contributions, methodology, and experimental results positions it as a meaningful advancement in the field of audio machine learning.
The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largely focus on binary classification and fall short in explaining how manipulations occur, tracing where the sources originated, or generalizing to unseen sources-thereby limiting the explainability and reliability of detection. To address these limitations, we present FakeSound2, a benchmark designed to advance deepfake sound detection beyond binary accuracy. FakeSound2 evaluates models across three dimensions: localization, traceability, and generalization, covering 6 manipulation types and 12 diverse sources. Experimental results show that although current systems achieve high classification accuracy, they struggle to recognize forged pattern distributions and provide reliable explanations. By highlighting these gaps, FakeSound2 establishes a comprehensive benchmark that reveals key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.
Primary: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology
All Institutions: epochs using the AdamW optimizer with a learning rate of, Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology
The paper presents FakeSound2, a benchmark aimed at enhancing deepfake sound detection through improved explainability and generalization. This comprehensive approach addresses significant gaps in the current literature and has the potential to drive advancements in the field of audio forensics.
The paper introduces a novel benchmark, FakeSound2, which evaluates deepfake sound detection across three critical dimensions: localization, traceability, and generalization. The methodology is well-structured, leveraging a comprehensive dataset that includes various manipulation types and sources. The automated pipeline for dataset construction is a significant contribution, as it ensures a diverse and high-quality dataset for training and evaluation. However, the reliance on existing models for baseline comparisons may limit the perceived novelty of the proposed methods.
The experimental results are thorough, showcasing the performance of current models on the FakeSound2 benchmark. The results highlight the strengths of existing systems in binary classification while revealing their weaknesses in explainability and generalization. The use of metrics like Acc$_identify$, Acc$_manipulation$, and F1$_segment$ provides a clear assessment of model capabilities. However, the paper could benefit from more detailed comparisons with state-of-the-art methods to contextualize the findings further.
The paper provides sufficient details about the dataset construction and evaluation metrics, which supports reproducibility. However, the implementation specifics of the baseline model and the training process could be elaborated further to enhance clarity for future researchers attempting to replicate the study.
The primary limitations identified include the models' struggles with explainability and generalization, particularly in distinguishing between similar manipulation types. The dataset's reliance on specific generative models may also introduce biases that could affect generalization to unseen sources. Additionally, the current evaluation metrics may not fully capture the complexity of audio manipulation tasks.
The work addresses a pressing issue in the realm of synthetic media, particularly concerning ethical and security implications. By establishing a benchmark for explainable and generalizable deepfake sound detection, the paper has the potential to influence future research directions and promote the development of more robust detection systems. This is particularly relevant in contexts where audio authenticity is critical, such as journalism, law enforcement, and digital forensics. The paper presents FakeSound2, a benchmark aimed at enhancing deepfake sound detection through improved explainability and generalization. This comprehensive approach addresses significant gaps in the current literature and has the potential to drive advancements in the field of audio forensics.
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions causing signal degradations or mismatches between enrollment and test data, impacting performance. Existing benchmarks evaluate only subsets of these conditions, missing others entirely. We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks. While several benchmarks do exist that each cover some of these issues, SVeritas is the first comprehensive evaluation that not only includes all of these, but also several other entirely new, but nonetheless important, real-life conditions that have not previously been benchmarked. We use SVeritas to evaluate several state-of-the-art SV models and observe that while some architectures maintain stability under common distortions, they suffer substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression. Extending our analysis across demographic subgroups, we further identify disparities in robustness across age groups, gender, and linguistic backgrounds. By standardizing evaluation under realistic and synthetic stress conditions, SVeritas enables precise diagnosis of model weaknesses and establishes a foundation for advancing equitable and reliable speaker verification systems.
Primary: and Author n
All Institutions: Address line, and Author n
The main contribution of this paper is the introduction of SVeritas, a comprehensive benchmark for evaluating speaker verification systems under diverse conditions, which significantly enhances the understanding of model robustness and fairness across demographic groups. This work is poised to influence future research directions in speaker verification, emphasizing the importance of robustness in real-world applications.
The paper introduces SVeritas, a comprehensive benchmark for speaker verification (SV) systems that evaluates models under a wide array of stress conditions. The methodology is robust, incorporating both real-world and synthetic stressors, which is a significant advancement over existing benchmarks that only cover limited scenarios. The modular design allows for easy integration of new models and evaluation settings, enhancing its utility for future research. However, the fixed levels of stress conditions may limit the granularity of the analysis.
The experiments conducted using SVeritas are thorough, evaluating several state-of-the-art SV models across diverse conditions, including demographic factors. The results reveal critical insights into model performance under various stressors, highlighting disparities in robustness across different demographic groups. The statistical methods employed, such as paired t-tests, are appropriate for the analysis, although the paper could benefit from more extensive discussions on the implications of the findings.
The paper does not provide explicit URLs for code or datasets, which raises concerns about reproducibility. While the methodology is well-documented, the absence of a public repository or demo limits the ability of other researchers to replicate the experiments or utilize the benchmark effectively.
The primary limitation noted is the fixed severity levels of stress conditions, which may not reflect real-world variability in audio degradation. Additionally, the paper acknowledges that the sample sizes for some demographic groups may be small, potentially affecting the statistical power of the comparisons.
The development of SVeritas has significant implications for the field of speaker verification, particularly in enhancing the robustness and fairness of SV systems. By addressing real-world challenges and providing a standardized evaluation framework, this work lays the groundwork for future advancements in equitable and reliable speaker verification technologies. The main contribution of this paper is the introduction of SVeritas, a comprehensive benchmark for evaluating speaker verification systems under diverse conditions, which significantly enhances the understanding of model robustness and fairness across demographic groups. This work is poised to influence future research directions in speaker verification, emphasizing the importance of robustness in real-world applications.
High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and semantic information within tokens, leading to a lack of fine-grained details in synthesized speech. In this study, we propose MBCodec, a novel multi-codebook audio codec based on Residual Vector Quantization (RVQ) that learns a hierarchically structured representation. MBCodec leverages self-supervised semantic tokenization and audio subband features from the raw signals to construct a functionally-disentangled latent space. In order to encourage comprehensive learning across various layers of the codec embedding space, we introduce adaptive dropout depths to differentially train codebooks across layers, and employ a multi-channel pseudo-quadrature mirror filter (PQMF) during training. By thoroughly decoupling semantic and acoustic features, our method not only achieves near-lossless speech reconstruction but also enables a remarkable 170x compression of 24 kHz audio, resulting in a low bit rate of just 2.2 kbps. Experimental evaluations confirm its consistent and substantial outperformance of baselines across all evaluations.
Primary: No funding was received for conducting this study. The authors have no relevant financial or nonfinancial interests to disclose
All Institutions: No funding was received for conducting this study. The authors have no relevant financial or nonfinancial interests to disclose
The main contribution of this paper is the introduction of MBCodec, a novel audio codec that effectively disentangles semantic and acoustic information, achieving near-lossless reconstruction at a remarkable compression ratio. This work represents a significant step forward in the field of neural audio codecs, addressing critical challenges in audio quality and compression efficiency while paving the way for future research and applications in high-fidelity audio processing.
The proposed MBCodec introduces a multi-codebook architecture that employs Residual Vector Quantization (RVQ) to achieve effective disentanglement of semantic and acoustic features. The use of self-supervised learning for semantic tokenization and the introduction of a pseudo-quadrature mirror filter (PQMF) to supervise acoustic information are innovative aspects that enhance model interpretability and performance. The adaptive dropout strategy is a thoughtful addition that optimizes training efficiency by dynamically adjusting the number of active codebooks based on their contribution to the reconstruction quality. Overall, the methodology is well-structured and addresses key limitations in existing audio codecs.
The experimental setup is robust, benchmarking MBCodec against established baselines like DAC and Encodec. The authors provide comprehensive evaluations across multiple metrics, including PESQ and SI-SDR, demonstrating significant improvements in audio reconstruction quality and compression efficiency. The ablation study further strengthens the findings by isolating the contributions of key components, validating the necessity of both the PQMF and adaptive dropout mechanisms. However, the paper lacks clarity on the datasets used, which could affect the reproducibility of results.
The paper provides some implementation details, such as the architecture of the encoder and decoder, training duration, and hyperparameters. However, it lacks a clear description of the dataset preprocessing steps and the specific configurations used for the experiments. The absence of a publicly available code repository or demo limits the reproducibility of the results, as external researchers would need to replicate the entire setup without direct access to the code or data.
While MBCodec shows promising results, it is essential to note that the paper does not address potential limitations in terms of scalability or real-time application viability. The complexity of the model may hinder its deployment in resource-constrained environments. Additionally, the reliance on large-scale datasets for training may limit its applicability to domains with less available data.
The advancements in audio codec technology presented in this paper have significant implications for various applications, including speech synthesis, telecommunications, and multimedia streaming. The ability to achieve high-fidelity audio reconstruction at extremely low bitrates can enhance user experiences in voice communication and media consumption, particularly in bandwidth-limited scenarios. The main contribution of this paper is the introduction of MBCodec, a novel audio codec that effectively disentangles semantic and acoustic information, achieving near-lossless reconstruction at a remarkable compression ratio. This work represents a significant step forward in the field of neural audio codecs, addressing critical challenges in audio quality and compression efficiency while paving the way for future research and applications in high-fidelity audio processing.
Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To address these challenges, we propose AudioGenie-Reasoner (AGR), the first unified training-free multi-agent system that coordinates perception and reasoning over an evolving chain of textual evidence. Our key idea is a paradigm shift that transforms audio deep reasoning into complex text understanding task from a new perspective, thereby unlocking the full potential of large language models. Specifically, the design of AGR mimics the human coarse-to-fine cognitive process. It first transforms the input audio into a coarse text-based document. Then, we design a novel proactive iterative document refinement loop, featuring tool-augmented routes and specialized agents, to continuously search for missing information and augment the evidence chain in a coarse-to-fine manner until sufficient question-related information is gathered for making final predictions. Experimental results show that AGR achieves state-of-the-art (SOTA) performance over existing open-source audio deep reasoning models across various benchmarks. The code will be made publicly available.
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou)
The main contribution of this paper is the introduction of AudioGenie-Reasoner, a training-free multi-agent framework that effectively integrates audio perception and reasoning through a novel iterative refinement process. This work significantly advances the field of audio deep reasoning by proposing a unique methodology that leverages large language models and human-like cognitive processes, although it could benefit from more comprehensive experimental details and reproducibility measures.
The proposed AudioGenie-Reasoner (AGR) framework introduces a novel approach to audio deep reasoning by leveraging a multi-agent system that operates in a training-free manner. The methodology is commendable as it mimics human cognitive processes, allowing for a coarse-to-fine transformation of audio inputs into textual evidence. The proactive iterative document refinement loop is particularly innovative, as it emphasizes active exploration and information augmentation, which are critical in reasoning tasks. However, the paper could benefit from a more detailed explanation of the specific algorithms used by the specialized agents and how they interact with the tool-augmented routes.
The experimental results indicate that AGR achieves state-of-the-art performance across various benchmarks, demonstrating its effectiveness compared to existing models. However, the paper lacks detailed descriptions of the datasets used, including their sizes and characteristics, which are essential for understanding the generalizability of the results. Additionally, the evaluation metrics employed should be clearly defined to assess the robustness of the findings.
The authors mention that the code will be made publicly available, which is a positive aspect for reproducibility. However, the paper does not provide sufficient implementation details or a clear methodology for reproducing the experiments, such as hyperparameter settings or the computational resources required. This lack of detail could hinder the ability of other researchers to replicate the study.
One significant limitation is the reliance on a training-free approach, which may restrict the model's adaptability to specific tasks or domains that require fine-tuning. Additionally, the absence of a comprehensive comparison with other training-free models in the literature leaves questions about the relative performance and applicability of AGR. The paper also does not address potential scalability issues when dealing with larger audio datasets.
The implications of this research are substantial, as it opens new avenues for audio processing and reasoning applications, such as in automated transcription, audio-based question answering, and interactive audio systems. By bridging the gap between perception and reasoning, AGR could enhance user experiences in various domains, including education, entertainment, and accessibility technologies. The main contribution of this paper is the introduction of AudioGenie-Reasoner, a training-free multi-agent framework that effectively integrates audio perception and reasoning through a novel iterative refinement process. This work significantly advances the field of audio deep reasoning by proposing a unique methodology that leverages large language models and human-like cognitive processes, although it could benefit from more comprehensive experimental details and reproducibility measures.
Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization. However, existing methods often struggle to address nonlinear clock drift and lack mechanisms for quantifying uncertainty. Traditional methods like Cross-correlation and Dynamic Time Warping assume simple drift patterns and provide no reliability measures. Meanwhile, recent deep learning models typically treat alignment as a binary classification task, overlooking inter-channel dependencies and uncertainty estimation. We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization. We extend BEATs encoders with cross-attention layers to model temporal relationships between channels. We also develop a confidence-weighted scoring function that uses the full prediction distribution instead of binary thresholding. Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline. On individual datasets, we achieved 0.14 MSE on ARU data (77% reduction) and 0.45 MSE on zebra finch data (18% reduction). The framework supports probabilistic temporal alignment, moving beyond point estimates. While validated in a bioacoustic context, the approach is applicable to a broader range of multi-channel audio tasks where alignment confidence is critical. Code available on: https://github.com/Ragib-Amin-Nihal/BEATsCA
Primary: Fictional University
All Institutions: 2133 Long Road, 8765 Dream Blvd, An unnumbered footnote that may come in handy, University Imagination, Fictional University, Important Laboratory
The paper presents a significant advancement in multi-channel audio alignment through the integration of cross-attention mechanisms and confidence-weighted scoring, addressing key limitations of existing methods. The comprehensive evaluation and validation of the proposed approach underscore its potential impact on the field of audio processing and related applications.
The paper introduces a novel method that integrates cross-attention mechanisms with confidence-weighted scoring to enhance multi-channel audio alignment. The approach effectively models inter-channel dependencies and provides uncertainty quantification, addressing limitations of traditional methods and binary classification systems. The use of BEATs encoders with cross-attention layers is a significant innovation, allowing for better temporal relationship modeling. The confidence-weighted scoring function is well-conceived, utilizing the full prediction distribution rather than binary thresholds, which is a notable advancement in the field.
The experimental results are compelling, demonstrating substantial improvements in Mean Squared Error (MSE) over baseline methods across multiple datasets. Achieving first place in the BioDCASE 2025 Task 1 challenge validates the effectiveness of the proposed method. The paper includes thorough validation and ablation studies that provide insight into the contributions of different components of the model, enhancing the credibility of the results.
The implementation details are well-documented, including the architecture, training configurations, and data augmentation techniques. The use of fixed random seeds for reproducibility is a strong point. However, the absence of a demo page limits the accessibility of the results for further exploration by the community.
While the method shows promise, it relies on affine drift approximations for candidate generation, which may not generalize to all scenarios of clock drift. Additionally, the paper could benefit from a more extensive discussion on the computational efficiency of the proposed method, especially given the complexity introduced by cross-attention mechanisms.
The proposed framework has significant implications for various applications requiring precise audio synchronization, such as bioacoustic monitoring, spatial audio systems, and distributed sensor networks. Its ability to quantify uncertainty in alignment decisions could enhance the reliability of systems in critical applications. The paper presents a significant advancement in multi-channel audio alignment through the integration of cross-attention mechanisms and confidence-weighted scoring, addressing key limitations of existing methods. The comprehensive evaluation and validation of the proposed approach underscore its potential impact on the field of audio processing and related applications.
Audio-driven talking head generation is crucial for applications in virtual reality, digital avatars, and film production. While NeRF-based methods enable high-fidelity reconstruction, they suffer from low rendering efficiency and suboptimal audio-visual synchronization. This work presents PGSTalker, a real-time audio-driven talking head synthesis framework based on 3D Gaussian Splatting (3DGS). To improve rendering performance, we propose a pixel-aware density control strategy that adaptively allocates point density, enhancing detail in dynamic facial regions while reducing redundancy elsewhere. Additionally, we introduce a lightweight Multimodal Gated Fusion Module to effectively fuse audio and spatial features, thereby improving the accuracy of Gaussian deformation prediction. Extensive experiments on public datasets demonstrate that PGSTalker outperforms existing NeRF- and 3DGS-based approaches in rendering quality, lip-sync precision, and inference speed. Our method exhibits strong generalization capabilities and practical potential for real-world deployment.
Primary: Xinjiang University
All Institutions: small Liejun Wang1, small School of Computer Science and Technology, small Yinfeng Yu1(, Xinjiang University, Urumqi 830017, small Fuchun Sun2, small Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center, small Wendong Zheng3, small Tianheng Zhu1
PGSTalker presents a novel framework for real-time audio-driven talking head generation, significantly improving rendering efficiency and lip-sync accuracy through innovative methodologies. The work addresses critical challenges in the field, demonstrating strong experimental validation and practical potential for real-world applications.
The proposed methodology in PGSTalker is innovative, leveraging 3D Gaussian Splatting (3DGS) and introducing a pixel-aware density control strategy that enhances rendering efficiency and detail in dynamic facial regions. The Multimodal Gated Fusion Module (MGF) is a significant contribution, effectively integrating audio and spatial features for improved Gaussian deformation prediction. The approach is well-structured, with a clear delineation of the face and inside mouth branches, allowing for targeted modeling of distinct facial dynamics. The methodology is robust, addressing limitations of prior NeRF and 3DGS methods while maintaining real-time performance.
The experiments are extensive, utilizing public datasets and comparing PGSTalker against several state-of-the-art methods. The evaluation metrics are comprehensive, including rendering quality, lip-sync accuracy, and inference speed, which are critical for real-time applications. The results demonstrate PGSTalker's superiority in all evaluated metrics, providing strong evidence of its effectiveness. The ablation study further validates the contributions of key components, enhancing the credibility of the findings.
While the paper provides a detailed description of the methods and experiments, it lacks specific URLs for code or demo access, which could hinder reproducibility. The implementation details, such as the training pipeline and loss functions, are well-explained, but without a public repository, independent verification of results may be challenging.
One identified limitation is the reliance on high-quality training data, which may not be readily available for all potential users. Additionally, the model's performance in highly variable audio conditions or with diverse speaker characteristics has not been extensively tested, which could affect generalization in real-world applications. The computational requirements, while improved, still necessitate significant resources for real-time performance.
The implications of PGSTalker are significant, particularly in fields such as virtual reality, digital avatars, and film production. The ability to generate realistic, audio-driven talking heads in real-time could revolutionize user interactions in digital environments and enhance content creation in media. The framework's potential for practical deployment suggests a wide range of applications, from entertainment to telecommunication. PGSTalker presents a novel framework for real-time audio-driven talking head generation, significantly improving rendering efficiency and lip-sync accuracy through innovative methodologies. The work addresses critical challenges in the field, demonstrating strong experimental validation and practical potential for real-world applications.
State-of-the-art automatic speech recognition (ASR) models like Whisper, perform poorly on atypical speech, such as that produced by individuals with dysarthria. Past works for atypical speech have mostly investigated fully personalized (or idiosyncratic) models, but modeling strategies that can both generalize and handle idiosyncracy could be more effective for capturing atypical speech. To investigate this, we compare four strategies: (a) $\textit{normative}$ models trained on typical speech (no personalization), (b) $\textit{idiosyncratic}$ models completely personalized to individuals, (c) $\textit{dysarthric-normative}$ models trained on other dysarthric speakers, and (d) $\textit{dysarthric-idiosyncratic}$ models which combine strategies by first modeling normative patterns before adapting to individual speech. In this case study, we find the dysarthric-idiosyncratic model performs better than idiosyncratic approach while requiring less than half as much personalized data (36.43 WER with 128 train size vs 36.99 with 256). Further, we found that tuning the speech encoder alone (as opposed to the LM decoder) yielded the best results reducing word error rate from 71% to 32% on average. Our findings highlight the value of leveraging both normative (cross-speaker) and idiosyncratic (speaker-specific) patterns to improve ASR for underrepresented speech populations.
Primary: Vanderbilt University
All Institutions: Department of Computer Science, Stony Brook University, Address line, College of Connected Computing, and Author n, Vanderbilt University
The main contribution of this paper is the introduction of a dysarthric-idiosyncratic modeling approach that effectively combines normative and personalized strategies to enhance ASR performance for individuals with dysarthria. This work not only advances the technical understanding of ASR in atypical speech contexts but also highlights the need for more inclusive and representative datasets in machine learning research.
The paper presents a systematic comparison of four modeling strategies for automatic speech recognition (ASR) tailored to dysarthric speech, which is a significant contribution to the field. The methodology is well-structured, employing a combination of normative and idiosyncratic modeling approaches. The use of transfer learning and fine-tuning techniques is appropriate given the limited data available for dysarthric speakers. However, while the methodology is sound, it could benefit from a more detailed explanation of the parameter-efficient strategies employed and their specific impacts on model performance.
The experiments are thorough, utilizing a well-defined dataset (TORGO) and employing leave-one-out cross-validation to ensure robustness. The results demonstrate clear improvements in word error rates (WER) across different modeling strategies, particularly highlighting the effectiveness of the dysarthric-idiosyncratic model. However, the reliance on a small dataset limits the generalizability of the findings, and the paper could have included more extensive comparisons with existing state-of-the-art models.
The paper provides sufficient details regarding the experimental setup, including model architecture, training parameters, and evaluation metrics, which supports reproducibility. The availability of the GitHub repository further enhances the potential for other researchers to replicate the study. However, a more detailed description of the data preprocessing steps would improve clarity.
The study is limited by the small number of speakers in the TORGO dataset, which may not capture the full diversity of dysarthric speech. Additionally, the paper acknowledges that factors such as regional accents and dialects were not controlled, which could influence results. The authors also note the need for larger datasets to validate their findings, particularly in real-world applications.
This research has significant implications for improving ASR systems for individuals with dysarthria, a population often underserved by current technologies. By demonstrating that a hybrid approach can outperform purely personalized models, the findings could lead to more accessible and effective speech recognition tools in clinical settings. The study also emphasizes the importance of inclusive AI development, which is crucial for ensuring that technological advancements benefit all users, including those with disabilities. The main contribution of this paper is the introduction of a dysarthric-idiosyncratic modeling approach that effectively combines normative and personalized strategies to enhance ASR performance for individuals with dysarthria. This work not only advances the technical understanding of ASR in atypical speech contexts but also highlights the need for more inclusive and representative datasets in machine learning research.
This paper introduces MR-CQTdiff, a novel neural-network architecture for diffusion-based audio generation that leverages a multi-resolution Constant-$Q$ Transform (C$Q$T). The proposed architecture employs an efficient, invertible CQT framework that adjusts the time-frequency resolution on an octave-by-octave basis. This design addresses the issue of low temporal resolution at lower frequencies, enabling more flexible and expressive audio generation. We conduct an evaluation using the Fr\'echet Audio Distance (FAD) metric across various architectures and two datasets. Experimental results demonstrate that MR-CQTdiff achieves state-of-the-art audio quality, outperforming competing architectures.
Primary: Both authors contributed equally to this work. It was funded by Volkswagen Foundation (Volkswagen Stiftung) Germany
All Institutions: Both authors contributed equally to this work. It was funded by Volkswagen Foundation (Volkswagen Stiftung) Germany, under Grant no. 96 881
The paper introduces MR-CQTdiff, a novel architecture for diffusion-based audio generation that leverages a multi-resolution constant-Q transform to improve audio quality. The comprehensive analysis highlights its innovative methodology, robust experimental validation, and significant implications for the field of audio processing and generation.
The paper presents a well-structured methodology that introduces the MR-CQTdiff architecture, which innovatively employs a multi-resolution Constant-Q Transform (CQT) to enhance diffusion-based audio generation. The architecture's design addresses the critical trade-off between time and frequency resolution, particularly for low-frequency audio signals, by utilizing multiple parallel CQT filters. This approach allows for better capture of transient audio events and harmonically rich content, which is a significant improvement over existing methods. The use of a U-Net structure facilitates effective feature reuse and gradient flow, enhancing the model's training stability.
The experimental evaluation is robust, utilizing two diverse datasets (FMA-Large and OpenSinger) to assess the performance of MR-CQTdiff against several strong baselines. The use of the Fréchet Audio Distance (FAD) metric provides a quantitative measure of audio quality, and the results demonstrate that MR-CQTdiff consistently outperforms other models, particularly in capturing transient details in vocal audio. The thoroughness of the experiments, including the comparison with latent diffusion models, adds credibility to the findings.
The paper provides sufficient implementation details, including the architecture specifications, training parameters, and dataset descriptions, which enhance reproducibility. The availability of the code on GitHub and the demo page with audio samples further supports this aspect, allowing other researchers to replicate the experiments and validate the results.
While the proposed architecture shows promising results, the paper acknowledges limitations in terms of computational efficiency compared to latent diffusion models. The focus on audio generation quality may lead to increased resource consumption, which could be a barrier for broader applications. Additionally, the evaluation primarily focuses on unconditional generation, and further exploration of conditional generation tasks could provide deeper insights into the model's capabilities.
The MR-CQTdiff architecture has significant potential applications in various audio generation tasks, including music synthesis, sound design, and audio restoration. By improving the quality of generated audio, this work could influence the development of more sophisticated audio generation tools and enhance user experiences in creative industries. The findings may also inspire further research into time-frequency representations in generative models, potentially leading to advancements in other domains of machine learning. The paper introduces MR-CQTdiff, a novel architecture for diffusion-based audio generation that leverages a multi-resolution constant-Q transform to improve audio quality. The comprehensive analysis highlights its innovative methodology, robust experimental validation, and significant implications for the field of audio processing and generation.
Piano cover generation aims to automatically transform a pop song into a piano arrangement. While numerous deep learning approaches have been proposed, existing models often fail to maintain structural consistency with the original song, likely due to the absence of beat-aware mechanisms or the difficulty of modeling complex rhythmic patterns. Rhythmic information is crucial, as it defines structural similarity (e.g., tempo, BPM) and directly impacts the overall quality of the generated music. In this paper, we introduce Etude, a three-stage architecture consisting of Extract, strucTUralize, and DEcode stages. By pre-extracting rhythmic information and applying a novel, simplified REMI-based tokenization, our model produces covers that preserve proper song structure, enhance fluency and musical dynamics, and support highly controllable generation through style injection. Subjective evaluations with human listeners show that Etude substantially outperforms prior models, achieving a quality level comparable to that of human composers.
The main contribution of this paper is the introduction of the Etude framework, a novel three-stage architecture for Automatic Piano Cover Generation that significantly enhances the quality and controllability of generated music. This work represents a substantial advancement in the field of music generation, addressing critical challenges in structural consistency and stylistic diversity through innovative methodologies and comprehensive evaluations.
The proposed methodology of Etude is well-structured and innovative, consisting of a three-stage architecture that effectively separates the extraction of musical features, the structuralization of rhythmic information, and the decoding of the final output. This modular approach addresses key challenges in Automatic Piano Cover Generation (APCG), particularly the need for structural consistency and stylistic control. The introduction of Tiny-REMI as a minimalistic token representation is a significant improvement over previous models, simplifying the learning task for the decoder. The use of a pre-trained Beat-Transformer for rhythmic analysis is also a notable enhancement, ensuring that the generated covers maintain a coherent rhythmic framework.
The experimental evaluation is comprehensive, utilizing both objective and subjective metrics to assess the performance of the Etude framework against several baseline models. The dataset of approximately 7,700 pop song and piano cover pairs is substantial, and the authors have taken care to ensure data quality through filtering and alignment methods. The results demonstrate that Etude significantly outperforms existing models in both objective metrics (WPD, RGC, IPE) and subjective evaluations (similarity, fluency, dynamic expression, overall quality), providing strong evidence for the effectiveness of the proposed approach.
The paper provides sufficient detail regarding the training process, model architecture, and evaluation metrics, which supports reproducibility. However, the lack of a publicly available code repository limits the ease with which others can replicate the results. The authors mention that all code and audio demonstrations are available on their project page, which is a positive aspect, but the absence of a GitHub link could hinder broader accessibility.
One identified limitation is the reliance on the performance of the front-end components, particularly the Beat-Detector and Extractor. The authors acknowledge that the framework's structural accuracy is constrained by the precision of the beat tracker and that the Extractor's flattening process may lead to information loss. This could affect the model's ability to capture the primary melody of the original song, resulting in incomplete melodic lines. Additionally, the subjective evaluation indicates that while the model performs well, it still falls short of human performance in certain aspects.
The potential applications of the Etude framework are significant, particularly in the realm of music generation and AI-assisted creativity. The ability to generate high-quality, stylistically diverse piano covers could enhance user engagement in music production and education. Furthermore, the framework's modular design allows for future extensions, such as integrating more advanced beat-tracking modules or exploring multi-stream extractors, which could further improve its capabilities. The main contribution of this paper is the introduction of the Etude framework, a novel three-stage architecture for Automatic Piano Cover Generation that significantly enhances the quality and controllability of generated music. This work represents a substantial advancement in the field of music generation, addressing critical challenges in structural consistency and stylistic diversity through innovative methodologies and comprehensive evaluations.
Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.
Primary: textitConcordia University
All Institutions: textitConcordia University, Mila-Quebec AI Institute
The main contribution of this paper is the development of FocalCodec-Stream, a novel streaming low-bitrate speech codec that effectively balances reconstruction quality, semantic preservation, and latency, thereby advancing the state of the art in neural audio codecs. The comprehensive analysis of the technical contributions, methodology, and experimental results highlights its significance in addressing real-time audio processing challenges.
The methodology presented in this paper is robust and innovative, particularly in its use of multi-stage causal distillation to adapt the WavLM architecture for streaming applications. The introduction of a lightweight refiner module to enhance audio quality under latency constraints is a significant contribution, as it addresses a critical challenge in low-latency audio processing. The architectural modifications, such as the use of causal convolutions and sliding window attention, are well-justified and effectively enable the codec to maintain performance while achieving streamability. The paper also provides a clear and structured approach to the codec design, which is commendable.
The experimental evaluation is thorough, comparing FocalCodec-Stream against several existing streaming codecs across multiple metrics, including speech resynthesis, voice conversion, and downstream task performance. The results demonstrate that FocalCodec-Stream consistently outperforms its competitors in terms of intelligibility and speaker fidelity, even at lower bitrates. The use of diverse datasets, such as LibriSpeech and Libri-Light, adds credibility to the findings. The ablation studies further substantiate the importance of the refiner and the multi-stage training approach, providing a comprehensive understanding of the model's performance.
The paper mentions that code and checkpoints will be made available on GitHub, which is a positive aspect for reproducibility. However, while the implementation details are described, the paper could benefit from more explicit guidance on hyperparameter settings and training procedures to facilitate easier replication of the results by other researchers.
One limitation noted in the paper is the performance gap between FocalCodec-Stream and the full-context FocalCodec, particularly at lower bitrates. This is expected due to the stricter constraints imposed by real-time streaming. Additionally, the paper does not address potential challenges in scaling the model to larger datasets or the implications of deploying such a codec in resource-constrained environments.
The potential applications of FocalCodec-Stream are significant, particularly in real-time speech applications such as virtual assistants, telecommunication, and interactive dialogue systems. By achieving low-latency, high-quality audio coding, this work could enhance user experiences in various audio-related technologies, making it a valuable contribution to the field of machine learning and audio processing. The main contribution of this paper is the development of FocalCodec-Stream, a novel streaming low-bitrate speech codec that effectively balances reconstruction quality, semantic preservation, and latency, thereby advancing the state of the art in neural audio codecs. The comprehensive analysis of the technical contributions, methodology, and experimental results highlights its significance in addressing real-time audio processing challenges.
Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.
Primary: We employ a learning rate of
All Institutions: We employ a learning rate of
The main contribution of this paper is the introduction of Fed-PISA, a novel federated learning framework for voice cloning that effectively balances personalization and communication efficiency through a disentangled adaptation mechanism and personalized aggregation strategy. This work significantly advances the field of federated TTS systems, addressing key challenges in personalization and communication costs while demonstrating strong empirical results.
The proposed methodology of Fed-PISA is innovative in its use of a disentangled Low-Rank Adaptation (LoRA) mechanism to separate speaker timbre from stylistic features, allowing for efficient federated learning without compromising personalization. The introduction of a personalized aggregation strategy based on collaborative filtering is a significant advancement, enabling the model to leverage stylistic similarities among clients effectively. The detailed description of the LoRA parameterization and the client-server interaction provides a clear understanding of the framework's operational dynamics.
The experiments are robust, utilizing four public datasets with emotion annotations to evaluate the effectiveness of Fed-PISA against various baselines, including both federated and non-federated methods. The results demonstrate significant improvements in style expressivity, speaker similarity, and naturalness, with detailed metrics reported. The inclusion of ablation studies strengthens the findings, confirming the necessity of the proposed components.
The paper provides sufficient implementation details, including the architecture, training parameters, and evaluation metrics, which facilitates reproducibility. The availability of a demo page with audio samples further aids in understanding the practical implications of the research.
While the approach shows promise, it may still be limited by the reliance on the quality of the datasets used and the inherent challenges of federated learning, such as variability in client data distribution. Additionally, the communication costs, while minimized, could still be a concern in highly distributed environments.
The implications of this work extend to various applications in personalized speech synthesis, voice assistants, and accessibility technologies. By enabling effective voice cloning with privacy preservation, Fed-PISA could enhance user experiences in numerous domains, including entertainment, education, and assistive technologies. The main contribution of this paper is the introduction of Fed-PISA, a novel federated learning framework for voice cloning that effectively balances personalization and communication efficiency through a disentangled adaptation mechanism and personalized aggregation strategy. This work significantly advances the field of federated TTS systems, addressing key challenges in personalization and communication costs while demonstrating strong empirical results.
In this work, we investigate multimodal foundation models (MFMs) for EmoFake detection (EFD) and hypothesize that they will outperform audio foundation models (AFMs). MFMs due to their cross-modal pre-training, learns emotional patterns from multiple modalities, while AFMs rely only on audio. As such, MFMs can better recognize unnatural emotional shifts and inconsistencies in manipulated audio, making them more effective at distinguishing real from fake emotional expressions. To validate our hypothesis, we conduct a comprehensive comparative analysis of state-of-the-art (SOTA) MFMs (e.g. LanguageBind) alongside AFMs (e.g. WavLM). Our experiments confirm that MFMs surpass AFMs for EFD. Beyond individual foundation models (FMs) performance, we explore FMs fusion, motivated by findings in related research areas such synthetic speech detection and speech emotion recognition. To this end, we propose SCAR, a novel framework for effective fusion. SCAR introduces a nested cross-attention mechanism, where representations from FMs interact at two stages sequentially to refine information exchange. Additionally, a self-attention refinement module further enhances feature representations by reinforcing important cross-FM cues while suppressing noise. Through SCAR with synergistic fusion of MFMs, we achieve SOTA performance, surpassing both standalone FMs and conventional fusion approaches and previous works on EFD.
The main contribution of this paper is the introduction of a novel framework, SCAR, for EmoFake detection that leverages multimodal foundation models and demonstrates superior performance compared to existing audio foundation models. This research significantly advances the understanding and capabilities in the detection of emotionally manipulated audio, addressing a critical gap in the field of audio deepfake detection.
The paper presents a well-structured methodology for EmoFake detection using multimodal foundation models (MFMs) and a novel framework called SCAR for fusing these models. The nested cross-attention mechanism is a significant innovation, allowing for enhanced interaction between different modalities, which is a critical aspect of the proposed approach. The authors provide a clear explanation of the architecture and the rationale behind their design choices, which strengthens the overall methodology. However, the paper could benefit from a more detailed comparison of the proposed SCAR framework with existing fusion techniques beyond simple concatenation.
The experiments are comprehensive, utilizing a unique dataset specifically designed for EmoFake detection. The authors validate their hypothesis through rigorous testing, demonstrating that MFMs outperform AFMs in EFD tasks. The use of Equal Error Rate (EER) as a metric is appropriate for the domain, and the results are clearly presented, showing significant improvements over baseline models. However, the paper lacks a thorough exploration of the statistical significance of the results, which would bolster the claims of superiority.
The authors provide a GitHub repository with accessible code and models, which is a positive aspect for reproducibility. The training details, including optimizer settings and architecture specifics, are adequately described, allowing other researchers to replicate the experiments. However, the paper could enhance reproducibility by including more extensive documentation on the dataset and preprocessing steps.
One limitation of the study is the reliance on a single dataset for evaluation, which may not capture the full variability of EmoFake detection scenarios. Additionally, while the proposed SCAR framework shows promise, its complexity may pose challenges for real-time applications. The paper also does not address potential biases in the dataset or the models used.
The implications of this research are significant, particularly in areas such as misinformation, security, and emotional manipulation detection. As deepfake technology becomes increasingly sophisticated, the ability to detect emotionally manipulated audio could play a crucial role in maintaining trust in digital communications. The findings could inform future research directions and applications in various fields, including forensics and media verification. The main contribution of this paper is the introduction of a novel framework, SCAR, for EmoFake detection that leverages multimodal foundation models and demonstrates superior performance compared to existing audio foundation models. This research significantly advances the understanding and capabilities in the detection of emotionally manipulated audio, addressing a critical gap in the field of audio deepfake detection.
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.
Primary: KTH Royal Institute of Technology
All Institutions: KTH Royal Institute of Technology, Department of Speech, Music and Hearing, Thanks to XYZ agency for funding
VoXtream presents a pioneering approach to streaming TTS with ultra-low latency, combining innovative transformer architectures to achieve competitive performance. The paper's contributions are substantial, addressing a critical need in real-time speech synthesis and setting a new benchmark for future research in the field.
The methodology presented in VoXtream is innovative, utilizing a combination of autoregressive transformers to achieve low-latency streaming TTS. The architecture's design, which includes an incremental Phoneme Transformer, a Temporal Transformer, and a Depth Transformer, is well thought out and addresses the critical issue of initial latency in TTS systems. The use of dynamic look-ahead for phoneme processing is particularly noteworthy, as it allows for immediate speech output without waiting for the entire input, which is a significant advancement over existing models. The integration of these components into a cohesive framework demonstrates a solid understanding of the challenges in TTS and offers a practical solution.
The experimental evaluation is robust, with comprehensive testing on established datasets such as SEED-TTS and LibriSpeech. The paper provides clear comparisons with multiple baseline models, showcasing VoXtream's performance in terms of intelligibility, naturalness, and latency. The results indicate that VoXtream not only meets but often exceeds the performance of larger models, despite being trained on a smaller dataset. The use of both objective metrics (WER, SPK-SIM, UTMOS) and subjective evaluations through user studies strengthens the credibility of the findings.
The paper includes sufficient implementation details, such as model architecture specifications, training procedures, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the results. The authors mention the use of specific datasets and training setups, which is helpful, but a direct link to the code would enhance reproducibility further.
One limitation of the study is the reliance on a mid-scale dataset (9k hours), which may restrict the model's generalizability compared to systems trained on larger datasets. Additionally, while the model achieves low initial latency, the paper does not extensively discuss the trade-offs in quality that may arise from such optimizations. The subjective evaluations, while positive, could benefit from a larger participant pool to ensure broader applicability of the results.
The implications of VoXtream are significant for real-time applications in conversational AI, voice assistants, and simultaneous translation systems. The ability to generate speech with minimal latency enhances user experience and engagement, making it a valuable contribution to the field of speech synthesis. The model's architecture could inspire further research into low-latency systems and their applications in various domains, potentially leading to advancements in human-computer interaction. VoXtream presents a pioneering approach to streaming TTS with ultra-low latency, combining innovative transformer architectures to achieve competitive performance. The paper's contributions are substantial, addressing a critical need in real-time speech synthesis and setting a new benchmark for future research in the field.
In speech enhancement, knowledge distillation (KD) compresses models by transferring a high-capacity teacher's knowledge to a compact student. However, conventional KD methods train the student to mimic the teacher's output entirely, which forces the student to imitate the regions where the teacher performs poorly and to apply distillation to the regions where the student already performs well, which yields only marginal gains. We propose Distilling Selective Patches (DISPatch), a KD framework for speech enhancement that applies the distillation loss to spectrogram patches where the teacher outperforms the student, as determined by a Knowledge Gap Score. This approach guides optimization toward areas with the most significant potential for student improvement while minimizing the influence of regions where the teacher may provide unreliable instruction. Furthermore, we introduce Multi-Scale Selective Patches (MSSP), a frequency-dependent method that uses different patch sizes across low- and high-frequency bands to account for spectral heterogeneity. We incorporate DISPatch into conventional KD methods and observe consistent gains in compact students. Moreover, integrating DISPatch and MSSP into a state-of-the-art frequency-dependent KD method considerably improves performance across all metrics.
Primary: School of Electrical Engineering
All Institutions: Republic of Korea, School of Electrical Engineering, and Jung-Woo Choi
The main contribution of this paper is the introduction of the DISPatch framework, which innovatively applies selective knowledge distillation in speech enhancement, leading to significant performance improvements while addressing the limitations of traditional methods. This work represents a meaningful advancement in the field, with potential applications extending beyond speech enhancement to various machine learning tasks.
The proposed DISPatch framework introduces a novel approach to knowledge distillation in speech enhancement by selectively applying distillation losses to spectrogram patches where the teacher model outperforms the student. This is quantified using a Knowledge Gap Score (KGS), which is a significant advancement over traditional methods that indiscriminately apply distillation across all output regions. The introduction of Multi-Scale Selective Patches (MSSP) further enhances the methodology by adapting patch sizes based on frequency characteristics, addressing spectral heterogeneity effectively. The methodology is well-structured and clearly articulated, demonstrating a thoughtful integration of existing techniques with innovative modifications.
The experiments are comprehensive, utilizing well-established datasets such as DNS2020 and VoiceBank+DEMAND, which provide a robust basis for evaluating the proposed method. The results indicate consistent improvements across various metrics when DISPatch is applied, particularly in conjunction with DFKD and MSSP. The ablation studies effectively demonstrate the importance of the KGS in selecting informative patches, reinforcing the method's validity. However, the paper could benefit from more extensive comparisons with a broader range of existing methods to contextualize the performance gains.
The implementation details are sufficiently detailed, including model configurations, training setups, and hyperparameters. The paper provides a GitHub link for accessing the code, which is crucial for reproducibility. However, the absence of a clear description of the environment and dependencies required for running the code could pose challenges for some researchers.
While the methodology shows promise, it may be limited by the assumptions made regarding the teacher model's superiority. If a teacher model is not adequately trained or is flawed, the selective distillation might not yield the expected benefits. Additionally, the paper does not explore the scalability of the approach to larger datasets or more complex models, which could be a potential area for future research.
The DISPatch framework has significant implications for real-world applications in speech enhancement, particularly in resource-constrained environments where computational efficiency is paramount. By improving the performance of compact models, this research could facilitate the deployment of advanced speech processing technologies in mobile devices and other low-power applications. The principles established in this work may also be applicable to other domains within machine learning, such as image processing and natural language processing, where selective learning could enhance model performance. The main contribution of this paper is the introduction of the DISPatch framework, which innovatively applies selective knowledge distillation in speech enhancement, leading to significant performance improvements while addressing the limitations of traditional methods. This work represents a meaningful advancement in the field, with potential applications extending beyond speech enhancement to various machine learning tasks.
Diffusion and flow matching (FM) models have achieved remarkable progress in speech enhancement (SE), yet their dependence on multi-step generation is computationally expensive and vulnerable to discretization errors. Recent advances in one-step generative modeling, particularly MeanFlow, provide a promising alternative by reformulating dynamics through average velocity fields. In this work, we present COSE, a one-step FM framework tailored for SE. To address the high training overhead of Jacobian-vector product (JVP) computations in MeanFlow, we introduce a velocity composition identity to compute average velocity efficiently, eliminating expensive computation while preserving theoretical consistency and achieving competitive enhancement quality. Extensive experiments on standard benchmarks show that COSE delivers up to 5x faster sampling and reduces training cost by 40%, all without compromising speech quality. Code is available at https://github.com/ICDM-UESTC/COSE.
The main contribution of this paper is the introduction of COSE, a one-step flow matching framework for speech enhancement that significantly reduces computational costs while maintaining high-quality output. This work represents a meaningful advancement in the efficiency of generative models for audio processing, with promising implications for real-time applications in speech technology.
The proposed COSE framework introduces a novel approach to one-step flow matching for speech enhancement by utilizing a velocity composition identity to efficiently compute average velocities. This innovation addresses the computational overhead associated with Jacobian-vector product computations in existing MeanFlow models, which is a significant improvement in terms of efficiency while maintaining theoretical consistency. The methodology is well-structured, clearly delineating the steps taken to achieve the proposed enhancements. However, the paper could benefit from a more detailed explanation of the underlying mathematical principles and their implications for the broader context of generative modeling.
The authors conducted extensive experiments on standard benchmarks, demonstrating that COSE achieves up to 5x faster sampling and a 40% reduction in training costs without sacrificing speech quality. This is a compelling result that indicates the practical applicability of the framework. However, the paper lacks detailed descriptions of the datasets used, the specific metrics for evaluating speech quality, and comparisons with other state-of-the-art methods, which would strengthen the validation of the results.
The availability of code on GitHub is a positive aspect that enhances reproducibility. However, the paper does not provide sufficient details on the experimental setup, hyperparameter configurations, or specific versions of libraries used, which could hinder other researchers from replicating the results accurately.
One limitation is the reliance on standard benchmarks without exploring real-world applications or datasets that may present different challenges. Additionally, while the reduction in computational cost is significant, the paper does not discuss potential trade-offs in terms of model complexity or performance in edge cases.
The COSE framework has the potential to significantly impact the field of speech enhancement by providing a more efficient method that can be integrated into real-time applications, such as voice assistants and hearing aids. Its implications extend to various domains where clear speech quality is crucial, potentially improving user experiences across multiple technologies. The main contribution of this paper is the introduction of COSE, a one-step flow matching framework for speech enhancement that significantly reduces computational costs while maintaining high-quality output. This work represents a meaningful advancement in the efficiency of generative models for audio processing, with promising implications for real-time applications in speech technology.
The spatial semantic segmentation task focuses on separating and classifying sound objects from multichannel signals. To achieve two different goals, conventional methods fine-tune a large classification model cascaded with the separation model and inject classified labels as separation clues for the next iteration step. However, such integration is not ideal, in that fine-tuning over a smaller dataset loses the diversity of large classification models, features from the source separation model are different from the inputs of the pretrained classifier, and injected one-hot class labels lack semantic depth, often leading to error propagation. To resolve these issues, we propose a Dual-Path Classifier (DPC) architecture that combines object features from a source separation model with semantic representations acquired from a pretrained classification model without fine-tuning. We also introduce a Semantic Clue Encoder (SCE) that enriches the semantic depth of injected clues. Our system achieves a state-of-the-art 11.19 dB CA-SDRi and enhanced semantic fidelity on the DCASE 2025 task4 evaluation set, surpassing the top-rank performance of 11.00 dB. These results highlight the effectiveness of integrating separator-derived features and rich semantic clues.
Primary: School of Electrical Engineering
All Institutions: and Jung-Woo Choi, School of Electrical Engineering
The main contribution of this paper is the introduction of a novel Dual-Path Classifier and Semantic Clue Encoder that significantly enhance sound separation and classification performance. The methodology effectively addresses key limitations of existing approaches, leading to improved accuracy and robustness in audio processing tasks.
The proposed methodology introduces a Dual-Path Classifier (DPC) and a Semantic Clue Encoder (SCE) to address the challenges of sound separation and classification. The DPC architecture effectively combines object features from a source separation model with semantic representations from a pretrained classifier without fine-tuning, which is a significant improvement over conventional methods. The SCE enhances the semantic depth of the injected clues, mitigating the limitations of one-hot encoding. The architecture's design, which includes a dual-path CRNN and a robust fusion mechanism, demonstrates a thoughtful approach to leveraging existing models while preserving feature diversity and richness. However, the paper could benefit from a clearer explanation of the integration process between the DPC and SCE, as well as more detailed descriptions of the training process.
The experiments are well-structured, utilizing the DCASE 2025 task4 challenge as a benchmark for evaluation. The results indicate a clear performance improvement over previous systems, with a state-of-the-art CA-SDRi score of 11.19 dB. The paper provides comprehensive comparisons across different stages of the proposed framework, showcasing the effectiveness of both the DPC and SCE. However, the evaluation could be strengthened by including more diverse datasets and additional performance metrics to provide a broader perspective on the model's capabilities.
The paper lacks sufficient implementation details that would facilitate reproducibility. While it mentions the use of pretrained weights and specific training configurations, it does not provide the exact architecture details, hyperparameters, or code repositories. Including this information would greatly enhance the ability of other researchers to replicate the results.
The paper identifies several limitations in existing methods, such as the loss of diversity in fine-tuning and the inadequacy of one-hot class labels. However, it does not thoroughly address potential weaknesses in the proposed approach, such as the reliance on pretrained models and the risk of overfitting to the training data. Additionally, the evaluation is limited to a specific challenge dataset, which may not fully represent real-world scenarios.
The proposed methods have significant implications for audio processing applications, particularly in environments where sound separation and classification are critical, such as in assistive technologies, smart environments, and multimedia content creation. By improving the accuracy and robustness of sound separation systems, this research could enhance user experiences in various audio-related applications. The main contribution of this paper is the introduction of a novel Dual-Path Classifier and Semantic Clue Encoder that significantly enhance sound separation and classification performance. The methodology effectively addresses key limitations of existing approaches, leading to improved accuracy and robustness in audio processing tasks.
The pipeline for multi-participant audiobook production primarily consists of three stages: script analysis, character voice timbre selection, and speech synthesis. Among these, script analysis can be automated with high accuracy using NLP models, whereas character voice timbre selection still relies on manual effort. Speech synthesis uses either manual dubbing or text-to-speech (TTS). While TTS boosts efficiency, it struggles with emotional expression, intonation control, and contextual scene adaptation. To address these challenges, we propose DeepDubbing, an end-to-end automated system for multi-participant audiobook production. The system comprises two main components: a Text-to-Timbre (TTT) model and a Context-Aware Instruct-TTS (CA-Instruct-TTS) model. The TTT model generates role-specific timbre embeddings conditioned on text descriptions. The CA-Instruct-TTS model synthesizes expressive speech by analyzing contextual dialogue and incorporating fine-grained emotional instructions. This system enables the automated generation of multi-participant audiobooks with both timbre-matched character voices and emotionally expressive narration, offering a novel solution for audiobook production.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications, Beijing University of Civil Engineering and Architecture, China & Beijing University of Civil Engineering and Architecture, Tencent Music Entertainment Lyra Lab
The main contribution of this paper is the introduction of DeepDubbing, an end-to-end automated system for multi-participant audiobook production that combines innovative text-to-timbre generation and context-aware speech synthesis. This work represents a significant advancement in the field of audio synthesis, addressing critical challenges in emotional expressiveness and character voice differentiation, thereby paving the way for more immersive audiobook experiences.
The methodology presented in the paper is robust and innovative, leveraging a dual-component architecture that includes a Text-to-Timbre (TTT) model and a Context-Aware Instruct-TTS (CA-Instruct-TTS) model. The use of conditional flow matching for timbre generation is a significant advancement, allowing for more nuanced and contextually appropriate voice synthesis. The integration of large language models (LLMs) for both timbre description generation and emotional instruction extraction showcases a sophisticated approach to automating audiobook production. However, the paper could benefit from a more detailed explanation of the training processes and hyperparameter choices for the models.
The experimental evaluation is comprehensive, utilizing a large-scale internal dataset and employing both subjective and objective metrics to assess the performance of the proposed models. The results indicate that the DeepDubbing system achieves high levels of naturalness and emotional expressiveness in synthesized speech, outperforming baseline methods. The use of a diverse set of evaluation metrics, including Character Matching Score and Mean Opinion Scores, adds credibility to the findings. However, the paper could improve by providing more comparative analysis against a wider range of existing systems.
The paper mentions the release of the BookVoice-50h dataset and provides a demo URL, which enhances reproducibility. However, specific implementation details, such as the exact configurations and training procedures for the models, are not thoroughly documented, making it challenging for other researchers to replicate the results without further guidance.
One notable limitation is the TTT model's struggle with generating child-like voices due to the lack of authentic child speech data in the training set. This limitation could hinder the system's applicability in scenarios requiring diverse character voices. Additionally, while the paper addresses the emotional expressiveness of the CA-Instruct-TTS model, it does not explore the potential biases that might arise from the training data or the implications of using LLMs in this context.
The proposed DeepDubbing system has significant potential applications in the audiobook industry, particularly in automating the production of multi-participant audiobooks, which could reduce costs and production times. The ability to generate emotionally expressive and contextually aware speech could enhance user engagement and experience. Furthermore, the methodologies developed could be adapted for other applications in voice synthesis, such as gaming, virtual reality, and interactive storytelling. The main contribution of this paper is the introduction of DeepDubbing, an end-to-end automated system for multi-participant audiobook production that combines innovative text-to-timbre generation and context-aware speech synthesis. This work represents a significant advancement in the field of audio synthesis, addressing critical challenges in emotional expressiveness and character voice differentiation, thereby paving the way for more immersive audiobook experiences.
Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation. Code is available at https://github.com/WingSingFung/TISDiSS.
Primary: Fudan University
All Institutions: Shanghai Key Laboratory of Intelligent Information Processing, Central Conservatory of Music, Department of Music AI and Music IT, Fudan University, School of Computer Science and Technology
The main contribution of this paper is the introduction of the TISDiSS framework, which effectively balances performance and computational efficiency in discriminative source separation tasks. This work presents a significant advancement in the field, particularly for applications requiring low-latency processing, while also providing a solid foundation for future research in scalable audio processing methodologies.
The proposed TISDiSS framework integrates several innovative components, including early-split multi-loss supervision and shared-parameter design, which are well-justified in the context of improving source separation tasks. The dynamic inference repetitions allow for a flexible trade-off between speed and performance, which is particularly relevant for real-time applications. However, while the methodology is robust, the paper could benefit from a clearer explanation of how these components interact and their specific contributions to the overall performance improvement.
The experiments are conducted on standard speech separation benchmarks, showcasing state-of-the-art performance with a reduced parameter count. The results are compelling and demonstrate the effectiveness of the TISDiSS framework. However, the paper lacks a comprehensive comparison with a broader range of existing methods, which could further validate the claimed advantages.
The authors provide a GitHub repository for code access, which is a positive aspect for reproducibility. However, the paper would benefit from more detailed documentation regarding the experimental setup, hyperparameters, and specific configurations used in the experiments to facilitate easier reproduction by other researchers.
One limitation is the potential overfitting to the specific benchmarks used, as the paper does not explore the generalizability of the TISDiSS framework across diverse datasets or tasks beyond speech separation. Additionally, the reliance on dynamic inference repetitions may introduce complexity in deployment, which could be a barrier for practical applications.
The TISDiSS framework has significant implications for real-time audio processing applications, such as virtual assistants and music production tools, where efficient source separation is crucial. By enabling scalable performance adjustments, it opens avenues for further research into adaptive models that can cater to varying computational resources. The main contribution of this paper is the introduction of the TISDiSS framework, which effectively balances performance and computational efficiency in discriminative source separation tasks. This work presents a significant advancement in the field, particularly for applications requiring low-latency processing, while also providing a solid foundation for future research in scalable audio processing methodologies.
While large audio-language models (LALMs) have demonstrated state-of-the-art audio understanding, their reasoning capability in complex soundscapes still falls behind large vision-language models (LVLMs). Compared to the visual domain, one bottleneck is the lack of large-scale chain-of-thought audio data to teach LALM stepwise reasoning. To circumvent this data and modality gap, we present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student on the same audio-visual question answering (AVQA) dataset. SightSound-R1 consists of three core steps: (i) test-time scaling to generate audio-focused chains of thought (CoT) from an LVLM teacher, (ii) audio-grounded validation to filter hallucinations, and (iii) a distillation pipeline with supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) for the LALM student. Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set as well as in unseen auditory scenes and questions, outperforming both pretrained and label-only distilled baselines. Thus, we conclude that vision reasoning can be effectively transferred to audio models and scaled with abundant audio-visual data.
Primary: University of Washington
All Institutions: Columbia University, University of Washington
The main contribution of this paper is the introduction of SightSound-R1, a novel framework for cross-modal reasoning distillation that enhances the reasoning capabilities of audio-language models by leveraging the strengths of vision-language models. This work represents a significant step forward in bridging the modality gap in multimodal AI systems, with the potential for broad applications in various fields.
The proposed methodology of SightSound-R1 is innovative, leveraging a cross-modal distillation framework that effectively bridges the reasoning capabilities between LVLMs and LALMs. The three-step process—test-time scaling, audio-grounded validation, and a distillation pipeline—demonstrates a thoughtful approach to addressing the identified gap in reasoning capabilities. The use of self-consistency to generate diverse reasoning traces and the incorporation of a lightweight audio-grounded fact verification step are particularly noteworthy. However, the methodology could benefit from a more detailed explanation of the underlying assumptions and potential biases in the audio-grounded validation process.
The experimental evaluation is robust, utilizing multiple datasets (AVQA, MMAU, and MUSIC-AVQA) to validate the effectiveness of the proposed framework. The results indicate significant improvements in LALM reasoning performance, particularly in sound tasks, which supports the hypothesis that reasoning can be effectively transferred from LVLMs. The comparative analysis against pretrained and label-only distilled baselines adds credibility to the findings. However, the paper could improve by providing more detailed statistical analysis and significance testing for the reported results.
The implementation details are described with sufficient clarity, including the use of specific models, training parameters, and evaluation metrics. However, the absence of a public code repository or supplementary materials limits the reproducibility of the results. Future work should consider making the code and trained models available to facilitate further research and validation of the findings.
One limitation identified is the potential for hallucinations in the reasoning generated by the LVLM teacher, which may mislead the LALM student during training. Additionally, the performance drop in certain categories (Speech and Music) suggests that the framework may not generalize equally across all audio types, indicating a need for further refinement and integration with LALM perception capabilities.
The implications of this research are significant, as it addresses a critical gap in multimodal reasoning capabilities, particularly in the audio domain. By enhancing LALMs' reasoning through cross-modal distillation, the framework has the potential to improve applications in audio understanding, accessibility technologies, and interactive AI systems. The approach could pave the way for more sophisticated audio-language models that can reason about complex soundscapes, ultimately contributing to advancements in human-computer interaction and multimedia content analysis. The main contribution of this paper is the introduction of SightSound-R1, a novel framework for cross-modal reasoning distillation that enhances the reasoning capabilities of audio-language models by leveraging the strengths of vision-language models. This work represents a significant step forward in bridging the modality gap in multimodal AI systems, with the potential for broad applications in various fields.
Although Large Audio-Language Models (LALMs) have exhibited outstanding performance in auditory understanding, their performance in affective computing scenarios, particularly in emotion recognition, reasoning, and subtle sentiment differentiation, remains suboptimal. Recent advances in Reinforcement Learning (RL) have shown promise in improving LALMs' reasoning abilities. However, two critical challenges hinder the direct application of RL techniques to Speech Emotion Recognition (SER) tasks: (1) convergence instability caused by ambiguous emotional boundaries and (2) limited reasoning ability when using relatively small models (e.g., 7B-parameter architectures). To overcome these limitations, we introduce EMO-RL, a novel framework incorporating reinforcement learning with two key innovations: Emotion Similarity-Weighted Reward (ESWR) and Explicit Structured Reasoning (ESR). Built upon pretrained LALMs, our method employs group-relative policy optimization with emotion constraints. Comprehensive experiments demonstrate that our EMO-RL training strategies can significantly enhance the emotional reasoning capabilities of LALMs, attaining state-of-the-art results on both the MELD and IEMOCAP datasets, and cross-dataset experiments prove the strong superiority of generalization.
Primary: and Author n
All Institutions: Address line, and Author n
The main contribution of this paper is the introduction of the EMO-RL framework, which enhances the emotional reasoning capabilities of large audio-language models for speech emotion recognition through innovative reinforcement learning techniques. This work represents a meaningful advancement in the field, addressing critical challenges in emotion recognition and setting a foundation for future research in multi-modal emotion detection systems.
The proposed EMO-RL framework effectively integrates reinforcement learning with emotion-specific strategies, namely Emotion Similarity-Weighted Reward (ESWR) and Explicit Structured Reasoning (ESR). This innovative approach addresses the challenges of convergence instability and limited reasoning capabilities in speech emotion recognition tasks. The methodology is well-structured, providing a clear transformation of the SER problem into a regression framework that accommodates emotional nuances. However, the reliance on psychological models like Plutchik's wheel for reward structuring, while beneficial, may introduce biases based on the chosen emotional framework.
The experiments conducted are comprehensive, utilizing multiple datasets (MELD, IEMOCAP, RAVDESS, SAVEE) to validate the effectiveness of the proposed approach. The results demonstrate significant improvements over baseline models, achieving state-of-the-art performance in SER tasks. The evaluation metrics used (Unweighted Accuracy, Weighted Accuracy, Macro F1 Score) are appropriate for the task, providing a well-rounded assessment of model performance. However, the paper could benefit from more detailed comparisons with additional state-of-the-art methods beyond those mentioned.
The implementation details are sufficiently detailed, including the model architecture, training parameters, and experimental setup. However, the absence of a publicly accessible code repository limits the reproducibility of the results. Providing access to the trained models or code would enhance the paper's impact and allow other researchers to validate the findings.
The paper acknowledges limitations, including the focus solely on the speech modality without exploring multi-modal contexts that could enhance the framework's applicability. Additionally, the computational complexity and inference efficiency issues may hinder real-time applications. These limitations suggest areas for future research and development.
The EMO-RL framework has significant implications for various applications in affective computing, such as mental health assessment, customer service, and human-computer interaction. By improving emotion recognition capabilities in audio-language models, this research paves the way for more emotionally aware AI systems, enhancing user experience and interaction quality. The main contribution of this paper is the introduction of the EMO-RL framework, which enhances the emotional reasoning capabilities of large audio-language models for speech emotion recognition through innovative reinforcement learning techniques. This work represents a meaningful advancement in the field, addressing critical challenges in emotion recognition and setting a foundation for future research in multi-modal emotion detection systems.
The steered response power (SRP) method is one of the most popular approaches for acoustic source localization with microphone arrays. It is often based on simplifying acoustic assumptions, such as an omnidirectional sound source in the far field of the microphone array(s), free field propagation, and spatially uncorrelated noise. In reality, however, there are many acoustic scenarios where such assumptions are violated. This paper proposes a generalization of the conventional SRP method that allows to apply generic acoustic models for localization with arbitrary microphone constellations. These models may consider, for instance, level differences in distributed microphones, the directivity of sources and receivers, or acoustic shadowing effects. Moreover, also measured acoustic transfer functions may be applied as acoustic model. We show that the delay-and-sum beamforming of the conventional SRP is not optimal for localization with generic acoustic models. To this end, we propose a generalized SRP beamforming criterion that considers generic acoustic models and spatially correlated noise, and derive an optimal SRP beamformer. Furthermore, we propose and analyze appropriate frequency weightings. Unlike the conventional SRP, the proposed method can jointly exploit observed level and time differences between the microphone signals to infer the source location. Realistic simulations of three different microphone setups with speech under various noise conditions indicate that the proposed method can significantly reduce the mean localization error compared to the conventional SRP and, in particular, a reduction of more than 60% can be archived in noisy conditions.
Primary: University of Oldenburg
All Institutions: and the Cluster of Excellence Hearing4all, 26129 Oldenburg, University of Oldenburg, This project has received funding from the SOUNDS European Training Network -- an European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 956369, Germany (e-mail, &D Department, S.~Doclo is with the Department of Medical Physics and Acoustics, and T.~Wolff are with the Audio AI R
The paper presents a novel approach to sound source localization by generalizing the steered response power method to accommodate complex acoustic environments. This advancement is crucial for enhancing the accuracy and robustness of localization systems in real-world applications.
The paper introduces a generalized steered response power (GSRP) method that enhances traditional SRP techniques by incorporating generic acoustic models and addressing limitations in the conventional SRP method. The authors provide a comprehensive mathematical framework that allows for the inclusion of various acoustic propagation models and noise characteristics, which is a significant advancement over previous methods that relied on oversimplified assumptions. The proposed MVCNR and MPCNR beamformers demonstrate a robust design that optimizes localization accuracy under diverse acoustic conditions. The methodology is well-structured, with clear derivations and justifications for the proposed approaches.
The experimental validation is thorough, utilizing realistic simulations across three different microphone setups with varying noise conditions. The results indicate a significant reduction in localization error compared to conventional methods, particularly in challenging acoustic environments. The paper provides detailed descriptions of the experimental setups, including the generation of microphone signals and the evaluation metrics used. The performance of the proposed methods is convincingly demonstrated through comparative analysis against established techniques, showcasing their effectiveness in real-world scenarios.
The paper lacks specific implementation details or code availability, which could hinder reproducibility. While the authors describe the methodologies and experimental setups in detail, the absence of a publicly available code repository or supplementary materials limits the ability of other researchers to replicate the results. Providing a demo or project URL would enhance reproducibility and facilitate further research in this area.
One limitation of the proposed methods is their dependency on accurate acoustic models, which may not always be feasible in practical applications. The performance of the GSRP methods could be sensitive to model inaccuracies or assumptions regarding noise characteristics. Additionally, while the paper addresses various noise conditions, the generalizability of the findings to all acoustic environments remains to be thoroughly validated.
The advancements presented in this paper have significant implications for various applications, including teleconferencing, robotics, and autonomous systems. By improving sound source localization in complex acoustic environments, the proposed methods can enhance the performance of systems that rely on accurate spatial awareness, ultimately leading to better user experiences and more effective technological solutions. The work contributes to the ongoing development of more sophisticated audio processing techniques, which are increasingly relevant in today's technology-driven world. The paper presents a novel approach to sound source localization by generalizing the steered response power method to accommodate complex acoustic environments. This advancement is crucial for enhancing the accuracy and robustness of localization systems in real-world applications.
Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. However, significant room for optimization remains for the TS-ASR task within the LALMs architecture. While Chain of Thoughts (CoT) and Reinforcement Learning (RL) have proven effective in certain speech tasks, TS-ASR, which requires the model to deeply comprehend speech signals, differentiate various speakers, and handle overlapping utterances is particularly well-suited to a reasoning-guided approach. Therefore, we propose a novel framework that incorporates CoT and RL training into TS-ASR for performance improvement. A novel CoT dataset of TS-ASR is constructed, and the TS-ASR model is first trained on regular data and then fine-tuned on CoT data. Finally, the model is further trained with RL using selected data to enhance generalized reasoning capabilities. Experiment results demonstrate a significant improvement of TS-ASR performance with CoT and RL training, establishing a state-of-the-art performance compared with previous works of TS-ASR on comparable datasets.
Primary: †Equal contribution
All Institutions: †Equal contribution
The main contribution of this paper is the novel integration of Chain-of-Thought and Reinforcement Learning into the TS-ASR task, resulting in a significant performance improvement in transcribing target speakers from overlapping speech. This work represents a meaningful advancement in the field of speech recognition, particularly in challenging acoustic environments.
The proposed methodology is innovative, integrating Chain-of-Thought (CoT) and Reinforcement Learning (RL) into the Target Speaker Automatic Speech Recognition (TS-ASR) task. The construction of a novel CoT dataset tailored for TS-ASR is a significant contribution, as it allows for structured reasoning and enhances the model's ability to handle overlapping speech. The three-stage training paradigm—base model training, CoT fine-tuning, and RL refinement—demonstrates a comprehensive approach to improving model performance. However, the methodology could benefit from clearer descriptions of the CoT dataset construction process and the rationale behind certain design choices.
The experiments are well-structured, comparing the proposed model against both traditional and state-of-the-art LLM-based TS-ASR methods. The reported results show a significant reduction in word error rates (WER), indicating the effectiveness of the proposed framework. The use of ablation studies to assess the impact of different training strategies adds rigor to the evaluation. However, the paper lacks detailed statistical analysis of the results, such as confidence intervals or significance tests, which would strengthen the claims of improvement.
The paper provides a reasonable level of detail regarding the experimental setup, including the datasets used and the training parameters. The availability of the CoT dataset on GitHub enhances reproducibility. However, further details on the model architecture and specific hyperparameters used during training would improve the ability of other researchers to replicate the results.
One limitation is the reliance on specific datasets (LibriSpeech, Libri2Mix, Libri3Mix), which may restrict the generalizability of the findings to other domains or datasets. Additionally, while the proposed methods show improvements, the paper does not address potential overfitting issues or the model's performance in highly variable real-world scenarios. The complexity of the model may also pose challenges in terms of computational resources required for training and deployment.
The integration of reasoning capabilities into TS-ASR has significant implications for applications in real-time communication systems, assistive technologies, and multimedia content analysis. By improving the ability to transcribe overlapping speech in complex environments, this research could enhance accessibility for individuals with hearing impairments and improve the accuracy of automated transcription services in various industries. The main contribution of this paper is the novel integration of Chain-of-Thought and Reinforcement Learning into the TS-ASR task, resulting in a significant performance improvement in transcribing target speakers from overlapping speech. This work represents a meaningful advancement in the field of speech recognition, particularly in challenging acoustic environments.
Deep learning-based Sound Event Localization and Detection (SELD) systems degrade significantly on real-world, long-tailed datasets. Standard regression losses bias learning toward frequent classes, causing rare events to be systematically under-recognized. To address this challenge, we introduce MAGENTA (Magnitude And Geometry-ENhanced Training Approach), a unified loss function that counteracts this bias within a physically interpretable vector space. MAGENTA geometrically decomposes the regression error into radial and angular components, enabling targeted, rarity-aware penalties and strengthened directional modeling. Empirically, MAGENTA substantially improves SELD performance on imbalanced real-world data, providing a principled foundation for a new class of geometry-aware SELD objectives. Code is available at: https://github.com/itsjunwei/MAGENTA_ICASSP
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, This research is supported by the Singapore Ministry of Education, under research grant MOE-T2EP20224-0010, School of Electrical and Electronic Engineering, Academic Research Fund Tier 2, Smart Nation TRANS Lab
The main contribution of this paper is the introduction of MAGENTA, a geometry- and rarity-aware loss function for Sound Event Localization and Detection that effectively addresses the challenges posed by long-tailed datasets. This work represents a significant advancement in the field, providing a principled and effective solution that enhances the detection of rare acoustic events while maintaining robust localization performance.
The proposed MAGENTA framework introduces a novel geometric decomposition of regression errors in Sound Event Localization and Detection (SELD), specifically addressing the challenges posed by long-tailed datasets. By separating the error into radial and angular components, the authors provide a targeted approach to mitigate detection timidity for rare classes. This methodology is well-grounded in the physical interpretation of the problem and is a significant advancement over traditional loss functions like Mean Squared Error (MSE), which do not account for the unique geometry of the ACCDOA representation. The modular design of the loss function allows for fine-tuning and flexibility, making it a robust solution for SELD tasks.
The experiments are rigorously designed, utilizing the STARSS23 dataset, which is representative of real-world scenarios and characterized by a significant class imbalance. The authors provide a comprehensive evaluation of various loss function configurations, demonstrating the effectiveness of MAGENTA through empirical results that show substantial improvements in SELD performance metrics. The results are well-presented, with clear comparisons against baseline methods, and the statistical significance of improvements is implied through the structured experimentation. However, the paper lacks detailed statistical analysis of the results, which could strengthen the claims made.
The paper includes sufficient implementation details, including the architecture used (SELDNet), training parameters, and evaluation metrics. The availability of the code on GitHub enhances reproducibility, allowing other researchers to replicate the experiments and validate the findings. However, the paper could benefit from additional documentation or examples on how to run the code effectively.
One limitation is the reliance on a single dataset (STARSS23) for evaluation, which may not fully capture the diversity of real-world acoustic environments. Additionally, while the proposed method shows improvements, the potential for increased false positives due to heightened sensitivity in rare class detection is noted but not quantitatively analyzed. The authors also mention future work on adaptive priors, indicating that the current approach may not fully address all aspects of class imbalance.
The MAGENTA framework has significant implications for applications in audio surveillance, smart environments, and assistive technologies, where accurate sound event detection and localization are critical. By improving the recognition of rare sound events, this work could enhance situational awareness in various domains, including public safety and human-computer interaction. The methodology also sets a precedent for future research in long-tailed learning and geometry-aware training approaches. The main contribution of this paper is the introduction of MAGENTA, a geometry- and rarity-aware loss function for Sound Event Localization and Detection that effectively addresses the challenges posed by long-tailed datasets. This work represents a significant advancement in the field, providing a principled and effective solution that enhances the detection of rare acoustic events while maintaining robust localization performance.
Personalizing Automatic Speech Recognition (ASR) for dysarthric speech is crucial but challenging due to training and storing of individual user adapters. We propose a hybrid meta-training method for a single model, excelling in zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). Measuring Word Error Rate (WER) on state-of-the-art subsets, the model achieves 13.9% WER on Euphonia which surpasses speaker-independent baselines (17.5% WER) and rivals user-specific personalized models. On SAP Test 1, its 5.3% WER significantly bests the 8% from even personalized adapters. We also demonstrate the importance of example curation, where an oracle text-similarity method shows 5 curated examples can achieve performance similar to 19 randomly selected ones, highlighting a key area for future efficiency gains. Finally, we conduct data ablations to measure the data efficiency of this approach. This work presents a practical, scalable, and personalized solution.
Primary: Google DeepMind
All Institutions: Google DeepMind
This paper presents a novel approach to dysarthric speech recognition through a hybrid meta-learning strategy, significantly advancing the state-of-the-art in personalized ASR systems. The methodology is innovative, and the results demonstrate substantial improvements, positioning the work as a meaningful contribution to the field of machine learning and accessibility.
The proposed methodology introduces a hybrid meta-training strategy that combines zero-shot and few-shot learning for dysarthric speech recognition. By utilizing in-context learning (ICL), the authors effectively eliminate the need for per-user model training, which is a significant advancement in the field. The use of a single model that can adapt to various users on-the-fly is innovative and practical, addressing the complexities of traditional ASR personalization methods. The exploration of example curation methods adds depth to the methodology, showcasing a thoughtful approach to data efficiency.
The experiments are well-structured, utilizing two substantial datasets (Euphonia and SAP) to validate the proposed method. The results demonstrate a clear improvement in Word Error Rate (WER) compared to existing models, establishing new state-of-the-art benchmarks. The comparative analysis of different training strategies provides strong evidence for the effectiveness of the mixed-objective approach. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of each component in the model.
The paper provides a clear description of the datasets and evaluation metrics used, which is essential for reproducibility. However, the lack of a publicly available code repository or demo limits the ability for others to replicate the results precisely. The detailed methodology does allow for a reasonable attempt at reproduction, but the absence of shared resources is a drawback.
While the paper presents a robust solution, it does not address potential challenges in real-world deployment, such as the variability of dysarthric speech across different users and contexts. Additionally, the reliance on a large foundational model (Gemini 2.5 Flash) may limit accessibility for researchers without similar resources. The exploration of example curation methods, while promising, also raises questions about the practicality of implementing these strategies in real-time applications.
This research has significant implications for accessibility in technology, particularly for individuals with speech impairments. By improving ASR systems for dysarthric speech, the work can enhance communication tools for affected individuals, fostering greater inclusion in various domains. The findings could also inspire further research into personalized AI systems across different modalities, potentially benefiting a wider range of users with diverse needs. This paper presents a novel approach to dysarthric speech recognition through a hybrid meta-learning strategy, significantly advancing the state-of-the-art in personalized ASR systems. The methodology is innovative, and the results demonstrate substantial improvements, positioning the work as a meaningful contribution to the field of machine learning and accessibility.
While existing speech audio codecs designed for compression exploit limited forms of temporal redundancy and allow for multi-scale representations, they tend to represent all features of audio in the same way. In contrast, generative voice models designed for text-to-speech and voice transfer tasks have recently proved effective at factorizing audio signals into high-level semantic representations of fundamentally distinct features. In this paper, we leverage such representations in a novel semantic communications approach to achieve lower bitrates without sacrificing perceptual quality or suitability for specific downstream tasks. Our technique matches or outperforms existing audio codecs on transcription, sentiment analysis, and speaker verification when encoding at 2-4x lower bitrate -- notably surpassing Encodec in perceptual quality and speaker verification while using up to 4x less bitrate.
Primary: & Technology Research
All Institutions: & Technology Research
The main contribution of this paper is the introduction of a novel approach to semantic audio compression that significantly reduces bitrate while preserving perceptual quality and task-relevant information. This research represents a meaningful advancement in the field of audio processing and machine learning, offering a new direction for future exploration in efficient communication technologies.
The paper presents a novel semantic compression approach that leverages generative voice models to factor audio signals into high-level semantic representations. This method is innovative as it focuses on preserving semantic information relevant to specific downstream tasks rather than encoding all audio features uniformly. The approach is well-structured, utilizing a combination of content-style tokens and timbre samples to achieve lower bitrates while maintaining quality. However, the methodology could benefit from clearer explanations of the encoding schemes and the rationale behind the choices made, particularly regarding the use of Vevo and the auxiliary compression techniques.
The experiments are comprehensive, utilizing the VoxCeleb1 dataset and evaluating the proposed method against traditional and neural codecs across multiple downstream tasks, including transcription, sentiment analysis, and speaker verification. The results demonstrate that the proposed method consistently outperforms existing codecs at lower bitrates, which is a significant achievement. However, the paper lacks detailed statistical analyses of the results, such as confidence intervals or significance testing, which would strengthen the claims made about performance improvements.
The paper does not provide sufficient details on the implementation of the proposed methods, such as specific hyperparameters, training procedures, or the exact architecture of the models used. This lack of detail may hinder reproducibility for other researchers attempting to replicate the results. Including code or a detailed supplementary material section would greatly enhance reproducibility.
The paper acknowledges several limitations, including the inability to handle overlapping speakers and the potential for latency due to the timbre encoding approach. Additionally, the reliance on a single dataset (VoxCeleb1) may limit the generalizability of the findings. The authors also note that errors in timbre transmission can lead to permanent inaccuracies in voice reconstruction, which is a critical concern for real-time applications.
The proposed method has significant implications for ultra-low bandwidth voice communication, particularly in applications where bandwidth is constrained, such as remote areas or during emergencies. The ability to maintain high-quality audio while reducing bitrate could enhance communication technologies in various fields, including telecommunication, assistive technologies, and real-time translation services. The focus on semantic preservation aligns with ongoing trends in AI and machine learning, making this research relevant to future advancements in the field. The main contribution of this paper is the introduction of a novel approach to semantic audio compression that significantly reduces bitrate while preserving perceptual quality and task-relevant information. This research represents a meaningful advancement in the field of audio processing and machine learning, offering a new direction for future exploration in efficient communication technologies.
Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity remain underexplored. In this work, we analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions. These distortions not only mislead transcription but also degrade phonetic cues critical for speaker verification, leading to identity drift. Using DeepSpeech as our ASR target, we generate targeted adversarial examples and evaluate their impact on speaker embeddings across genuine and impostor samples. Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift, highlighting the need for phonetic-aware defenses to ensure the robustness of ASR and speaker recognition systems.
This paper presents a phonetic perspective on adversarial attacks in audio processing, revealing how subtle perturbations can mislead both speech recognition and speaker verification systems. The innovative approach and thorough experimental evaluation contribute valuable insights to the field of machine learning and speech technology, emphasizing the need for phonetic-aware defenses.
The methodology is well-structured, employing a white-box attack approach on the DeepSpeech model to generate adversarial examples. The paper effectively formulates the problem of adversarial attacks at both the transcription and speaker identity levels, providing a clear mathematical framework for the attack success criteria. The phonetic analysis of perturbations is a novel angle that enriches the understanding of how adversarial attacks can exploit linguistic features. However, the paper could benefit from a more detailed explanation of the optimization process and the specific metrics used to quantify phonetic confusions.
The experiments are comprehensive, utilizing a diverse dataset (VCTK corpus) and a variety of target phrases that cover a wide range of phonetic structures. The results clearly demonstrate the dual impact of adversarial perturbations on both transcription accuracy and speaker identity drift. The use of two state-of-the-art embedding models for speaker verification adds robustness to the findings. However, the paper could improve by including more detailed statistical analysis of the results and addressing potential variability in the experimental setup.
The paper provides a GitHub repository for additional figures and visualizations, which is a positive aspect for reproducibility. However, it lacks detailed implementation instructions or code snippets within the text that would facilitate easier replication of the experiments by other researchers.
The study is limited to white-box attacks in controlled environments, which may not fully represent real-world scenarios where adversarial attacks can be more complex due to environmental factors and black-box models. Additionally, the paper does not address the potential implications of over-the-air effects or the performance of defenses against such attacks.
The findings of this work have significant implications for the security of ASR and speaker verification systems, highlighting vulnerabilities that could be exploited in real-world applications. The focus on phonetic features in adversarial attacks opens new avenues for research in developing more robust speech technologies and defenses against adversarial manipulation. The insights gained from this study could inform the design of future systems that are more resilient to such threats. This paper presents a phonetic perspective on adversarial attacks in audio processing, revealing how subtle perturbations can mislead both speech recognition and speaker verification systems. The innovative approach and thorough experimental evaluation contribute valuable insights to the field of machine learning and speech technology, emphasizing the need for phonetic-aware defenses.
The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudoinverse operator of the Mel filterbank, we develop MelFlow, a streaming-capable generative Mel vocoder for speech sampled at 16 kHz with an algorithmic latency of only 32 ms and a total latency of 48 ms. We show real-time streaming capability at this latency not only in theory, but in practice on a consumer laptop GPU. Furthermore, we show that our model achieves substantially better PESQ and SI-SDR values compared to well-established not streaming-capable baselines for Mel vocoding including HiFi-GAN.
Primary: Technology and Space (BMFTR) under grant agreement No. 01IS24072A (COMFORT)
All Institutions: Technology and Space (BMFTR) under grant agreement No. 01IS24072A (COMFORT), We acknowledge funding by the German Federal Ministry of Research
The main contribution of this paper is the introduction of MelFlow, a streaming generative Mel vocoder that achieves real-time performance with significantly improved audio quality metrics compared to existing methods. This work represents a meaningful advancement in the field of audio processing, particularly for applications requiring low-latency speech synthesis.
The paper introduces MelFlow, a novel streaming-capable generative Mel vocoder that leverages generative flow matching and builds on previous work in diffusion-based STFT phase retrieval. The methodology is well-structured, combining established techniques with new innovations to achieve real-time performance. The authors effectively define algorithmic and total latency, providing a clear framework for their streaming approach. The use of causal convolutional neural networks and an efficient caching mechanism for inference is particularly noteworthy, as it allows for real-time processing without compromising output quality. However, the paper could benefit from a more detailed explanation of the iterative inference scheme and how it compares to traditional methods.
The experimental section is robust, demonstrating the effectiveness of MelFlow against established baselines such as HiFi-GAN. The authors provide comprehensive metrics (PESQ, SI-SDR, etc.) to evaluate performance, showing significant improvements in quality metrics while maintaining real-time capabilities. The use of multiple datasets (EARS-WHAM v2 and LibriTTS) adds credibility to the results. However, the experiments could be strengthened by including more diverse datasets and additional qualitative assessments of audio quality.
The paper mentions plans to provide a public code repository and model checkpoints, which is a positive step towards reproducibility. However, specific implementation details, such as hyperparameter settings and training configurations, could be more explicitly stated to facilitate replication by other researchers. The lack of a demo URL also limits immediate accessibility for interested parties.
One limitation is the potential trade-off between the number of inference steps and the quality of output, as indicated by the results showing differences between N=5 and N=25. Additionally, while the paper claims substantial improvements over non-streaming methods, it does not fully explore the implications of these improvements in practical applications. The focus on a single sampling rate (16 kHz) may also limit the generalizability of the findings.
The development of a real-time streaming Mel vocoder has significant implications for text-to-speech systems and other speech processing applications. By enabling more natural and interactive communication, this research could enhance user experiences in various domains, including virtual assistants, gaming, and telecommunication. The methodology could also inspire further innovations in real-time audio processing and other generative models. The main contribution of this paper is the introduction of MelFlow, a streaming generative Mel vocoder that achieves real-time performance with significantly improved audio quality metrics compared to existing methods. This work represents a meaningful advancement in the field of audio processing, particularly for applications requiring low-latency speech synthesis.
Child-centered long-form recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, the first self-supervised speech representation model trained on 13,000 hours of multilingual child-centered long-form recordings spanning over 40 languages. We evaluate BabyHuBERT on speaker segmentation, identifying when target children speak versus female adults, male adults, or other children -- a fundamental preprocessing step for analyzing naturalistic language experiences. BabyHuBERT achieves F1-scores from 52.1% to 74.4% across six diverse datasets, consistently outperforming W2V2-LL4300 (trained on English long-forms) and standard HuBERT (trained on clean adult speech). Notable improvements include 13.2 absolute F1 points over HuBERT on Vanuatu and 15.9 points on Solomon Islands corpora, demonstrating effectiveness on underrepresented languages. By sharing code and models, BabyHuBERT serves as a foundation model for child speech research, enabling fine-tuning on diverse downstream tasks.
BabyHuBERT introduces a pioneering self-supervised speech representation model tailored for multilingual child-centered recordings, demonstrating substantial improvements in speaker segmentation tasks. The comprehensive methodology and significant technical contributions position this work as a valuable asset for advancing research in child speech processing and language development.
The methodology proposed in BabyHuBERT is robust and innovative, leveraging a large-scale multilingual dataset specifically tailored for child-centered recordings. The adoption of HuBERT's masked prediction approach is well-justified, considering the inherent noise in child-centered audio. The two-iteration pre-training strategy, utilizing features from different layers of WavLM, demonstrates a thoughtful adaptation of existing models to the unique challenges of the task. The fine-tuning strategy is also comprehensive, employing both frozen feature extraction and full fine-tuning to evaluate the model's performance effectively.
The experiments conducted are thorough, with a clear focus on speaker segmentation as the primary task. The paper presents a well-structured evaluation across multiple datasets, showcasing substantial improvements over existing models like W2V2-LL4300 and standard HuBERT. The reported F1-scores highlight the model's effectiveness, particularly in underrepresented languages, which is a significant contribution to the field. However, the paper could benefit from additional details on the datasets used for evaluation and comparisons against more diverse baselines.
The paper provides adequate implementation details, including training procedures, hyperparameters, and dataset partitioning strategies. However, the absence of a publicly available code repository or demo limits reproducibility. Future work should prioritize sharing code and trained models to facilitate further research and application of BabyHuBERT.
The paper acknowledges limitations related to the computational expense of self-supervised pre-training, which restricts the exploration of hyperparameter configurations. Additionally, the performance on underrepresented classes suggests that further improvements are necessary. The reliance on human annotators for topline comparisons introduces variability that could affect the interpretation of results.
BabyHuBERT has the potential to significantly advance research in child language acquisition by providing a robust tool for analyzing naturalistic language experiences. Its multilingual focus addresses a critical gap in existing speech models, promoting inclusivity in language development research. The model's ability to improve speaker segmentation in diverse acoustic environments can facilitate better understanding of child interactions and language learning processes. BabyHuBERT introduces a pioneering self-supervised speech representation model tailored for multilingual child-centered recordings, demonstrating substantial improvements in speaker segmentation tasks. The comprehensive methodology and significant technical contributions position this work as a valuable asset for advancing research in child speech processing and language development.
Obstructive sleep apnoea (OSA) is a prevalent condition with significant health consequences, yet many patients remain undiagnosed due to the complexity and cost of over-night polysomnography. Acoustic-based screening provides a scalable alternative, yet performance is limited by environmental noise and the lack of physiological context. Respiratory effort is a key signal used in clinical scoring of OSA events, but current approaches require additional contact sensors that reduce scalability and patient comfort. This paper presents the first study to estimate respiratory effort directly from nocturnal audio, enabling physiological context to be recovered from sound alone. We propose a latent-space fusion framework that integrates the estimated effort embeddings with acoustic features for OSA detection. Using a dataset of 157 nights from 103 participants recorded in home environments, our respiratory effort estimator achieves a concordance correlation coefficient of 0.48, capturing meaningful respiratory dynamics. Fusing effort and audio improves sensitivity and AUC over audio-only baselines, especially at low apnoea-hypopnoea index thresholds. The proposed approach requires only smartphone audio at test time, which enables sensor-free, scalable, and longitudinal OSA monitoring.
Primary: University of Sheffield
All Institutions: School of Computer Science, GOSH BRC grant 187217, University of Sheffield, and Ning Ma, The authors would like to thank Passion For Life Healthcare (UK) Ltd for providing sleep data. The study was partially funded by MRC IAA grant 182731, and Innovate UK Open Grant 26767, Passion for Life Healthcare
This paper presents a novel approach to estimating respiratory effort from nocturnal audio, significantly advancing the field of acoustic-based sleep apnoea detection. The integration of respiratory dynamics into OSA screening offers a promising direction for non-invasive monitoring, though challenges remain in reproducibility and model performance in noisy environments.
The proposed methodology introduces a latent-space fusion framework that innovatively combines respiratory effort embeddings inferred from nocturnal audio with acoustic features for obstructive sleep apnoea (OSA) detection. This approach is commendable as it addresses the limitations of existing methods that rely on additional sensors, thus enhancing scalability and patient comfort. The use of a CNN-LSTM architecture to extract features from audio signals is appropriate, and the decision to use the concordance correlation coefficient (CCC) as the optimization objective is well-justified, given its sensitivity to both correlation and bias. However, the methodology could benefit from more detailed explanations of the model training process and hyperparameter tuning.
The experiments are robust, utilizing a dataset of 157 nights from 103 participants, which is a significant sample size for a study of this nature. The results demonstrate that the respiratory effort estimator achieves a CCC of 0.478, indicating a meaningful relationship between audio and respiratory dynamics. The performance metrics for OSA severity classification show that the proposed model outperforms audio-only baselines, particularly at lower AHI thresholds, which is clinically relevant. However, the paper could improve by providing more comprehensive comparisons with existing state-of-the-art methods and discussing the implications of the performance metrics in a clinical context.
The paper lacks sufficient details regarding the implementation and code availability, which are crucial for reproducibility. While it describes the model architecture and training procedures, it does not provide information on the specific datasets used for training and validation splits, nor does it mention whether the code or trained models will be made publicly available. This limits the ability of other researchers to replicate the study.
Several limitations are noted, including the challenges posed by environmental noise and the variability of smartphone recordings, which can impact the accuracy of the respiratory effort predictions. Additionally, the temporal misalignment between audio and respiratory signals may lead to lower correlation values. The CCC of 0.478, while indicative of some predictive capability, suggests that the model may still struggle with certain segments of audio. The paper also does not address the potential for overfitting given the relatively small dataset size compared to the complexity of the model.
The implications of this research are significant, as it presents a non-invasive, scalable method for OSA screening that could improve early detection and management of the condition. The ability to monitor respiratory effort using only smartphone audio could lead to widespread adoption in home settings, reducing the burden on healthcare systems and improving patient outcomes. Future work could explore further enhancements to the model and its application in diverse populations. This paper presents a novel approach to estimating respiratory effort from nocturnal audio, significantly advancing the field of acoustic-based sleep apnoea detection. The integration of respiratory dynamics into OSA screening offers a promising direction for non-invasive monitoring, though challenges remain in reproducibility and model performance in noisy environments.
Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose {\epsilon}ar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm. Our contributions are threefold: (i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception. (ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives--Instantaneous Frequency and Group Delay--for precision. (iii) A new spectral supervision paradigm where magnitude is supervised by all four Mid/Side/Left/Right components, while phase is supervised only by the LR components. Experiments show {\epsilon}ar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.
Primary: ar-LAB
All Institutions: ar-LAB
The paper presents a novel VAE architecture, epsilonar-VAE, that significantly enhances high-fidelity music reconstruction by integrating perceptual weighting and innovative loss functions. This comprehensive analysis highlights the technical contributions and potential impact on the audio processing field, showcasing a meaningful advancement in machine learning applications for audio.
The proposed methodology introduces several innovative components to the VAE architecture tailored for audio signal reconstruction. The integration of a K-weighting perceptual filter is a significant enhancement, aligning the model's objectives with psychoacoustic principles. The introduction of novel phase losses (Correlation Loss and Phase Loss) addresses critical aspects of audio fidelity, particularly in stereo coherence and transient clarity. The spectral supervision paradigm, which separates magnitude and phase supervision, is a thoughtful approach that reflects an understanding of the complexities involved in audio reconstruction. Overall, the methodology is well-structured and presents a comprehensive approach to improving high-fidelity audio generation.
The experimental setup is robust, utilizing a combination of public datasets and a proprietary in-house dataset, which strengthens the validity of the results. The paper provides detailed metrics for evaluating performance, including novel metrics for phase accuracy, which adds depth to the evaluation process. The comparison against leading models such as EnCodec and DAC demonstrates the effectiveness of the proposed approach. However, the results could benefit from additional qualitative assessments, such as listening tests, to complement the quantitative metrics.
The paper includes a clear description of the training process, model architecture, and loss functions, which aids in reproducibility. The availability of model weights and code on the provided demo URL is a positive aspect that encourages further exploration and validation by the research community. However, the paper could enhance reproducibility by providing more detailed hyperparameter settings and training configurations.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of audio signals encountered in real-world applications. Additionally, while the model shows improvements in reconstruction quality, the paper does not address potential computational costs or efficiency concerns associated with the proposed architecture. The focus on perceptual aspects may also overlook other factors influencing audio quality.
The advancements presented in this paper have significant implications for audio engineering, music production, and machine learning applications in audio synthesis. By improving the fidelity of audio reconstruction, this work could enhance various applications, including music streaming, audio restoration, and virtual reality audio experiences. The open-source nature of the model promotes accessibility and encourages further research in the field. The paper presents a novel VAE architecture, epsilonar-VAE, that significantly enhances high-fidelity music reconstruction by integrating perceptual weighting and innovative loss functions. This comprehensive analysis highlights the technical contributions and potential impact on the audio processing field, showcasing a meaningful advancement in machine learning applications for audio.
Multimodal acoustic event classification plays a key role in audio-visual systems. Although combining audio and visual signals improves recognition, it is still difficult to align them over time and to reduce the effect of noise across modalities. Existing methods often treat audio and visual streams separately, fusing features later with contrastive or mutual information objectives. Recent advances explore multimodal graph learning, but most fail to distinguish between intra- and inter-modal temporal dependencies. To address this, we propose Temporally Heterogeneous Graph-based Contrastive Learning (THGCL). Our framework constructs a temporal graph for each event, where audio and video segments form nodes and their temporal links form edges. We introduce Gaussian processes for intra-modal smoothness, Hawkes processes for inter-modal decay, and contrastive learning to capture fine-grained relationships. Experiments on AudioSet show that THGCL achieves state-of-the-art performance.
The main contribution of this paper is the development of a novel framework, THGCL, that effectively addresses the challenges of temporal alignment and noise reduction in multimodal acoustic event classification through a sophisticated graph-based approach. This work represents a meaningful advancement in the field of audio-visual machine learning, demonstrating both theoretical innovation and practical applicability.
The proposed Temporally Heterogeneous Graph-based Contrastive Learning (THGCL) framework innovatively constructs a temporal heterogeneous graph to model both intra- and inter-modal dependencies in acoustic event classification. The integration of Gaussian processes and Hawkes processes to manage temporal relationships is a significant methodological advancement. The contrastive learning component effectively enhances the robustness of the model against noise, showcasing a thoughtful design that addresses key challenges in multimodal learning.
The experiments conducted on the AudioSet dataset demonstrate a thorough evaluation of the proposed method against existing state-of-the-art approaches. The use of mean average precision (mAP) and area under the ROC curve (AUC) as evaluation metrics is appropriate, and the results indicate a clear performance advantage of THGCL. However, further comparisons with more diverse datasets could strengthen the validation of the method's generalizability.
The paper provides sufficient implementation details, including the architecture of the Temporal Heterogeneous Graph Network (THGN), hyperparameters, and training procedures, which are essential for reproducibility. The availability of the code repository on GitHub further supports this aspect.
While the proposed method shows promise, it may be limited by its reliance on the quality of the input features from the audio and video modalities. Additionally, the complexity of the model could pose challenges in real-time applications where computational efficiency is critical. The paper could also benefit from a more extensive discussion on potential biases in the dataset used.
The advancements in multimodal acoustic event classification have significant implications for various applications, including surveillance systems, smart environments, and human-computer interaction. By improving the robustness of audio-visual systems, this research could enhance the reliability of automated systems in real-world scenarios. The main contribution of this paper is the development of a novel framework, THGCL, that effectively addresses the challenges of temporal alignment and noise reduction in multimodal acoustic event classification through a sophisticated graph-based approach. This work represents a meaningful advancement in the field of audio-visual machine learning, demonstrating both theoretical innovation and practical applicability.
In this paper, we show that discrete optimal transport (DOT) is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level WavLM embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Evaluated on ASVspoof2019 and ASVspoof5 with AASIST baselines, DOT yields consistently high equal error rate (EER) across datasets and remains competitive after CM fine-tuning, outperforming several conventional attacks in cross-dataset transfer. Ablation analysis highlights the practical impact of vocoder overlap. Results indicate that distribution-level alignment is a powerful and stable attack surface for deployed CMs.
Primary: University of Rochester
All Institutions: and J. Prakasan, University of Rochester, Submitted to ICASSP 2026. A demonstration webpage TBA
The main contribution of this paper is the introduction of discrete optimal transport as a powerful method for generating adversarial audio attacks against anti-spoofing systems, demonstrating both theoretical and practical advancements in the field. The comprehensive analysis of the methodology and results highlights the significance of distribution-level alignment in enhancing the effectiveness of audio adversarial attacks.
The paper introduces a novel approach using discrete optimal transport (DOT) as a black-box adversarial attack against audio anti-spoofing countermeasures, which is a significant advancement in the field of audio security. The methodology is well-structured, detailing the process of aligning frame-level WavLM embeddings to a bona fide pool using entropic OT and a top-$k$ barycentric projection. The use of a neural vocoder for waveform reconstruction is appropriate and adds to the realism of the generated audio. The authors provide a clear theoretical foundation for their approach, although the paper could benefit from a more detailed explanation of the entropic regularization and its implications.
The experiments are robust, utilizing established datasets (ASVspoof2019 and ASVspoof5) and employing AASIST baselines for evaluation. The results demonstrate that the DOT attack consistently achieves high equal error rates (EER), indicating its effectiveness across different datasets and after countermeasure fine-tuning. The ablation analysis regarding vocoder overlap is particularly insightful, showcasing the practical implications of the attack. However, the paper lacks a comprehensive comparison with more recent adversarial attack methodologies, which could further contextualize its contributions.
While the paper outlines the experimental setup and methodologies, it does not provide sufficient details for full reproducibility. Key parameters, such as the specific configurations for the neural vocoder and the exact implementation of the DOT algorithm, are not fully disclosed. Additionally, the absence of a code repository or demonstration URL limits the ability for other researchers to replicate the findings.
The primary limitation of the study is its reliance on specific datasets, which may not generalize to all audio environments or countermeasures. The effectiveness of the DOT attack may vary with different types of audio data or countermeasures not covered in the experiments. Furthermore, the paper does not address potential defenses against the proposed attack, which is critical for understanding its practical implications.
The findings of this research have significant implications for the field of audio security, particularly in enhancing the robustness of anti-spoofing systems. The methodology could be applied to improve the security of voice recognition systems in various applications, including banking, personal assistants, and security systems. However, the potential for misuse in creating more sophisticated adversarial attacks raises ethical considerations that need to be addressed. The main contribution of this paper is the introduction of discrete optimal transport as a powerful method for generating adversarial audio attacks against anti-spoofing systems, demonstrating both theoretical and practical advancements in the field. The comprehensive analysis of the methodology and results highlights the significance of distribution-level alignment in enhancing the effectiveness of audio adversarial attacks.
Multistep inference is a bottleneck for real-time generative speech enhancement because flow- and diffusion-based systems learn an instantaneous velocity field and therefore rely on iterative ordinary differential equation (ODE) solvers. We introduce MeanFlowSE, a conditional generative model that learns the average velocity over finite intervals along a trajectory. Using a Jacobian-vector product (JVP) to instantiate the MeanFlow identity, we derive a local training objective that directly supervises finite-interval displacement while remaining consistent with the instantaneous-field constraint on the diagonal. At inference, MeanFlowSE performs single-step generation via a backward-in-time displacement, removing the need for multistep solvers; an optional few-step variant offers additional refinement. On VoiceBank-DEMAND, the single-step model achieves strong intelligibility, fidelity, and perceptual quality with substantially lower computational cost than multistep baselines. The method requires no knowledge distillation or external teachers, providing an efficient, high-fidelity framework for real-time generative speech enhancement. The proposed method is open-sourced at https://github.com/liduojia1/MeanFlowSE.
The main contribution of this paper is the introduction of MeanFlowSE, a novel framework for real-time generative speech enhancement that achieves high-quality results with significantly reduced computational costs. This work represents a meaningful advancement in the field of audio processing, particularly in the context of generative models for speech enhancement.
The proposed MeanFlowSE model innovatively addresses the bottleneck of multistep inference in generative speech enhancement by introducing a framework that learns an average velocity field for finite-interval displacement. This approach, leveraging the MeanFlow identity and Jacobian-vector product, allows for single-step inference, which is a significant advancement over traditional methods that rely on iterative ODE solvers. The methodology is well-structured, with a clear training objective that aligns with the instantaneous-field constraint, and the use of a backward-in-time displacement during inference is particularly noteworthy for its efficiency.
The experiments conducted on the VoiceBank-DEMAND dataset are robust, showcasing the performance of MeanFlowSE against several state-of-the-art baselines. The reported metrics, including intelligibility, fidelity, and perceptual quality, demonstrate that MeanFlowSE not only matches but often surpasses existing methods while achieving a significantly lower real-time factor. The comprehensive comparison with other models, such as SGMSE and FlowSE, provides a strong validation of the proposed method's effectiveness.
The paper provides sufficient details regarding the implementation, including the architecture (NCSN++ U-Net with self-attention), training procedures, and evaluation metrics. The open-sourcing of the code on GitHub enhances reproducibility, allowing other researchers to validate and build upon the findings presented.
One noted limitation is the reliance on a linear-Gaussian path for modeling, which may restrict the flexibility of the approach in more complex scenarios. Additionally, the use of first-order derivative estimation could introduce inaccuracies, particularly in non-linear contexts. Future work is suggested to explore more sophisticated modeling techniques that could mitigate these issues.
The implications of this research extend to various applications in real-time communication systems, automatic speech recognition, and assistive technologies for the hearing impaired. By improving the efficiency and quality of speech enhancement, this work has the potential to significantly enhance user experiences in noisy environments and contribute to advancements in human-computer interaction. The main contribution of this paper is the introduction of MeanFlowSE, a novel framework for real-time generative speech enhancement that achieves high-quality results with significantly reduced computational costs. This work represents a meaningful advancement in the field of audio processing, particularly in the context of generative models for speech enhancement.
Speech large language models (SLLMs) built on speech encoders, adapters, and LLMs demonstrate remarkable multitask understanding performance in high-resource languages such as English and Chinese. However, their effectiveness substantially degrades in low-resource languages such as Thai. This limitation arises from three factors: (1) existing commonly used speech encoders, like the Whisper family, underperform in low-resource languages and lack support for broader spoken language understanding tasks; (2) the ASR-based alignment paradigm requires training the entire SLLM, leading to high computational cost; (3) paired speech-text data in low-resource languages is scarce. To overcome these challenges in the low-resource language Thai, we introduce XLSR-Thai, the first self-supervised learning (SSL) speech encoder for Thai. It is obtained by continuously training the standard SSL XLSR model on 36,000 hours of Thai speech data. Furthermore, we propose U-Align, a speech-text alignment method that is more resource-efficient and multitask-effective than typical ASR-based alignment. Finally, we present Thai-SUP, a pipeline for generating Thai spoken language understanding data from high-resource languages, yielding the first Thai spoken language understanding dataset of over 1,000 hours. Multiple experiments demonstrate the effectiveness of our methods in building a Thai multitask-understanding SLLM. We open-source XLSR-Thai and Thai-SUP to facilitate future research.
Primary: School of Computer Science
All Institutions: School of Computer Science
The main contribution of this paper is the introduction of XLSR-Thai, U-Align, and the Thai-SUP pipeline, which collectively address the challenges of building effective speech large language models for multitask understanding in low-resource languages. This work significantly advances the field by providing innovative solutions to longstanding issues in speech processing for underrepresented languages.
The paper proposes a comprehensive methodology that includes the development of XLSR-Thai, a self-supervised learning speech encoder specifically for Thai, which is a significant advancement given the scarcity of resources for low-resource languages. The introduction of U-Align as a more efficient speech-text alignment method is innovative, as it circumvents the computational costs associated with traditional ASR-based methods. The Thai-SUP pipeline for generating spoken language understanding data from high-resource languages is a practical solution to the data scarcity problem, showcasing a well-rounded approach to the challenges faced in low-resource language processing.
The experiments conducted are extensive and demonstrate the effectiveness of the proposed methods. The results show that XLSR-Thai outperforms existing models in ASR performance and multitask understanding tasks. The comparative analysis of U-Align against ASR-based alignment methods provides clear evidence of its advantages in terms of both performance and efficiency. The use of multiple metrics (e.g., character error rate, classification accuracy) adds robustness to the evaluation.
The paper mentions that XLSR-Thai and Thai-SUP are open-sourced, which is a positive aspect for reproducibility. However, the details regarding the implementation of U-Align and the specific datasets used could be more thoroughly documented to enhance reproducibility further. The reliance on various external datasets and models also necessitates careful attention to their availability and licensing.
One limitation is the focus on a single low-resource language (Thai), which may not generalize to other languages with different linguistic structures or phonetic characteristics. Additionally, while the proposed methods are resource-efficient, the initial training on large datasets (36,000 hours) may still pose a barrier for some researchers. The paper could also benefit from a more detailed discussion on the potential biases introduced by the data generation process in Thai-SUP.
The research has significant implications for the development of speech technologies in low-resource languages, which are often overlooked in the field of machine learning. By providing tools and datasets for Thai, the work encourages further research and development in similar languages, potentially leading to more inclusive and accessible AI technologies. The findings could also influence policy decisions regarding language preservation and technology deployment in multilingual societies. The main contribution of this paper is the introduction of XLSR-Thai, U-Align, and the Thai-SUP pipeline, which collectively address the challenges of building effective speech large language models for multitask understanding in low-resource languages. This work significantly advances the field by providing innovative solutions to longstanding issues in speech processing for underrepresented languages.
Multichannel speech enhancement leverages spatial cues to improve intelligibility and quality, but most learning-based methods rely on specific microphone array geometry, unable to account for geometry changes. To mitigate this limitation, current array-agnostic approaches employ large multi-geometry datasets but may still fail to generalize to unseen layouts. We propose AmbiDrop (Ambisonics with Dropouts), an Ambisonics-based framework that encodes arbitrary array recordings into the spherical harmonics domain using Ambisonics Signal Matching (ASM). A deep neural network is trained on simulated Ambisonics data, combined with channel dropout for robustness against array-dependent encoding errors, therefore omitting the need for a diverse microphone array database. Experiments show that while the baseline and proposed models perform similarly on the training arrays, the baseline degrades on unseen arrays. In contrast, AmbiDrop consistently improves SI-SDR, PESQ, and STOI, demonstrating strong generalization and practical potential for array-agnostic speech enhancement.
Primary: Ben Gurion University of the Negev
All Institutions: School of Electrical and Computer Engineering, Ben Gurion University of the Negev
The paper presents AmbiDrop, a novel Ambisonics-based framework for array-agnostic speech enhancement, demonstrating strong generalization capabilities and practical potential for diverse applications. The technical contribution is significant, addressing a critical challenge in the field of multichannel speech enhancement.
The proposed AmbiDrop framework introduces a novel approach to array-agnostic speech enhancement by utilizing Ambisonics encoding and dropout-based learning. The methodology effectively addresses the limitations of existing multichannel speech enhancement techniques that rely on specific microphone geometries. By encoding arbitrary array recordings into the spherical harmonics domain, the authors create a robust input representation that is independent of array configuration. The incorporation of dropout during training simulates the challenges of real-world encoding errors, enhancing the model's robustness. This innovative approach is well-justified and theoretically sound, providing a clear pathway for practical application.
The experiments are comprehensive, comparing the proposed model against a baseline that relies on specific microphone configurations. The results demonstrate that while both models perform similarly on training arrays, AmbiDrop significantly outperforms the baseline on unseen arrays, showcasing its generalization capabilities. The use of objective metrics such as SI-SDR, PESQ, and STOI provides a solid foundation for evaluating performance. However, the paper could benefit from additional qualitative assessments or user studies to further validate the perceptual quality improvements.
The paper includes sufficient detail regarding the experimental setup, including the generation of datasets and the training process. However, the absence of a publicly available code repository or demo URL limits reproducibility. Future work should consider releasing the code and datasets to facilitate further research and validation of the proposed methods.
One limitation of the study is the reliance on simulated data for training, which may not fully capture the complexities of real-world scenarios. Additionally, while the model shows strong performance on unseen arrays, the results on the AR glasses array indicate potential challenges in generalization to highly irregular configurations. Future work should explore these aspects further.
The AmbiDrop framework has significant implications for various applications, including telecommunication, hearing aids, and human-computer interaction. By providing a robust solution for speech enhancement across diverse microphone geometries, it can improve user experiences in real-world environments where array configurations vary widely. The potential for deployment in consumer devices could enhance accessibility and usability in everyday communication scenarios. The paper presents AmbiDrop, a novel Ambisonics-based framework for array-agnostic speech enhancement, demonstrating strong generalization capabilities and practical potential for diverse applications. The technical contribution is significant, addressing a critical challenge in the field of multichannel speech enhancement.
Contrastive language--audio pretraining (CLAP) has achieved remarkable success as an audio--text embedding framework, but existing approaches are limited to monaural or single-source conditions and cannot fully capture spatial information. The central challenge in modeling spatial information lies in multi-source conditions, where the correct correspondence between each sound source and its location is required. To tackle this problem, we propose Spatial-CLAP, which introduces a content-aware spatial encoder that enables spatial representations coupled with audio content. We further propose spatial contrastive learning (SCL), a training strategy that explicitly enforces the learning of the correct correspondence and promotes more reliable embeddings under multi-source conditions. Experimental evaluations, including downstream tasks, demonstrate that Spatial-CLAP learns effective embeddings even under multi-source conditions, and confirm the effectiveness of SCL. Moreover, evaluation on unseen three-source mixtures highlights the fundamental distinction between conventional single-source training and our proposed multi-source training paradigm. These findings establish a new paradigm for spatially-aware audio--text embeddings.
Primary: The University of Tokyo
All Institutions: JSPS KAKENHI Grant Number 24KJ0860, JST Moonshot Grant Number JPMJMS2011, samples obtained by applying SCL to, and NII Open Collaborative Research 2025-(251S4-22735), JST FOREST Program JPMJFR226V, hidden units and ReLU activations) to produce the final, Japan. Keio University, footnotesize The work was supported by, The University of Tokyo
The development of Spatial-CLAP has significant implications for various applications, including augmented reality (AR), virtual reality (VR), and robotics, where understanding spatial audio cues is critical. By advancing the state of the art in audio-text embeddings, this research could enhance the capabilities of systems that rely on accurate audio perception and interpretation, leading to more immersive and responsive user experiences. The main contribution of this work is the introduction of Spatial-CLAP, a spatially-aware audio-text embedding model that effectively captures both content and spatial information in multi-source conditions. This research significantly advances the field of audio processing by addressing the limitations of existing models and providing a strong foundation for future developments in spatial audio understanding.
The paper introduces Spatial-CLAP, a novel audio-text embedding model that effectively integrates spatial information into the existing CLAP framework. The methodology is robust, featuring a content-aware spatial encoder (CA-SE) that captures spatial representations alongside audio content, and a spatial contrastive learning (SCL) strategy that enhances the model's ability to learn correct content-space correspondences in multi-source conditions. This dual approach is innovative and addresses a significant gap in existing models, which have primarily focused on single-source scenarios. The use of simulated room impulse responses (RIRs) for training and the incorporation of hard negative examples in SCL are particularly noteworthy, as they provide a rigorous framework for improving the model's performance in complex auditory environments.
The experimental evaluations are comprehensive, utilizing a variety of metrics to assess the performance of Spatial-CLAP across different conditions, including single-source and multi-source scenarios. The results demonstrate significant improvements over conventional methods, particularly in tasks that require understanding spatial relationships in audio. The paper includes detailed comparisons with baseline models, showcasing the effectiveness of the proposed methods. However, the reliance on synthetic data and simulated environments could limit the generalizability of the findings to real-world applications.
The authors have committed to releasing their code and pretrained models, which is crucial for reproducibility. The detailed descriptions of the model architecture, training procedures, and datasets used further enhance the reproducibility of the study. However, the paper could benefit from more explicit details regarding the hyperparameter tuning and the specific configurations used during training.
One limitation of the study is the potential overfitting to the synthetic training conditions, as the model's performance in real-world scenarios remains untested. Additionally, while the model shows promise in handling multi-source conditions, its performance in dynamic environments with moving sources has not been addressed. The paper also does not explore the computational efficiency of the model, which is an important consideration for practical applications.
The development of Spatial-CLAP has significant implications for various applications, including augmented reality (AR), virtual reality (VR), and robotics, where understanding spatial audio cues is critical. By advancing the state of the art in audio-text embeddings, this research could enhance the capabilities of systems that rely on accurate audio perception and interpretation, leading to more immersive and responsive user experiences. The main contribution of this work is the introduction of Spatial-CLAP, a spatially-aware audio-text embedding model that effectively captures both content and spatial information in multi-source conditions. This research significantly advances the field of audio processing by addressing the limitations of existing models and providing a strong foundation for future developments in spatial audio understanding.
This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenization and multi-stage processing pipelines. To address the inherent difficulties of modeling continuous features, we propose a representation alignment module that aligns output representations of the transformer decoder with semantic embeddings from a pretrained ASR encoder during training. This mechanism not only speeds up training convergence, but also enhances cross-modal coherence between the textual and acoustic domains. Comprehensive experiments demonstrate that MELA-TTS achieves state-of-the-art performance across multiple evaluation metrics while maintaining robust zero-shot voice cloning capabilities, in both offline and streaming synthesis modes. Our results establish a new benchmark for continuous feature generation approaches in TTS, offering a compelling alternative to discrete-token-based paradigms.
Primary: The first two authors contribute equally to this work
All Institutions: The first two authors contribute equally to this work
The paper presents MELA-TTS, a novel joint transformer-diffusion framework for TTS synthesis that eliminates the need for speech tokenization, achieving state-of-the-art performance while enhancing training efficiency and output coherence. This work significantly advances the field of speech synthesis by addressing key limitations of existing models and providing a compelling alternative to traditional approaches.
The proposed MELA-TTS framework integrates a joint transformer-diffusion model that innovatively addresses the limitations of traditional TTS systems reliant on discrete tokenization. The introduction of a representation alignment module is a significant methodological advancement, as it aligns the outputs of the transformer with semantic embeddings from a pretrained ASR encoder, enhancing both training efficiency and output coherence. The autoregressive generation of continuous mel-spectrograms without tokenization is a notable shift in paradigm, indicating a robust approach to continuous feature modeling.
The experiments are comprehensive, utilizing both small (LibriTTS) and large-scale datasets (170,000 hours) to validate the model's performance across various metrics, including WER and CER. The results demonstrate state-of-the-art performance in multiple scenarios, showcasing the model's effectiveness in both offline and streaming synthesis modes. The ablation studies provide strong evidence for the contributions of the representation alignment and utterance embedding, reinforcing the robustness of the experimental design.
While the paper provides a detailed description of the model architecture and training process, the absence of a publicly available code repository or demo limits reproducibility. The methodology is well-documented, but without access to the implementation, independent validation of the results is challenging.
One limitation noted is the model's performance in voice cloning, particularly in speaker similarity metrics compared to discrete-token-based systems. The authors acknowledge that the diffusion module's reliance on local context may hinder its ability to leverage broader input conditions, suggesting potential areas for future improvement. Additionally, the model's complexity may pose challenges for deployment in real-time applications.
The implications of MELA-TTS extend beyond TTS synthesis, potentially influencing areas such as audio generation, voice cloning, and even applications in music synthesis. By eliminating the need for tokenization and multi-stage processing, the framework could lead to more efficient and natural-sounding speech synthesis systems, enhancing user experiences in various domains. The paper presents MELA-TTS, a novel joint transformer-diffusion framework for TTS synthesis that eliminates the need for speech tokenization, achieving state-of-the-art performance while enhancing training efficiency and output coherence. This work significantly advances the field of speech synthesis by addressing key limitations of existing models and providing a compelling alternative to traditional approaches.
In this paper, we present state-of-the-art diarization error rates (DERs) on multiple publicly available datasets, including AliMeeting-far, AliMeeting-near, AMI-Mix, AMI-SDM, DIHARD III, and MagicData RAMC. Leveraging EEND-TA, a single unified non-autoregressive model for end-to-end speaker diarization, we achieve new benchmark results, most notably a DER of 14.49% on DIHARD III. Our approach scales pretraining through 8-speaker simulation mixtures, ensuring each generated speaker mixture configuration is sufficiently represented. These experiments highlight that EEND-based architectures possess a greater capacity for learning than previously explored, surpassing many existing diarization solutions while maintaining efficient speeds during inference.
The paper makes a substantial contribution to the field of speaker diarization by presenting a state-of-the-art model that effectively leverages large-scale pre-training and demonstrates competitive performance across multiple datasets. The methodology is robust, though the lack of reproducibility resources and some limitations in performance on specific datasets suggest areas for future improvement.
The paper presents a novel approach to speaker diarization using the EEND-TA model, which is a unified non-autoregressive architecture. The methodology is well-structured, leveraging a combination of Conformer encoders and Transformer decoders, and introduces a significant innovation in scaling pre-training with simulated mixtures of up to 8 speakers. This addresses the challenge of limited annotated datasets in diarization tasks. The authors provide a clear explanation of their model architecture and the rationale behind their design choices, which enhances the understanding of their contributions.
The experiments are comprehensive, covering multiple publicly available datasets and demonstrating state-of-the-art performance in terms of Diarization Error Rates (DER). The authors effectively compare their results against existing methods, showcasing improvements across various configurations. The use of a large-scale pre-training dataset (over 80,000 hours) is particularly noteworthy, as it demonstrates the model's capacity to learn effectively from diverse speaker configurations. However, the paper could benefit from more detailed discussions on the experimental setup and the specific conditions under which the results were obtained.
While the paper includes sufficient details regarding the model architecture and training procedures, it lacks explicit links to code repositories or supplementary materials that would facilitate reproduction of the results. The absence of a demo or project URL is a significant limitation for reproducibility, as other researchers may find it challenging to replicate the experiments without access to the code or datasets used.
The paper acknowledges that the model does not outperform existing state-of-the-art results on all datasets, particularly AISHELL-4, CALLHOME, and VoxConverse. This limitation highlights the need for further refinement and tuning of the model for specific datasets. Additionally, the reliance on simulated mixtures may not fully capture the complexities of real-world audio recordings, which could affect generalization.
The findings of this research have significant implications for real-time applications in speech processing, such as automated transcription services, video conferencing, and customer service systems. By improving the efficiency and accuracy of speaker diarization, this work can enhance user experiences in various audio-based applications. The emphasis on end-to-end models also aligns with trends in machine learning towards more integrated and efficient solutions. The paper makes a substantial contribution to the field of speaker diarization by presenting a state-of-the-art model that effectively leverages large-scale pre-training and demonstrates competitive performance across multiple datasets. The methodology is robust, though the lack of reproducibility resources and some limitations in performance on specific datasets suggest areas for future improvement.
This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual class-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Engineering Research Center of Speech and Language Information Processing
The main contribution of this paper is the introduction of DAIEN-TTS, an innovative environment-aware zero-shot TTS framework that enables disentangled control of speaker timbre and background environments, significantly advancing the capabilities of text-to-speech synthesis. The methodology and experimental results demonstrate a meaningful step forward in the field, with potential applications that could reshape user interactions with synthesized speech.
The proposed DAIEN-TTS framework introduces a novel approach to zero-shot TTS by utilizing disentangled audio infilling, which allows for independent control over speaker timbre and environmental background. The incorporation of a pretrained speech-environment separation (SES) module is a significant methodological advancement, as it effectively disentangles the speech and environmental components. The use of random span masking during training and dual class-free guidance (DCFG) during inference enhances the model's controllability and adaptability to varying conditions. The methodology is well-structured, leveraging existing frameworks like F5-TTS while innovating on top of them, showcasing a clear progression in the field of TTS synthesis.
The experiments are comprehensive, utilizing the LibriTTS corpus and the DNS-Challenge dataset to simulate a variety of environmental conditions. The evaluation metrics include both objective measures (WER, SIM-o) and subjective assessments (MOS for naturalness, speaker similarity, and environment similarity), providing a robust framework for assessing the model's performance. The results demonstrate that DAIEN-TTS outperforms existing baselines, including F5-TTS, in both silence and background environment scenarios, indicating its effectiveness in generating high-quality, environment-aware speech. The thoroughness of the experimental setup and the clarity of the results contribute positively to the paper's impact.
The paper provides detailed descriptions of the model architecture, training procedures, and evaluation metrics, which are essential for reproducibility. However, specific hyperparameters and the exact configurations of the training environment (e.g., GPU specifications) are mentioned but could be elaborated further to enhance reproducibility. The authors could also consider providing a code repository to facilitate implementation by other researchers.
One limitation of the study is the reliance on the LibriTTS corpus, which may not fully capture the diversity of real-world speech and environmental conditions. Additionally, while the model shows strong performance in controlled settings, its robustness in highly variable real-world scenarios remains to be tested. The paper does not address potential biases in the training data, which could affect the generalizability of the model.
The DAIEN-TTS framework has significant implications for applications in virtual reality, audiobooks, and personalized voice assistants, where the ability to synthesize speech with varying environmental contexts can enhance user experience. The ability to independently control speaker characteristics and background environments could lead to more immersive and realistic interactions in various multimedia applications. The research also contributes to the broader field of speech synthesis by addressing the challenge of environment-aware synthesis, paving the way for future advancements in TTS technologies. The main contribution of this paper is the introduction of DAIEN-TTS, an innovative environment-aware zero-shot TTS framework that enables disentangled control of speaker timbre and background environments, significantly advancing the capabilities of text-to-speech synthesis. The methodology and experimental results demonstrate a meaningful step forward in the field, with potential applications that could reshape user interactions with synthesized speech.
Current audio captioning systems rely heavily on supervised learning with paired audio-caption datasets, which are expensive to curate and may not reflect human preferences in real-world scenarios. To address this limitation, we propose a preference-aligned audio captioning framework based on Reinforcement Learning from Human Feedback (RLHF). To effectively capture nuanced human preferences, we train a Contrastive Language-Audio Pretraining (CLAP)-based reward model using human-labeled pairwise preference data. This reward model is integrated into a reinforcement learning framework to fine-tune any baseline captioning system without relying on ground-truth caption annotations. Extensive human evaluations across multiple datasets show that our method produces captions preferred over those from baseline models, particularly in cases where the baseline models fail to provide correct and natural captions. Furthermore, our framework achieves performance comparable to supervised approaches with ground-truth data, demonstrating its effectiveness in aligning audio captioning with human preferences and its scalability in real-world scenarios.
The main contribution of this paper is the introduction of a novel RLHF framework for audio captioning that effectively aligns model outputs with human preferences, demonstrating competitive performance without the need for ground-truth captions. This work significantly advances the field of audio captioning by addressing the limitations of existing methods and providing a scalable solution for real-world applications.
The proposed methodology is innovative in its use of Reinforcement Learning from Human Feedback (RLHF) to align audio captions with human preferences without requiring paired audio-caption datasets. The integration of a Contrastive Language-Audio Pretraining (CLAP)-based reward model trained on pairwise human preference data is a significant advancement over traditional supervised learning approaches. The paper effectively addresses the challenges of audio captioning, particularly the ambiguity and temporal complexity inherent in audio data, by focusing on human alignment rather than static similarity metrics. The use of reward shaping techniques to mitigate reward hacking is a thoughtful addition that enhances the robustness of the approach.
The experiments are comprehensive, utilizing both public and proprietary datasets to evaluate the performance of the proposed system. The results demonstrate that the RLHF-based method consistently outperforms baseline models, particularly in challenging scenarios where traditional models fail. The human evaluations provide strong evidence of the method's effectiveness in producing captions that are preferred by annotators, and the comparative analysis with supervised approaches highlights the scalability and cost-effectiveness of the proposed framework. However, the reliance on limited preference data for training the reward model may affect the generalizability of the results.
The paper provides detailed implementation details, including model architecture, training procedures, and hyperparameter settings, which enhances reproducibility. However, the absence of publicly available code or datasets limits the ability for others to fully replicate the experiments. The authors should consider releasing their code and datasets to facilitate further research and validation.
One notable limitation is the reliance on pairwise preference data, which may not capture the full spectrum of human judgment. Additionally, the method's performance may vary significantly depending on the quality and quantity of the preference data used for training the reward model. The authors acknowledge the potential for reward hacking and the challenges associated with human evaluation, which can introduce variability in the results.
The proposed framework has significant potential applications in various domains, including assistive technologies for the hearing impaired, content creation for multimedia platforms, and enhancing user experiences in audio-based applications. By aligning audio captioning systems more closely with human preferences, the research could lead to more intuitive and effective interactions with audio content, ultimately benefiting a wide range of users. The main contribution of this paper is the introduction of a novel RLHF framework for audio captioning that effectively aligns model outputs with human preferences, demonstrating competitive performance without the need for ground-truth captions. This work significantly advances the field of audio captioning by addressing the limitations of existing methods and providing a scalable solution for real-world applications.
We use the term re-identification to refer to the process of recovering the original speaker's identity from anonymized speech outputs. Speaker de-identification systems aim to reduce the risk of re-identification, but most evaluations focus only on individual-level measures and overlook broader risks from soft biometric leakage. We introduce the Soft Biometric Leakage Score (SBLS), a unified method that quantifies resistance to zero-shot inference attacks on non-unique traits such as channel type, age range, dialect, sex of the speaker, or speaking style. SBLS integrates three elements: direct attribute inference using pre-trained classifiers, linkage detection via mutual information analysis, and subgroup robustness across intersecting attributes. Applying SBLS with publicly available classifiers, we show that all five evaluated de-identification systems exhibit significant vulnerabilities. Our results indicate that adversaries using only pre-trained models - without access to original speech or system details - can still reliably recover soft biometric information from anonymized output, exposing fundamental weaknesses that standard distributional metrics fail to capture.
Primary: National Institute of Standards and Technology
All Institutions: This research is based upon work supported by the Intelligence Advanced Research Projects Activity (IARPA), National Institute of Standards and Technology, emphAnonymous Real-Time Speech (ARTS), when any subgroup has perfect leakage
The paper makes a significant contribution by introducing a novel metric for assessing soft biometric leakage in speaker de-identification systems, revealing vulnerabilities that traditional metrics overlook. The comprehensive methodology and experimental evaluation underscore the importance of addressing privacy concerns in speech processing, although further work is needed to enhance reproducibility and generalizability.
The paper introduces the Soft Biometric Leakage Score (SBLS), a novel metric that integrates three components: zero-shot attribute inference, systematic linkage detection, and subgroup robustness. This comprehensive approach addresses a significant gap in the evaluation of speaker de-identification systems by focusing on soft biometric leakage rather than traditional metrics. The methodology is well-structured and employs established statistical techniques, such as mutual information analysis, to quantify vulnerabilities effectively. However, the choice of heuristic weights in the SBLS calculation could benefit from further justification or exploration of alternative weighting strategies.
The experimental setup is robust, utilizing a diverse dataset (Mixer 3 corpus) with rich demographic annotations to evaluate five different speaker de-identification systems. The results demonstrate significant vulnerabilities in all evaluated systems, highlighting the effectiveness of the SBLS in revealing soft biometric leakage. The paper provides clear performance metrics, including AUC values and subgroup-specific leakage scores, which enhance the interpretability of the findings. However, the reliance on a single dataset may limit the generalizability of the results.
While the paper outlines the methodology and experimental setup in detail, it lacks specific implementation details or code availability, which could hinder reproducibility. The absence of a public repository for the SBLS implementation or the evaluated systems means that other researchers cannot easily replicate the experiments or build upon the findings.
The primary limitation is the focus on a single dataset, which may not capture the full spectrum of speaker characteristics and de-identification challenges. Additionally, the paper does not address potential variations in performance across different languages or dialects, which could affect the applicability of the findings. The heuristic nature of the SBLS component weights also raises questions about the robustness of the results.
This research has significant implications for privacy and security in speech processing applications, particularly in contexts where speaker anonymity is crucial. By quantifying soft biometric leakage, the findings can inform the development of more robust speaker de-identification systems and contribute to the broader discourse on privacy-preserving technologies in machine learning. The introduction of SBLS could lead to improved standards for evaluating de-identification systems, ultimately enhancing user trust and safety in voice-based applications. The paper makes a significant contribution by introducing a novel metric for assessing soft biometric leakage in speaker de-identification systems, revealing vulnerabilities that traditional metrics overlook. The comprehensive methodology and experimental evaluation underscore the importance of addressing privacy concerns in speech processing, although further work is needed to enhance reproducibility and generalizability.
Self-supervised learning (SSL) has pushed speaker verification accuracy close to state-of-the-art levels, but the Transformer backbones used in most SSL encoders hinder on-device and real-time deployment. Prior compression work trims layer depth or width yet still inherits the quadratic cost of self-attention. We propose SV-Mixer, the first fully MLP-based student encoder for SSL distillation. SV-Mixer replaces Transformer with three lightweight modules: Multi-Scale Mixing for multi-resolution temporal features, Local-Global Mixing for frame-to-utterance context, and Group Channel Mixing for spectral subspaces. Distilled from WavLM, SV-Mixer outperforms a Transformer student by 14.6% while cutting parameters and GMACs by over half, and at 75% compression, it closely matches the teacher's performance. Our results show that attention-free SSL students can deliver teacher-level accuracy with hardware-friendly footprints, opening the door to robust on-device speaker verification.
Primary: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government. (MSIT) (2023R1A2C1005744)
All Institutions: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government. (MSIT) (2023R1A2C1005744), Corresponding author
This paper presents a significant advancement in the field of speaker verification by introducing SV-Mixer, a lightweight, attention-free encoder that achieves competitive performance while being suitable for deployment in resource-constrained environments. The innovative approach and thorough experimental validation position this work as a valuable contribution to the ongoing evolution of self-supervised learning in audio applications.
The paper introduces SV-Mixer, a novel architecture that replaces Transformer encoders with a fully MLP-based design tailored for self-supervised learning in speaker verification. The methodology is well-structured, incorporating three specialized mixing modules (Multi-Scale Mixing, Local-Global Mixing, and Group Channel Mixing) that enhance temporal and spectral feature extraction while reducing computational complexity. The design is justified through a clear rationale for moving away from self-attention mechanisms, which are computationally expensive. The paper effectively demonstrates how these modules work together to maintain accuracy under aggressive model compression, showcasing a thoughtful approach to architectural design in the context of SSL.
The experiments are comprehensive, utilizing the VoxCeleb2 dataset for training and evaluating on multiple test sets, including VoxCeleb1 and VoxSRC 2023. The results consistently show that SV-Mixer outperforms both Transformer-based models and other MLP architectures, providing strong empirical evidence for its effectiveness. The ablation studies further substantiate the contributions of each mixing module, and the robustness of SV-Mixer under varying compression levels is particularly noteworthy. However, the paper could benefit from additional comparisons with more diverse architectures and a broader range of datasets.
The authors have made their code, pretrained models, and inference scripts publicly available on GitHub, which is a positive aspect for reproducibility. The implementation details are clearly outlined, including training parameters and data augmentation strategies. However, the paper could improve by providing more detailed instructions on how to replicate the experiments, such as specific configurations for the training environment.
The paper acknowledges limitations, such as the fixed training setup and reliance on a single distillation strategy, which may overlook more effective combinations of training objectives. Additionally, while the results are promising, the generalizability of the findings to other domains or tasks beyond speaker verification remains to be explored.
The proposed SV-Mixer architecture has significant implications for on-device speaker verification, particularly in resource-constrained environments. By reducing the computational burden associated with traditional Transformer architectures, this work could facilitate the deployment of advanced speaker verification systems in mobile and embedded applications, enhancing accessibility and usability in real-world scenarios. The findings may also inspire further research into attention-free architectures across various domains in machine learning. This paper presents a significant advancement in the field of speaker verification by introducing SV-Mixer, a lightweight, attention-free encoder that achieves competitive performance while being suitable for deployment in resource-constrained environments. The innovative approach and thorough experimental validation position this work as a valuable contribution to the ongoing evolution of self-supervised learning in audio applications.
High-fidelity binaural audio synthesis is crucial for immersive listening, but existing methods require extensive computational resources, limiting their edge-device application. To address this, we propose the Lightweight Implicit Neural Network (LINN), a novel two-stage framework. LINN first generates initial estimates using a time-domain warping, which is then refined by an Implicit Binaural Corrector (IBC) module. IBC is an implicit neural network that predicts amplitude and phase corrections directly, resulting in a highly compact model architecture. Experimental results show that LINN achieves statistically comparable perceptual quality to the best-performing baseline model while significantly improving computational efficiency. Compared to the most efficient existing method, LINN achieves a 72.7% reduction in parameters and significantly fewer compute operations (MACs). This demonstrates that our approach effectively addresses the trade-off between synthesis quality and computational efficiency, providing a new solution for high-fidelity edge-device spatial audio applications.
Primary: East China Normal University
All Institutions: This work is funded by the Interdisciplinary Programs on AI-Enabled, Shanghai Institute of Artificial Intelligence for Education, East China Normal University, School of Computer Science and Technology
The main contribution of this paper is the introduction of LINN, a lightweight framework for binaural audio synthesis that effectively balances high perceptual quality with computational efficiency, making it suitable for edge-device applications. This work represents a meaningful step forward in the field of audio synthesis, particularly in the context of deep learning and implicit neural representations.
The proposed Lightweight Implicit Neural Network (LINN) introduces a two-stage framework that effectively combines a Time-Domain Warping (TDW) module with an Implicit Binaural Corrector (IBC). The IBC's innovative approach of modeling spectral corrections as a continuous function is a significant advancement in binaural audio synthesis. The use of implicit neural representations allows for a compact model architecture, which is particularly beneficial for edge-device applications. The methodology is well-structured, with clear explanations of the architecture, loss functions, and positional encoding strategies, making it a robust contribution to the field.
The paper presents a thorough experimental evaluation using the Binaural Speech dataset, comparing LINN against several state-of-the-art models. The results indicate that while LINN does not outperform all baselines in every metric, it achieves competitive performance with significantly lower computational requirements. The combination of quantitative metrics and perceptual evaluations (MOS tests) provides a comprehensive assessment of the model's effectiveness. The statistical analysis further strengthens the findings, showcasing LINN's ability to maintain quality while reducing complexity.
The implementation details are well-documented, including the architecture specifications, training procedures, and evaluation metrics. The authors provide a GitHub repository for source code and audio samples, which enhances the reproducibility of the results. However, the paper could benefit from more extensive documentation on the dataset preprocessing and specific hyperparameter choices to facilitate easier replication of the experiments.
One limitation is that LINN does not achieve the highest performance in all objective metrics when compared to some baseline models, which may raise questions about its applicability in scenarios where absolute performance is critical. Additionally, the reliance on a specific dataset may limit the generalizability of the results to other audio synthesis tasks or datasets.
The development of LINN has significant implications for the deployment of binaural audio synthesis in resource-constrained environments, such as mobile devices and IoT applications. By addressing the trade-off between computational efficiency and audio quality, this work opens avenues for more widespread use of spatial audio technologies in virtual reality, gaming, and immersive media experiences. The main contribution of this paper is the introduction of LINN, a lightweight framework for binaural audio synthesis that effectively balances high perceptual quality with computational efficiency, making it suitable for edge-device applications. This work represents a meaningful step forward in the field of audio synthesis, particularly in the context of deep learning and implicit neural representations.
Singing Accompaniment Generation (SAG) is the process of generating instrumental music for a given clean vocal input. However, existing SAG techniques use source-separated vocals as input and overfit to separation artifacts. This creates a critical train-test mismatch, leading to failure on clean, real-world vocal inputs. We introduce AnyAccomp, a framework that resolves this by decoupling accompaniment generation from source-dependent artifacts. AnyAccomp first employs a quantized melodic bottleneck, using a chromagram and a VQ-VAE to extract a discrete and timbre-invariant representation of the core melody. A subsequent flow-matching model then generates the accompaniment conditioned on these robust codes. Experiments show AnyAccomp achieves competitive performance on separated-vocal benchmarks while significantly outperforming baselines on generalization test sets of clean studio vocals and, notably, solo instrumental tracks. This demonstrates a qualitative leap in generalization, enabling robust accompaniment for instruments - a task where existing models completely fail - and paving the way for more versatile music co-creation tools. Demo audio and code: https://anyaccomp.github.io
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
The main contribution of this paper is the introduction of AnyAccomp, a framework that effectively resolves the train-test mismatch in singing accompaniment generation by utilizing a quantized melodic bottleneck to enhance generalization capabilities. This work represents a significant advancement in the field of audio machine learning, addressing critical challenges and paving the way for more versatile music co-creation tools.
The paper introduces a novel two-stage framework, AnyAccomp, which effectively decouples accompaniment generation from source-dependent artifacts through a quantized melodic bottleneck using VQ-VAE and a flow-matching model. This approach is innovative as it addresses the critical train-test mismatch prevalent in existing SAG models, which typically overfit to artifacts from source-separated vocals. The use of a chromagram for timbre-invariant representation is a significant methodological advancement, allowing the model to focus on the core melody rather than irrelevant acoustic details. The combination of a robust representation and a flow-matching transformer for accompaniment generation is well-conceived and demonstrates a clear understanding of the challenges in the field.
The experiments are comprehensive, utilizing a substantial dataset of 8k hours of paired singing voice and accompaniment data. The evaluation metrics are well-defined, including both objective measures (FAD, APA) and subjective assessments (MOS tests), which provide a holistic view of the model's performance. The results convincingly demonstrate that AnyAccomp outperforms existing models, particularly in generalization to clean vocals and instrumental tracks, validating the effectiveness of the proposed methodology. However, the paper could benefit from more detailed comparisons with additional baseline models to further substantiate its claims.
The implementation details are clear, including model architectures, training parameters, and data preparation processes, which facilitate reproducibility. The authors provide sufficient information about the training setup, including the number of parameters and optimization strategies. However, the absence of a publicly available code repository limits the ease of reproduction for other researchers, which is a significant consideration in machine learning research.
While the proposed method shows promising results, the paper does not address potential limitations in terms of computational efficiency and scalability. The reliance on a large dataset for training may also pose challenges for users with limited resources. Additionally, the model's performance on highly diverse or unconventional vocal inputs has not been thoroughly tested, which could affect its applicability in real-world scenarios.
The framework has the potential to significantly enhance music creation tools, enabling artists and producers to generate high-quality instrumental accompaniments from vocal inputs. This could democratize music production, making it more accessible to amateurs and non-professionals. Furthermore, the advancements in generalization could lead to more robust AI systems in creative fields, fostering innovation in music technology and related domains. The main contribution of this paper is the introduction of AnyAccomp, a framework that effectively resolves the train-test mismatch in singing accompaniment generation by utilizing a quantized melodic bottleneck to enhance generalization capabilities. This work represents a significant advancement in the field of audio machine learning, addressing critical challenges and paving the way for more versatile music co-creation tools.
Stuttered and dysfluent speech detection systems have traditionally suffered from the trade-off between accuracy and clinical interpretability. While end-to-end deep learning models achieve high performance, their black-box nature limits clinical adoption. This paper looks at the Unconstrained Dysfluency Modeling (UDM) series-the current state-of-the-art framework developed by Berkeley that combines modular architecture, explicit phoneme alignment, and interpretable outputs for real-world clinical deployment. Through extensive experiments involving patients and certified speech-language pathologists (SLPs), we demonstrate that UDM achieves state-of-the-art performance (F1: 0.89+-0.04) while providing clinically meaningful interpretability scores (4.2/5.0). Our deployment study shows 87% clinician acceptance rate and 34% reduction in diagnostic time. The results provide strong evidence that UDM represents a practical pathway toward AI-assisted speech therapy in clinical environments.
Primary: Eric ZhangCorresponding author
All Institutions: Eric ZhangCorresponding author, SSHealth Team, AI for Healthcare Laboratory, Corresponding author
This paper presents a comprehensive evaluation of the UDM framework, demonstrating its effectiveness in clinical dysfluency detection while addressing the critical need for interpretability in AI applications within healthcare. The methodology and results contribute meaningfully to the field of machine learning in audio processing, particularly in enhancing the clinical utility of dysfluency detection systems.
The paper introduces the Unconstrained Dysfluency Modeling (UDM) framework, which is a modular and interpretable architecture designed to address the limitations of traditional dysfluency detection systems. The methodology is well-structured, incorporating multi-scale feature extraction, phoneme alignment, and explicit classification of dysfluency types. The explicit phoneme alignment module is a significant innovation that enhances interpretability, allowing clinicians to understand the model's decisions. However, the paper could benefit from a more detailed explanation of the equations and algorithms used, as well as the specific training and validation processes.
The experiments are robust, utilizing a large dataset from a clinical setting, which adds to the real-world applicability of the findings. The paper compares UDM against several baseline models, demonstrating superior performance across various metrics, including F1-score and interpretability scores. The inclusion of clinician feedback and acceptance rates provides valuable insights into the practical implications of the model. However, the results could be strengthened by including more diverse datasets and additional clinical settings to validate the model's generalizability.
The paper lacks explicit details regarding the implementation of the UDM framework, such as code availability or links to a repository. This absence limits the reproducibility of the results. Providing access to the model and datasets would enhance the credibility and usability of the research.
The paper acknowledges several limitations, including challenges with silent blocks, the current focus on Mandarin Chinese speakers, and the need for further validation in longitudinal studies. Additionally, the model's performance on different dysfluency types varies, indicating areas for improvement.
The UDM framework has significant potential for improving clinical practices in speech therapy by providing interpretable and accurate dysfluency detection. Its modular design could facilitate broader applications in other areas of healthcare where interpretability is crucial. The findings may also influence future research directions in AI-assisted speech therapy, particularly in under-resourced clinical environments. This paper presents a comprehensive evaluation of the UDM framework, demonstrating its effectiveness in clinical dysfluency detection while addressing the critical need for interpretability in AI applications within healthcare. The methodology and results contribute meaningfully to the field of machine learning in audio processing, particularly in enhancing the clinical utility of dysfluency detection systems.
Unsupervised anomalous sound detection aims to detect unknown anomalous sounds by training a model using only normal audio data. Despite advancements in self-supervised methods, the issue of frequent false alarms when handling samples of the same type from different machines remains unresolved. This paper introduces a novel training technique called one-stage supervised contrastive learning (OS-SCL), which significantly addresses this problem by perturbing features in the embedding space and employing a one-stage noisy supervised contrastive learning approach. On the DCASE 2020 Challenge Task 2, it achieved 94.64\% AUC, 88.42\% pAUC, and 89.24\% mAUC using only Log-Mel features. Additionally, a time-frequency feature named TFgram is proposed, which is extracted from raw audio. This feature effectively captures critical information for anomalous sound detection, ultimately achieving 95.71\% AUC, 90.23\% pAUC, and 91.23\% mAUC. The source code is available at: \underline{www.github.com/huangswt/OS-SCL}.
Primary: and in part by the State Grid Xinjiang Electric Power Company and Xinjiang Siji Information Technology Co
All Institutions: This work was supported by the National Natural Science Foundation of China No. 62366051, and in part by the State Grid Xinjiang Electric Power Company and Xinjiang Siji Information Technology Co, This work was supported in part by the National Natural Science Foundation of China under Grant 62366051, Ltd. under Grant SGITXX00ZHXX2200262
The paper presents a novel approach to anomalous sound detection that combines one-stage supervised contrastive learning with feature perturbation techniques, achieving state-of-the-art performance while challenging traditional beliefs about feature importance in audio analysis. The methodology is innovative, and the results have significant implications for real-world applications in industrial settings.
The proposed methodology introduces a novel one-stage supervised contrastive learning (OS-SCL) approach that effectively reduces false alarms in anomalous sound detection by perturbing features in the embedding space. The integration of a feature perturbation head (FPH) and the use of MixUp for data augmentation are innovative strategies that enhance the model's ability to learn decision boundaries. The introduction of TFgram as a time-frequency feature extraction method is also a significant contribution, as it challenges the conventional belief regarding the necessity of high-frequency components in anomaly detection. The methodology is well-structured, with clear explanations of the components and their roles in the overall framework.
The experiments are robust, utilizing the DCASE 2020 Challenge Task 2 dataset, which is a well-regarded benchmark in the field. The reported results demonstrate strong performance metrics (AUC, pAUC, mAUC) that surpass existing methods, indicating the effectiveness of the proposed approach. The paper includes a thorough comparison with state-of-the-art methods, and the ablation studies provide valuable insights into the contributions of each component of the proposed framework. However, more detailed statistical analyses and comparisons with additional datasets could further strengthen the findings.
The paper provides sufficient implementation details, including model architecture, training parameters, and the dataset used. The availability of the source code on GitHub enhances reproducibility, allowing other researchers to validate the results. However, the paper could benefit from a more comprehensive description of the experimental setup and hyperparameter tuning processes to facilitate easier reproduction of the results.
One limitation of the study is the reliance on a specific dataset (DCASE 2020 Challenge Task 2), which may limit the generalizability of the results to other domains or types of anomalous sounds. Additionally, while the OS-SCL method shows promise, the impact of label noise introduced during training could be further explored, as it may lead to unintended consequences in certain scenarios.
The proposed method has significant implications for industrial applications where anomalous sound detection is critical for maintenance and operational efficiency. By reducing false alarms and improving detection stability, the approach can enhance the reliability of automated monitoring systems in various industries, potentially leading to cost savings and improved safety. The findings challenge existing paradigms regarding feature reliance in audio processing, paving the way for further research in this area. The paper presents a novel approach to anomalous sound detection that combines one-stage supervised contrastive learning with feature perturbation techniques, achieving state-of-the-art performance while challenging traditional beliefs about feature importance in audio analysis. The methodology is innovative, and the results have significant implications for real-world applications in industrial settings.
Foundation models such as Wav2Vec2 excel at representation learning in speech tasks, including audio deepfake detection. However, after being fine-tuned on a fixed set of bonafide and spoofed audio clips, they often fail to generalize to novel deepfake methods not represented in training. To address this, we propose a mixture-of-LoRA-experts approach that integrates multiple low-rank adapters (LoRA) into the model's attention layers. A routing mechanism selectively activates specialized experts, enhancing adaptability to evolving deepfake attacks. Experimental results show that our method outperforms standard fine-tuning in both in-domain and out-of-domain scenarios, reducing equal error rates relative to baseline models. Notably, our best MoE-LoRA model lowers the average out-of-domain EER from 8.55\% to 6.08\%, demonstrating its effectiveness in achieving generalizable audio deepfake detection.
The main contribution of this paper is the introduction of a mixture-of-LoRA-experts framework that significantly improves the generalization capabilities of audio deepfake detection systems. This work represents a meaningful advancement in the field, combining innovative methodologies with a rigorous experimental evaluation, although it would benefit from improved reproducibility and a deeper exploration of limitations.
The proposed methodology introduces a mixture-of-LoRA-experts (MoE-LoRA) framework that enhances the adaptability of audio deepfake detection models by integrating multiple low-rank adapters into the attention layers of the Wav2Vec2 model. This approach is innovative in that it combines the benefits of parameter-efficient fine-tuning with the dynamic selection of specialized experts, allowing for improved generalization to unseen deepfake attacks. The routing mechanism for expert selection is well-conceived, promoting flexibility in model behavior depending on input characteristics. However, the paper could benefit from a more detailed explanation of the routing mechanism and its implications on computational efficiency.
The experimental setup is robust, utilizing multiple datasets that cover a range of spoofing techniques, which is crucial for evaluating the generalizability of the proposed method. The results demonstrate a clear improvement in equal error rates (EER) compared to baseline models, particularly in out-of-domain scenarios, which is a significant contribution to the field. The ablation studies conducted further strengthen the findings by highlighting the contributions of individual components of the MoE-LoRA framework. However, the paper lacks a discussion on the statistical significance of the results, which would enhance the credibility of the claims made.
The paper provides a thorough description of the experimental setup, including dataset details, training protocols, and evaluation metrics. However, the absence of a publicly available code repository or demo limits reproducibility. Future work should consider releasing the code to facilitate validation and further exploration by the research community.
One limitation is the reliance on the specific architecture of Wav2Vec2 and AASIST, which may not generalize well to other model architectures or domains outside audio deepfake detection. Additionally, while the MoE-LoRA approach shows promise, the complexity of the model may introduce challenges in deployment and real-time applications. The paper also does not address potential biases in the training datasets, which could affect the model's performance in real-world scenarios.
The implications of this research are significant, as audio deepfake detection is increasingly relevant in various domains, including security, media integrity, and misinformation prevention. The proposed method could enhance the robustness of voice authentication systems and contribute to the development of more reliable detection tools in the face of evolving deepfake technologies. The adaptability of the MoE-LoRA framework may also inspire similar approaches in other domains of machine learning where generalization to unseen data is critical. The main contribution of this paper is the introduction of a mixture-of-LoRA-experts framework that significantly improves the generalization capabilities of audio deepfake detection systems. This work represents a meaningful advancement in the field, combining innovative methodologies with a rigorous experimental evaluation, although it would benefit from improved reproducibility and a deeper exploration of limitations.
This paper proposes APSS, a novel neural speech separation model with parallel amplitude and phase spectrum estimation. Unlike most existing speech separation methods, the APSS distinguishes itself by explicitly estimating the phase spectrum for more complete and accurate separation. Specifically, APSS first extracts the amplitude and phase spectra from the mixed speech signal. Subsequently, the extracted amplitude and phase spectra are fused by a feature combiner into joint representations, which are then further processed by a deep processor with time-frequency Transformers to capture temporal and spectral dependencies. Finally, leveraging parallel amplitude and phase separators, the APSS estimates the respective spectra for each speaker from the resulting features, which are then combined via inverse short-time Fourier transform (iSTFT) to reconstruct the separated speech signals. Experimental results indicate that APSS surpasses both time-domain separation methods and implicit-phase-estimation-based time-frequency approaches. Also, APSS achieves stable and competitive results on multiple datasets, highlighting its strong generalization capability and practical applicability.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Engineering Research Center of Speech and Language Information Processing
The main contribution of this paper is the introduction of the APSS model, which explicitly estimates both amplitude and phase spectra for improved speech separation. This innovative approach, combined with rigorous experimental validation, positions the work as a significant advancement in the audio processing domain, addressing a critical challenge in speech separation tasks.
The proposed APSS model introduces a novel approach to speech separation by explicitly modeling both amplitude and phase spectra, which is a significant advancement over existing methods that often neglect phase information. The architecture is well-structured, utilizing a feature combiner and deep processors with time-frequency Transformers, which effectively captures the temporal and spectral dependencies. The parallel amplitude and phase separators are a clever design choice that allows for independent estimation while still leveraging the correlation between amplitude and phase. This dual modeling is a notable methodological contribution, addressing a critical gap in the field.
The experimental setup is robust, utilizing well-known datasets (WSJ0-2Mix and Libri2Mix) to validate the model's performance. The results demonstrate that APSS outperforms various baseline models, including both time-domain and implicit-phase estimation methods. The use of ablation studies to assess the contributions of different components of the model adds rigor to the evaluation, providing clear evidence of the importance of each part of the architecture. However, the paper could benefit from more detailed comparisons with additional state-of-the-art methods to further contextualize its contributions.
The paper provides sufficient details regarding the model architecture, training criteria, and experimental setup, which should allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which others can replicate the results. Including such resources would enhance the paper's impact and facilitate further research.
While the model shows strong performance, it is primarily focused on monaural two-speaker separation, which may limit its applicability in more complex scenarios involving multiple speakers or varying acoustic conditions. Additionally, the reliance on specific datasets for validation raises questions about generalization to real-world applications where conditions may differ significantly from those in the training data.
The advancements presented in this paper have the potential to significantly improve speech separation technologies, which are crucial for applications in automatic speech recognition, hearing aids, and communication systems in noisy environments. By effectively addressing the cocktail party problem, the APSS model could enhance user experience in various audio processing applications, making it a valuable contribution to the field. The main contribution of this paper is the introduction of the APSS model, which explicitly estimates both amplitude and phase spectra for improved speech separation. This innovative approach, combined with rigorous experimental validation, positions the work as a significant advancement in the audio processing domain, addressing a critical challenge in speech separation tasks.
This paper introduces a multi-stage self-directed framework designed to address the spatial semantic segmentation of sound scene (S5) task in the DCASE 2025 Task 4 challenge. This framework integrates models focused on three distinct tasks: Universal Sound Separation (USS), Single-label Classification (SC), and Target Sound Extraction (TSE). Initially, USS breaks down a complex audio mixture into separate source waveforms. Each of these separated waveforms is then processed by a SC block, generating two critical pieces of information: the waveform itself and its corresponding class label. These serve as inputs for the TSE stage, which isolates the source that matches this information. Since these inputs are produced within the system, the extraction target is identified autonomously, removing the necessity for external guidance. The extracted waveform can be looped back into the classification task, creating a cycle of iterative refinement that progressively enhances both separability and labeling accuracy. We thus call our framework a multi-stage self-guided system due to these self-contained characteristics. On the official evaluation dataset, the proposed system achieves an 11.00 dB increase in class-aware signal-to-distortion ratio improvement (CA-SDRi) and a 55.8\% accuracy in label prediction, outperforming the ResUNetK baseline by 4.4 dB and 4.3\%, respectively, and achieving first place among all submissions.
Primary: Fictional University
All Institutions: School of Electrical Engineering, 8765 Dream Blvd, University Imagination, Important Laboratory, Fictional University, Meta Reality Labs, 2133 Long Road
This paper presents a novel self-guided multi-stage framework for sound scene analysis that significantly improves audio source separation and classification. The technical contributions are substantial, with a well-defined methodology and promising experimental results, although the lack of reproducibility resources and broader comparative analyses could be areas for improvement.
The paper proposes a multi-stage self-guided framework that integrates Universal Sound Separation (USS), Single-label Classification (SC), and Target Sound Extraction (TSE) in a novel way. The architecture is well-structured and leverages iterative refinement to enhance both separation and classification accuracy. The use of a modified DeFT-Mamba model for USS and TSE is innovative, as it allows for the simultaneous processing of audio and class labels, which is a significant improvement over traditional methods that rely on external cues. The methodology is robust, with clear delineation of stages and the rationale for each component's design.
The experimental results demonstrate a significant improvement in both CA-SDRi and classification accuracy compared to the baseline. Achieving first place in the DCASE 2025 Task 4 challenge indicates a strong validation of the proposed framework. The paper provides comprehensive details on the training setup, data augmentation strategies, and evaluation metrics, which are crucial for assessing the performance of the models. However, the absence of comparisons with a wider range of existing methods could limit the contextual understanding of the results.
The paper includes detailed descriptions of the model architectures, loss functions, and training procedures, which are essential for reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the findings fully. Providing a GitHub repository or similar resource would greatly enhance reproducibility.
One limitation is the reliance on a specific dataset (DCASE 2025 Task 4), which may not generalize to other audio separation tasks or real-world applications. Additionally, while the iterative refinement process is beneficial, it may introduce computational overhead, making the framework less practical for real-time applications. The paper does not address potential issues related to model complexity and inference time.
The proposed framework has significant implications for audio processing applications, particularly in environments where sound source separation and classification are critical, such as in robotics, surveillance, and assistive technologies. By improving the accuracy of sound event detection, this research could enhance user experiences in various audio-related fields, including augmented reality and smart home devices. This paper presents a novel self-guided multi-stage framework for sound scene analysis that significantly improves audio source separation and classification. The technical contributions are substantial, with a well-defined methodology and promising experimental results, although the lack of reproducibility resources and broader comparative analyses could be areas for improvement.
While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them in novel ways at inference time. By adapting a pre-trained TTS model, SpeechOp inherits a rich understanding of natural speech, accelerating training and improving S2S task quality, while simultaneously enhancing core TTS performance. Finally, we introduce Implicit Task Composition (ITC), a novel pipeline where ASR-derived transcripts (e.g., from Whisper) guide SpeechOp's enhancement via our principled inference-time task composition. ITC achieves state-of-the-art content preservation by robustly combining web-scale speech understanding with SpeechOp's generative capabilities. Audio samples are available at https://justinlovelace.github.io/projects/speechop
Primary: Work done during an internship at Adobe
All Institutions: Work done during an internship at Adobe
The main contribution of this paper is the introduction of SpeechOp, a novel framework for generative speech processing that enables inference-time task composition, significantly improving the quality and versatility of speech-to-speech processing tasks. The technical contributions, particularly in task composition and leveraging pre-trained models, represent a meaningful advancement in the field of audio machine learning.
The methodology presented in SpeechOp is innovative, leveraging a multi-task latent diffusion model that repurposes pre-trained TTS models for a variety of speech processing tasks. The introduction of Implicit Task Composition (ITC) is particularly noteworthy, as it allows for dynamic task composition at inference time, which is a significant advancement in the field. The integration of ASR-derived transcripts to guide the generative process adds a layer of sophistication that enhances the model's ability to preserve content and speaker identity, addressing a critical challenge in speech-to-speech processing.
The experimental setup is robust, with a clear focus on evaluating the performance of SpeechOp across multiple speech tasks. The paper provides comparative results against state-of-the-art methods, demonstrating significant improvements in content preservation and task quality. However, the specifics of the datasets used and the metrics for evaluation could be elaborated further to strengthen the findings.
The paper mentions audio samples and provides a demo URL, which is a positive aspect for reproducibility. However, there is a lack of detailed information regarding the implementation, such as code availability or specific hyperparameters used in training, which could hinder full reproducibility by other researchers.
One limitation is the reliance on pre-trained TTS models, which may introduce biases inherent in those models. Additionally, while the paper claims state-of-the-art performance, it would benefit from a more extensive discussion on the generalizability of the approach across diverse speech datasets and languages.
The implications of SpeechOp are significant, as it offers a versatile framework for various speech processing tasks, potentially transforming applications in accessibility, voice synthesis, and real-time communication. The ability to compose tasks at inference time could lead to more adaptive and intelligent speech systems, enhancing user experience in numerous domains. The main contribution of this paper is the introduction of SpeechOp, a novel framework for generative speech processing that enables inference-time task composition, significantly improving the quality and versatility of speech-to-speech processing tasks. The technical contributions, particularly in task composition and leveraging pre-trained models, represent a meaningful advancement in the field of audio machine learning.
The majority of mainstream neural vocoders primarily focus on speech quality and generation speed, while overlooking latency, which is a critical factor in real-time applications. Excessive latency leads to noticeable delays in user interaction, severely degrading the user experience and rendering such systems impractical for real-time use. Therefore, this paper proposes DLL-APNet, a Distilled Low-Latency neural vocoder which first predicts the Amplitude and Phase spectra explicitly from input mel spectrogram and then reconstructs the speech waveform via inverse short-time Fourier transform (iSTFT). The DLL-APNet vocoder leverages causal convolutions to constrain the utilization of information to current and historical contexts, effectively minimizing latency. To mitigate speech quality degradation caused by causal constraints, a knowledge distillation strategy is proposed, where a pre-trained non-causal teacher vocoder guides intermediate feature generation of the causal student DLL-APNet vocoder. Experimental results demonstrate that the proposed DLL-APNet vocoder produces higher-quality speech than other causal vocoders, while requiring fewer computational resources. Furthermore, the proposed DLL-APNet vocoder achieves speech quality on par with mainstream non-causal neural vocoders, validating its ability to deliver both high perceptual quality and low latency.
Primary: University of Science and Technology of China
All Institutions: and Zhen-Hua Ling, National Engineering Research Center of Speech and Language Information, University of Science and Technology of China
The paper presents DLL-APNet, a novel low-latency neural vocoder that effectively balances speech quality and latency through innovative use of causal convolutions and knowledge distillation. The methodology and experimental results indicate a meaningful contribution to the field of speech synthesis, particularly for applications requiring real-time performance.
The proposed methodology of DLL-APNet is well-structured, leveraging causal convolutions to minimize latency while employing knowledge distillation to enhance speech quality. The explicit prediction of amplitude and phase spectra from mel spectrograms is a significant improvement over traditional vocoders that often neglect latency. The integration of a pre-trained non-causal model as a teacher for the student model is innovative and effectively addresses the trade-off between latency and speech quality. The use of causal convolutions is appropriate for real-time applications, and the paper provides a clear explanation of how these convolutions operate to maintain causality.
The experimental setup is robust, utilizing the VCTK dataset, which is a standard benchmark for speech synthesis tasks. The authors compare their model against several state-of-the-art vocoders, both causal and non-causal, providing a comprehensive analysis of performance metrics. The results demonstrate that DLL-APNet outperforms other causal vocoders while maintaining quality comparable to non-causal models. The use of multiple objective metrics (SNR, RMSE, MCD, etc.) adds credibility to their findings. However, the paper could benefit from more qualitative evaluations, such as user studies or perceptual tests, to complement the objective metrics.
The paper includes sufficient implementation details, such as hyperparameter settings, model architecture, and training procedures, which facilitate reproducibility. The authors also mention the use of a demo page for generated speech samples, enhancing transparency. However, the lack of a publicly available code repository limits the ease of reproduction for other researchers.
One limitation is the reliance on a pre-trained non-causal model, which may not be readily available to all researchers. Additionally, while the paper addresses latency, it does not explore the potential trade-offs in terms of model size and complexity, which could impact deployment in resource-constrained environments. The paper also lacks a discussion on the generalizability of the model to different languages or accents, which could be an important consideration for real-world applications.
The proposed DLL-APNet vocoder has significant implications for real-time speech applications, such as telecommunication, virtual assistants, and interactive voice response systems. By addressing the critical issue of latency while maintaining high speech quality, this work contributes to the advancement of practical speech synthesis technologies. The findings could influence future research directions in low-latency vocoding and real-time audio processing. The paper presents DLL-APNet, a novel low-latency neural vocoder that effectively balances speech quality and latency through innovative use of causal convolutions and knowledge distillation. The methodology and experimental results indicate a meaningful contribution to the field of speech synthesis, particularly for applications requiring real-time performance.
Dysarthric speech severity classification is crucial for objective clinical assessment and progress monitoring in individuals with motor speech disorders. Although prior methods have addressed this task, achieving robust generalization in speaker-independent (SID) scenarios remains challenging. This work introduces DSSCNet, a novel deep neural architecture that combines Convolutional, Squeeze-Excitation (SE), and Residual network, helping it extract discriminative representations of dysarthric speech from mel spectrograms. The addition of SE block selectively focuses on the important features of the dysarthric speech, thereby minimizing loss and enhancing overall model performance. We also propose a cross-corpus fine-tuning framework for severity classification, adapted from detection-based transfer learning approaches. DSSCNet is evaluated on two benchmark dysarthric speech corpora: TORGO and UA-Speech under speaker-independent evaluation protocols: One-Speaker-Per-Severity (OSPS) and Leave-One-Speaker-Out (LOSO) protocols. DSSCNet achieves accuracies of 56.84% and 62.62% under OSPS and 63.47% and 64.18% under LOSO setting on TORGO and UA-Speech respectively outperforming existing state-of-the-art methods. Upon fine-tuning, the performance improves substantially, with DSSCNet achieving up to 75.80% accuracy on TORGO and 68.25% on UA-Speech in OSPS, and up to 77.76% and 79.44%, respectively, in LOSO. These results demonstrate the effectiveness and generalizability of DSSCNet for fine-grained severity classification across diverse dysarthric speech datasets.
Primary: Arnab Kumar Roy is with Sikkim Manipal Institute of Technology (SMIT)
All Institutions: Manuscript received March 19, India - 737136 (e-mail: arnab, Arnab Kumar Roy is with Sikkim Manipal Institute of Technology (SMIT), and Paban Sapkota are with National Institute of Technology Sikkim, and Paban Sapkota, 2025; revised X X
The main contribution of this paper is the development of DSSCNet, a novel deep learning architecture that significantly improves dysarthric speech severity classification through innovative use of transfer learning and advanced neural network components. This work represents a meaningful step forward in the field of speech processing, addressing critical challenges in speaker-independent scenarios and enhancing the potential for real-world applications.
The proposed DSSCNet architecture effectively integrates convolutional layers, squeeze-excitation blocks, and residual connections to enhance the classification of dysarthric speech severity. The methodology is well-structured, leveraging deep learning principles to address the challenges of speaker-independent classification. The incorporation of a cross-corpus fine-tuning strategy is particularly noteworthy, as it allows the model to generalize better across different datasets, which is a significant advancement over traditional methods that often struggle with speaker variability.
The experimental setup is robust, utilizing two benchmark datasets (TORGO and UA-Speech) and employing rigorous evaluation protocols (OSPS and LOSO) to assess model performance. The results demonstrate a clear improvement over baseline models and existing state-of-the-art methods, particularly after fine-tuning, which showcases the effectiveness of the proposed architecture. However, the paper could benefit from more extensive ablation studies to further validate the contributions of individual components.
The paper provides a detailed description of the methodology, including data preprocessing, model architecture, and training procedures, which aids in reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for other researchers to replicate the results directly.
While the model shows promising results, it still faces challenges related to class imbalance, particularly for medium severity levels. The reliance on two specific datasets may also limit the generalizability of the findings to broader dysarthric speech scenarios. Additionally, the paper does not address potential overfitting issues, which could arise from the model's complexity.
The implications of this research are significant for clinical applications, particularly in developing assistive technologies for individuals with dysarthria. By improving the accuracy of severity classification, the proposed model can enhance the effectiveness of therapy plans and assistive communication devices, ultimately contributing to better quality of life for affected individuals. The approach also lays the groundwork for future research in speaker-independent speech processing tasks. The main contribution of this paper is the development of DSSCNet, a novel deep learning architecture that significantly improves dysarthric speech severity classification through innovative use of transfer learning and advanced neural network components. This work represents a meaningful step forward in the field of speech processing, addressing critical challenges in speaker-independent scenarios and enhancing the potential for real-world applications.
Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate LALMs' audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary significantly across different scenarios. Moreover, most benchmarks do not consider the joint understanding of speech, scene, and events within the same audio clip. In this work, we introduce SSEU-Bench, the first versatile audio understanding benchmark that explicitly accounts for energy differences between speech and non-speech audio, with both independent and joint understanding settings for speech, scene, and events. Furthermore, we demonstrate that some LALMs tend to underperform on certain tasks in a joint understanding setting. To address this issue, we introduce Chain-of-Thought, which effectively improves LALMs' joint audio understanding performance by decomposing complex tasks into simpler reasoning steps.
Primary: School of Electrical Engineering
All Institutions: Republic of Korea, School of Electrical Engineering, and Jung-Woo Choi
The paper presents SSEU-Bench, a novel benchmark for evaluating Large Audio Language Models (LALMs) in joint audio understanding tasks, significantly advancing the field of audio processing by addressing key gaps in existing benchmarks. The innovative methodology and comprehensive evaluation highlight the potential of LALMs while revealing challenges that remain in achieving robust audio understanding.
The paper introduces SSEU-Bench, a novel benchmark that addresses the joint understanding of speech, scene, and events in audio signals, which is a significant advancement over existing benchmarks that typically treat these components separately. The methodology effectively incorporates energy differences between speech and non-speech audio, which is crucial for realistic audio understanding tasks. The introduction of the Chain-of-Thought (CoT) approach for improving joint understanding by decomposing tasks into simpler steps is innovative and adds depth to the methodology.
The experiments are well-structured, evaluating multiple Large Audio Language Models (LALMs) across various tasks (ASR, ASC, and AT) under different signal-to-noise ratios (SNRs). The results demonstrate the performance of LALMs in independent and joint understanding settings, providing a comprehensive view of their capabilities. However, the paper could benefit from more extensive comparisons with state-of-the-art models beyond CLAP-based approaches to contextualize the findings further.
The authors have committed to releasing all data and code, which is a positive step towards reproducibility. However, the paper lacks detailed descriptions of the experimental setups, including hyperparameters and specific configurations used for the LALMs, which could hinder full reproducibility.
One limitation is the reliance on a limited number of LALMs for evaluation, which may not fully represent the landscape of audio understanding models. Additionally, the performance degradation observed in joint understanding tasks raises questions about the robustness of LALMs in complex scenarios. The paper also does not address potential biases in the datasets used for training and evaluation.
The proposed benchmark and methodologies have significant implications for real-world applications, such as human-machine interaction, automatic transcription services, and environmental sound recognition. By improving the understanding of audio signals in a joint context, this work paves the way for more sophisticated audio processing systems that can better mimic human auditory perception. The paper presents SSEU-Bench, a novel benchmark for evaluating Large Audio Language Models (LALMs) in joint audio understanding tasks, significantly advancing the field of audio processing by addressing key gaps in existing benchmarks. The innovative methodology and comprehensive evaluation highlight the potential of LALMs while revealing challenges that remain in achieving robust audio understanding.
Speech therapy plays a critical role in training speech disorders caused by neurological impairments such as stroke. However, traditional manual and computer-assisted systems are limited in real-time accessibility and articulatory motion feedback, constraining their practical utility. Recent advances in multimodal large language models (MLLMs) have demonstrated significant potential in healthcare, particularly through their ability to integrate multimodal data for adaptive assessment and therapeutic feedback. Nevertheless, challenges including insufficient acquisition and fusion of articulatory information, inadequate parsing of articulatory organ motion trajectories, and the scarcity of high-quality domain-specific datasets hinder the application of MLLMs in speech therapy. To address these limitations, we propose an MLLM-based speech rehabilitation assistance system that synergistically leverages ultrasound tongue imaging and speech signals to deliver precise, interactive articulatory feedback. We construct a high-quality domain-specific dataset comprising UTI-speech dialogue pairs. This dataset facilitates fine-tuning to enhance the model's clinical adaptability. Building on this dataset, our methods achieves spatiotemporal fusion training strategy of ultrasound videos and speech signals, enabling fine-grained articulatory impairment analysis and ultimately generating actionable feedback.
Primary: The Eighth Affiliated Hospital of Sun Yat-sen University
All Institutions: Chinese Academy of Sciences, The Eighth Affiliated Hospital of Sun Yat-sen University, Key Laboratory of Biomedical Imaging Science and System, Department of Rehabilitation Medicine, Shenzhen Institute of Advanced Technology, Shaofeng zhao, *Corresponding authors
The main contribution of this paper is the development of a multimodal large language model-based system for personalized speech therapy that integrates ultrasound imaging and speech signals, demonstrating significant potential to enhance the effectiveness of speech rehabilitation. The innovative methodology and promising experimental results position this work as a significant advancement in the intersection of machine learning and healthcare.
The proposed methodology is innovative, leveraging a multimodal large language model (MLLM) that integrates ultrasound tongue imaging (UTI) with speech signals to provide personalized feedback for speech rehabilitation. The authors construct a high-quality dataset of UTI-speech dialogue pairs, which is critical for fine-tuning the model. The dual-agent collaborative QA generation framework is a notable contribution, as it enhances the generation of dialogue data for therapy applications. The spatiotemporal fusion training strategy is well-conceived, allowing for a nuanced understanding of articulatory dynamics. However, the paper could benefit from a more detailed explanation of the model architecture and the specific algorithms used for data processing and feature extraction.
The experiments are comprehensive, utilizing a well-defined dataset and clear evaluation metrics, including BLEU, METEOR, and ROUGE-L for natural language generation, as well as accuracy and F1-score for dysarthria assessment. The results demonstrate significant improvements over baseline models, indicating the effectiveness of the proposed approach. The ablation studies provide valuable insights into the contributions of different modalities, confirming the importance of integrating UTI data. However, the paper lacks a comparative analysis with more recent state-of-the-art models in the same domain, which could strengthen the claims of superiority.
The implementation details are adequately described, including the training configuration and dataset characteristics. However, the absence of a publicly available code repository or demo limits reproducibility. Providing access to the dataset and model would enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on a relatively small dataset, which may affect the generalizability of the model. Additionally, the paper does not address potential biases in the dataset or the implications of using a single model architecture. The authors could also explore the scalability of their approach in real-world clinical settings.
This research has significant implications for the field of speech therapy, particularly in enhancing accessibility and personalization of treatment for individuals with speech disorders. By integrating advanced machine learning techniques with clinical practice, the proposed system could improve patient outcomes and reduce the burden on healthcare professionals. The work may inspire further research into multimodal approaches in other areas of rehabilitation and therapy. The main contribution of this paper is the development of a multimodal large language model-based system for personalized speech therapy that integrates ultrasound imaging and speech signals, demonstrating significant potential to enhance the effectiveness of speech rehabilitation. The innovative methodology and promising experimental results position this work as a significant advancement in the intersection of machine learning and healthcare.
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which dynamically fuse speaker-aware global information and fine-grained local features to guide expert selection. This mechanism enables speaker-specific routing by leveraging both global context and local acoustic cues. Experiments on LibriSpeechMix show that GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios. To our best knowledge, this is the first work to apply Mixture-of-Experts (MoE) to end-to-end MTASR with a global-local fusion strategy. Our code and train dataset can be found at https://github.com/NKU-HLT/GLAD.
Primary: Nankai University
All Institutions: Nankai University, Corresponding author, College of Computer Science
The paper presents GLAD, a novel framework for multi-talker ASR that significantly enhances transcription accuracy in overlapping speech scenarios. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on the field of speech processing and related applications.
The paper proposes a novel framework called GLAD, which utilizes a Mixture-of-Experts (MoE) approach to improve multi-talker automatic speech recognition (MTASR). The methodology is well-structured, introducing a global-local aware dynamic routing mechanism that effectively combines global context with local acoustic features. This dual approach allows for more precise expert selection, particularly in high-overlap scenarios where speaker identities and content are entangled. The integration of LoRA-based experts enhances the scalability and efficiency of the model, making it suitable for real-world applications. The design choices are justified, and the paper provides a clear explanation of how the proposed framework addresses existing limitations in MTASR.
The experiments conducted on the LibriSpeechMix dataset are comprehensive, comparing GLAD-SOT against various baseline models. The results demonstrate significant improvements in performance, especially in challenging multi-talker scenarios. The use of both Permutation-Invariant WER and Overlap-Aware WER as evaluation metrics provides a nuanced understanding of the model's capabilities. The ablation studies effectively highlight the contributions of different components of the GLAD architecture, reinforcing the importance of the proposed global-local fusion strategy.
The authors provide a link to their GitHub repository, which includes the code and training dataset, enhancing the reproducibility of their work. The detailed descriptions of the model architecture, training settings, and evaluation metrics further support the ability of other researchers to replicate the study. However, the paper could benefit from additional details regarding hyperparameter tuning and specific configurations used during training.
While the proposed GLAD framework shows promising results, it may still face challenges in extremely noisy environments or with highly variable speaker characteristics that were not extensively tested. Additionally, the reliance on the LibriSpeechMix dataset may limit the generalizability of the findings to other real-world datasets with different characteristics. The paper does not address potential computational costs associated with the dynamic routing mechanism, which could be a concern for deployment in resource-constrained environments.
The advancements in MTASR presented in this paper have significant implications for various applications, including meeting transcription, voice assistants, and multi-party dialogue systems. By improving the accuracy of speech recognition in overlapping scenarios, this research could enhance communication technologies and accessibility tools, benefiting diverse user groups. The dynamic routing approach could also inspire further research in other domains where multi-modal data processing is required. The paper presents GLAD, a novel framework for multi-talker ASR that significantly enhances transcription accuracy in overlapping speech scenarios. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on the field of speech processing and related applications.
Audio codecs are a critical component of modern speech generation systems. This paper introduces a low-bitrate, multi-scale residual codec that encodes speech into four distinct streams: semantic, timbre, prosody, and residual. This architecture achieves high-fidelity speech reconstruction at competitive low bitrates while demonstrating an inherent ability for information disentanglement. We construct a two-stage language model for text-to-speech (TTS) synthesis using this codec, which, despite its lightweight design and minimal data requirements, achieves a state-of-the-art Word Error Rate (WER) and superior speaker similarity compared to several larger models. Furthermore, the codec's design proves highly effective for voice conversion, enabling independent manipulation of speaker timbre and prosody.
Primary: The Hong Kong University of Science and Technology
All Institutions: Hong Kong SAR, The Hong Kong University of Science and Technology, utterances are randomly sampled from the dataset of VCTK. These utterances come from different speakers (4 male/ 4 female)
This paper presents a novel low-bitrate multi-stream residual codec that effectively disentangles speech attributes for high-fidelity speech generation. The technical contributions are significant, with a well-structured methodology and robust experimental validation, positioning it as a valuable advancement in the field of audio processing and speech synthesis.
The proposed method introduces a multi-stream residual codec that effectively disentangles speech into semantic, timbre, prosody, and residual streams. This architecture is innovative in its approach to information disentanglement, achieving high compression rates while maintaining speech quality. The use of pre-trained models for feature extraction and the cascaded architecture for stream fusion are well-justified and contribute to the overall efficiency of the codec. The methodology is clearly articulated, with a logical flow from speech encoding to reconstruction, and the integration of auxiliary losses enhances the model's ability to capture prosodic features.
The experiments are comprehensive, utilizing diverse datasets that ensure robustness in speaker identity and prosody representation. The results demonstrate the codec's competitive performance against existing models, achieving state-of-the-art WER and speaker similarity metrics. The inclusion of various evaluation metrics, such as STOI, PESQ, and WER, provides a well-rounded assessment of the model's capabilities. However, the paper could benefit from more extensive comparisons with a wider range of existing codecs to further validate its claims.
The implementation details are described in sufficient depth, including the architecture of the codec and TTS model, as well as the training objectives and evaluation metrics. However, the lack of URLs for code repositories or demo pages limits the reproducibility of the work. Providing access to the code and trained models would significantly enhance the ability of other researchers to replicate the results.
One limitation is the reliance on pre-trained models, which may not generalize well across different languages or dialects. Additionally, the evaluation is primarily focused on English datasets, potentially limiting the applicability of the findings to other languages. The paper also does not address the scalability of the model to larger datasets or more complex speech scenarios.
The proposed codec has significant implications for applications in speech synthesis, voice conversion, and other audio processing tasks. Its ability to disentangle speech attributes could lead to advancements in personalized TTS systems and more efficient audio streaming technologies. The lightweight design and low bitrate requirements make it particularly relevant for mobile and real-time applications, potentially broadening access to high-quality speech generation technologies. This paper presents a novel low-bitrate multi-stream residual codec that effectively disentangles speech attributes for high-fidelity speech generation. The technical contributions are significant, with a well-structured methodology and robust experimental validation, positioning it as a valuable advancement in the field of audio processing and speech synthesis.
Target Speaker Extraction (TSE) is a critical challenge in cocktail party scenarios. While leveraging multiple modalities, such as voice, lip, face, and expression embeddings, can enhance performance, real-world applications often suffer from intermittent modality dropout. This paper presents a comprehensive study on the interactions and robustness of various multimodal fusion strategies under varying degrees of modality dropout. We build upon a state-of-the-art audio-visual speech enhancement system and integrate four distinct speaker identity cues: lip embeddings for synchronized contextual information, a voice speaker embedding extracted via cross-attention for acoustic consistency, a static face embedding for speaker identity, and a novel dynamic expression embedding for frame-wise emotional features. We systematically evaluate different combinations of these modalities under two key training regimes: zero dropout and 80% modality dropout. Extensive experiments demonstrate that while a full multimodal ensemble achieves optimal performance under ideal (zero dropout) conditions, its effectiveness diminishes significantly when test-time dropout occurs without prior exposure during training. Crucially, we show that training with a high (80%) modality dropout rate dramatically enhances model robustness, enabling the system to maintain superior performance even under severe test-time missing modalities. Our findings highlight that voice embeddings exhibit consistent robustness, while the proposed expression embedding provides valuable complementary information. This work underscores the importance of training strategies that account for real-world imperfection, moving beyond pure performance maximization to achieve practical reliability in multimodal speech enhancement systems.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, Manuscript received April 19, Wuhan University, This paper was produced by the IEEE Publication Technology Group. They are in Piscataway, 2021; revised August 16, *Corresponding author
The main contribution of this paper is the introduction of a robust multimodal target speaker extraction system that integrates various speaker identity cues and demonstrates the importance of training with modality dropout to enhance real-world applicability. The comprehensive analysis of the interactions between modalities and their robustness under dropout conditions provides valuable insights for future research in audio-visual speech enhancement.
The methodology presented in this paper is robust and well-structured, focusing on the integration of multiple modalities (lip, voice, face, and expression embeddings) for target speaker extraction. The authors employ a state-of-the-art audio-visual speech enhancement system and introduce a novel dynamic expression embedding that adds significant value to the model. The systematic evaluation of different combinations of modalities under varying dropout conditions is a strong point, demonstrating a thorough understanding of the challenges in real-world applications. The use of cross-attention mechanisms for voice embeddings is particularly noteworthy, as it enhances the contextual relationship between the enrollment and mixed speech.
The experiments are comprehensive, utilizing a well-defined dataset from the 3rd COG-MHEAR Audio-Visual Speech Enhancement Challenge (AVSEC-3). The paper presents clear results that illustrate the performance of different multimodal configurations under both ideal and challenging conditions. The findings indicate that while a full multimodal ensemble performs best under zero dropout conditions, the robustness of the model significantly improves when trained with high dropout rates. This highlights the practical implications of the research, as it addresses real-world scenarios where modality dropout is common.
The implementation details are adequately described, including the architecture of the baseline system, the training process, and the evaluation metrics used. However, the absence of a publicly available code repository or demo limits reproducibility. Providing access to the model and datasets would enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on a specific dataset, which may not fully represent the diversity of real-world audio-visual scenarios. Additionally, while the paper discusses the robustness of the model under dropout conditions, it does not explore the impact of other potential noise factors or environmental variables that could affect performance. The sensitivity of the expression embedding to modality availability could also be a concern in practical applications.
The findings of this research have significant implications for the development of robust audio-visual systems in various applications, including telecommunications, assistive technologies, and interactive systems. By emphasizing the importance of training strategies that account for real-world imperfections, this work contributes to the advancement of more reliable multimodal systems that can operate effectively in unpredictable environments. The main contribution of this paper is the introduction of a robust multimodal target speaker extraction system that integrates various speaker identity cues and demonstrates the importance of training with modality dropout to enhance real-world applicability. The comprehensive analysis of the interactions between modalities and their robustness under dropout conditions provides valuable insights for future research in audio-visual speech enhancement.
Voice activity detection (VAD) is essential in speech-based systems, but traditional methods detect only speech presence without identifying speakers. Target-speaker VAD (TS-VAD) extends this by detecting the speech of a known speaker using a short enrollment utterance, but this assumption fails in open-domain scenarios such as meetings or customer service calls, where the main speaker is unknown. We propose EEND-SAA, an enrollment-less, streaming-compatible framework for main-speaker VAD, which identifies the primary speaker without prior knowledge. Unlike TS-VAD, our method determines the main speaker as the one who talks more steadily and clearly, based on speech continuity and volume. We build our model on EEND using two self-attention attractors in a Transformer and apply causal masking for real-time use. Experiments on multi-speaker LibriSpeech mixtures show that EEND-SAA reduces main-speaker DER from 6.63% to 3.61% and improves F1 from 0.9667 to 0.9818 over the SA-EEND baseline, achieving state-of-the-art performance under conditions involving speaker overlap and noise.
Primary: National Yang Ming Chiao Tung University
All Institutions: Institute of Electrical and Computer Engineering, Taiwan under Grant NSTC 113-2221-E-A49-146, This research is supported by the National Science and Technology Council, National Yang Ming Chiao Tung University
The paper presents a novel framework for enrollment-less main speaker voice activity detection using self-attention attractors, significantly advancing the state of the art in multi-speaker scenarios. The technical contributions, particularly the dual self-attention mechanism and real-time processing capabilities, position this work as a meaningful addition to the field of audio processing and speech technology.
The methodology presented in EEND-SAA is innovative in its approach to voice activity detection by eliminating the need for speaker enrollment, which is a significant limitation in traditional systems. The use of self-attention attractors within a Transformer framework is a novel contribution that enhances the model's ability to distinguish the main speaker from background noise and overlapping speech. The dual self-attention mechanism is particularly noteworthy, as it allows the model to focus on both the main speaker and background speakers simultaneously, improving detection accuracy. The incorporation of causal masking for real-time processing further enhances the practicality of the model in interactive environments. Overall, the methodology is sound and builds effectively on existing work in the field, particularly EEND and its variants.
The experimental evaluation is robust, utilizing the LibriSpeech dataset to simulate real-world conditions with overlapping speakers and noise. The authors provide comprehensive results that demonstrate significant improvements in main-speaker detection metrics, such as DER and F1 scores, compared to baseline models. The ablation studies are particularly useful in highlighting the contributions of various components, such as positional encoding and the dual attractor design. However, the paper could benefit from more extensive comparisons with additional state-of-the-art methods to further validate the performance claims.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which aids in reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for other researchers to easily replicate the results. Including such resources would significantly enhance the reproducibility of the findings.
One limitation of the proposed approach is its reliance on the quality of the input audio. In extremely noisy environments or with significant overlap from multiple speakers, the model's performance may degrade. Additionally, while the model shows promise in real-time applications, the computational efficiency and latency in practical deployments have not been extensively discussed. The paper also does not address potential biases in the training data that could affect the model's generalizability across different speaker demographics.
The implications of this research are significant for various applications in speech recognition, customer service, and interactive voice systems, where identifying the main speaker in noisy environments is crucial. The enrollment-less approach can facilitate more flexible and user-friendly systems, making them more accessible in real-world scenarios. This work could lead to advancements in smart assistants, meeting transcription tools, and other audio processing applications, ultimately improving user experience and system efficiency. The paper presents a novel framework for enrollment-less main speaker voice activity detection using self-attention attractors, significantly advancing the state of the art in multi-speaker scenarios. The technical contributions, particularly the dual self-attention mechanism and real-time processing capabilities, position this work as a meaningful addition to the field of audio processing and speech technology.
Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end -- approximately $54\times$ less compute ($25\times$ architecture-only) than spectrogram-domain separators like AudioSep -- while remaining fully bitstream-compatible.
The main contribution of this paper is the introduction of CodecSep, a novel NAC-based model for universal sound separation that combines efficient audio processing with text-driven control, outperforming existing methods in both fidelity and computational efficiency. This work represents a significant advancement in the field of audio processing, particularly for applications requiring real-time performance on resource-constrained devices.
The paper introduces CodecSep, a novel approach to universal sound separation that leverages neural audio codecs (NACs) and a transformer-based masker modulated by text embeddings. This combination allows for efficient, on-device sound separation while maintaining high fidelity and perceptual quality. The use of Feature-wise Linear Modulation (FiLM) to condition the transformer on text embeddings is an innovative aspect that enhances the model's ability to interpret prompts semantically. The methodology is well-structured, with a clear rationale for the design choices, particularly the decision to operate in the latent space of the codec rather than the spectrogram domain. This choice significantly reduces computational requirements and improves separation performance.
The authors conduct a comprehensive evaluation across multiple datasets, including both in-domain and open-domain benchmarks. The results demonstrate that CodecSep outperforms existing models like AudioSep in separation fidelity (measured by SI-SDR) while remaining competitive in perceptual quality (ViSQOL). The experiments are well-designed, with careful attention to matched training and prompt protocols, and the inclusion of ablation studies helps to isolate the contributions of different components of the model. The reported performance gains are substantial, indicating the effectiveness of the proposed approach.
The paper mentions that supplementary code will be provided to facilitate reproducibility, which is a positive aspect. However, specific implementation details, hyperparameters, and training configurations are deferred to the appendix, which may pose challenges for full reproducibility unless the supplementary materials are made readily accessible.
The paper acknowledges several limitations, including the modest scale of training data and prompt diversity, which could affect generalization. Additionally, it notes that while the model is robust to synonymic paraphrases, it has not been tested with prompts that include explicit temporal structures. The perceptual quality of sound effects (SFX) in some cases trails behind the best competing scores, indicating room for improvement.
The ability to perform efficient, text-guided sound separation has significant implications for various applications, including media production, assistive technologies, and real-time audio editing. The model's efficiency makes it suitable for deployment on edge devices, which could democratize access to advanced audio processing capabilities. The main contribution of this paper is the introduction of CodecSep, a novel NAC-based model for universal sound separation that combines efficient audio processing with text-driven control, outperforming existing methods in both fidelity and computational efficiency. This work represents a significant advancement in the field of audio processing, particularly for applications requiring real-time performance on resource-constrained devices.
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to classify abnormal heart sounds indicative of CVDs using synchronised phonocardiogram (PCG) and electrocardiogram (ECG) signals, as well as multichannel PCG (mPCG). However, state-of-the-art architectures remain underutilised due to the limited availability of synchronised and multichannel datasets. Augmented datasets and pre-trained models provide a pathway to overcome these limitations, enabling transformer-based architectures to be trained effectively. This work combines traditional signal processing with denoising diffusion models, WaveGrad and DiffWave, to create an augmented dataset to fine-tune a Wav2Vec 2.0-based classifier on multimodal and multichannel heart sound datasets. The approach achieves state-of-the-art performance. On the Computing in Cardiology (CinC) 2016 dataset of single channel PCG, accuracy, unweighted average recall (UAR), sensitivity, specificity and Matthew's correlation coefficient (MCC) reach 92.48\%, 93.05\%, 93.63\%, 92.48\%, 94.93\% and 0.8283, respectively. Using the synchronised PCG and ECG signals of the training-a dataset from CinC, 93.14\%, 92.21\%, 94.35\%, 90.10\%, 95.12\% and 0.8380 are achieved for accuracy, UAR, sensitivity, specificity and MCC, respectively. Using a wearable vest dataset consisting of mPCG data, the model achieves 77.13\% accuracy, 74.25\% UAR, 86.47\% sensitivity, 62.04\% specificity, and 0.5082 MCC. These results demonstrate the effectiveness of transformer-based models for CVD detection when supported by augmented datasets, highlighting their potential to advance multimodal and multichannel heart sound classification.
Primary: inst2 Kayapanda Mandana
All Institutions: inst1 Yue Rong, inst1 Milan Marocchi, inst2 Kayapanda Mandana, inst1 Matthew Fynn
This paper effectively combines advanced machine learning techniques with practical applications in cardiovascular health, showcasing a novel approach to heart sound classification that could significantly impact clinical practices. The integration of synthetic data generation methods addresses critical data limitations, making it a valuable contribution to the field of medical machine learning.
The paper presents a comprehensive methodology that integrates traditional signal processing with advanced deep learning techniques, specifically leveraging Wav2Vec 2.0 and diffusion models (WaveGrad and DiffWave) for synthetic and augmented biosignal generation. The approach is innovative in that it addresses the scarcity of high-quality, synchronized datasets for heart sound classification by generating synthetic data, which is a significant contribution to the field. The methodology is well-structured, with clear steps for data augmentation, model training, and evaluation, although some details regarding hyperparameter tuning and model architecture could be elaborated further for clarity.
The experimental section is robust, utilizing multiple datasets (CinC 2016 and a wearable vest dataset) to validate the proposed models. The results demonstrate state-of-the-art performance on the CinC dataset and near-state-of-the-art results on the multichannel vest dataset. The metrics used for evaluation (accuracy, UAR, sensitivity, specificity, and MCC) are appropriate for the classification task, and the use of cross-validation enhances the reliability of the findings. However, the paper could benefit from a more detailed comparison with existing methods beyond just accuracy metrics to provide a clearer context of its contributions.
The paper provides a reasonable level of detail regarding the implementation, including the hardware used and the data preprocessing steps. However, it lacks a direct link to a code repository or demo, which would greatly enhance reproducibility. Additionally, while the hyperparameter optimization process is mentioned, specific values and configurations used in the experiments are not fully disclosed, which could hinder replication efforts by other researchers.
One limitation of the study is the reliance on synthetic data, which may not fully capture the complexities of real-world heart sounds. The performance on the multichannel vest dataset is lower than on the CinC dataset, indicating challenges in generalizing the model to noisier, real-world data. Furthermore, the paper does not address potential biases in the datasets used, which could affect the model's applicability across diverse populations.
The implications of this research are significant, particularly in the context of early detection of cardiovascular diseases, which is a leading cause of mortality worldwide. The ability to classify heart sounds accurately and inexpensively could enhance pre-screening methods and facilitate better patient outcomes. The integration of multimodal data (PCG and ECG) also opens avenues for more comprehensive diagnostic tools in cardiology. This paper effectively combines advanced machine learning techniques with practical applications in cardiovascular health, showcasing a novel approach to heart sound classification that could significantly impact clinical practices. The integration of synthetic data generation methods addresses critical data limitations, making it a valuable contribution to the field of medical machine learning.
Agentic AI has been standardized in industry as a practical paradigm for coordinating specialized models and tools to solve complex multimodal tasks. In this work, we present WeaveMuse, a multi-agent system for music understanding, symbolic composition, and audio synthesis. Each specialist agent interprets user requests, derives machine-actionable requirements (modalities, formats, constraints), and validates its own outputs, while a manager agent selects and sequences tools, mediates user interaction, and maintains state across turns. The system is extendable and deployable either locally, using quantization and inference strategies to fit diverse hardware budgets, or via the HFApi to preserve free community access to open models. Beyond out-of-the-box use, the system emphasizes controllability and adaptation through constraint schemas, structured decoding, policy-based inference, and parameter-efficient adapters or distilled variants that tailor models to MIR tasks. A central design goal is to facilitate intermodal interaction across text, symbolic notation and visualization, and audio, enabling analysis-synthesis-render loops and addressing cross-format constraints. The framework aims to democratize, implement, and make accessible MIR tools by supporting interchangeable open-source models of various sizes, flexible memory management, and reproducible deployment paths.
Primary: University
All Institutions: Company, Department of Computer Science, International Laboratories, University
WeaveMuse presents a novel multi-agent system for music understanding and generation, emphasizing controllability and accessibility in MIR tasks. The paper's technical contributions are significant, but further empirical validation and methodological details are needed to fully realize its potential impact in the field.
The methodology presented in WeaveMuse is innovative in its approach to creating a multi-agent system for music understanding and generation. The architecture effectively combines various specialized agents that handle different aspects of music processing, such as symbolic composition and audio synthesis. The use of a manager agent to orchestrate these interactions is a significant contribution, allowing for a seamless user experience across modalities. The emphasis on controllability through structured decoding and parameter-efficient adapters is commendable, as it addresses a critical need for flexibility in music information retrieval (MIR) tasks. However, the paper could benefit from a more detailed explanation of the specific algorithms employed within the agents and how they interact with one another.
The paper provides a conceptual framework and initial behaviors under constrained settings, but lacks extensive experimental validation. While it outlines the deployment modes and resource management strategies, concrete results demonstrating the performance of the system in real-world scenarios are limited. The authors mention using various models and tools, but without empirical data or comparative analysis, it is challenging to assess the effectiveness of WeaveMuse against existing systems. Future work should include rigorous experiments with quantitative metrics to substantiate the claims made.
The authors have made efforts to ensure reproducibility by providing an open-source framework and a public repository. The identical planner and prompt templates for local and hosted configurations are a positive aspect for users looking to replicate the results. However, the paper could improve by including more detailed instructions on setting up the environment and running experiments, as well as providing sample datasets for testing.
The paper acknowledges several limitations, such as the dependency on the underlying large language model's capabilities and potential performance degradation with smaller models. Additionally, the orchestration of tools and agentic prompting may not always yield the expected results, which could hinder user experience. The authors also note that the system is still a work in progress, indicating that further refinements are necessary.
WeaveMuse has the potential to significantly impact the field of music information retrieval and generation by democratizing access to advanced music processing tools. Its open-source nature and support for various models could foster community engagement and innovation. The framework's ability to facilitate intermodal interaction across text, symbolic notation, and audio could lead to new applications in music education, composition, and analysis. WeaveMuse presents a novel multi-agent system for music understanding and generation, emphasizing controllability and accessibility in MIR tasks. The paper's technical contributions are significant, but further empirical validation and methodological details are needed to fully realize its potential impact in the field.
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
Primary: Work does not relate to position at Amazon
All Institutions: Work does not relate to position at Amazon
FuseCodec introduces a novel framework for speech tokenization that effectively integrates acoustic, semantic, and contextual signals, significantly advancing the state of the art in speech processing. The combination of innovative methodologies and strong empirical results positions this work as a meaningful contribution to the field of machine learning and audio processing.
The methodology proposed in FuseCodec is innovative, particularly in its approach to integrating semantic and contextual features into the encoder latent space. The three techniques—Latent Representation Fusion, Global Semantic-Contextual Supervision, and Temporally Aligned Contextual Supervision—are well-defined and address significant challenges in speech tokenization. The use of strong cross-modal alignment and globally informed supervision is a notable advancement, enhancing the robustness of the model. However, the paper could benefit from a more detailed explanation of the implementation specifics and how these techniques interact in practice.
The experiments conducted on the LibriSpeech dataset demonstrate the effectiveness of FuseCodec, achieving state-of-the-art results in multiple metrics such as transcription accuracy and perceptual quality. The comparison with existing models like EnCodec and SpeechTokenizer is thorough, showcasing clear improvements. However, the paper lacks a comprehensive analysis of the statistical significance of the results, which would strengthen the claims made about performance improvements.
The availability of code and pretrained models on GitHub is a positive aspect, promoting reproducibility. However, the paper should include more detailed instructions on the setup and any dependencies required to run the experiments, as well as the specific configurations used for training and evaluation.
One limitation is the potential overfitting to the LibriSpeech dataset, as the paper does not discuss generalization to other datasets or real-world applications. Additionally, while the methods are promising, the complexity of the model may pose challenges in deployment scenarios where computational resources are limited.
The implications of FuseCodec extend beyond speech tokenization, potentially impacting areas such as speech synthesis and natural language processing. The integration of semantic and contextual cues could enhance various applications, including virtual assistants, transcription services, and accessibility tools for the hearing impaired. The work encourages further exploration of multimodal approaches in audio processing, which could lead to more intuitive human-computer interactions. FuseCodec introduces a novel framework for speech tokenization that effectively integrates acoustic, semantic, and contextual signals, significantly advancing the state of the art in speech processing. The combination of innovative methodologies and strong empirical results positions this work as a meaningful contribution to the field of machine learning and audio processing.