Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.
Primary: Tsinghua University
All Institutions: and propose the prior augmentation strategies to reduce cascading errors. Comprehensive experimental results demonstrate that our AudioLBM outperforms previous audio upsampling systems by a large margin across speech, Corresponding author: Jun Zhu, Equal contribution, Shengshu AI, Tsinghua University, Department of CST, kHz on 192Audio and 192Music
The main contribution of this work is the introduction of a novel audio super-resolution system utilizing Latent Bridge Models, which significantly enhances the quality of audio upsampling beyond existing methods. The comprehensive methodology, rigorous experimental validation, and potential applications highlight its significance in advancing the field of audio processing.
The paper introduces a novel approach to audio super-resolution using Latent Bridge Models (LBMs), which compress audio waveforms into a continuous latent space. The methodology is well-structured, leveraging frequency-aware LBMs and a cascaded design to enhance the upsampling process beyond 48 kHz. The integration of informative priors from low-resolution (LR) signals into the generative framework is innovative, allowing for better quality audio synthesis. The paper also presents two prior augmentation strategies to mitigate cascading errors, which is a thoughtful addition to the overall framework. The use of variational autoencoders (VAEs) for compression and the detailed explanation of the bridge process further demonstrate the robustness of the proposed methodology.
The experimental setup is comprehensive, utilizing multiple benchmark datasets (VCTK, ESC-50, Song-Describer) and internal test sets to evaluate the performance of the proposed method. The results indicate a significant improvement over existing methods, achieving state-of-the-art performance in both objective and perceptual quality metrics. The paper effectively compares its results against various baselines, providing clear evidence of the advantages of the proposed approach. The ablation studies conducted further validate the contributions of each component of the model.
The paper includes sufficient details regarding the training setup, model architecture, and evaluation metrics, which enhances reproducibility. However, the absence of a public code repository limits the ability for independent verification of results. The authors mention a demo URL, which may provide some interactive insights, but a complete code release would be beneficial for the community.
While the proposed method shows promising results, it is important to note that the reliance on high-quality training data may limit its applicability in scenarios where such data is scarce. Additionally, the paper acknowledges potential misuse of the technology, such as unauthorized synthesis of audio, which raises ethical considerations. The cascading approach, while innovative, may still introduce artifacts that could affect the final output quality if not managed properly.
The implications of this research are significant for various applications, including audio restoration, music production, and hearing aids, where high-quality audio is essential. The ability to upscale audio beyond traditional limits opens new avenues for creative industries and enhances user experiences in audio consumption. However, the ethical concerns regarding misuse must be addressed to prevent potential negative impacts on the industry. The main contribution of this work is the introduction of a novel audio super-resolution system utilizing Latent Bridge Models, which significantly enhances the quality of audio upsampling beyond existing methods. The comprehensive methodology, rigorous experimental validation, and potential applications highlight its significance in advancing the field of audio processing.
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.
The main contribution of this paper is the introduction of UniSS, a unified single-stage framework for expressive speech-to-speech translation that significantly advances the state of the art by integrating large language models and addressing key challenges in the field. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its significance in machine learning research.
The paper introduces a unified single-stage framework for expressive speech-to-speech translation (S2ST) called UniSS, which effectively addresses the challenges of preserving speaker identity and emotional style during translation. The methodology is innovative, employing a cross-modal chain-of-thought prompting process that allows for the integration of large language models (LLMs) into the speech domain. The use of a triple-tokenizer strategy to represent different aspects of speech (speaker, linguistic, and semantic tokens) is a notable strength, as it enhances the model's ability to capture and reproduce expressive characteristics. The progressive training strategy is well-structured, emphasizing the importance of data quality and alignment between speech and text modalities.
The experimental results are robust, demonstrating that UniSS significantly outperforms existing methods in translation fidelity, speech quality, and emotional preservation. The authors provide a comprehensive evaluation using both objective metrics (e.g., BLEU scores, prosody preservation) and subjective assessments (e.g., MOS scores), which lend credibility to their claims. The introduction of the UniST dataset, comprising 44.8k hours of expressive S2ST data, is a significant contribution that enhances the reproducibility of results and provides a valuable resource for future research.
The paper includes detailed implementation details, including the training configuration, hyperparameters, and the data construction process for the UniST dataset. The availability of the code and demo enhances reproducibility, allowing other researchers to replicate the findings and build upon the work. However, the complexity of the model and the extensive training data required may pose challenges for some researchers in terms of resource availability.
While the paper presents a strong framework, it acknowledges limitations such as the focus on only Chinese and English languages, which restricts the applicability of the model to multilingual scenarios. Additionally, the reliance on a large-scale dataset may limit the model's accessibility for smaller research teams or institutions. The authors also mention the need for a unified tokenizer to optimize vocabulary size, indicating potential areas for further improvement.
The proposed UniSS framework has significant implications for real-time interpretation, cross-lingual video dubbing, and other applications requiring high-quality expressive S2ST. By effectively preserving emotional style and speaker identity, this work could enhance user experiences in various communication technologies, making it particularly relevant in globalized contexts where multilingual interactions are common. The main contribution of this paper is the introduction of UniSS, a unified single-stage framework for expressive speech-to-speech translation that significantly advances the state of the art by integrating large language models and addressing key challenges in the field. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its significance in machine learning research.
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.
Primary: University of Hamburg
All Institutions: University of Hamburg, CISPA Helmholtz Center for Information Security, Signal Processing (SP)
The main contribution of this paper is the demonstration that modern speech enhancement systems are vulnerable to adversarial attacks, highlighting the need for robust defenses in the field. The comprehensive methodology and experimental evaluation provide valuable insights into the vulnerabilities of both predictive and generative models, making a meaningful contribution to the ongoing discourse on security in machine learning applications.
The paper proposes a novel approach to adversarial attacks on speech enhancement systems by leveraging psychoacoustic principles to mask adversarial noise. The methodology is well-structured, incorporating a white-box attack scenario where the adversary has full knowledge of the model. The introduction of a psychoacoustic model to optimize the inaudibility of the perturbation is particularly innovative. The authors also provide a detailed description of the optimization process, including the use of projected gradient descent and the incorporation of constraints to balance attack success and audibility. This methodological rigor enhances the credibility of the findings.
The experiments are comprehensive, utilizing the EARS-WHAM-v2 dataset, which is appropriate for evaluating speech enhancement systems. The evaluation metrics are well-chosen, including both attack success (WER, POLQA, ESTOI) and perturbation impact (SNR). The results are presented clearly, showing a systematic comparison between predictive and generative models, with insightful analysis on the effects of different configurations. The paper effectively demonstrates the vulnerability of speech enhancement systems to adversarial attacks and highlights the robustness of diffusion models.
The authors provide sufficient details regarding the experimental setup, including model architectures and training procedures. The inclusion of links to the project page and GitHub repository enhances reproducibility. However, the paper could benefit from more explicit instructions on replicating the psychoacoustic model and the adversarial attack process, as these are critical to understanding the full scope of the methodology.
One limitation of the study is that it primarily focuses on white-box attacks, which may not fully represent real-world scenarios where adversaries have limited knowledge of the model. Additionally, while the paper discusses the robustness of diffusion models, it does not explore the potential trade-offs in performance or the computational complexity associated with these models. The generalizability of the findings to other speech enhancement systems beyond those tested is also not addressed.
This research has significant implications for the security of speech enhancement systems, which are increasingly used in applications such as hearing aids and telecommunication devices. By demonstrating vulnerabilities to adversarial attacks, the work raises awareness about the need for more robust models in real-world applications. The findings could inform future research aimed at developing defenses against such attacks, ultimately contributing to safer and more reliable speech processing technologies. The main contribution of this paper is the demonstration that modern speech enhancement systems are vulnerable to adversarial attacks, highlighting the need for robust defenses in the field. The comprehensive methodology and experimental evaluation provide valuable insights into the vulnerabilities of both predictive and generative models, making a meaningful contribution to the ongoing discourse on security in machine learning applications.
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.
The main contribution of this paper is the introduction of UniSS, a unified single-stage framework for expressive speech-to-speech translation that significantly advances the state of the art by integrating large language models and addressing key challenges in the field. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its significance in machine learning research.
The paper introduces a unified single-stage framework for expressive speech-to-speech translation (S2ST) called UniSS, which effectively addresses the challenges of preserving speaker identity and emotional style during translation. The methodology is innovative, employing a cross-modal chain-of-thought prompting process that allows for the integration of large language models (LLMs) into the speech domain. The use of a triple-tokenizer strategy to represent different aspects of speech (speaker, linguistic, and semantic tokens) is a notable strength, as it enhances the model's ability to capture and reproduce expressive characteristics. The progressive training strategy is well-structured, emphasizing the importance of data quality and alignment between speech and text modalities.
The experimental results are robust, demonstrating that UniSS significantly outperforms existing methods in translation fidelity, speech quality, and emotional preservation. The authors provide a comprehensive evaluation using both objective metrics (e.g., BLEU scores, prosody preservation) and subjective assessments (e.g., MOS scores), which lend credibility to their claims. The introduction of the UniST dataset, comprising 44.8k hours of expressive S2ST data, is a significant contribution that enhances the reproducibility of results and provides a valuable resource for future research.
The paper includes detailed implementation details, including the training configuration, hyperparameters, and the data construction process for the UniST dataset. The availability of the code and demo enhances reproducibility, allowing other researchers to replicate the findings and build upon the work. However, the complexity of the model and the extensive training data required may pose challenges for some researchers in terms of resource availability.
While the paper presents a strong framework, it acknowledges limitations such as the focus on only Chinese and English languages, which restricts the applicability of the model to multilingual scenarios. Additionally, the reliance on a large-scale dataset may limit the model's accessibility for smaller research teams or institutions. The authors also mention the need for a unified tokenizer to optimize vocabulary size, indicating potential areas for further improvement.
The proposed UniSS framework has significant implications for real-time interpretation, cross-lingual video dubbing, and other applications requiring high-quality expressive S2ST. By effectively preserving emotional style and speaker identity, this work could enhance user experiences in various communication technologies, making it particularly relevant in globalized contexts where multilingual interactions are common. The main contribution of this paper is the introduction of UniSS, a unified single-stage framework for expressive speech-to-speech translation that significantly advances the state of the art by integrating large language models and addressing key challenges in the field. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its significance in machine learning research.
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.
Primary: University of Hamburg
All Institutions: University of Hamburg, CISPA Helmholtz Center for Information Security, Signal Processing (SP)
The main contribution of this paper is the demonstration that modern speech enhancement systems are vulnerable to adversarial attacks, highlighting the need for robust defenses in the field. The comprehensive methodology and experimental evaluation provide valuable insights into the vulnerabilities of both predictive and generative models, making a meaningful contribution to the ongoing discourse on security in machine learning applications.
The paper proposes a novel approach to adversarial attacks on speech enhancement systems by leveraging psychoacoustic principles to mask adversarial noise. The methodology is well-structured, incorporating a white-box attack scenario where the adversary has full knowledge of the model. The introduction of a psychoacoustic model to optimize the inaudibility of the perturbation is particularly innovative. The authors also provide a detailed description of the optimization process, including the use of projected gradient descent and the incorporation of constraints to balance attack success and audibility. This methodological rigor enhances the credibility of the findings.
The experiments are comprehensive, utilizing the EARS-WHAM-v2 dataset, which is appropriate for evaluating speech enhancement systems. The evaluation metrics are well-chosen, including both attack success (WER, POLQA, ESTOI) and perturbation impact (SNR). The results are presented clearly, showing a systematic comparison between predictive and generative models, with insightful analysis on the effects of different configurations. The paper effectively demonstrates the vulnerability of speech enhancement systems to adversarial attacks and highlights the robustness of diffusion models.
The authors provide sufficient details regarding the experimental setup, including model architectures and training procedures. The inclusion of links to the project page and GitHub repository enhances reproducibility. However, the paper could benefit from more explicit instructions on replicating the psychoacoustic model and the adversarial attack process, as these are critical to understanding the full scope of the methodology.
One limitation of the study is that it primarily focuses on white-box attacks, which may not fully represent real-world scenarios where adversaries have limited knowledge of the model. Additionally, while the paper discusses the robustness of diffusion models, it does not explore the potential trade-offs in performance or the computational complexity associated with these models. The generalizability of the findings to other speech enhancement systems beyond those tested is also not addressed.
This research has significant implications for the security of speech enhancement systems, which are increasingly used in applications such as hearing aids and telecommunication devices. By demonstrating vulnerabilities to adversarial attacks, the work raises awareness about the need for more robust models in real-world applications. The findings could inform future research aimed at developing defenses against such attacks, ultimately contributing to safer and more reliable speech processing technologies. The main contribution of this paper is the demonstration that modern speech enhancement systems are vulnerable to adversarial attacks, highlighting the need for robust defenses in the field. The comprehensive methodology and experimental evaluation provide valuable insights into the vulnerabilities of both predictive and generative models, making a meaningful contribution to the ongoing discourse on security in machine learning applications.
Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.1\% on MMAR, and 70.7\% on MMSU, establishing new state-of-the-art performance across these benchmarks.
Primary: South China University of Technology
All Institutions: South China University of Technology, Antgroup, Shanghai Jiao Tong University, University of Rochester, The Chinese University of Hong Kong, King's College London
The paper presents a comprehensive approach to improving Large Audio Language Models through innovative dataset construction and training paradigms, addressing critical gaps in the current research landscape. The technical contributions, particularly in the context of audio contribution analysis, position this work as a notable advancement in the field of audio processing and multimodal AI.
The paper introduces a novel dataset, AudioMCQ, which is substantial in size (571k samples) and includes chain-of-thought annotations. The methodology for dataset construction is well-structured, avoiding reliance on existing LALMs to prevent hallucinations. The introduction of Audio-Contribution Filtering to categorize audio contributions into weak and strong subsets is a significant methodological advancement. The proposed post-training paradigms (Weak-to-Strong and Mixed-to-Strong) are innovative and provide a framework for enhancing LALM performance based on audio contributions, which is a relatively unexplored area in the field.
The experiments are robust, demonstrating the effectiveness of the proposed methods through competitive results in the DCASE 2025 Audio-Question-Answering challenge, where the authors achieved first place. The performance metrics across multiple benchmarks (MMAU-test-mini, MMAU, MMAR, MMSU) indicate that the proposed strategies lead to state-of-the-art results. The systematic evaluation of the zero audio-contribution phenomenon adds depth to the experimental design, showcasing the authors' thorough understanding of the challenges in LALM training.
The paper provides sufficient details regarding the dataset construction pipeline, training strategies, and evaluation protocols, which enhances reproducibility. However, the absence of a public repository or demo URL limits the ease with which others can replicate the results.
One limitation is the reliance on the quality of the audio data and its annotations, which can introduce biases or inaccuracies. Additionally, while the dataset is large, the diversity of audio types and contexts may still be limited, potentially affecting generalizability. The paper does not address the potential computational costs associated with the proposed training paradigms.
The findings have significant implications for the development of more effective multimodal AI systems, particularly in audio understanding tasks. The methodologies proposed could be applied to other domains where audio and text modalities intersect, potentially influencing future research directions in LALMs and related fields. The paper presents a comprehensive approach to improving Large Audio Language Models through innovative dataset construction and training paradigms, addressing critical gaps in the current research landscape. The technical contributions, particularly in the context of audio contribution analysis, position this work as a notable advancement in the field of audio processing and multimodal AI.
Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fidelity, while diffusion models remain impractical for streaming. Moreover, most assume a fixed target sampling rate, requiring external resampling that leads to redundant computations. We present TF-Restormer, an encoder-decoder architecture that concentrates analysis on input-bandwidth with a time-frequency dual-path encoder and reconstructs missing high-frequency bands through a light decoder with frequency extension queries. It enables efficient and universal restoration across arbitrary input-output rates without redundant resampling. To support adversarial training across diverse rates, we introduce a shared sampling-frequency-independent (SFI) STFT discriminator. TF-Restormer further supports streaming with a causal time module, and improves robustness under extreme degradations by injecting spectral inductive bias into the frequency module. Finally, we propose a scaled log-spectral loss that stabilizes optimization under severe conditions while emphasizing well-predicted spectral details. As a single model across sampling rates, TF-Restormer consistently outperforms prior systems, achieving balanced gains in signal fidelity and perceptual quality, while its streaming mode maintains competitive effectiveness for real-time application. Code and demos are available at https://tf-restormer.github.io/demo.
The main contribution of this paper is the introduction of TF-Restormer, a novel speech restoration model that effectively addresses the challenges of restoring speech signals under various distortions while maintaining efficiency across different sampling rates. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its significance in the field of audio processing and machine learning.
The paper presents TF-Restormer, an innovative encoder-decoder architecture that utilizes a time-frequency dual-path approach to address the challenges of speech restoration under various distortions. The methodology is well-structured, focusing on the input bandwidth while employing a lightweight decoder for high-frequency reconstruction. The introduction of a shared sampling-frequency-independent (SFI) STFT discriminator for adversarial training is a notable contribution, allowing the model to operate efficiently across different sampling rates without the need for redundant resampling. The use of a scaled log-spectral loss to stabilize optimization under severe conditions is also a significant methodological advancement. Overall, the methodology is robust and addresses key limitations in existing approaches.
The experiments are thorough, utilizing diverse datasets such as UNIVERSE and VCTK for evaluation across various tasks, including denoising and super-resolution. The results demonstrate consistent improvements over prior systems in terms of signal fidelity and perceptual quality, with detailed comparisons against state-of-the-art models. The use of multiple metrics (PESQ, SDR, LSD, etc.) provides a comprehensive assessment of the model's performance. However, the paper could benefit from additional ablation studies to further validate the impact of individual components within the architecture.
The authors provide a clear implementation strategy, including training details, model configurations, and the use of publicly available datasets. The availability of code and demos enhances reproducibility, although the lack of a direct GitHub repository link may hinder ease of access for some researchers. The detailed training pipeline and parameter settings are well-documented, which is a positive aspect for reproducibility.
While the paper presents a strong framework, it does not extensively address potential limitations, such as the computational cost associated with the dual-path architecture at higher sampling rates. Additionally, the model's performance in real-world scenarios could be further validated with more extensive testing on diverse datasets beyond the synthetic and controlled environments used in the experiments.
The TF-Restormer model has significant implications for real-time speech restoration applications, particularly in scenarios involving low-bandwidth communication and various distortions. Its ability to operate across different sampling rates without redundant resampling makes it a practical solution for real-world applications. The advancements in spectral prediction and adversarial training could also inspire further research in audio processing and enhancement. The main contribution of this paper is the introduction of TF-Restormer, a novel speech restoration model that effectively addresses the challenges of restoring speech signals under various distortions while maintaining efficiency across different sampling rates. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its significance in the field of audio processing and machine learning.
Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extending these techniques to SVS remains non-trivial due to the requirement for precise melody control. In particular, prompt-based generation often introduces prosody leakage, where pitch information is inadvertently entangled within the timbre prompt, compromising controllability. We present CoMelSinger, a zero-shot SVS framework that enables structured and disentangled melody control within a discrete codec modeling paradigm. Built on the non-autoregressive MaskGCT architecture, CoMelSinger replaces conventional text inputs with lyric and pitch tokens, preserving in-context generalization while enhancing melody conditioning. To suppress prosody leakage, we propose a coarse-to-fine contrastive learning strategy that explicitly regularizes pitch redundancy between the acoustic prompt and melody input. Furthermore, we incorporate a lightweight encoder-only Singing Voice Transcription (SVT) module to align acoustic tokens with pitch and duration, offering fine-grained frame-level supervision. Experimental results demonstrate that CoMelSinger achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability over competitive baselines.
Primary: National University of Singapore
All Institutions: National University of Singapore, Copyright may be transferred without notice, after which this version may no longer be accessible. Junchuan Zhao, This work has been submitted to the IEEE for possible publication, are affiliated with the School of Computing
CoMelSinger presents a novel framework for zero-shot singing voice synthesis that effectively addresses melody control and prosody leakage. The combination of innovative methodology and promising experimental results positions this work as a significant contribution to the field of machine learning in audio synthesis.
The methodology presented in CoMelSinger is innovative, leveraging a non-autoregressive MaskGCT architecture to replace traditional text inputs with discrete lyric and pitch tokens. This approach effectively addresses the challenge of prosody leakage by introducing a coarse-to-fine contrastive learning strategy, which regularizes pitch redundancy. The incorporation of a lightweight encoder-only Singing Voice Transcription (SVT) module for frame-level supervision is a significant enhancement, allowing for better alignment of acoustic tokens with pitch and duration. Overall, the methodology is well-structured and demonstrates a clear understanding of the challenges in singing voice synthesis.
The experimental setup is robust, with comprehensive evaluations against competitive baselines. The results indicate notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability, which are critical metrics in the field of singing voice synthesis. However, the paper could benefit from a more detailed analysis of the datasets used and the specific metrics employed to quantify improvements, as this would enhance the credibility of the findings.
While the paper outlines the methodology and experimental results, it lacks sufficient implementation details that would facilitate reproducibility. Key aspects such as hyperparameter settings, data preprocessing steps, and code availability are not mentioned, which could hinder other researchers from replicating the study.
One limitation is the potential overfitting to the training data, particularly in the context of zero-shot learning. The paper does not address how the model performs with unseen data outside of the training distribution. Additionally, the reliance on a discrete token-based approach may limit the expressiveness of the generated singing voices compared to continuous representations.
The advancements made in CoMelSinger have the potential to significantly impact the fields of music technology and artificial intelligence, particularly in applications such as music composition, voice cloning, and interactive entertainment. The ability to generate expressive singing voices with structured control could lead to new creative tools for artists and musicians, enhancing the accessibility of music production. CoMelSinger presents a novel framework for zero-shot singing voice synthesis that effectively addresses melody control and prosody leakage. The combination of innovative methodology and promising experimental results positions this work as a significant contribution to the field of machine learning in audio synthesis.
Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.
Primary: Korea Advanced Institute of Science and Technology
All Institutions: Korea Advanced Institute of Science and Technology, Ho Chi Minh City University of Technology, AITech Lab, These authors contributed equally to this work
The main contribution of this paper is the introduction of MAGE, a novel Masked Audio Generative Enhancer that utilizes a scarcity-aware coarse-to-fine masking strategy and a lightweight corrector module, achieving state-of-the-art performance in speech enhancement with a significantly reduced model size. This work represents a meaningful advancement in the field of generative speech enhancement, balancing efficiency and perceptual quality, and setting a foundation for future research in practical applications.
The methodology presented in this paper is innovative, particularly with the introduction of the scarcity-aware coarse-to-fine masking strategy. This approach addresses the limitations of traditional masked generative models by prioritizing token frequencies, which enhances both efficiency and generalization. The inclusion of a lightweight corrector module for low-confidence predictions is a significant advancement, allowing for iterative refinement of predictions. The architecture is built upon established models like BigCodec and Qwen2.5-0.5B, yet the selective layer retention to achieve a compact model size of 200M parameters is a notable achievement in balancing performance and efficiency.
The experimental evaluation is robust, utilizing well-established benchmarks such as the DNS Challenge and noisy LibriSpeech. The results demonstrate that MAGE outperforms larger baselines in terms of perceptual quality and word error rate, which is critical for downstream applications like speech recognition. The paper effectively compares MAGE against both discriminative and generative models, providing comprehensive metrics that highlight its advantages. However, the reliance on simulated distortions raises questions about its real-world applicability.
The paper provides sufficient details regarding the implementation, including model architecture, training parameters, and evaluation metrics. However, the lack of a publicly available code repository limits full reproducibility. The authors should consider releasing their code to facilitate further research and validation of their findings.
While MAGE shows strong results, its performance may be limited by its training on simulated data, which could affect generalization to real-world scenarios. Additionally, the evaluation metrics focus primarily on perceptual quality and WER, potentially overlooking other important aspects of speech enhancement like latency and computational efficiency in practical applications.
The advancements presented in this paper could have significant implications for real-world applications in speech enhancement, particularly in environments with background noise or reverberation. The compact design of MAGE makes it suitable for deployment in resource-constrained settings, which is crucial for applications like mobile devices and real-time communication systems. The potential for future extensions to multilingual and streaming scenarios further enhances its relevance in diverse applications. The main contribution of this paper is the introduction of MAGE, a novel Masked Audio Generative Enhancer that utilizes a scarcity-aware coarse-to-fine masking strategy and a lightweight corrector module, achieving state-of-the-art performance in speech enhancement with a significantly reduced model size. This work represents a meaningful advancement in the field of generative speech enhancement, balancing efficiency and perceptual quality, and setting a foundation for future research in practical applications.
Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.
The main contribution of this paper is the introduction of TTScore, a novel evaluation framework for synthesized speech that provides targeted assessments of intelligibility and prosody through conditional prediction of discrete speech tokens. This work significantly advances the field by addressing the limitations of existing metrics and aligning more closely with human perceptions of speech quality.
The paper introduces TTScore, a novel evaluation framework that utilizes conditional prediction of discrete speech tokens to assess intelligibility and prosody in synthesized speech. The methodology is well-structured, employing two distinct sequence-to-sequence models tailored for intelligibility (TTScore-int) and prosody (TTScore-pro). This targeted approach addresses the limitations of existing metrics, such as WER and F0-RMSE, by providing reference-free evaluations that align more closely with human perception. The use of discrete speech tokens derived from advanced models like HuBERT and FACodec adds robustness to the evaluation process.
The experiments are comprehensive, utilizing multiple benchmarks (SOMOS, VoiceMOS, TTSArena) to validate the effectiveness of the proposed metrics. The paper reports strong correlations between TTScore metrics and human judgments of speech quality, outperforming traditional metrics. The evaluation setup is rigorous, comparing TTScore against established baselines and demonstrating its reliability across diverse datasets. However, the paper could benefit from a more detailed analysis of the statistical significance of the results.
The authors provide a GitHub repository with code and pre-trained models, enhancing the reproducibility of their work. Implementation details are sufficiently described, including model architectures and training procedures. However, the paper lacks specific hyperparameter settings and training configurations that could further aid in reproducing the results.
One limitation is the reliance on existing datasets for evaluation, which may not encompass all variations of synthesized speech. Additionally, while TTScore shows improved correlations with human judgments, it may still be sensitive to the quality of the underlying speech synthesis systems. The paper does not address potential biases in the datasets used for training and evaluation.
The proposed evaluation framework has significant implications for the field of speech synthesis, offering a more nuanced understanding of intelligibility and prosody. This can lead to improved speech generation systems, enhancing applications in assistive technologies, human-computer interaction, and language learning. The methodology could also inspire further research into targeted evaluation metrics in other domains of machine learning. The main contribution of this paper is the introduction of TTScore, a novel evaluation framework for synthesized speech that provides targeted assessments of intelligibility and prosody through conditional prediction of discrete speech tokens. This work significantly advances the field by addressing the limitations of existing metrics and aligning more closely with human perceptions of speech quality.
The rapid growth of the digital economy in South-East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained on high-resource languages collapse when applied to SEA, due to mismatches in synthesis quality, language-specific characteristics, and data scarcity. To close this gap, we present SEA-Spoof, the first large-scale Audio Deepfake Detection (ADD) dataset especially for SEA languages. SEA-Spoof spans 300+ hours of paired real and spoof speech across Tamil, Hindi, Thai, Indonesian, Malay, and Vietnamese. Spoof samples are generated from a diverse mix of state-of-the-art open-source and commercial systems, capturing wide variability in style and fidelity. Benchmarking state-of-the-art detection models reveals severe cross-lingual degradation, but fine-tuning on SEA-Spoof dramatically restores performance across languages and synthesis sources. These results highlight the urgent need for SEA-focused research and establish SEA-Spoof as a foundation for developing robust, cross-lingual, and fraud-resilient detection systems.
Primary: Institute for Infocomm Research (I2R)
All Institutions: The University of New South Wales, Nanyang Technological University, Institute for Infocomm Research (I2R), Alibaba Group, United States of America
The paper presents SEA-Spoof, the first large-scale dataset for audio deepfake detection in six South-East Asian languages, filling a critical gap in existing resources and demonstrating significant improvements in detection performance through fine-tuning. The comprehensive methodology and experimental validation highlight its importance for advancing research in multilingual audio deepfake detection.
The methodology is robust, focusing on the creation of the SEA-Spoof dataset, which is a significant contribution to the field of audio deepfake detection. The authors carefully selected six South-East Asian languages based on linguistic diversity, population coverage, and practical relevance. The dataset construction is thorough, utilizing a mix of state-of-the-art open-source and commercial systems to generate spoofed audio, which ensures a wide variability in synthesis quality. The systematic pairing of real and spoofed audio for controlled evaluations is a strong methodological aspect that enhances the dataset's utility for future research.
The experimental evaluation is comprehensive, benchmarking multiple state-of-the-art models against the newly created SEA-Spoof dataset. The results clearly demonstrate the cross-lingual performance degradation of existing models when applied to SEA languages, validating the necessity of the dataset. Fine-tuning experiments show significant improvements in model performance, underscoring the dataset's effectiveness as a diagnostic tool and a resource for enhancing detection capabilities.
The paper provides sufficient details on the dataset's construction and the experimental setup, including the models used for benchmarking and the training protocols. However, the lack of a publicly available code repository limits the full reproducibility of the experiments. While the dataset is accessible, the absence of implementation details for the models may hinder other researchers from replicating the study completely.
One limitation is the focus on only six languages, which, while significant, does not cover the entire spectrum of languages in the SEA region. Additionally, the dataset's reliance on specific synthesis systems may introduce biases that could affect generalizability. The paper also mentions plans for future work, indicating that the dataset may evolve, but the current version may not be exhaustive.
The creation of SEA-Spoof has the potential to significantly impact the field of audio deepfake detection, particularly in multilingual contexts. By addressing the gap in resources for SEA languages, the dataset can facilitate the development of more effective detection systems tailored to the unique characteristics of these languages. This work emphasizes the importance of regional focus in AI research and could lead to broader applications in security, fraud detection, and speech technology. The paper presents SEA-Spoof, the first large-scale dataset for audio deepfake detection in six South-East Asian languages, filling a critical gap in existing resources and demonstrating significant improvements in detection performance through fine-tuning. The comprehensive methodology and experimental validation highlight its importance for advancing research in multilingual audio deepfake detection.
This paper introduces a discrete diffusion model (DDM) framework for text-aligned speech tokenization and reconstruction. By replacing the auto-regressive speech decoder with a discrete diffusion counterpart, our model achieves significantly better reconstruction quality, stronger ASR performance, and faster inference. We provide a comprehensive analysis of applying DDMs to speech reconstruction, examining sampler choices, inference steps, and robustness to length-scale estimation errors. Furthermore, we improve the original TASTE by systematically comparing vector quantization modules, showing that FSQ yields up to a 35% relative WER reduction and +0.14 UT-MOS improvement over RVQ for AR models, while also enhancing DDM performance. Our model generates speech in just 10 denoising steps and even supports single-step generation with only minor quality degradation.
This paper presents a pioneering application of discrete diffusion models to speech tokenization and reconstruction, showcasing substantial improvements in efficiency and quality over traditional autoregressive methods. The comprehensive methodology and experimental validation contribute significantly to the field, paving the way for future research and applications in speech technology.
The paper introduces a novel discrete diffusion model (DDM) framework for speech tokenization and reconstruction, effectively replacing traditional autoregressive decoders with a more efficient DDM approach. The methodology is well-structured, providing a comprehensive analysis of various aspects such as sampler choices and vector quantization techniques. The use of finite scalar quantization (FSQ) as an alternative to residual vector quantization (RVQ) is a significant methodological improvement that enhances performance metrics like WER and UT-MOS. The detailed exploration of inference settings and robustness to length-scale estimation errors further strengthens the methodology's rigor.
The experiments are robust, utilizing a large dataset (Granary English-only) and employing various evaluation metrics (WER, PESQ, MOS, etc.) to assess performance comprehensively. The comparison between AR and DDM models is well-articulated, showing clear advantages in both reconstruction quality and inference speed. The results substantiate the claims made in the paper, demonstrating the effectiveness of DDMs in speech applications. However, the paper could benefit from additional comparisons with state-of-the-art models beyond the baseline.
The paper provides sufficient details regarding the training setup, including the number of GPUs used, batch sizes, and training procedures. However, the absence of a publicly accessible code repository limits full reproducibility, as other researchers may struggle to replicate the results without the exact implementation details.
One limitation is the reliance on a specific dataset that may not generalize across all speech applications. Additionally, the paper does not address potential challenges in real-world applications, such as handling diverse accents or noisy environments beyond the dataset used. The assumption that the model has access to global S3 token lengths during inference may also pose practical challenges.
The proposed DDM-based TASTE framework has significant implications for the field of speech processing, particularly in applications requiring efficient and high-quality speech synthesis and recognition. The advancements could lead to improvements in voice assistants, automated transcription services, and other speech-related technologies, ultimately enhancing user experiences in various domains. This paper presents a pioneering application of discrete diffusion models to speech tokenization and reconstruction, showcasing substantial improvements in efficiency and quality over traditional autoregressive methods. The comprehensive methodology and experimental validation contribute significantly to the field, paving the way for future research and applications in speech technology.
This paper introduces WrenNet, an efficient neural network enabling real-time multi-species bird audio classification on low-power microcontrollers for scalable biodiversity monitoring. We propose a semi-learnable spectral feature extractor that adapts to avian vocalizations, outperforming standard mel-scale and fully-learnable alternatives. On an expert-curated 70-species dataset, WrenNet achieves up to 90.8\% accuracy on acoustically distinctive species and 70.1\% on the full task. When deployed on an AudioMoth device ($\leq$1MB RAM), it consumes only 77mJ per inference. Moreover, the proposed model is over 16x more energy-efficient compared to Birdnet when running on a Raspberry Pi 3B+. This work demonstrates the first practical framework for continuous, multi-species acoustic monitoring on low-power edge devices.
Primary: University of Trento
All Institutions: University of Trento
The main contribution of this paper is the introduction of WrenNet, a novel neural network architecture that enables efficient multi-species bird audio classification on low-power devices, significantly advancing the field of bioacoustic monitoring. This work is notable for its innovative methodology and practical applications, addressing critical challenges in environmental monitoring with a focus on energy efficiency and real-time processing.
The methodology presented in this paper is robust, featuring a well-thought-out neural architecture (WrenNet) that addresses the specific challenges of multi-species bird classification on low-power devices. The introduction of a semi-learnable spectral feature extractor is particularly innovative, allowing for adaptive frequency mapping that enhances the model's performance on avian vocalizations. The use of causal convolutions and a unidirectional GRU for temporal processing is a strong choice for maintaining memory efficiency while ensuring real-time processing capabilities. The paper effectively combines deep learning techniques with practical constraints of edge devices, showcasing a thoughtful approach to system-algorithm co-design.
The experimental evaluation is comprehensive, utilizing a well-curated dataset of 70 species and demonstrating the model's performance through various benchmarks. The accuracy results are promising, particularly for acoustically distinctive species, and the energy consumption metrics highlight the practical viability of the proposed system. The comparison with existing models like BirdNET, showcasing a significant reduction in energy consumption, adds to the credibility of the results. However, the paper could benefit from more detailed discussions on the statistical significance of the results and potential variations in performance across different environments.
The paper provides a clear overview of the experimental setup, including the dataset creation and training processes. The availability of scripts in a public repository enhances reproducibility. However, more detailed documentation on the specific configurations used for training and testing would further facilitate replication of the results by other researchers.
While the paper presents a significant advancement, it does have limitations. The model's performance on the full dataset (70 species) shows a drop in accuracy, indicating that further refinement may be necessary for closely related species. Additionally, the reliance on a specific dataset may limit generalizability to other geographical regions or bird species. The energy consumption metrics, while impressive, could vary significantly with different environmental conditions and hardware configurations.
The implications of this work are substantial for biodiversity monitoring and conservation efforts. By enabling real-time, low-power classification of bird species, this technology can facilitate large-scale ecological studies and contribute to the understanding of avian populations and their habitats. The approach could be extended to other wildlife monitoring applications, potentially transforming how ecological data is collected and analyzed. The main contribution of this paper is the introduction of WrenNet, a novel neural network architecture that enables efficient multi-species bird audio classification on low-power devices, significantly advancing the field of bioacoustic monitoring. This work is notable for its innovative methodology and practical applications, addressing critical challenges in environmental monitoring with a focus on energy efficiency and real-time processing.
Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker overlap, and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, full-duplex settings. Captured from a deployed AI assistant, the dataset comprises 5,805 annotated sessions with synchronized user and mixed-channel views, RTTM/CTM timing, and role labels. We introduce a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio for long-context recognition. ASR evaluation includes WER, CER, and HC-WER, which measures concept-level accuracy across healthcare settings. LLM-generated responses are assessed using rubric-based and pairwise protocols. MMedFD establishes a reproducible framework for benchmarking streaming ASR and end-to-end duplex agents in healthcare deployment. The dataset and related resources are publicly available at https://github.com/Kinetics-JOJO/MMedFD
The main contribution of this paper is the introduction of MMedFD, a novel healthcare ASR corpus and a robust framework for evaluating multi-turn, full-duplex speech recognition systems. This work addresses a critical gap in the ASR field, particularly in clinical dialogue, and lays the groundwork for future advancements in healthcare communication technologies.
The paper introduces a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, which is a significant contribution to the field of ASR in healthcare. The use of the Whisper-small model fine-tuned on role-concatenated audio for long-context recognition is innovative, addressing the challenges of multi-turn and full-duplex interactions in clinical settings. The methodology is well-structured and clearly articulated, although it would benefit from more detailed comparisons with existing methods.
The experiments are comprehensive, utilizing a dataset of 5,805 annotated sessions, which is substantial for the domain. The evaluation metrics, including WER, CER, and HC-WER, are appropriate for assessing ASR performance in healthcare. However, the paper could enhance its impact by providing more detailed results and comparisons with baseline models to better illustrate the effectiveness of the proposed methods.
The authors have made the dataset and related resources publicly available, which is commendable for reproducibility. However, the paper lacks detailed implementation instructions or code snippets that would facilitate replication of the results by other researchers. Including such details would strengthen the reproducibility aspect significantly.
One limitation is the focus on a specific language (Chinese), which may restrict the generalizability of the findings to other languages or dialects. Additionally, while the dataset is substantial, the paper does not discuss potential biases in the data collection process or the diversity of the speakers involved, which could affect the model's performance in real-world applications.
The development of MMedFD has the potential to significantly impact the healthcare sector by improving the efficiency and accuracy of ASR systems in clinical dialogues. This could lead to better patient interactions and streamlined workflows in healthcare settings. The framework established for benchmarking streaming ASR can also encourage further research and development in this area. The main contribution of this paper is the introduction of MMedFD, a novel healthcare ASR corpus and a robust framework for evaluating multi-turn, full-duplex speech recognition systems. This work addresses a critical gap in the ASR field, particularly in clinical dialogue, and lays the groundwork for future advancements in healthcare communication technologies.
Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.
Primary: ~Corresponding authors
All Institutions: ~Equal contribution, ~Corresponding authors
The paper presents a significant contribution to the field of TTS by introducing a new metric for assessing prosody diversity, which is crucial for improving the naturalness of synthesized speech. The methodology is innovative, and the experimental results support its effectiveness, marking a meaningful advancement in the evaluation of TTS systems.
The paper introduces a novel metric, Discretized Speech Weighted Edit Distance (DS-WED), which is a significant advancement in measuring prosody diversity in zero-shot TTS systems. The methodology is robust, leveraging weighted edit distance over semantic tokens, and is well-supported by the creation of the ProsodyEval dataset, which includes human ratings that enhance the reliability of the metric. The approach is methodologically sound, addressing a gap in the current literature regarding the correlation between acoustic metrics and human perception of prosody.
The experiments conducted on the ProsodyEval dataset are comprehensive, featuring 1000 speech samples and 2000 human ratings, which provide a solid foundation for evaluating the proposed metric. The results demonstrate that DS-WED correlates more strongly with human judgments than existing metrics, showcasing its effectiveness. Additionally, the benchmarking of state-of-the-art TTS systems reveals practical applications of the metric, although further details on the experimental setup could enhance transparency.
The paper provides sufficient details regarding the dataset and the proposed metric, which aids in reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for the proposed methods and results. Including implementation details or a link to a code repository would significantly enhance reproducibility.
One limitation noted is the reliance on human ratings, which can introduce variability and subjectivity into the evaluation process. Additionally, the paper mentions that current large audio language models (LALMs) are limited in capturing prosodic variations, indicating an area for further research. The scope of the dataset, while substantial, may not encompass all potential prosodic variations present in diverse languages and accents.
The development of a reliable metric for prosody diversity has significant implications for the TTS field, potentially enhancing the naturalness and expressiveness of synthesized speech. This work could influence future research directions in TTS systems, particularly in improving user experience and accessibility for diverse populations. The findings may also encourage further exploration into the integration of prosody in other areas of machine learning and natural language processing. The paper presents a significant contribution to the field of TTS by introducing a new metric for assessing prosody diversity, which is crucial for improving the naturalness of synthesized speech. The methodology is innovative, and the experimental results support its effectiveness, marking a meaningful advancement in the evaluation of TTS systems.
Speech summarization is a critical component of spoken content understanding, particularly in the era of rapidly growing spoken and audiovisual data. Recent advances in multi-modal large language models (MLLMs), leveraging the power of LLMs, enable generating textual summaries directly from speech without intermediate transcriptions, while supporting controllable styles and zero-shot generalization. However, open-source MLLMs continue to lag behind the state-of-the-art text-based LLMs, limiting their practical deployment for speech summarization. In this work, we present a novel multi-stage reinforcement learning training framework to enhance the speech summarization capabilities in MLLMs. Our model delivers substantial improvements over strong baselines, outperforms much larger MLLMs, and significantly narrows the gap with state-of-the-art text-based LLMs.
Primary: Microsoft CoreAI
All Institutions: Microsoft CoreAI
The main contribution of this work is a novel multi-stage reinforcement learning framework that significantly enhances the speech summarization capabilities of multi-modal large language models. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and natural language processing.
The proposed methodology introduces a multi-stage reinforcement learning framework that effectively enhances speech summarization capabilities in multi-modal large language models (MLLMs). The combination of supervised fine-tuning on synthetic data, on-policy knowledge distillation, and Direct Preference Optimization is innovative and addresses key challenges in the field, such as error propagation and modality gaps. The approach is well-structured and leverages existing models and techniques, showcasing a thoughtful integration of various methodologies to improve performance.
The experimental evaluation is robust, utilizing multiple benchmarks (Golden3, AMI, and FLORAS) to assess the model's performance. The paper provides a thorough comparison with both open-source and state-of-the-art systems, demonstrating significant performance improvements. The ablation studies further validate the effectiveness of each component of the proposed framework, highlighting the importance of data quality and the choice of teacher models in knowledge distillation.
While the paper provides detailed descriptions of the training processes and datasets, it lacks specific URLs for code or datasets, which could hinder reproducibility. The absence of a public repository or demo limits the ability for other researchers to replicate the results independently. However, the methodology is described in sufficient detail for knowledgeable practitioners to implement similar experiments.
The paper acknowledges issues such as hallucinations and reward hacking, which are common in reinforcement learning settings. While the proposed methods mitigate these issues, they do not completely eliminate them. Additionally, the focus on English-only data in training may limit the model's applicability in multilingual contexts, despite showing some cross-lingual generalization.
The advancements in speech summarization have significant implications for accessibility, productivity, and information retrieval in various domains, including education, business, and media. The ability to generate coherent summaries from spoken content can enhance user experiences and facilitate better information management in an increasingly audio-centric world. The main contribution of this work is a novel multi-stage reinforcement learning framework that significantly enhances the speech summarization capabilities of multi-modal large language models. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and natural language processing.
Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This work introduces a data-efficient personalization method that quantifies phoneme-level uncertainty to guide fine-tuning. We leverage Monte Carlo Dropout to estimate which phonemes a model finds most difficult and use these estimates for a targeted oversampling strategy. We validate our method on English and German datasets. Crucially, we demonstrate that our model-derived uncertainty strongly correlates with phonemes identified as challenging in an expert clinical logopedic report, marking, to our knowledge, the first work to successfully align model uncertainty with expert assessment of speech difficulty. Our results show that this clinically-validated, uncertainty-guided sampling significantly improves ASR accuracy, delivering a practical framework for personalized and inclusive ASR.
Primary: University of Zurich
All Institutions: Technical University of Munich, School of Computation, Institute of Neuroinformatics, University of Zurich and ETH Zurich, University of Zurich, Department of Computational Linguistics, Information and Technology
This paper presents a novel framework for data-efficient ASR personalization that utilizes uncertainty-based phoneme difficulty scoring to improve recognition accuracy for non-normative speech. The integration of clinical validation with machine learning techniques represents a meaningful contribution to both the fields of speech recognition and assistive technology.
The methodology is robust, leveraging Monte Carlo Dropout to quantify phoneme-level uncertainty, which is innovative in the context of ASR personalization for non-normative speech. The introduction of the Phoneme Difficulty Score (PhDScore) is a significant advancement, as it combines multiple uncertainty metrics to guide oversampling effectively. The approach to link model uncertainty with clinical assessments is particularly noteworthy and demonstrates a thoughtful integration of machine learning with clinical insights.
The experiments are well-structured, utilizing both English and German datasets to validate the proposed method. The results show a clear improvement in ASR accuracy for non-normative speech, and the correlation with clinical assessments adds credibility to the findings. However, the limited number of speakers in the BF-Sprache dataset may affect the generalizability of the results.
While the paper provides a detailed description of the methodology, including the computation of the PhDScore and the experimental setup, it lacks specific implementation details or code availability, which could hinder reproducibility. Future work should consider sharing code or datasets to facilitate further research.
The primary limitation is the small size of the BF-Sprache dataset, which restricts the breadth of the findings. Additionally, the subjective nature of clinical assessments may introduce variability in the validation process. The trade-off between personalization and generalization is also a concern, as it may limit the practical application of the method in real-world scenarios.
This work has significant implications for the development of personalized ASR systems, particularly for individuals with speech impairments. By improving the accuracy of ASR for non-normative speech, the proposed method could enhance communication aids and assistive technologies, making them more effective and inclusive for users with diverse speech patterns. This paper presents a novel framework for data-efficient ASR personalization that utilizes uncertainty-based phoneme difficulty scoring to improve recognition accuracy for non-normative speech. The integration of clinical validation with machine learning techniques represents a meaningful contribution to both the fields of speech recognition and assistive technology.
Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising alternative to next-token prediction, avoiding the technical complexities associated with discrete speech tokenization. As a relatively new paradigm, research on reinforcement learning (RL)-based fine-tuning of speech ARDMs remains limited. In this paper, we propose Autoregressive Diffusion-Direct Preference Optimization (ARDM-DPO) to advance this research. By fine-tuning the recently proposed zero-shot text-to-speech model DiTAR with DPO, we achieve significant improvements in terms of speech expressiveness and robustness for long texts.
Primary: The Chinese University of Hong Kong
All Institutions: School of Data Science, ByteDance Seed, The Chinese University of Hong Kong, School of Artificial Intelligence, Nanjing University
The main contribution of this paper is the introduction of ARDM-DPO, a novel method for fine-tuning autoregressive diffusion models in speech generation, which enhances expressiveness and robustness while addressing the challenges of traditional TTS systems. The comprehensive evaluation of the method demonstrates its potential impact on the field of audio generation and reinforces the importance of preference alignment in machine learning models.
The proposed method, Autoregressive Diffusion-Direct Preference Optimization (ARDM-DPO), represents a significant advancement in the application of autoregressive diffusion models for text-to-speech (TTS) systems. The methodology effectively integrates reinforcement learning principles to fine-tune the DiTAR model, addressing the limitations of traditional next-token prediction approaches. The authors provide a clear framework for preference alignment, which is critical for enhancing the expressiveness and robustness of generated speech. However, the paper could benefit from a more detailed discussion on the implementation specifics of DPO in the context of ARDMs, as well as a deeper exploration of the underlying assumptions made during model training.
The experiments are well-structured, utilizing comprehensive datasets and benchmarks to evaluate the performance of ARDM-DPO against baseline methods. The authors present quantitative metrics such as F0 variance and character error rate, alongside qualitative assessments through listener evaluations, which provide a balanced view of the model's performance. The results indicate significant improvements in expressiveness and robustness, although the paper notes some instability in training, which warrants further investigation. The use of a large preference dataset strengthens the findings, but additional comparisons with more baseline models could enhance the robustness of the conclusions drawn.
The paper provides a reasonable level of detail regarding the experimental setup, including model architecture, training parameters, and evaluation metrics. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider releasing the model and code to facilitate further research and validation by the community.
The paper acknowledges the instability of the ARDM-DPO training process, particularly in Task A, which can lead to degradation in speech quality. This instability raises questions about the robustness of the method in practical applications. Additionally, the reliance on preference datasets for training may introduce biases that affect the generalizability of the model. The authors also mention the need for early stopping, which could complicate the training process.
The advancements presented in this paper have the potential to significantly improve TTS systems, making them more expressive and aligned with human preferences. This could enhance applications in various fields, including virtual assistants, audiobooks, and entertainment. The work contributes to the growing body of research on autoregressive diffusion models, potentially influencing future developments in multimodal generation tasks. The main contribution of this paper is the introduction of ARDM-DPO, a novel method for fine-tuning autoregressive diffusion models in speech generation, which enhances expressiveness and robustness while addressing the challenges of traditional TTS systems. The comprehensive evaluation of the method demonstrates its potential impact on the field of audio generation and reinforces the importance of preference alignment in machine learning models.
We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations.
Primary: Language Technologies Institute
All Institutions: Department of Computer Science, Language Technologies Institute, University of Texas at Austin, Analytical Imaging and Modeling Center, Cortney Van’t Slot, Carnegie Mellon University, Children's Health, Department of Plastic Surgery, University of Texas Southwestern Medical Center
The main contribution of this paper is the introduction of ChiReSSD, a novel speech reconstruction framework that effectively addresses the unique challenges of disordered speech in children while preserving speaker identity. This work represents a meaningful advancement in the intersection of machine learning and clinical speech pathology, with the potential to significantly impact both research and practical applications in the field.
The methodology presented in this paper is innovative, leveraging a modified version of StyleTTS2 to specifically address the challenges of reconstructing speech for children with speech sound disorders (SSD). The framework's ability to disentangle acoustic and prosodic features while preserving speaker identity is a significant advancement over traditional methods that often fail to account for the unique characteristics of children's speech. The adaptation of the model to handle the higher pitch and prosodic patterns of child speech is well-justified and effectively executed. However, the paper could benefit from a more detailed description of the training process and hyperparameter tuning, as these are critical for replicating the results.
The experimental evaluation is robust, utilizing multiple datasets (STAR, UltraSuite, and TORGO) to demonstrate the effectiveness of ChiReSSD across different populations. The results show substantial improvements in lexical accuracy and speaker identity preservation, with clear metrics such as WER, CER, and PCC providing quantitative support for the claims made. The correlation of automatic evaluations with human expert annotations (Pearson correlation of 0.63) is particularly noteworthy, as it suggests a practical application for reducing manual transcription efforts in clinical settings. The experiments are well-structured, but the paper could enhance clarity by providing more context for the choice of evaluation metrics.
While the paper provides a general overview of the methods and datasets used, it lacks specific implementation details that would aid in reproducibility. For instance, the exact configurations of the model training, including learning rates, batch sizes, and the specific architecture of the StyleTTS2 modifications, are not thoroughly detailed. Including a supplementary material section with code snippets or a link to a repository would greatly enhance reproducibility.
One limitation of the study is the reliance on specific datasets that may not fully represent the diversity of speech disorders in children. The generalization to adult dysarthric speech is promising, but the paper does not address potential limitations in applying the model to other populations or languages. Additionally, while the model shows improvements in phonetic accuracy, residual errors remain, and the paper suggests future work to address these, indicating that the current version may not be fully optimized.
The implications of this research are significant, particularly in the fields of speech-language pathology and assistive technology. By providing a framework that can improve the intelligibility of disordered speech while preserving the speaker's identity, ChiReSSD has the potential to enhance communication for children with SSD, thereby improving their social and academic outcomes. Furthermore, the ability to automate clinical evaluations could alleviate some of the burdens on speech-language therapists, allowing them to focus on more complex cases. The main contribution of this paper is the introduction of ChiReSSD, a novel speech reconstruction framework that effectively addresses the unique challenges of disordered speech in children while preserving speaker identity. This work represents a meaningful advancement in the intersection of machine learning and clinical speech pathology, with the potential to significantly impact both research and practical applications in the field.
Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.
Primary: Fictional University
All Institutions: 2133 Long Road, 8765 Dream Blvd, This Work is supported by ONR N00014-23-1-2050 and N00014-23- 1-2086, Johns Hopkins University, An unnumbered footnote that may come in handy, University Imagination, Department of Electrical and Computer Engineering, Fictional University, Industry Lab, Important Laboratory
The paper presents FlexSED, a novel open-vocabulary sound event detection framework that effectively addresses existing limitations in sound classification and adapts well to diverse real-world applications. The innovative integration of pretrained models and robust training strategies positions this work as a significant contribution to the field of audio machine learning.
The proposed FlexSED framework introduces a novel architecture that integrates pretrained audio and text models, addressing the limitations of traditional sound event detection systems. The encoder-decoder structure and adaptive fusion strategy are innovative, allowing for effective continuous training and improved performance in open-vocabulary contexts. The use of large language models for negative query filtering is particularly noteworthy, as it enhances the robustness of the training process by mitigating issues related to missing labels. Overall, the methodology is well-structured and leverages existing technologies in a creative manner.
The experiments conducted on the AudioSet-Strong dataset demonstrate the effectiveness of FlexSED, showcasing significant improvements over traditional models. The evaluation metrics used, including PSDS1, provide a fine-grained analysis of the model's performance in terms of temporal localization and sound event detection accuracy. The results from zero-shot and few-shot learning scenarios further validate the model's adaptability and generalization capabilities, which are crucial for real-world applications. However, the paper could benefit from additional comparisons with more diverse baseline models to strengthen its claims.
The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameters, which facilitate reproducibility. The authors have also made the code and pretrained models available on GitHub, enhancing the accessibility of their work for further research and experimentation. However, the absence of a demo URL limits immediate practical engagement with the model.
One limitation of the study is the reliance on the AudioSet-Strong dataset, which, while substantial, may not encompass the full diversity of sound events encountered in real-world scenarios. Additionally, the model's performance in highly noisy environments or with overlapping sound events could be further explored. The paper also does not address potential computational costs associated with using large language models for negative query filtering, which may limit practical deployment in resource-constrained settings.
The FlexSED framework has the potential to significantly advance the field of sound event detection by enabling more flexible and user-friendly interactions through open-vocabulary capabilities. Its applications could extend to various domains, including smart home technologies, wildlife monitoring, and assistive devices for the hearing impaired. By improving the adaptability of sound event detection systems, this work could lead to more intelligent and responsive audio processing solutions in everyday environments. The paper presents FlexSED, a novel open-vocabulary sound event detection framework that effectively addresses existing limitations in sound classification and adapts well to diverse real-world applications. The innovative integration of pretrained models and robust training strategies positions this work as a significant contribution to the field of audio machine learning.
While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio models while preserving its acoustic competence. Our method introduces two key dimensions: source-wise distillation, which leverages both textual and acoustic teachers to provide complementary modality-specific supervision; and layer-wise distillation, which aligns teacher signals with appropriate student layers to improve transfer efficiency. This dual-dimensional strategy enables fine-grained control over the distillation process, effectively bridging the gap between symbolic reasoning and speech representations. Experimental results show significant improvements in audio reasoning performance, demonstrating the effectiveness of our framework as a reasoning transfer solution for audio modeling.
Primary: Peking University
All Institutions: * Corresponding author, Peking University, The State Key Laboratory of Multimedia Information Processing, Jiutian Artificial Intelligence Research Institute, †Equal contribution
This paper presents a novel framework for knowledge distillation that enhances reasoning capabilities in audio models by leveraging both textual and acoustic supervision. The comprehensive methodology and strong experimental results indicate a meaningful contribution to the field of machine learning, particularly in audio processing and reasoning tasks.
The proposed methodology introduces a dual-dimensional knowledge distillation framework that effectively addresses the challenges of reasoning in audio models by incorporating both source-wise and layer-wise distillation. This approach is innovative as it not only leverages the strengths of textual and acoustic teachers but also aligns the distillation process with the architecture of the student model, allowing for a more nuanced transfer of knowledge. The textualization of audio to bridge the modality gap is particularly noteworthy, as it enables the application of textual reasoning techniques to audio data.
The experimental evaluation is robust, utilizing relevant datasets such as CoTA and MMAU to assess the performance of the proposed framework. The results demonstrate significant improvements in reasoning accuracy across various tasks, indicating the effectiveness of the proposed distillation methods. The comparison against baseline models and different distillation strategies provides a comprehensive understanding of the framework's impact.
The paper includes sufficient detail regarding the training setup, model configurations, and evaluation metrics, which aids in reproducibility. However, the absence of publicly available code or a project URL limits the ease with which other researchers can replicate the work.
One limitation is the reliance on specific datasets, which may not generalize across all audio reasoning tasks. Additionally, while the framework shows improvements, the paper does not extensively discuss potential computational costs associated with the dual-dimensional distillation process, which could impact scalability in real-world applications.
The proposed framework has significant implications for advancing audio models, particularly in applications requiring complex reasoning, such as automated transcription, sentiment analysis, and interactive voice assistants. By enhancing the reasoning capabilities of audio models, this work could lead to more intelligent and context-aware audio processing systems. This paper presents a novel framework for knowledge distillation that enhances reasoning capabilities in audio models by leveraging both textual and acoustic supervision. The comprehensive methodology and strong experimental results indicate a meaningful contribution to the field of machine learning, particularly in audio processing and reasoning tasks.
Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite recent advancements, state-of-the-art ASR models like Whisper still struggle with non-normative speech due to limited training data availability and high acoustic variability. Moreover, collecting and annotating non-normative speech is burdensome: speaking is effortful for many affected individuals, while laborious annotation often requires caregivers familiar with the speaker. This work introduces a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning. We validate our method on the English UA-Speech dataset and a newly collected German speech dataset, BF-Sprache, from a child with structural speech impairment. The dataset and approach are designed to reflect the challenges of low-resource settings that include individuals with speech impairments. Our method significantly improves ASR accuracy for impaired speech while maintaining data and annotation efficiency, offering a practical path toward inclusive ASR.
Primary: University of Zurich
All Institutions: Technical University of Munich, School of Computation, Institute of Neuroinformatics, University of Zurich and ETH Zurich, University of Zurich, Department of Computational Linguistics, Information and Technology
This paper presents a novel Bayesian Low-rank Adaptation framework for personalized impaired speech recognition, significantly improving ASR accuracy while addressing the challenges of data scarcity and variability in non-normative speech. The methodology and results contribute meaningfully to the field, offering practical solutions for inclusive communication technologies.
The proposed methodology introduces a novel Bayesian Low-rank Adaptation (VI LoRA) framework, which effectively addresses the challenges of data scarcity and high variability in impaired speech recognition. The incorporation of variational inference to estimate the posterior distributions of adaptation parameters is a significant advancement over traditional low-rank adaptation methods. The dual prior approach for layer-wise weight variations is particularly innovative, allowing for a more informed adaptation process. However, the assumption of independence in the factorization of the variational parameters may limit the model's ability to capture complex interactions between layers.
The experiments are well-structured, utilizing two distinct datasets (UA-Speech and BF-Sprache) that highlight the effectiveness of the proposed method across different languages and intelligibility levels. The comparative analysis against various baselines, including full fine-tuning and standard LoRA, demonstrates the robustness and efficiency of VI LoRA, particularly in low-data scenarios. The results indicate substantial improvements in word and character error rates, especially for speakers with very low intelligibility, underscoring the practical applicability of the method.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details that would facilitate reproducibility. The absence of a publicly available code repository or demo URL further hinders the ability of other researchers to replicate the findings. Clearer documentation of hyperparameters, training procedures, and data preprocessing steps would enhance reproducibility.
The study acknowledges limitations related to the small speaker pool in the BF-Sprache dataset, which may affect the generalizability of the findings. Additionally, the reliance on contextual pattern matching in existing ASR systems could hinder language learning for children with speech impairments. The assumption of independent factorization in the variational parameters may not fully capture the complexities of the model, potentially impacting performance.
The proposed framework has significant implications for the development of inclusive ASR systems that can accommodate individuals with speech impairments. By improving recognition accuracy and maintaining data efficiency, the method can enhance communication for affected individuals, fostering social inclusion and educational opportunities. The approach also opens avenues for further research in low-resource speech recognition across languages, contributing to the broader field of assistive technologies. This paper presents a novel Bayesian Low-rank Adaptation framework for personalized impaired speech recognition, significantly improving ASR accuracy while addressing the challenges of data scarcity and variability in non-normative speech. The methodology and results contribute meaningfully to the field, offering practical solutions for inclusive communication technologies.
The evaluation of audio fingerprinting at a realistic scale is limited by the scarcity of large public music databases. We present an audio-free approach that synthesises latent fingerprints which approximate the distribution of real fingerprints. Our method trains a Rectified Flow model on embeddings extracted by pre-trained neural audio fingerprinting systems. The synthetic fingerprints generated using our system act as realistic distractors and enable the simulation of retrieval performance at a large scale without requiring additional audio. We assess the fidelity of synthetic fingerprints by comparing the distributions to real data. We further benchmark the retrieval performances across multiple state-of-the-art audio fingerprinting frameworks by augmenting real reference databases with synthetic distractors, and show that the scaling trends obtained with synthetic distractors closely track those obtained with real distractors. Finally, we scale the synthetic distractor database to model retrieval performance for very large databases, providing a practical metric of system scalability that does not depend on access to audio corpora.
Primary: Queen Mary University of London
All Institutions: Queen Mary University of London, supported jointly by UK Research and Innovation [grant number EP/S022694/1] and Queen Mary University of London, School of Electronic Engineering and Computer Science, A. Bhattacharjee and M. Pasini are research students at the UKRI Centre for Doctoral Training in Artificial Intelligence and Music
The paper presents a framework for scalable evaluation of audio fingerprinting systems using synthetic latent fingerprints generated by a rectified flow model. The methodology is innovative and addresses a critical challenge in the field, with potential applications that could enhance the performance and scalability of audio identification systems.
The paper introduces a novel approach to audio fingerprinting by synthesizing latent fingerprints using a Rectified Flow model, which is a significant advancement in the field. The methodology is well-structured, leveraging generative modeling to create realistic distractors without requiring additional audio data. The use of embeddings from pre-trained systems enhances the fidelity of the synthetic fingerprints, and the approach is theoretically sound, with a clear explanation of the model architecture and training process. The authors provide a comprehensive description of how the generative model approximates the distribution of real fingerprints, which is a critical aspect of their methodology.
The experimental setup is robust, employing a well-defined evaluation framework that assesses both the fidelity of synthetic fingerprints and their effectiveness as distractors in retrieval tasks. The use of multiple state-of-the-art audio fingerprinting systems for benchmarking adds credibility to the results. The experiments demonstrate that synthetic distractors can effectively simulate real-world conditions, with results indicating that scaling trends are closely tracked. However, the paper could benefit from more extensive statistical analysis to further validate the findings.
The authors have made their code and trained models available on GitHub, which is a positive aspect for reproducibility. The detailed description of the training process, including hyperparameters and dataset specifics, supports the reproducibility of the experiments. However, some equations and figures referenced in the text are not fully detailed, which could hinder complete replication of the results.
One limitation of the study is the reliance on a single dataset (Free Music Archive) for training and evaluation, which may affect the generalizability of the findings to other audio domains. Additionally, while the synthetic fingerprints closely match real distributions, there may still be nuances in real data that are not captured by the generative model. The paper also does not explore the potential biases introduced by the dataset used for training.
This research has significant implications for the field of music information retrieval, particularly in scenarios where large annotated audio datasets are not available. By enabling scalable evaluation of audio fingerprinting systems, the proposed framework can facilitate advancements in real-time audio identification applications, such as music recognition services and copyright enforcement. The approach could also inspire further research into generative modeling techniques in other areas of machine learning. The paper presents a framework for scalable evaluation of audio fingerprinting systems using synthetic latent fingerprints generated by a rectified flow model. The methodology is innovative and addresses a critical challenge in the field, with potential applications that could enhance the performance and scalability of audio identification systems.
Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}
Primary: * Work done during internship at Channel Corporation
All Institutions: * Work done during internship at Channel Corporation, Channel Corporation, Corresponding author
The main contribution of this paper is the introduction of a novel preference-guided optimization approach for prosody learning in TTS systems, which effectively addresses the limitations of existing methods by utilizing human feedback to enhance the naturalness of synthesized speech. This work represents a meaningful step forward in the field of TTS, providing a practical solution to a longstanding challenge in achieving expressive and natural speech synthesis.
The paper introduces an iterative Direct Preference Optimization (DPO) scheme that innovatively addresses the challenge of optimizing prosody in TTS systems without a verifiable reward signal. The methodology is well-structured, leveraging human-labeled preference pairs to guide the model towards more natural prosody, which is a significant advancement over traditional methods that rely heavily on transcription-oriented signals. The regularization to the current model is a thoughtful addition that helps maintain stability during training, which is critical in TTS applications.
The experiments are robust, utilizing the KoCC-TTS dataset, which is specifically curated for authentic Korean call center interactions. The results demonstrate a clear improvement in human preference ratings (ELO) and competitive character error rates (CER) compared to both GRPO and commercial baselines. This empirical validation strengthens the claims made in the paper and showcases the effectiveness of the proposed method in a real-world context.
The paper provides sufficient detail regarding the methodology and experimental setup, which is crucial for reproducibility. However, it would benefit from the inclusion of hyperparameters, model architectures, and specific training procedures to enhance clarity for future researchers attempting to replicate or build upon this work.
One limitation acknowledged is the reliance on human preference pairs, which may introduce variability and subjectivity into the training process. Additionally, the method's performance in diverse linguistic contexts beyond Korean remains untested, which could limit its generalizability.
The findings have significant implications for the development of more natural and human-like TTS systems, which can enhance user experience in various applications, including virtual assistants, audiobooks, and customer service interactions. By improving prosody in TTS, this work contributes to the broader goal of creating more engaging and effective human-computer interactions. The main contribution of this paper is the introduction of a novel preference-guided optimization approach for prosody learning in TTS systems, which effectively addresses the limitations of existing methods by utilizing human feedback to enhance the naturalness of synthesized speech. This work represents a meaningful step forward in the field of TTS, providing a practical solution to a longstanding challenge in achieving expressive and natural speech synthesis.
Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose \textbf{MATA}, a novel training-free method that dynamically pushes LALMs to pay \textbf{M}ore \textbf{A}ttention \textbf{T}o \textbf{A}udio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.
Primary: Corresponding author
All Institutions: Corresponding author
The main contribution of this paper is the introduction of MATA, a novel training-free method that enhances audio attention in LALMs, which significantly improves their performance on audio reasoning tasks. The study's findings are relevant and timely, addressing a crucial challenge in the field of multi-modal machine learning and paving the way for future advancements.
The proposed MATA method is innovative in its approach to addressing the audio-textual attention imbalance in Large Audio-Language Models (LALMs). By dynamically adjusting attention weights post raw scoring without retraining the model, MATA offers a practical solution that is both efficient and effective. The choice to target only the last token in intermediate layers is particularly insightful, as it aligns with the model's architecture and the critical role of these layers in multi-modal fusion. However, the lack of detailed hyperparameter tuning and exploration of different enhancement strengths could limit the method's applicability across various models.
The experiments conducted on the MMAU and MMAR benchmarks provide strong evidence for the efficacy of MATA, showcasing significant performance improvements over baseline models. The results are compelling, especially the claim that MATA enables an open-source model to outperform a proprietary one for the first time. However, the paper could benefit from additional details on the experimental setup, such as the specific configurations of the baseline models and the statistical significance of the results presented.
The paper does not provide a clear path for reproducing the results, as it lacks links to code repositories or detailed implementation instructions. While the methodology is described, the absence of a public implementation or demo limits the ability for other researchers to validate the findings independently.
One limitation is the focus on only two benchmarks, which may not fully capture the generalizability of MATA across diverse audio reasoning tasks. Additionally, the method's reliance on a single hyperparameter for attention enhancement may not be optimal for all scenarios, and further exploration of this aspect could yield more robust results.
The implications of this work are significant, as it addresses a critical gap in multi-modal model performance, particularly in audio reasoning tasks. By improving the attention allocation towards audio, MATA could enhance applications in various fields, including human-computer interaction, assistive technologies, and multimedia content analysis. This research opens avenues for further exploration into multi-modal learning, potentially leading to more balanced and capable AI systems. The main contribution of this paper is the introduction of MATA, a novel training-free method that enhances audio attention in LALMs, which significantly improves their performance on audio reasoning tasks. The study's findings are relevant and timely, addressing a crucial challenge in the field of multi-modal machine learning and paving the way for future advancements.
Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.
Primary: Tsinghua University
All Institutions: and propose the prior augmentation strategies to reduce cascading errors. Comprehensive experimental results demonstrate that our AudioLBM outperforms previous audio upsampling systems by a large margin across speech, Corresponding author: Jun Zhu, Equal contribution, Shengshu AI, Tsinghua University, Department of CST, kHz on 192Audio and 192Music
The main contribution of this work is the introduction of a novel audio super-resolution system utilizing Latent Bridge Models, which significantly enhances the quality of audio upsampling beyond existing methods. The comprehensive methodology, rigorous experimental validation, and potential applications highlight its significance in advancing the field of audio processing.
The paper introduces a novel approach to audio super-resolution using Latent Bridge Models (LBMs), which compress audio waveforms into a continuous latent space. The methodology is well-structured, leveraging frequency-aware LBMs and a cascaded design to enhance the upsampling process beyond 48 kHz. The integration of informative priors from low-resolution (LR) signals into the generative framework is innovative, allowing for better quality audio synthesis. The paper also presents two prior augmentation strategies to mitigate cascading errors, which is a thoughtful addition to the overall framework. The use of variational autoencoders (VAEs) for compression and the detailed explanation of the bridge process further demonstrate the robustness of the proposed methodology.
The experimental setup is comprehensive, utilizing multiple benchmark datasets (VCTK, ESC-50, Song-Describer) and internal test sets to evaluate the performance of the proposed method. The results indicate a significant improvement over existing methods, achieving state-of-the-art performance in both objective and perceptual quality metrics. The paper effectively compares its results against various baselines, providing clear evidence of the advantages of the proposed approach. The ablation studies conducted further validate the contributions of each component of the model.
The paper includes sufficient details regarding the training setup, model architecture, and evaluation metrics, which enhances reproducibility. However, the absence of a public code repository limits the ability for independent verification of results. The authors mention a demo URL, which may provide some interactive insights, but a complete code release would be beneficial for the community.
While the proposed method shows promising results, it is important to note that the reliance on high-quality training data may limit its applicability in scenarios where such data is scarce. Additionally, the paper acknowledges potential misuse of the technology, such as unauthorized synthesis of audio, which raises ethical considerations. The cascading approach, while innovative, may still introduce artifacts that could affect the final output quality if not managed properly.
The implications of this research are significant for various applications, including audio restoration, music production, and hearing aids, where high-quality audio is essential. The ability to upscale audio beyond traditional limits opens new avenues for creative industries and enhances user experiences in audio consumption. However, the ethical concerns regarding misuse must be addressed to prevent potential negative impacts on the industry. The main contribution of this work is the introduction of a novel audio super-resolution system utilizing Latent Bridge Models, which significantly enhances the quality of audio upsampling beyond existing methods. The comprehensive methodology, rigorous experimental validation, and potential applications highlight its significance in advancing the field of audio processing.
AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfakes become nearly indistinguishable from real human speech, the need for robust detection methods and effective countermeasures has become critically urgent. In this paper, we present the ISPL's submission to the SAFE challenge at IH&MMSec 2025, where our system ranked first across all tasks. Our solution introduces a novel approach to audio deepfake detection based on a Mixture of Experts architecture. The proposed system leverages multiple state-of-the-art detectors, combining their outputs through an attention-based gating network that dynamically weights each expert based on the input speech signal. In this design, each expert develops a specialized understanding of the shared training data by learning to capture different complementary aspects of the same input through inductive biases. Experimental results indicate that our method outperforms existing approaches across multiple datasets. We further evaluate and analyze the performance of our system in the SAFE challenge.
The main contribution of this paper is the introduction of an attention-based Mixture of Experts architecture for robust speech deepfake detection, which combines the strengths of multiple detectors to improve performance. This innovative approach, along with strong experimental results, positions the work as a valuable addition to the ongoing efforts in combating audio deepfakes, though it requires further details for reproducibility and practical application.
The paper introduces a Mixture of Experts (MoE) architecture enhanced by an attention-based gating mechanism, which is a sophisticated approach to audio deepfake detection. The use of multiple state-of-the-art detectors allows the model to leverage complementary strengths, effectively addressing the challenge of distinguishing between real and synthetic speech. The attention mechanism dynamically adjusts the contribution of each expert based on the input, which is a notable innovation that enhances the model's adaptability and robustness. However, the paper could benefit from a more detailed description of the inductive biases employed and how they are integrated into the learning process.
The experimental results demonstrate that the proposed method outperforms existing approaches across multiple datasets, which is a strong indicator of its effectiveness. The authors participated in the SAFE challenge, achieving first place across all tasks, which adds credibility to their claims. However, the paper lacks detailed information about the datasets used, including their sizes, diversity, and how they were split for training and testing. This information is crucial for assessing the generalizability of the results.
The paper does not provide sufficient implementation details or access to code repositories, which raises concerns about reproducibility. While the methodology is described, the absence of a clear path for other researchers to replicate the experiments limits the impact of the findings. Providing a GitHub repository or similar would greatly enhance the paper's contribution to the field.
One limitation is the potential overfitting to the datasets used, especially if they are not sufficiently diverse. Additionally, the reliance on multiple experts may increase computational complexity, which could hinder real-time applications. The paper does not address how the model performs under adversarial conditions or with varying qualities of input audio, which is critical for practical deployment.
The implications of this research are significant, particularly in the context of increasing concerns about misinformation and biometric spoofing. Effective detection methods for speech deepfakes can enhance security in various applications, including virtual assistants and online communications. However, the potential for misuse of such technologies also warrants careful consideration of ethical implications and the need for responsible deployment. The main contribution of this paper is the introduction of an attention-based Mixture of Experts architecture for robust speech deepfake detection, which combines the strengths of multiple detectors to improve performance. This innovative approach, along with strong experimental results, positions the work as a valuable addition to the ongoing efforts in combating audio deepfakes, though it requires further details for reproducibility and practical application.
Spatial target speaker extraction isolates a desired speaker's voice in multi-speaker environments using spatial information, such as the direction of arrival (DoA). Although recent deep neural network (DNN)-based discriminative methods have shown significant performance improvements, the potential of generative approaches, such as generative adversarial networks (GANs), remains largely unexplored for this problem. In this work, we demonstrate that a GAN can effectively leverage both noisy mixtures and spatial information to extract and generate the target speaker's speech. By conditioning the GAN on intermediate features of a discriminative spatial filtering model in addition to DoA, we enable steerable target extraction with high spatial resolution of 5 degrees, outperforming state-of-the-art discriminative methods in perceptual quality-based objective metrics.
Primary: 6hours of data. For the steerable-target scenario
All Institutions: 6hours of data. For the steerable-target scenario, Fraunhofer IIS, 91058 Erlangen, Am Wolfsmantel 33
The main contribution of this paper is the introduction of a GAN-based framework for spatial target speaker extraction that effectively utilizes spatial information and intermediate features from discriminative models, demonstrating superior performance in perceptual quality metrics compared to existing methods. The comprehensive methodology and rigorous experimental evaluation underscore its significance in advancing the field of audio signal processing.
The paper introduces a novel GAN-based approach for multi-microphone spatial target speaker extraction, leveraging both spatial information (DoA) and intermediate features from discriminative models. The methodology is well-structured, employing an end-to-end training framework that combines adversarial, reconstruction, and feature-matching losses. The use of a U-Net-like architecture for the generator and a multi-scale STFT-based discriminator is appropriate for the task, allowing for effective feature extraction and conditioning. The conditioning on both DoA and intermediate discriminative features represents a significant methodological advancement, enhancing the model's ability to isolate target speakers in complex acoustic environments.
The experimental setup is robust, utilizing a comprehensive dataset generated through simulated acoustic environments. The authors provide a clear comparison against state-of-the-art discriminative methods, with results indicating superior performance in perceptual quality metrics (PESQ and SCOREQ) while maintaining strong spatial selectivity. The inclusion of multiple SNR levels in testing strengthens the evaluation, demonstrating the model's effectiveness across varying conditions. However, the reliance on synthetic data may limit the generalizability of the results in real-world applications.
The paper provides detailed descriptions of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. The authors do mention the use of specific datasets and training configurations, which is helpful for reproducing the experiments.
One limitation of the proposed method is its dependence on the quality of the simulated data, which may not fully capture the complexities of real-world acoustic environments. Additionally, while the model shows improved performance in perceptual metrics, it may still struggle in scenarios with very low SNR or highly reverberant conditions. The paper also does not explore the computational efficiency of the proposed GAN model, which could be a concern for real-time applications.
The proposed method has significant implications for various applications, including hearing aids, conference systems, and automatic speech recognition. By improving the ability to isolate target speakers in noisy environments, this research could enhance communication technologies and accessibility tools for individuals with hearing impairments. The advancements in generative modeling for audio tasks could also inspire further research in related fields, such as speech synthesis and enhancement. The main contribution of this paper is the introduction of a GAN-based framework for spatial target speaker extraction that effectively utilizes spatial information and intermediate features from discriminative models, demonstrating superior performance in perceptual quality metrics compared to existing methods. The comprehensive methodology and rigorous experimental evaluation underscore its significance in advancing the field of audio signal processing.
Achieving robust and personalized performance in neuro-steered Target Speaker Extraction (TSE) remains a significant challenge for next-generation hearing aids. This is primarily due to two factors: the inherent non-stationarity of EEG signals across sessions, and the high inter-subject variability that limits the efficacy of generalized models. To address these issues, we propose Brainprint-Modulated Target Speaker Extraction (BM-TSE), a novel framework for personalized and high-fidelity extraction. BM-TSE first employs a spatio-temporal EEG encoder with an Adaptive Spectral Gain (ASG) module to extract stable features resilient to non-stationarity. The core of our framework is a personalized modulation mechanism, where a unified brainmap embedding is learned under the joint supervision of subject identification (SID) and auditory attention decoding (AAD) tasks. This learned brainmap, encoding both static user traits and dynamic attentional states, actively refines the audio separation process, dynamically tailoring the output to each user. Evaluations on the public KUL and Cocktail Party datasets demonstrate that BM-TSE achieves state-of-the-art performance, significantly outperforming existing methods. Our code is publicly accessible at: https://github.com/rosshan-orz/BM-TSE.
Primary: These authors contributed equally to this work
All Institutions: School of Data Science, School of Artificial Intelligence, These authors contributed equally to this work
The paper presents the Brainprint-Modulated Target Speaker Extraction (BM-TSE) framework, which significantly advances personalized neuro-steered audio extraction by integrating EEG signal processing with innovative modulation techniques. The methodology is robust and well-structured, addressing critical challenges in the field and demonstrating substantial technical contributions with promising experimental results.
The proposed BM-TSE framework introduces a robust spatio-temporal EEG encoder combined with an Adaptive Spectral Gain (ASG) module, which addresses the non-stationarity of EEG signals effectively. The architecture's unique feature is the personalized brainmap modulation mechanism that integrates subject identification and auditory attention decoding tasks, enabling dynamic audio refinement based on individual neural patterns. This approach is innovative as it leverages stable, user-specific EEG features to enhance target speaker extraction, which is a significant advancement over existing generalized models.
The experiments conducted on the KUL and Cocktail Party datasets demonstrate the model's superiority over existing methods, achieving state-of-the-art results in terms of speech quality and intelligibility. The ablation studies provide a clear understanding of the contributions of each component, reinforcing the importance of the proposed architecture. The metrics used for evaluation, including SI-SDR, PESQ, and STOI, are appropriate and relevant for assessing the model's performance in audio processing tasks.
The paper provides sufficient implementation details, including the use of PyTorch, the training setup, and the datasets. The code is publicly available on GitHub, which enhances reproducibility. However, the absence of a live demo or interactive visualization limits immediate accessibility for other researchers.
One limitation is the reliance on EEG data, which may not be universally applicable across all populations or settings due to inter-subject variability. Additionally, while the model shows promise, the performance may vary with different types of auditory stimuli or in more complex acoustic environments. The paper could also benefit from a discussion on the computational efficiency and real-time applicability of the proposed framework.
The BM-TSE framework has significant implications for the development of advanced hearing aids and assistive listening technologies, potentially improving the quality of life for individuals with hearing impairments. By personalizing audio extraction based on neural signatures, this research paves the way for more adaptive and user-centered auditory processing systems. The paper presents the Brainprint-Modulated Target Speaker Extraction (BM-TSE) framework, which significantly advances personalized neuro-steered audio extraction by integrating EEG signal processing with innovative modulation techniques. The methodology is robust and well-structured, addressing critical challenges in the field and demonstrating substantial technical contributions with promising experimental results.
Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great potential to alleviate the need for experts to label long audio recordings by hand. However, they still typically rely on the availability of labelled data for model training, restricting applicability to a few species and datasets. In this work, we build the first fully unsupervised algorithm to decompose birdsong recordings into sequences of syllables. We first detect syllable events, then cluster them to extract templates -- syllable representations -- before performing matching pursuit to decompose the recording as a sequence of syllables. We evaluate our automatic annotations against human labels on a dataset of Bengalese finch songs and find that our unsupervised method achieves high performance. We also demonstrate that our approach can distinguish individual birds within a species through their unique vocal signatures, for both Bengalese finches and another species, the great tit.
The main contribution of this paper is the development of a fully unsupervised method for annotating birdsongs at the syllable level, which addresses the significant challenge of data labeling in bioacoustics. The innovative approach and promising results position this work as a valuable addition to the field, with potential applications in conservation and animal behavior studies.
The paper presents a fully unsupervised algorithm for identifying and segmenting syllables in birdsong recordings, which is a significant advancement given the reliance on labeled data in previous methods. The methodology involves detecting syllable events, clustering them to create templates, and using a matching pursuit approach to decompose recordings into syllable sequences. The use of PCA and HDBSCAN for clustering, along with a split-merge strategy to refine syllable templates, demonstrates a thoughtful approach to handling the complexities of audio data. However, the methodology could benefit from more detailed explanations of the parameter choices and the impact of different thresholds on performance.
The experiments are well-structured, utilizing two distinct datasets (Bengalese finches and great tits) to validate the method's effectiveness. The evaluation metrics, including precision and recall, are appropriate for the task, and the results show promising performance, particularly in distinguishing individual birds. However, the paper lacks a comprehensive comparison with existing methods, which would provide context for the reported performance metrics. The choice of hyperparameters appears to be somewhat arbitrary, and further tuning could potentially enhance results.
The paper provides a reasonable level of detail regarding the experimental setup and methodology, but it lacks specific URLs for code or data access, which hinders reproducibility. The absence of a publicly available implementation means that other researchers cannot easily replicate the findings or build upon the work. Including a GitHub repository or similar would significantly improve this aspect.
The paper acknowledges that the method may not perform well in the presence of structured noise, which is a significant limitation for real-world applications. Additionally, the reliance on a fixed-size support set for template generation may restrict the method's adaptability to varying datasets. The potential for oversplitting clusters is also a concern, as it could lead to inaccuracies in syllable identification.
The implications of this research are substantial, particularly in the fields of bioacoustics and wildlife conservation. By enabling the automatic annotation of birdsong, the method could facilitate large-scale studies of bird populations and behaviors, contributing to biodiversity monitoring and conservation efforts. Furthermore, the approach has the potential to be adapted for other taxa, broadening its applicability beyond avian species. The main contribution of this paper is the development of a fully unsupervised method for annotating birdsongs at the syllable level, which addresses the significant challenge of data labeling in bioacoustics. The innovative approach and promising results position this work as a valuable addition to the field, with potential applications in conservation and animal behavior studies.
We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.
Primary: using AdamW with a learning rate of
All Institutions: on 8Ă—NVIDIA A100 GPUs, denoising steps with a CFG scale of, UC San Diego, using AdamW with a learning rate of
The main contribution of this paper is the introduction of StereoFoley, an end-to-end framework for generating object-aware stereo audio from video, addressing a critical gap in the field of video-to-audio generation. This work significantly advances the state-of-the-art by combining innovative methodologies with a strong experimental foundation, paving the way for future research and applications in audio synthesis.
The methodology presented in StereoFoley is robust, integrating various components such as video analysis, object tracking, and audio synthesis to create a comprehensive framework for stereo audio generation. The introduction of a synthetic data generation pipeline to address the limitations of existing datasets is a notable strength, showcasing innovation in data handling. The use of latent diffusion models and the design of a two-stage audio generation process enhances the model's performance. However, the reliance on synthetic data may raise questions about the generalizability of the results to real-world scenarios.
The experiments are well-structured, comparing the proposed model against state-of-the-art baselines. The use of both objective metrics and a human listening study provides a balanced evaluation of the model's performance. The results indicate that StereoFoley achieves competitive performance, particularly in object-aware audio generation. However, the marginal differences in some metrics suggest that while improvements are present, they may not be as pronounced as claimed.
The paper provides sufficient details regarding the model architecture, training process, and evaluation metrics, which supports reproducibility. However, the absence of publicly available code or datasets limits the ability of other researchers to fully replicate the study. The authors should consider releasing their code and synthetic datasets to enhance reproducibility and facilitate further research.
The primary limitation of the study is its reliance on synthetic data, which may not fully capture the complexities of real-world audio-visual interactions. Additionally, the evaluation metrics used may not be entirely suitable for high-sample-rate stereo sound, potentially underrepresenting the model's capabilities. The paper also acknowledges that the performance of the model may vary based on the quality of the input video data.
The implications of this research are significant, particularly in fields such as film production, gaming, and virtual reality, where high-quality audio-visual synchronization is crucial. The ability to generate object-aware stereo audio could enhance user experiences in immersive environments. Furthermore, the framework could serve as a foundation for future developments in audio generation, potentially influencing related areas such as sound design and machine learning applications in multimedia. The main contribution of this paper is the introduction of StereoFoley, an end-to-end framework for generating object-aware stereo audio from video, addressing a critical gap in the field of video-to-audio generation. This work significantly advances the state-of-the-art by combining innovative methodologies with a strong experimental foundation, paving the way for future research and applications in audio synthesis.