Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field towards multi-stage pipelines that rely on pre-trained speech tokenizers, but these create a semantic-acoustic divide, limiting holistic and expressive speech generation. We resolve these dilemma through hierarchical semantic-acoustic modeling with semi-discrete residual representations and present a novel tokenizer-free TTS model VoxCPM. Our framework introduces a differentiable quantization bottleneck that induces natural specialization: a Text-Semantic Language Model (TSLM) generates semantic-prosodic plans, while a Residual Acoustic Model (RALM) recovers fine-grained acoustic details. This hierarchical semantic-acoustic representation guides a local diffusion-based decoder to generate high-fidelity speech latents. Critically, the entire architecture is trained end-to-end under a simple diffusion objective, eliminating dependency on external speech tokenizers. Trained on a massive 1.8 million hours of bilingual corpus, our VoxCPM-0.5B model achieves state-of-the-art zero-shot TTS performance among open-source systems, demonstrating that our approach delivers expressive and stable synthesis. Besides, VoxCPM shows the capability to comprehend text to infer and generate appropriate prosody and style, delivering speech with context-aware expressiveness and natural flow. To facilitate community-driven research and development, VoxCPM is publicly accessible under Apache 2.0.
Primary: Tsinghua University
All Institutions: Tsinghua University, Tsinghua Shenzhen International Graduate School, ModelBest
The paper introduces VoxCPM, a tokenizer-free TTS model that effectively resolves the expressivity-stability trade-off in speech synthesis through hierarchical semantic-acoustic modeling. This work significantly advances the field by providing a robust framework that enhances both the quality and expressiveness of generated speech, while also addressing critical limitations of existing approaches.
The paper presents a novel hierarchical architecture for text-to-speech synthesis that effectively separates semantic-prosodic planning from acoustic rendering. The introduction of a differentiable quantization bottleneck (FSQ) is a significant innovation, allowing for a semi-discrete representation that stabilizes the model while preserving expressive capabilities. The use of a Text-Semantic Language Model (TSLM) and a Residual Acoustic Language Model (RALM) showcases a thoughtful approach to addressing the limitations of existing models that either rely on discrete tokens or continuous representations. The methodology is well-structured, with clear delineation of roles for each component, which is essential for achieving high-quality speech synthesis.
The experiments are robust, utilizing a massive bilingual corpus of 1.8 million hours, which significantly enhances the model's training and evaluation. The results demonstrate state-of-the-art performance on multiple benchmarks, including zero-shot TTS capabilities, which is a critical advancement in the field. The paper provides extensive ablation studies that validate the contributions of the FSQ and RALM components, reinforcing the effectiveness of the proposed architecture.
The paper lacks detailed implementation specifics that would facilitate full reproducibility, such as hyperparameter settings and training configurations for the models beyond general descriptions. While it mentions the use of the Megatron framework and provides some training details, a more comprehensive breakdown would be beneficial for other researchers looking to replicate the results.
The model's multilingual capabilities are limited, primarily optimized for English and Chinese, which may restrict its applicability in diverse linguistic contexts. Additionally, the controllability of speech attributes is not fully developed, and the current AudioVAE supports only 16kHz audio generation, which may not meet high-fidelity application standards. These limitations suggest areas for future research and development.
The advancements in TTS technology presented in this paper have significant implications for various applications, including virtual assistants, gaming, and accessibility tools. However, the potential for misuse, such as voice cloning for deceptive purposes, raises ethical concerns that must be addressed through responsible deployment and detection mechanisms. The paper introduces VoxCPM, a tokenizer-free TTS model that effectively resolves the expressivity-stability trade-off in speech synthesis through hierarchical semantic-acoustic modeling. This work significantly advances the field by providing a robust framework that enhances both the quality and expressiveness of generated speech, while also addressing critical limitations of existing approaches.
Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, and costly, which can limit the reproducibility of clinical studies and place a strain on healthcare resources. While automated methods exist, they have significant drawbacks. Reference-based approaches require transcriptions or healthy speech samples, restricting them to read speech and limiting their applicability. Existing reference-free methods are also flawed; supervised models often learn spurious shortcuts from data, while handcrafted features are often unreliable and restricted to specific speech tasks. This paper introduces XPPG-PCA (x-vector phonetic posteriorgram principal component analysis), a novel, unsupervised, reference-free method for speech severity evaluation. Using three Dutch oral cancer datasets, we demonstrate that XPPG-PCA performs comparably to, or exceeds established reference-based methods. Our experiments confirm its robustness against data shortcuts and noise, showing its potential for real-world clinical use. Taken together, our results show that XPPG-PCA provides a robust, generalizable solution for the objective assessment of speech pathology, with the potential to significantly improve the efficiency and reliability of clinical evaluations across a range of disorders. An open-source implementation is available.
Primary: Japan and the Netherlands Cancer Institute
All Institutions: Japan and the Netherlands Cancer Institute, University of Groningen, University Medical Center Groningen, University of Amsterdam, CNRS, Nagoya University
The main contribution of this paper is the introduction of XPPG-PCA, a novel reference-free method for automatic speech severity evaluation that outperforms traditional approaches, demonstrating robustness against data shortcuts and noise. This work represents a meaningful advancement in the automation of speech pathology assessments, with potential applications in clinical practice and research.
The proposed methodology, XPPG-PCA, is innovative as it combines x-vectors and phonetic posteriorgrams to create a reference-free evaluation method for speech severity. The use of principal component analysis (PCA) in an unsupervised manner is a significant departure from traditional supervised methods, which often rely on reference data. The approach is well-structured and addresses the limitations of existing methods by focusing on robustness against data shortcuts and noise. However, the reliance on PCA without labels raises questions about the interpretability of the derived severity scores.
The experiments conducted on three Dutch oral cancer datasets are comprehensive and demonstrate the effectiveness of the proposed method. The results indicate that XPPG-PCA performs comparably or better than established reference-based methods, showcasing its potential for real-world clinical applications. The evaluation metrics used, including correlation and root mean square error, provide a solid basis for assessing performance across different datasets and conditions. The ablation studies further enhance the credibility of the findings by isolating the contributions of individual components of the model.
The paper mentions that an open-source implementation is available, which is crucial for reproducibility. However, detailed implementation specifics, such as the exact configurations used during training and evaluation, could enhance reproducibility further. The inclusion of code and clear documentation would facilitate replication of the results by other researchers.
One limitation is the potential overfitting to the specific datasets used for training and evaluation, which may not generalize to other speech disorders outside the studied populations. Additionally, the unsupervised nature of the method may lead to challenges in interpreting the severity scores, as they do not directly correlate with human judgments without reference data. The reliance on PCA could also obscure the contributions of individual features, making it difficult to ascertain which aspects of speech are most indicative of severity.
The proposed method has significant implications for the field of speech pathology, particularly in improving the efficiency and reliability of speech assessments in clinical settings. By reducing the reliance on expert evaluations, XPPG-PCA could lower costs and increase accessibility to speech evaluations for patients. The open-source nature of the project also encourages further research and development in this area, potentially leading to advancements in automated speech analysis across various disorders. The main contribution of this paper is the introduction of XPPG-PCA, a novel reference-free method for automatic speech severity evaluation that outperforms traditional approaches, demonstrating robustness against data shortcuts and noise. This work represents a meaningful advancement in the automation of speech pathology assessments, with potential applications in clinical practice and research.
Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.
Primary: Sony Group Corporation
All Institutions: Sony Group Corporation, UC San Diego
The main contribution of this paper is the introduction of SoundReactor, a novel framework for online video-to-audio generation that operates frame-by-frame, enabling real-time audio synthesis with low latency and high-quality synchronization. This work represents a meaningful advancement in the field of audio-visual machine learning, with significant implications for interactive applications and content creation.
The paper introduces a novel approach to frame-level online video-to-audio generation, which is a significant advancement over traditional offline methods. The use of a decoder-only causal transformer for audio generation, combined with grid features from the DINOv2 vision encoder, is innovative and effectively addresses the challenges of maintaining causality and low latency. The methodology is well-structured, with a clear focus on end-to-end causality and synchronization between audio and video. The training process, which involves diffusion pre-training followed by consistency fine-tuning, is a thoughtful approach to enhance model performance.
The experiments are robust, utilizing a benchmark of diverse gameplay videos from AAA titles, which adds credibility to the findings. The reported results demonstrate that the model generates high-quality audio that is semantically and temporally aligned with the video content. Objective and human evaluations further validate the effectiveness of the model, although the paper could benefit from more detailed comparisons with existing state-of-the-art methods to contextualize its performance.
The paper provides a clear description of the model architecture and training procedure, which is essential for reproducibility. However, it lacks detailed information regarding the datasets used, including their size and specific characteristics, which could hinder full reproducibility by other researchers. The availability of demo samples is a positive aspect, but a code repository would further enhance reproducibility.
One limitation is the reliance on a specific type of video content (gameplay videos), which may not generalize well to other video domains. Additionally, while the model achieves low per-frame latency, the computational requirements may limit its applicability in resource-constrained environments. The paper could also explore the model's performance on a wider variety of audio-visual content to assess its versatility.
The potential applications of this research are significant, particularly in interactive media, live content creation, and generative models for virtual environments. The ability to generate audio in real-time from video could revolutionize fields such as gaming, virtual reality, and content creation, making it easier for creators to produce immersive experiences. The implications for accessibility in media production are also noteworthy, as this technology could enable more inclusive content creation. The main contribution of this paper is the introduction of SoundReactor, a novel framework for online video-to-audio generation that operates frame-by-frame, enabling real-time audio synthesis with low latency and high-quality synchronization. This work represents a meaningful advancement in the field of audio-visual machine learning, with significant implications for interactive applications and content creation.
Personalized Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating personalized HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range spatial consistency and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbor dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model surpasses leading methods by a substantial margin in generating realistic, high-fidelity HRTFs.
Primary: University of Cambridge
All Institutions: University of Cambridge, Imperial College London, Huawei, Koç University, Queen Mary University of London
The main contribution of this paper is the introduction of HRTFformer, a novel transformer-based model for personalized HRTF upsampling that significantly improves spatial audio rendering accuracy. This work represents a meaningful advancement in the field of immersive audio, combining innovative methodologies with rigorous experimental validation to address a critical challenge in personalized audio experiences.
The proposed methodology, HRTFformer, leverages a transformer-based architecture to address the challenges of HRTF upsampling. The use of spherical harmonics for data representation is innovative, allowing the model to effectively capture spatial correlations. The introduction of a neighbor dissimilarity loss function is a significant contribution that enhances spatial coherence, addressing a key limitation in existing methods. The combination of global attention mechanisms with local feature extraction through convolutional blocks is well-justified and effectively implemented.
The experiments are comprehensive, utilizing a well-defined dataset (SONICOM HRTF dataset) and evaluating the model across varying sparsity levels. The results demonstrate that HRTFformer outperforms existing methods in both objective metrics (LSD, ILD, ITD) and perceptual evaluations, indicating robust performance in realistic scenarios. The inclusion of ablation studies further strengthens the evaluation by isolating the contributions of different model components.
The paper provides sufficient implementation details, including model architecture, training parameters, and evaluation metrics. However, the absence of a publicly available code repository or demo limits reproducibility. Future work could benefit from sharing the model and dataset to facilitate further research.
While the model shows strong performance, the reliance on perceptual models for evaluations may not fully capture human listener experiences. Additionally, the paper acknowledges the limitations of dataset diversity, which could affect generalization to unseen subjects. Future work should address these limitations through subjective evaluations and potentially expanding the dataset.
The advancements in personalized HRTF rendering have significant implications for immersive audio applications, including virtual and augmented reality, gaming, and assistive technologies for the hearing impaired. By improving the practicality of personalized HRTF measurements, this work could enhance user experiences across various domains. The main contribution of this paper is the introduction of HRTFformer, a novel transformer-based model for personalized HRTF upsampling that significantly improves spatial audio rendering accuracy. This work represents a meaningful advancement in the field of immersive audio, combining innovative methodologies with rigorous experimental validation to address a critical challenge in personalized audio experiences.
Neural audio codecs have recently emerged as powerful tools for high-quality and low-bitrate audio compression, leveraging deep generative models to learn latent representations of audio signals. However, existing approaches either rely on a single quantizer that only processes speech domain, or on multiple quantizers that are not well suited for downstream tasks. To address this issue, we propose MelCap, a unified "one-codebook-for-all" neural codec that effectively handles speech, music, and general sound. By decomposing audio reconstruction into two stages, our method preserves more acoustic details than previous single-codebook approaches, while achieving performance comparable to mainstream multi-codebook methods. In the first stage, audio is transformed into mel-spectrograms, which are compressed and quantized into compact single tokens using a 2D tokenizer. A perceptual loss is further applied to mitigate the over-smoothing artifacts observed in spectrogram reconstruction. In the second stage, a Vocoder recovers waveforms from the mel discrete tokens in a single forward pass, enabling real-time decoding. Both objective and subjective evaluations demonstrate that MelCap achieves quality on comparable to state-of-the-art multi-codebook codecs, while retaining the computational simplicity of a single-codebook design, thereby providing an effective representation for downstream tasks.
Primary: Chinese University of Hong Kong
All Institutions: Chinese University of Hong Kong, International Digital Economy Academy (IDEA), Peking University
MelCap presents a unified neural codec that effectively compresses diverse audio types using a single codebook, achieving high fidelity while simplifying the architecture. This paper makes a meaningful contribution to the field of audio processing by addressing key challenges in audio codec design and demonstrating strong performance through rigorous evaluation.
The methodology proposed in MelCap is innovative, combining a two-stage approach to audio codec design that leverages mel-spectrograms and discrete tokens. The use of perceptual loss to mitigate over-smoothing in audio reconstruction is a significant contribution, as it addresses a common issue in audio codecs. The architecture's reliance on a single codebook for various audio types (speech, music, general sounds) is a notable simplification over existing multi-codebook systems, enhancing both efficiency and usability. The integration of a GAN-based vocoder for waveform recovery from mel-spectrogram tokens is well-justified and effectively stabilizes training.
The experimental evaluation is robust, utilizing both objective metrics (VISQOL, LSD, Mel Distance, STFT Distance) and subjective listening tests to validate the codec's performance. The results indicate that MelCap achieves competitive perceptual quality compared to state-of-the-art multi-codebook codecs, which is a strong endorsement of its effectiveness. The ablation studies conducted to assess the impact of different loss functions provide valuable insights into the model's performance and optimization.
The paper includes a commitment to reproducibility, stating that all code, model weights, and training data will be publicly released upon acceptance. This transparency is crucial for the validation of results and for fostering further research in the field. However, specific implementation details, such as hyperparameters and training configurations, could be elaborated upon to enhance reproducibility further.
One limitation is the reliance on a single codebook, which may not capture the full diversity of audio signals as effectively as multi-codebook approaches in certain scenarios. Additionally, while the model demonstrates strong performance on the datasets used, its generalization to other audio domains or more complex soundscapes remains to be thoroughly tested.
The implications of this work are significant for audio compression technologies, particularly in applications requiring high fidelity and low bitrate, such as streaming services, telecommunications, and virtual reality. The ability to handle diverse audio types with a single codec could lead to more efficient audio processing pipelines and improved user experiences in various multimedia applications. MelCap presents a unified neural codec that effectively compresses diverse audio types using a single codebook, achieving high fidelity while simplifying the architecture. This paper makes a meaningful contribution to the field of audio processing by addressing key challenges in audio codec design and demonstrating strong performance through rigorous evaluation.
Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.
Primary: Sony Group Corporation
All Institutions: Sony Group Corporation, UC San Diego
The main contribution of this paper is the introduction of SoundReactor, a novel framework for online video-to-audio generation that operates frame-by-frame, enabling real-time audio synthesis with low latency and high-quality synchronization. This work represents a meaningful advancement in the field of audio-visual machine learning, with significant implications for interactive applications and content creation.
The paper introduces a novel approach to frame-level online video-to-audio generation, which is a significant advancement over traditional offline methods. The use of a decoder-only causal transformer for audio generation, combined with grid features from the DINOv2 vision encoder, is innovative and effectively addresses the challenges of maintaining causality and low latency. The methodology is well-structured, with a clear focus on end-to-end causality and synchronization between audio and video. The training process, which involves diffusion pre-training followed by consistency fine-tuning, is a thoughtful approach to enhance model performance.
The experiments are robust, utilizing a benchmark of diverse gameplay videos from AAA titles, which adds credibility to the findings. The reported results demonstrate that the model generates high-quality audio that is semantically and temporally aligned with the video content. Objective and human evaluations further validate the effectiveness of the model, although the paper could benefit from more detailed comparisons with existing state-of-the-art methods to contextualize its performance.
The paper provides a clear description of the model architecture and training procedure, which is essential for reproducibility. However, it lacks detailed information regarding the datasets used, including their size and specific characteristics, which could hinder full reproducibility by other researchers. The availability of demo samples is a positive aspect, but a code repository would further enhance reproducibility.
One limitation is the reliance on a specific type of video content (gameplay videos), which may not generalize well to other video domains. Additionally, while the model achieves low per-frame latency, the computational requirements may limit its applicability in resource-constrained environments. The paper could also explore the model's performance on a wider variety of audio-visual content to assess its versatility.
The potential applications of this research are significant, particularly in interactive media, live content creation, and generative models for virtual environments. The ability to generate audio in real-time from video could revolutionize fields such as gaming, virtual reality, and content creation, making it easier for creators to produce immersive experiences. The implications for accessibility in media production are also noteworthy, as this technology could enable more inclusive content creation. The main contribution of this paper is the introduction of SoundReactor, a novel framework for online video-to-audio generation that operates frame-by-frame, enabling real-time audio synthesis with low latency and high-quality synchronization. This work represents a meaningful advancement in the field of audio-visual machine learning, with significant implications for interactive applications and content creation.
Spoofing-robust speaker verification (SASV) combines the tasks of speaker and spoof detection to authenticate speakers under adversarial settings. Many SASV systems rely on fusion of speaker and spoof cues at embedding, score or decision levels, based on independently trained subsystems. In this study, we respect similar modularity of the two subsystems, by integrating their outputs using trainable back-end classifiers. In particular, we explore various approaches for directly optimizing the back-end for the recently-proposed SASV performance metric (a-DCF) as a training objective. Our experiments on the ASVspoof 5 dataset demonstrate two important findings: (i) nonlinear score fusion consistently improves a-DCF over linear fusion, and (ii) the combination of weighted cosine scoring for speaker detection with SSL-AASIST for spoof detection achieves state-of-the-art performance, reducing min a-DCF to 0.196 and SPF-EER to 7.6%. These contributions highlight the importance of modular design, calibrated integration, and task-aligned optimization for advancing robust and interpretable SASV systems.
Primary: Bursa Technical University
All Institutions: Bursa Technical University, University of Eastern Finland
This paper presents a novel approach to joint optimization of speaker and spoof detectors, significantly advancing the field of spoofing-robust automatic speaker verification. The integration of modular design with task-aligned optimization showcases a meaningful contribution to the robustness and interpretability of SASV systems.
The paper presents a joint optimization framework for speaker verification and spoof detection that maintains modularity while allowing for integrated training. The architecture is well-structured, utilizing trainable back-end classifiers for score fusion and employing calibration layers to align score distributions. The methodology is innovative in its approach to directly optimizing for the a-DCF metric, which is crucial for SASV performance. The nonlinear score fusion method is particularly noteworthy, as it demonstrates a clear advancement over traditional linear methods.
The experiments conducted on the ASVspoof 5 dataset are rigorous and well-documented, showcasing the effectiveness of the proposed methods. The reported results, including a minimum a-DCF of 0.196 and SPF-EER of 7.6%, indicate a significant improvement over previous benchmarks. The use of state-of-the-art techniques like SSL-AASIST for spoof detection further strengthens the experimental validation of the proposed approach.
While the paper provides a detailed description of the methodology and experimental setup, it lacks a dedicated section on implementation details or code availability, which may hinder reproducibility. Clearer guidelines or supplementary materials would enhance the ability of other researchers to replicate the results.
One limitation of the study is the reliance on a single dataset (ASVspoof 5), which may affect the generalizability of the results. Additionally, the paper does not address potential challenges in real-world applications, such as variations in environmental conditions or speaker characteristics.
The proposed SASV framework has significant implications for security applications where robust speaker verification is critical, such as in banking or personal device authentication. The modular design allows for flexibility in deployment and adaptation to different use cases, potentially leading to broader adoption in the industry. This paper presents a novel approach to joint optimization of speaker and spoof detectors, significantly advancing the field of spoofing-robust automatic speaker verification. The integration of modular design with task-aligned optimization showcases a meaningful contribution to the robustness and interpretability of SASV systems.
While recent years have seen remarkable progress in music generation models, research on their biases across countries, languages, cultures, and musical genres remains underexplored. This gap is compounded by the lack of datasets and benchmarks that capture the global diversity of music. To address these challenges, we introduce GlobalDISCO, a large-scale dataset consisting of 73k music tracks generated by state-of-the-art commercial generative music models, along with paired links to 93k reference tracks in LAION-DISCO-12M. The dataset spans 147 languages and includes musical style prompts extracted from MusicBrainz and Wikipedia. The dataset is globally balanced, representing musical styles from artists across 79 countries and five continents. Our evaluation reveals large disparities in music quality and alignment with reference music between high-resource and low-resource regions. Furthermore, we find marked differences in model performance between mainstream and geographically niche genres, including cases where models generate music for regional genres that more closely align with the distribution of mainstream styles.
Primary: unknown
All Institutions: unknown
This paper introduces GlobalDISCO, a large-scale dataset aimed at exploring biases in AI-generated music across cultures and languages. The comprehensive methodology and significant findings regarding disparities in music generation quality across regions and genres contribute meaningfully to the field, emphasizing the importance of addressing biases in AI systems.
The methodology for constructing the GlobalDISCO dataset is robust and well-structured, involving a comprehensive process of artist selection, metadata enrichment, and music generation using multiple state-of-the-art models. The inclusion of diverse musical styles and languages is commendable, and the systematic approach to ensure global representation is a significant strength. However, the reliance on commercial black-box models limits transparency in the generation process, which could affect reproducibility.
The experimental evaluation is thorough, employing multiple audio embedding models and metrics (FAD and KAD) to assess the quality of generated music. The results clearly illustrate the disparities in performance across different regions and genres, providing valuable insights into the biases present in current music generation models. The use of both objective metrics and human evaluations strengthens the findings, although the paper could benefit from more detailed statistical analysis of the results.
While the methodology is detailed, the use of proprietary models without access to their internal workings poses challenges for reproducibility. The paper does provide a clear pipeline for dataset construction and evaluation, but the lack of access to the models themselves may hinder other researchers from replicating the results fully.
One limitation is the potential bias introduced by using commercial models, which may not generalize well to other contexts or datasets. Additionally, the dataset's focus on generated music may overlook the nuances of live performance and cultural context, which are crucial in music. The paper also does not address how the dataset will be maintained or updated over time, which is important for long-term usability.
The findings have significant implications for the music generation field, highlighting the need for more inclusive datasets that represent global musical diversity. By addressing biases in AI-generated music, this work can contribute to more equitable AI systems that respect and preserve cultural heritage. The release of GlobalDISCO as a public resource is a positive step towards fostering research in this area. This paper introduces GlobalDISCO, a large-scale dataset aimed at exploring biases in AI-generated music across cultures and languages. The comprehensive methodology and significant findings regarding disparities in music generation quality across regions and genres contribute meaningfully to the field, emphasizing the importance of addressing biases in AI systems.
Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.
Primary: unknown
All Institutions: unknown
The paper presents a novel hybrid model for speech enhancement that combines Mamba and attention mechanisms, achieving state-of-the-art performance while maintaining efficiency. The methodology is innovative, but further details on implementation and broader evaluations would enhance its impact and reproducibility.
The proposed RWSA-MambaUNet model introduces a novel approach by integrating resolution-wise shared attention into the Mamba-U-Net architecture. This method allows for efficient attention sharing across different resolutions, which is a significant advancement over traditional attention mechanisms. The paper provides a clear description of the architecture and the rationale behind the design choices, demonstrating a solid understanding of both Mamba and attention mechanisms. However, the methodology could benefit from a deeper exploration of the theoretical underpinnings of the proposed attention mechanism and how it compares to existing methods in more detail.
The experiments are well-structured, comparing the proposed model against several baselines on multiple out-of-domain datasets. The results show strong performance improvements in key metrics such as PESQ, SSNR, and ESTOI, which are critical for speech enhancement tasks. The paper effectively highlights the advantages of the RWSA-MambaUNet in terms of model size and computational efficiency, providing a compelling case for its practical applicability. However, the evaluation could be strengthened by including more diverse datasets and a broader range of performance metrics to assess robustness.
The paper lacks detailed implementation specifics, such as hyperparameter settings, training procedures, and the exact architecture configurations used in experiments. This omission raises concerns about the reproducibility of the results. Including a link to a code repository or supplementary materials would greatly enhance the paper's reproducibility and allow for further validation of the findings.
One limitation is the lack of extensive ablation studies to isolate the contributions of different components of the proposed model. Additionally, the paper does not address potential limitations in terms of generalization to highly diverse or noisy environments, which are common in real-world applications of speech enhancement.
The proposed model has significant implications for various applications in speech processing, particularly in enhancing speech intelligibility in challenging acoustic environments. This could benefit areas such as telecommunications, hearing aids, and automated transcription services. The efficiency of the model also suggests potential for deployment in resource-constrained environments, making it accessible for broader use. The paper presents a novel hybrid model for speech enhancement that combines Mamba and attention mechanisms, achieving state-of-the-art performance while maintaining efficiency. The methodology is innovative, but further details on implementation and broader evaluations would enhance its impact and reproducibility.
Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of DAC-SE1, a novel language model-based framework for speech enhancement that operates directly on discrete audio tokens, achieving high fidelity and outperforming existing methods. This work represents a meaningful advancement in the field of speech enhancement, combining innovative methodology with rigorous experimental validation, and has the potential to influence future research directions in audio processing.
The methodology presented in this paper is innovative, leveraging discrete audio tokens for high-fidelity speech enhancement without the complexity of multi-stage pipelines. The authors effectively simplify the architecture by using a single-stage language model based on the LLaMA architecture, which is a notable departure from existing autoregressive models that typically rely on auxiliary encoders or dual-channel conditioning. The approach to flatten the multi-codebook structure into a single token sequence is a clever strategy that aligns well with the scaling laws observed in large language models, allowing for efficient processing of high-resolution audio.
The experiments conducted are thorough, with a clear focus on both objective and subjective evaluation metrics. The use of MUSHRA for human evaluation adds credibility to the findings, and the results demonstrate that DAC-SE1 consistently outperforms existing state-of-the-art methods across various metrics. The inclusion of diverse datasets for training and evaluation strengthens the robustness of the results, indicating that the model generalizes well across different noise profiles and conditions.
The authors have taken steps to ensure reproducibility by releasing their codebase and model checkpoints, which is critical for the advancement of research in this area. The detailed implementation details, including training strategies and dataset preparation, provide a solid foundation for other researchers to replicate the study. However, the lack of specific institutional affiliations may hinder the identification of the research group's credibility.
One limitation of the study is the reliance on a single model architecture, which may not capture all nuances of speech enhancement across various contexts. Additionally, while the results are promising, the paper does not extensively discuss the potential computational costs associated with training and deploying such large models, which could be a barrier for practical applications.
The implications of this research are significant, as high-fidelity speech enhancement has applications in various fields, including telecommunications, assistive technologies, and entertainment. The ability to enhance speech quality without complex architectures could lead to more accessible and efficient solutions in real-world applications. The main contribution of this paper is the introduction of DAC-SE1, a novel language model-based framework for speech enhancement that operates directly on discrete audio tokens, achieving high fidelity and outperforming existing methods. This work represents a meaningful advancement in the field of speech enhancement, combining innovative methodology with rigorous experimental validation, and has the potential to influence future research directions in audio processing.
Speech encodes paralinguistic information such as demographics, voice quality, and health. Yet no audio foundation model supports zero-shot or out-of-distribution (OOD) generalization to these tasks. We introduce SLAP (Speaker contrastive Language-Audio Pretraining), the first model aligning speech with natural language descriptions of speaker and health metadata through contrastive learning. SLAP combines a Vision Transformer audio encoder with text encoders, trained on more than 3400 hours across 9 datasets with diverse speaker annotations. We evaluated on 38 binary classification tasks spanning demographics, voice characteristics, and clinical assessments across 14 datasets in 7 languages. SLAP achieves 62.9% average F1 in zero-shot evaluation, a 48% relative improvement over CLAP (42.4%), while demonstrating strong OOD generalization to unseen languages and clinical populations. When fine-tuned with linear probing, SLAP reaches 69.3% F1 overall and achieves best-in-class performance on health tasks (57.9% F1), surpassing larger foundation models.
Primary: unknown
All Institutions: unknown
SLAP introduces a novel framework for aligning speech with natural language descriptions to enhance speaker and health-related classification tasks. The combination of contrastive learning and extensive evaluation across diverse datasets positions this work as a meaningful contribution to the field of audio representation learning, with potential applications in healthcare and beyond.
The methodology presented in SLAP is innovative, utilizing a contrastive learning approach to align audio and text representations for speaker and health-related tasks. The model architecture, which combines a Vision Transformer audio encoder with text encoders, is well-justified and leverages a substantial dataset of over 3400 hours of audio. The use of natural language descriptions for speaker metadata is a significant advancement, allowing for a more nuanced understanding of speaker characteristics. The incorporation of a masked autoencoder objective alongside the contrastive learning framework adds robustness to the model's training process. However, the paper could benefit from a more detailed explanation of the hyperparameter choices and their impacts on performance.
The experimental evaluation is comprehensive, covering 38 binary classification tasks across 14 datasets in 7 languages. The results demonstrate a clear improvement over existing models, particularly in zero-shot scenarios, which is a critical aspect for practical applications in healthcare. The paper provides thorough comparisons with baseline models, showcasing SLAP's superior performance in various tasks. However, the evaluation could be strengthened by including more qualitative analyses of the model's predictions and potential failure cases.
The paper provides a reasonable level of detail regarding the training process, including optimizer settings and dataset descriptions. However, the lack of specific URLs for code or model access limits reproducibility. Additionally, the absence of detailed information on data preprocessing steps and the exact architecture specifications may hinder other researchers from replicating the study accurately.
One limitation noted is the reliance on a large amount of annotated data, which may not be readily available for all languages or demographics. The model's performance in low-data regimes, while mentioned, could be explored in more depth. Furthermore, the paper does not address potential biases in the training data that could affect model generalization across diverse populations.
The implications of SLAP are significant, particularly in healthcare settings where speech can serve as a non-invasive diagnostic tool. The ability to generalize across languages and clinical populations opens up new avenues for remote monitoring and early detection of health issues. This model could facilitate personalized interventions and improve healthcare delivery, especially in underserved communities. SLAP introduces a novel framework for aligning speech with natural language descriptions to enhance speaker and health-related classification tasks. The combination of contrastive learning and extensive evaluation across diverse datasets positions this work as a meaningful contribution to the field of audio representation learning, with potential applications in healthcare and beyond.
Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, and costly, which can limit the reproducibility of clinical studies and place a strain on healthcare resources. While automated methods exist, they have significant drawbacks. Reference-based approaches require transcriptions or healthy speech samples, restricting them to read speech and limiting their applicability. Existing reference-free methods are also flawed; supervised models often learn spurious shortcuts from data, while handcrafted features are often unreliable and restricted to specific speech tasks. This paper introduces XPPG-PCA (x-vector phonetic posteriorgram principal component analysis), a novel, unsupervised, reference-free method for speech severity evaluation. Using three Dutch oral cancer datasets, we demonstrate that XPPG-PCA performs comparably to, or exceeds established reference-based methods. Our experiments confirm its robustness against data shortcuts and noise, showing its potential for real-world clinical use. Taken together, our results show that XPPG-PCA provides a robust, generalizable solution for the objective assessment of speech pathology, with the potential to significantly improve the efficiency and reliability of clinical evaluations across a range of disorders. An open-source implementation is available.
Primary: Japan and the Netherlands Cancer Institute
All Institutions: Japan and the Netherlands Cancer Institute, University of Groningen, University Medical Center Groningen, University of Amsterdam, CNRS, Nagoya University
The main contribution of this paper is the introduction of XPPG-PCA, a novel reference-free method for automatic speech severity evaluation that outperforms traditional approaches, demonstrating robustness against data shortcuts and noise. This work represents a meaningful advancement in the automation of speech pathology assessments, with potential applications in clinical practice and research.
The proposed methodology, XPPG-PCA, is innovative as it combines x-vectors and phonetic posteriorgrams to create a reference-free evaluation method for speech severity. The use of principal component analysis (PCA) in an unsupervised manner is a significant departure from traditional supervised methods, which often rely on reference data. The approach is well-structured and addresses the limitations of existing methods by focusing on robustness against data shortcuts and noise. However, the reliance on PCA without labels raises questions about the interpretability of the derived severity scores.
The experiments conducted on three Dutch oral cancer datasets are comprehensive and demonstrate the effectiveness of the proposed method. The results indicate that XPPG-PCA performs comparably or better than established reference-based methods, showcasing its potential for real-world clinical applications. The evaluation metrics used, including correlation and root mean square error, provide a solid basis for assessing performance across different datasets and conditions. The ablation studies further enhance the credibility of the findings by isolating the contributions of individual components of the model.
The paper mentions that an open-source implementation is available, which is crucial for reproducibility. However, detailed implementation specifics, such as the exact configurations used during training and evaluation, could enhance reproducibility further. The inclusion of code and clear documentation would facilitate replication of the results by other researchers.
One limitation is the potential overfitting to the specific datasets used for training and evaluation, which may not generalize to other speech disorders outside the studied populations. Additionally, the unsupervised nature of the method may lead to challenges in interpreting the severity scores, as they do not directly correlate with human judgments without reference data. The reliance on PCA could also obscure the contributions of individual features, making it difficult to ascertain which aspects of speech are most indicative of severity.
The proposed method has significant implications for the field of speech pathology, particularly in improving the efficiency and reliability of speech assessments in clinical settings. By reducing the reliance on expert evaluations, XPPG-PCA could lower costs and increase accessibility to speech evaluations for patients. The open-source nature of the project also encourages further research and development in this area, potentially leading to advancements in automated speech analysis across various disorders. The main contribution of this paper is the introduction of XPPG-PCA, a novel reference-free method for automatic speech severity evaluation that outperforms traditional approaches, demonstrating robustness against data shortcuts and noise. This work represents a meaningful advancement in the automation of speech pathology assessments, with potential applications in clinical practice and research.
Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS. Demos are available at: https://flexicodec.github.io
Primary: Amphion Space
All Institutions: Amphion Space
The main contribution of this paper is the introduction of FlexiCodec, a dynamic neural audio codec that effectively preserves semantic information at very low frame rates, significantly advancing the field of audio processing and codec design. The methodology is innovative, and the experimental results validate its effectiveness, marking a meaningful contribution to the machine learning community.
The paper proposes FlexiCodec, a novel neural audio codec that operates at very low frame rates while preserving semantic information. The methodology is robust, leveraging a dynamic frame rate mechanism to adaptively merge semantically similar frames, thus addressing the challenge of information loss at lower frame rates. The architecture integrates ASR feature-assisted dual-stream encoding and Transformer bottlenecks, which enhances both semantic preservation and audio reconstruction quality. The dynamic frame merging process is well-explained, and the use of cosine similarity to guide frame merging is innovative.
The experiments are comprehensive, comparing FlexiCodec against multiple baseline systems at various frame rates. The evaluation metrics include semantic preservation (measured by WER) and acoustic quality (measured by PESQ, MCD, SIM, and UTMOS). The results demonstrate that FlexiCodec significantly outperforms existing codecs in semantic information retention, particularly at lower frame rates, and achieves competitive acoustic quality. The use of a large dataset (Librilight-Large) and detailed experimental setup enhances the credibility of the findings.
The paper provides thorough details on the methodology, training configurations, and evaluation setup, which supports reproducibility. The authors also commit to releasing their code and datasets post-review, further facilitating reproducibility.
One limitation is the reliance on ASR features, which may not generalize well across all types of audio data. Additionally, while the dynamic frame rate is a significant advancement, the potential for increased complexity in real-time applications may pose challenges. The paper also suggests that the acoustic quality may not scale linearly with bitrate, which could limit its applicability in certain scenarios.
The implications of this work are significant for applications in speech synthesis, audio compression, and multimodal language models. By enabling efficient low-frame-rate audio processing, FlexiCodec could enhance real-time applications in resource-constrained environments, such as mobile devices or IoT systems. The approach could also pave the way for future research in dynamic audio codecs and their integration into broader machine learning frameworks. The main contribution of this paper is the introduction of FlexiCodec, a dynamic neural audio codec that effectively preserves semantic information at very low frame rates, significantly advancing the field of audio processing and codec design. The methodology is innovative, and the experimental results validate its effectiveness, marking a meaningful contribution to the machine learning community.
Pressure sensors are widely integrated into modern Heating, Ventilation and Air Conditioning (HVAC) systems. As they are sensitive to acoustic pressure, they can be a source of eavesdropping. This paper introduces HVAC-EAR, which reconstructs intelligible speech from low-resolution, noisy pressure data with two key contributions: (i) We achieve intelligible reconstruction from as low as 0.5 kHz sampling rate, surpassing prior work limited to hot word detection, by employing a complex-valued conformer with a Complex Unified Attention Block to capture phoneme dependencies; (ii) HVAC-EAR mitigates transient HVAC noise by reconstructing both magnitude and phase of missing frequencies. For the first time, evaluations on real-world HVAC deployments show significant intelligibility, raising novel privacy concerns.
Primary: George Mason University
All Institutions: George Mason University
The paper presents HVAC-EAR, a novel approach to eavesdropping human speech using HVAC systems, demonstrating significant advancements in speech reconstruction from low-resolution pressure data. The methodology and experimental results indicate a meaningful contribution to the fields of audio processing and security, although further work is needed to enhance reproducibility and address limitations.
The methodology presented in HVAC-EAR is innovative, leveraging a complex-valued conformer architecture with a Complex Unified Attention Block to reconstruct intelligible speech from low-resolution pressure sensor data. The approach is well-structured, addressing both the reconstruction of missing frequencies and the mitigation of transient HVAC noise. The use of complex-valued networks for capturing phoneme dependencies is a notable advancement over traditional methods. However, the paper could benefit from a more detailed explanation of the training process and hyperparameter tuning.
The experimental evaluation is robust, utilizing real-world HVAC deployments and multiple metrics (LSD, NISQA-MOS, PESQ, STOI, SI-SDR) to assess intelligibility and quality. The results demonstrate significant improvements over baseline models, indicating the effectiveness of the proposed method. The inclusion of subjective analysis through Mean Opinion Scores adds credibility to the findings. Nonetheless, the limited dataset and focus on English-only speech may restrict the generalizability of the results.
The paper provides a reasonable level of detail regarding the architecture and evaluation metrics, but lacks comprehensive implementation details that would facilitate reproducibility. The absence of a publicly available code repository is a significant drawback.
Key limitations include the focus on a single language (English), the performance drop beyond 1.2 meters from the sensor, and the need for a minimum sampling frequency of 500 Hz. Additionally, the paper does not explore the implications of varying environmental conditions on the model's performance.
The findings raise critical privacy concerns regarding the use of HVAC systems in sensitive environments, such as healthcare and cleanrooms. The potential for eavesdropping using existing infrastructure highlights the need for improved security measures in building management systems. This research could lead to further investigations into privacy-preserving technologies in acoustic sensing. The paper presents HVAC-EAR, a novel approach to eavesdropping human speech using HVAC systems, demonstrating significant advancements in speech reconstruction from low-resolution pressure data. The methodology and experimental results indicate a meaningful contribution to the fields of audio processing and security, although further work is needed to enhance reproducibility and address limitations.
Acoustic to articulatory inversion has often been limited to a small part of the vocal tract because the data are generally EMA (ElectroMagnetic Articulography) data requiring sensors to be glued to easily accessible articulators. The presented acoustic to articulation model focuses on the inversion of the entire vocal tract from the glottis, the complete tongue, the velum, to the lips. It relies on a realtime dynamic MRI database of more than 3 hours of speech. The data are the denoised speech signal and the automatically segmented articulator contours. Several bidirectional LSTM-based approaches have been used, either inverting each articulator individually or inverting all articulators simultaneously. To our knowledge, this is the first complete inversion of the vocal tract. The average RMSE precision on the test set is 1.65 mm to be compared with the pixel size which is 1.62 mm.
Primary: Université de Lorraine
All Institutions: Université de Lorraine
The main contribution of this paper is the successful inversion of the complete vocal tract contour from acoustic signals using real-time MRI data, which represents a substantial advancement in articulatory modeling. This work not only pushes the boundaries of existing methodologies but also opens new avenues for research in speech technology and related fields.
The paper presents a novel approach to acoustic to articulatory inversion using real-time MRI data, which is a significant advancement over traditional methods that rely on EMA data. The use of bidirectional LSTM networks to simultaneously predict the contours of multiple articulators is innovative, and the integration of phonetic segmentation as an auxiliary task is a thoughtful addition that enhances the model's performance. However, the methodology could benefit from a more detailed discussion on the choice of hyperparameters and the rationale behind the specific model architectures employed.
The experiments conducted are robust, comparing different model architectures and training strategies (ABA vs. AAT). The dataset is well-defined, consisting of a substantial amount of high-quality rt-MRI data. The evaluation metrics (RMSE and median error) are appropriate, and the results indicate a meaningful improvement in articulatory inversion accuracy. However, the paper could improve by including more comparative analyses with existing state-of-the-art methods to contextualize the results further.
While the paper provides a clear description of the dataset and the models used, it lacks specific implementation details such as the exact code or a link to a repository. This omission makes it difficult for other researchers to reproduce the results. Including a supplementary material section with code snippets or a GitHub repository would enhance reproducibility.
The primary limitation is the reliance on a single speaker's data, which may not generalize well across different speakers or languages. Additionally, the use of denoised audio recorded in a noisy environment could introduce biases that affect the model's performance. The paper also acknowledges the imperfections in contour tracking, which could impact the accuracy of the predictions.
This research has significant implications for fields such as speech therapy, linguistics, and human-computer interaction, where understanding and simulating human speech production is crucial. The ability to accurately model the entire vocal tract could lead to advancements in speech synthesis and recognition technologies, enhancing user experiences in various applications. The main contribution of this paper is the successful inversion of the complete vocal tract contour from acoustic signals using real-time MRI data, which represents a substantial advancement in articulatory modeling. This work not only pushes the boundaries of existing methodologies but also opens new avenues for research in speech technology and related fields.
Low-latency symbolic music generation is essential for real-time improvisation and human-AI co-creation. Existing transformer-based models, however, face a trade-off between inference speed and musical quality. Traditional acceleration techniques such as embedding pooling significantly degrade quality, while recently proposed Byte Pair Encoding (BPE) methods - though effective on single-track piano data - suffer large performance drops in multi-track settings, as revealed by our analysis. We propose Attribute-Specialized Key-Value Head Sharing (AS-KVHS), adapted to music's structured symbolic representation, achieving about 30% inference speedup with only a negligible (about 0.4%) quality drop in objective evaluations and slight improvements in subjective listening tests. Our main contributions are (1) the first systematic study of BPE's generalizability in multi-track symbolic music, and (2) the introduction of AS-KVHS for low-latency symbolic music generation. Beyond these, we also release SAGE-Music, an open-source benchmark that matches or surpasses state-of-the-art models in generation quality.
Primary: University of Pennsylvania
All Institutions: Stanford University, University of Michigan, University of Pennsylvania, University of Waterloo
The paper presents a significant contribution to the field of symbolic music generation by introducing a novel method that balances efficiency and quality, supported by rigorous empirical evaluations. The findings have the potential to influence future research directions and practical applications in music technology.
The paper introduces a novel method called Attribute-Specialized Key-Value Head Sharing (AS-KVHS), which adapts existing transformer architectures specifically for symbolic music generation. This method is innovative in that it leverages the structured nature of musical attributes, allowing for efficient inference without significant quality degradation. The systematic study of Byte Pair Encoding (BPE) in multi-track settings is a valuable contribution, revealing important limitations of existing methods. The methodology is well-articulated, with clear explanations of the design choices and their implications for music generation.
The experiments are robust, utilizing a newly curated dataset (VirtuMIDI) that significantly enhances the evaluation of the proposed methods. The authors provide a comprehensive analysis of the efficiency-quality trade-off, comparing their approach against traditional methods and demonstrating superior performance in both objective and subjective evaluations. The use of the Normalized Musical Similarity Index (NMSI) as a composite metric for musical quality is a thoughtful addition, allowing for a nuanced assessment of generated outputs.
The paper includes detailed descriptions of the model architecture, training configurations, and evaluation protocols, which are essential for reproducibility. The authors provide sufficient information on the preprocessing steps and the evaluation metrics used, although access to the full code and model weights would further enhance reproducibility.
While the study presents significant advancements, it does not address potential limitations in the generalizability of the findings across different musical genres or styles. The reliance on a specific dataset may limit the applicability of the results to broader contexts. Additionally, the subjective listening tests, while rigorous, may introduce variability based on the annotators' musical backgrounds.
The advancements in low-latency symbolic music generation have substantial implications for real-time applications in music composition and performance, particularly in enhancing human-AI collaboration. The open-source release of SAGE-Music as a benchmark can foster further research and development in the field, potentially leading to more sophisticated AI-driven music generation tools. The paper presents a significant contribution to the field of symbolic music generation by introducing a novel method that balances efficiency and quality, supported by rigorous empirical evaluations. The findings have the potential to influence future research directions and practical applications in music technology.
Automated birdsong classification is essential for advancing ecological monitoring and biodiversity studies. Despite recent progress, existing methods often depend heavily on labeled data, use limited feature representations, and overlook temporal dynamics essential for accurate species identification. In this work, we propose a self-supervised contrastive network, ARIONet (Acoustic Representation for Interframe Objective Network), that jointly optimizes contrastive classification and future frame prediction using augmented audio representations. The model simultaneously integrates multiple complementary audio features within a transformer-based encoder model. Our framework is designed with two key objectives: (1) to learn discriminative species-specific representations for contrastive learning through maximizing similarity between augmented views of the same audio segment while pushing apart different samples, and (2) to model temporal dynamics by predicting future audio frames, both without requiring large-scale annotations. We validate our framework on four diverse birdsong datasets, including the British Birdsong Dataset, Bird Song Dataset, and two extended Xeno-Canto subsets (A-M and N-Z). Our method consistently outperforms existing baselines and achieves classification accuracies of 98.41%, 93.07%, 91.89%, and 91.58%, and F1-scores of 97.84%, 94.10%, 91.29%, and 90.94%, respectively. Furthermore, it demonstrates low mean absolute errors and high cosine similarity, up to 95%, in future frame prediction tasks. Extensive experiments further confirm the effectiveness of our self-supervised learning strategy in capturing complex acoustic patterns and temporal dependencies, as well as its potential for real-world applicability in ecological conservation and monitoring.
Primary: United International University
All Institutions: United International University, Charles Darwin University
The paper presents ARIONet, a self-supervised framework that effectively captures both species-specific acoustic signatures and their temporal evolution in birdsong, marking a significant contribution to the field of machine learning in audio classification. The innovative methodology and strong experimental results position this work as a valuable resource for future research in ecological monitoring and machine learning applications.
The proposed methodology is innovative, leveraging a self-supervised contrastive learning framework that integrates future frame prediction, which is particularly relevant for audio classification tasks. The use of a transformer-based encoder to process multiple complementary audio features is a strong point, enhancing the model's capacity to capture temporal dynamics in birdsong. The dual-objective approach, combining contrastive learning with predictive modeling, is a significant advancement over traditional methods that often rely on static features or extensive labeled data. The domain-specific augmentations introduced for birdsong classification further enhance the robustness of the model.
The experiments are extensive, validating the proposed framework across four diverse birdsong datasets. The reported classification accuracies and F1-scores demonstrate that ARIONet outperforms existing baselines, which underscores the effectiveness of the proposed approach. However, the paper could benefit from additional comparisons with a broader range of state-of-the-art methods to further contextualize the results.
The paper provides a detailed description of the methodology, including data preprocessing, feature extraction, and model training procedures. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Including such resources would enhance the paper's impact and facilitate further research in this area.
While the model shows strong performance, it relies on a lightweight transformer architecture, which may limit its capacity for more complex feature learning. Additionally, the preprocessing steps that discard low-energy segments could potentially exclude relevant data, impacting the overall performance in certain acoustic environments. The paper also does not address potential ethical concerns related to the deployment of automated monitoring systems in sensitive ecological contexts.
The implications of this research are significant for ecological monitoring and biodiversity studies, providing a scalable solution for automated birdsong classification. The ability to predict future vocalizations could enhance behavioral modeling and contribute to conservation efforts by enabling real-time tracking of bird populations. However, ethical considerations regarding the use of AI in ecological monitoring must be carefully addressed to avoid misclassification and ensure responsible deployment. The paper presents ARIONet, a self-supervised framework that effectively captures both species-specific acoustic signatures and their temporal evolution in birdsong, marking a significant contribution to the field of machine learning in audio classification. The innovative methodology and strong experimental results position this work as a valuable resource for future research in ecological monitoring and machine learning applications.
In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a novel vocoder-free audio super-resolution framework that leverages flow matching generative models to directly reconstruct high-fidelity waveforms. This approach simplifies the audio generation process and addresses critical limitations of existing methods, marking a significant advancement in the field of audio processing.
The paper introduces a vocoder-free framework for audio super-resolution that utilizes a flow matching generative model to directly reconstruct waveforms from complex-valued spectral coefficients. This approach is innovative as it bypasses the traditional two-stage process that relies on a pre-trained neural vocoder, thus simplifying the optimization process and potentially improving audio quality. The use of inverse Short-Time Fourier Transform (iSTFT) is a significant methodological advancement that addresses the limitations of existing methods.
The experiments conducted demonstrate the model's capability to produce high-fidelity audio at 48 kHz across various upsampling factors. The authors provide comparative results against state-of-the-art methods on both speech and general audio datasets, showcasing the effectiveness of their approach. However, the paper would benefit from more extensive ablation studies to further validate the contributions of each component of the proposed framework.
The paper lacks detailed implementation specifics, which may hinder reproducibility. While the results are promising, the absence of a publicly available code repository or demo limits the ability for other researchers to replicate the findings. Clearer guidelines on the experimental setup and parameter settings would enhance reproducibility.
One limitation of the proposed method is its reliance on the flow matching generative model, which may not generalize well across all audio types or conditions. Additionally, the paper does not address potential computational costs associated with the flow matching process, which could impact its practicality for real-time applications.
The proposed framework has significant implications for audio processing applications, particularly in enhancing audio quality for streaming, gaming, and virtual reality environments. By eliminating the need for a vocoder, it opens up new avenues for research and development in audio synthesis and restoration technologies. The main contribution of this paper is the introduction of a novel vocoder-free audio super-resolution framework that leverages flow matching generative models to directly reconstruct high-fidelity waveforms. This approach simplifies the audio generation process and addresses critical limitations of existing methods, marking a significant advancement in the field of audio processing.
Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and often suffer from inconsistent rating standards and poor reproducibility. To address these limitations, we introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting, enabling rigorous evaluation across different datasets. Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling: scalar reward models, semi-scalar reward models, and generative reward models (GRMs). Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences. To improve performance on these challenging pairs, we propose a MOS-aware GRM that incorporates an MOS-difference-based reward function, enabling the model to adaptively scale rewards according to the difficulty of each sample pair. Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases. We hope this work will establish both a benchmark and a methodological framework to foster more rigorous and scalable research in automatic speech quality assessment.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MOS-RMBench and the MOS-aware GRM, which collectively enhance the evaluation and modeling of speech quality in synthetic speech generation. This work is significant as it addresses critical limitations in existing methodologies, paving the way for more reliable and scalable assessments in the field.
The paper introduces MOS-RMBench, a novel benchmark that reformulates existing MOS datasets into a preference-comparison framework. This approach addresses the limitations of traditional MOS ratings, such as inconsistency and poor reproducibility. The authors propose three paradigms for reward modeling, which are well-structured and provide a comprehensive comparison of their performance. The introduction of the MOS-aware GRM, which adapts rewards based on the difficulty of sample pairs, is a significant methodological advancement that enhances the model's ability to discriminate fine-grained quality differences.
The experiments are robust, with clear metrics and thorough evaluations across multiple datasets. The findings are well-supported by empirical evidence, demonstrating that scalar models outperform others and that the proposed MOS-aware GRM improves performance on challenging cases. However, the paper could benefit from additional details on the datasets used and the specific experimental setup to facilitate reproducibility.
The paper includes a reproducibility statement, but it lacks detailed implementation instructions or code availability, which are crucial for enabling other researchers to replicate the results. This limits the overall impact of the findings, as reproducibility is a key aspect of scientific research.
One limitation noted is the performance gap between models on synthetic versus human speech, indicating that the models may not generalize well across different types of speech data. Additionally, the struggle with pairs that have small MOS differences suggests that further refinement is needed in the reward modeling approach.
The proposed methodologies and benchmark have the potential to significantly advance the field of automatic speech quality assessment, leading to improvements in synthetic speech generation. By establishing a more rigorous evaluation framework, this work could foster further research and development in speech technologies, impacting various applications such as virtual assistants, audiobooks, and accessibility tools. The main contribution of this paper is the introduction of MOS-RMBench and the MOS-aware GRM, which collectively enhance the evaluation and modeling of speech quality in synthetic speech generation. This work is significant as it addresses critical limitations in existing methodologies, paving the way for more reliable and scalable assessments in the field.
While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and research methodologies led to systems that failed to generalize to real world application. The main reason is due to the difference between raw deepfake audio, and deepfake audio that has been presented through a communication channel, e.g. by phone. We propose a new framework for data creation and research methodology, allowing for the development of spoofing countermeasures that would be more effective in real-world scenarios. By following the guidelines outlined here we improved deepfake detection accuracy by 39% in more robust and realistic lab setups, and by 57% on a real-world benchmark. We also demonstrate how improvement in datasets would have a bigger impact on deepfake detection accuracy than the choice of larger SOTA models would over smaller models; that is, it would be more important for the scientific community to make greater investment on comprehensive data collection programs than to simply train larger models with higher computational demands.
Primary: Microsoft
All Institutions: Microsoft
This paper presents a comprehensive framework for deepfake voice detection that significantly enhances the realism of training datasets, leading to improved generalization and effectiveness in real-world applications. The methodology and results provide valuable insights for researchers and practitioners in the field of audio deepfake detection, emphasizing the critical need for realistic data in developing robust countermeasures.
The paper introduces a novel framework for deepfake voice detection that emphasizes the importance of realistic dataset creation. It critiques existing methodologies for their lack of generalizability to real-world scenarios, proposing a comprehensive approach that includes various presentation methods (e.g., direct injection and loudspeaker playback) to simulate real-world conditions. This holistic view is a significant advancement over prior work that often focused on isolated aspects of deepfake generation. The methodology is well-structured, leveraging multiple data categories and augmentations to enhance the robustness of the detection systems.
The experiments conducted are thorough, utilizing a range of state-of-the-art (SOTA) models and comparing their performance across different training setups. The results demonstrate a clear improvement in detection accuracy when using the proposed realistic datasets compared to traditional approaches. The paper provides detailed metrics, including Equal Error Rate (EER) and Missed Detection Rate (MDR), which are crucial for evaluating the effectiveness of deepfake detection systems. The findings are statistically significant and provide a strong case for the proposed methodology.
The paper includes detailed implementation information, including training conditions, data augmentation techniques, and the specific models used. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. While the methodology is clearly described, access to the datasets and models would enhance the ability of other researchers to replicate the findings.
One limitation is the relatively small size of the real-world dataset (Fraud Academy), which may affect the generalizability of the results. Additionally, while the paper emphasizes the importance of realistic datasets, it does not extensively explore the computational costs associated with the proposed methods, particularly in terms of model training and inference times. The reliance on specific TTS engines may also limit the applicability of the findings to other systems.
The implications of this research are significant, particularly in the context of increasing concerns over the misuse of deepfake technology. By improving deepfake detection systems, the work contributes to enhancing security in applications such as phone banking and fraud prevention. The emphasis on realistic data collection also encourages future research to prioritize practical applicability, potentially influencing the direction of the field. This paper presents a comprehensive framework for deepfake voice detection that significantly enhances the realism of training datasets, leading to improved generalization and effectiveness in real-world applications. The methodology and results provide valuable insights for researchers and practitioners in the field of audio deepfake detection, emphasizing the critical need for realistic data in developing robust countermeasures.
We introduce a computationally efficient and tunable feedback delay network (FDN) architecture for real-time room impulse response (RIR) rendering that addresses the computational and latency challenges inherent in traditional convolution and Fourier transform based methods. Our approach directly optimizes FDN parameters to match target RIR acoustic and psychoacoustic metrics such as clarity and definition through novel differentiable programming-based optimization. Our method enables dynamic, real-time adjustments of room impulse responses that accommodates listener and source movement. When combined with previous work on representation of head-related impulse responses via infinite impulse responses, an efficient rendering of auditory objects is possible when the HRIR and RIR are known. Our method produces renderings with quality similar to convolution with long binaural room impulse response (BRIR) filters, but at a fraction of the computational cost.
Primary: University of Maryland
All Institutions: University of Maryland
The main contribution of this paper is the introduction of a computationally efficient feedback delay network for real-time room impulse response rendering, which significantly reduces computational costs while maintaining audio quality. This work represents a meaningful advancement in the field of spatial audio rendering, addressing key challenges in real-time applications.
The paper proposes a novel feedback delay network (FDN) architecture for real-time room impulse response (RIR) rendering, which is a significant advancement over traditional convolution and Fourier transform methods. The methodology is well-structured, utilizing differentiable programming for optimizing FDN parameters based on psychoacoustic metrics. The approach is innovative in that it does not rely on real RIR measurements, instead directly optimizing for perceptual metrics, which enhances its applicability in dynamic environments. The separation of early reflections and reverberant tails is a clever design choice that effectively reduces computational complexity while maintaining audio quality.
The experiments conducted demonstrate the effectiveness of the proposed method in synthesizing RIRs that closely match actual RIR metrics. The results are quantitatively supported by objective metrics such as clarity and definition, and the authors provide a thorough comparison of computational costs against traditional methods. However, the paper could benefit from additional subjective evaluations involving listener assessments to further validate the perceptual quality of the synthesized RIRs.
The paper lacks detailed implementation specifics, such as code availability or a clear description of the experimental setup, which may hinder reproducibility. While the methodology is described in detail, providing access to the code and datasets used would significantly enhance the reproducibility of the results.
One limitation is the reliance on psychoacoustic metrics for optimization, which may not capture all aspects of perceptual audio quality. Additionally, the paper does not address potential artifacts that may arise from the FDN synthesis, particularly in complex acoustic environments. The authors also acknowledge a slight discrepancy in the T_30 metric, which could be a concern for applications requiring high precision.
The proposed method has significant implications for spatial audio rendering in augmented and virtual reality applications, where real-time processing and adaptability to user movement are critical. The efficiency gains in computational cost and latency make it suitable for deployment on edge devices, potentially broadening the accessibility of high-quality spatial audio experiences. The main contribution of this paper is the introduction of a computationally efficient feedback delay network for real-time room impulse response rendering, which significantly reduces computational costs while maintaining audio quality. This work represents a meaningful advancement in the field of spatial audio rendering, addressing key challenges in real-time applications.
We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.
Primary: Duke University
All Institutions: Duke University
This paper presents a comprehensive evaluation of reasoning ability in voice-interactive systems, introducing the VERA benchmark and systematically documenting the Voice Reasoning Gap. The methodology and experimental rigor provide valuable insights into the architectural challenges faced by current voice models, paving the way for future innovations in the field.
The methodology is robust and well-structured, introducing the Voice Evaluation of Reasoning Ability (VERA) benchmark, which is a significant advancement in evaluating reasoning capabilities of voice-interactive systems. The authors have meticulously adapted existing text benchmarks for voice interaction, ensuring that the reasoning difficulty is preserved while addressing the unique challenges posed by real-time conversational constraints. The multi-stage pipeline for adapting benchmarks is particularly noteworthy, as it emphasizes reproducibility and quality control in the dataset creation process. The systematic evaluation of various voice systems against text baselines provides a comprehensive framework for understanding the Voice Reasoning Gap (VRG).
The experimental setup is thorough, assessing 12 contemporary voice systems alongside strong text baselines, which allows for a direct comparison of performance across modalities. The results reveal significant modality gaps, with detailed analyses of latency-accuracy trade-offs and failure modes. The use of statistical validation methods, such as McNemar's test, adds rigor to the findings. The paper also includes a well-defined error taxonomy that categorizes failures, providing insights into the underlying architectural challenges faced by voice systems.
The authors have made efforts to ensure reproducibility by providing a clear description of the benchmark construction process and making the dataset available on GitHub. However, the paper does not provide detailed implementation instructions for the models evaluated, which may hinder full reproducibility of the experimental results.
One limitation is the focus on existing voice systems without exploring potential novel architectures that could bridge the VRG. Additionally, the paper acknowledges that the evaluation is not a controlled experiment designed to prove causality, which may limit the generalizability of the findings. The reliance on commercial systems may also introduce variability that is not fully accounted for in the analysis.
The findings have significant implications for the development of voice-interactive systems, particularly in enhancing reasoning capabilities in real-time applications. The VERA benchmark can serve as a critical tool for future research, guiding the design of more effective voice assistants that balance fluency and reasoning depth. The insights gained from this work could influence the direction of research in voice technology and artificial intelligence, potentially leading to more intelligent and conversational systems. This paper presents a comprehensive evaluation of reasoning ability in voice-interactive systems, introducing the VERA benchmark and systematically documenting the Voice Reasoning Gap. The methodology and experimental rigor provide valuable insights into the architectural challenges faced by current voice models, paving the way for future innovations in the field.
Research on audio generation has progressively shifted from waveform-based approaches to spectrogram-based methods, which more naturally capture harmonic and temporal structures. At the same time, advances in image synthesis have shown that autoregression across scales, rather than tokens, improves coherence and detail. Building on these ideas, we introduce MARS (Multi-channel AutoRegression on Spectrograms), a framework that treats spectrograms as multi-channel images and employs channel multiplexing (CMX), a reshaping technique that lowers height and width without discarding information. A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms from coarse to fine resolutions efficiently. Experiments on a large-scale dataset demonstrate that MARS performs comparably or better than state-of-the-art baselines across multiple evaluation metrics, establishing an efficient and scalable paradigm for high-fidelity audio generation.
Primary: AI Lab, Università di Firenze
All Institutions: AI Lab, Università di Firenze
The main contribution of this paper is the introduction of MARS, a novel framework for audio generation that employs multi-channel autoregression on spectrograms, leveraging innovative techniques like channel multiplexing to enhance efficiency and fidelity. This work represents a meaningful advancement in the field of audio synthesis, offering a scalable approach that balances computational efficiency with high-quality output, thus paving the way for future research and applications in audio generation.
The proposed MARS framework innovatively adapts multi-channel autoregressive modeling from image synthesis to audio generation, utilizing a novel channel multiplexing (CMX) technique that effectively reduces spatial resolution while preserving critical frequency information. The shared tokenizer across different scales is a significant methodological advancement, allowing for coherent representations and hierarchical refinement in audio synthesis. The approach is well-structured, with a clear delineation of preprocessing, tokenizer training, and autoregressive modeling stages, which collectively enhance the model's efficiency and fidelity.
The experiments conducted on the NSynth dataset are comprehensive, employing a variety of metrics that assess both reconstruction quality and sample diversity. The results indicate that MARS performs competitively against state-of-the-art models, achieving superior scores in several key metrics such as NDB/$k$, PKID, and IKID, which are crucial for evaluating audio generation quality. The paper provides sufficient detail on the experimental setup, including training epochs and evaluation protocols, which supports the validity of the findings.
While the paper outlines the methodology and experimental setup in detail, it lacks specific implementation details such as code availability and hyperparameter settings, which are essential for reproducibility. The absence of a public demo or project URL further limits the ability of other researchers to replicate the results or build upon this work.
One limitation of the study is the reliance on a single dataset (NSynth) for evaluation, which may not fully capture the generalizability of the MARS framework across diverse audio generation tasks. Additionally, while CMX shows promise in maintaining audio fidelity, the slight trade-off in perceptual quality suggests that further optimization may be necessary. The paper could also benefit from a discussion on potential biases in the dataset and their impact on model performance.
The MARS framework has significant implications for various applications in audio generation, including music synthesis, sound design, and potentially in fields like virtual reality and gaming where high-fidelity audio is crucial. The methodology could inspire further research into multi-channel representations in other domains, such as video processing or medical imaging, highlighting its versatility and potential for broader adoption. The main contribution of this paper is the introduction of MARS, a novel framework for audio generation that employs multi-channel autoregression on spectrograms, leveraging innovative techniques like channel multiplexing to enhance efficiency and fidelity. This work represents a meaningful advancement in the field of audio synthesis, offering a scalable approach that balances computational efficiency with high-quality output, thus paving the way for future research and applications in audio generation.
Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the $\textbf{Spatial-Acoustic Geometry Encoder (SAGE}$), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present $\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, $\textbf{OWL}$ supports o'clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release $\textbf{BiDepth}$, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new $\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$ and improves spatial reasoning QA accuracy by up to $\textbf{25}$\% over BAT.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of OWL, an innovative geometry-aware audio large language model that significantly enhances spatial reasoning capabilities in auditory perception. The comprehensive analysis of the technical contributions, including the novel SAGE encoder and the extensive BiDepth dataset, positions this work as a meaningful advancement in the field of audio machine learning.
The paper introduces the Spatial-Acoustic Geometry Encoder (SAGE), which innovatively combines binaural audio features with 3D spatial structures through panoramic depth images and room impulse responses. This approach is a significant advancement over existing methods, as it allows for more precise spatial reasoning in audio large language models (ALLMs). The integration of SAGE with a spatially grounded chain-of-thought reasoning mechanism in OWL is particularly noteworthy, as it enhances the model's ability to perform complex reasoning tasks, such as direction-of-arrival (DoA) and distance estimation.
The authors construct and release the BiDepth dataset, which is a substantial contribution to the field, containing over one million QA pairs. The evaluation across benchmark datasets demonstrates a clear improvement in performance metrics, with a notable reduction in mean DoA error and significant gains in spatial reasoning QA accuracy. This empirical validation strengthens the claims made in the paper and showcases the effectiveness of the proposed methods.
The paper provides detailed information on model architecture, training procedures, and evaluation metrics, which is essential for reproducibility. However, the absence of a public code repository or demo page limits the ability of other researchers to replicate the results fully.
One limitation is the reliance on panoramic depth images and room impulse responses during training, which may not be readily available in all practical applications. Additionally, while the model shows improved performance, the paper does not extensively address the scalability of the approach or its performance in real-world scenarios outside the controlled datasets.
The advancements in spatial reasoning for audio models could have significant implications for various applications, including augmented reality, virtual reality, and assistive technologies for the visually impaired. By improving auditory perception and reasoning, this research could enhance user experiences in immersive environments and contribute to the development of more intelligent audio systems. The main contribution of this paper is the introduction of OWL, an innovative geometry-aware audio large language model that significantly enhances spatial reasoning capabilities in auditory perception. The comprehensive analysis of the technical contributions, including the novel SAGE encoder and the extensive BiDepth dataset, positions this work as a meaningful advancement in the field of audio machine learning.
This paper introduces Zimtohrli, a novel, full-reference audio similarity metric designed for efficient and perceptually accurate quality assessment. In an era dominated by computationally intensive deep learning models and proprietary legacy standards, there is a pressing need for an interpretable, psychoacoustically-grounded metric that balances performance with practicality. Zimtohrli addresses this gap by combining a 128-bin gammatone filterbank front-end, which models the frequency resolution of the cochlea, with a unique non-linear resonator model that mimics the human eardrum's response to acoustic stimuli. Similarity is computed by comparing perceptually-mapped spectrograms using modified Dynamic Time Warping (DTW) and Neurogram Similarity Index Measure (NSIM) algorithms, which incorporate novel non-linearities to better align with human judgment. Zimtohrli achieves superior performance to the baseline open-source ViSQOL metric, and significantly narrows the performance gap with the latest commercial POLQA metric. It offers a compelling balance of perceptual relevance and computational efficiency, positioning it as a strong alternative for modern audio engineering applications, from codec development to the evaluation of generative audio systems.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Zimtohrli, a novel psychoacoustic audio similarity metric that balances perceptual accuracy with computational efficiency. This work represents a meaningful advancement in audio quality assessment methodologies, offering a fresh perspective that could influence both academic research and practical applications in the audio engineering domain.
The methodology presented in Zimtohrli is grounded in psychoacoustic principles, utilizing a 128-bin gammatone filterbank to simulate human auditory perception. The integration of a non-linear resonator model to represent the eardrum's response is innovative and aligns well with the goal of creating a perceptually relevant audio similarity metric. The use of modified DTW and NSIM algorithms to compute similarity is a thoughtful approach that enhances the metric's ability to reflect human judgment in audio quality assessment. However, the paper could benefit from a more detailed explanation of the implementation specifics and the rationale behind the choice of parameters in the models.
The experiments conducted demonstrate Zimtohrli's performance against established metrics like ViSQOL and POLQA, showing a clear advantage in perceptual accuracy and computational efficiency. The results are promising, indicating that Zimtohrli can serve as a viable alternative in audio engineering applications. However, the paper lacks a comprehensive discussion on the datasets used for evaluation, which is critical for assessing the generalizability of the results.
The paper does not provide sufficient details regarding the implementation of Zimtohrli, which raises concerns about reproducibility. Key aspects such as the exact configuration of the gammatone filterbank, the specifics of the non-linear resonator model, and the datasets used for training and testing are not adequately described. This lack of detail could hinder other researchers from replicating the study or building upon its findings.
One limitation of the study is the absence of a thorough comparison with other contemporary metrics beyond ViSQOL and POLQA. Additionally, the paper does not address potential edge cases or scenarios where Zimtohrli might underperform, which is essential for understanding its applicability in diverse audio contexts. The reliance on psychoacoustic models, while innovative, may also introduce biases that need to be acknowledged.
Zimtohrli has significant potential applications in various fields, including audio codec development, music production, and the evaluation of generative audio systems. Its emphasis on perceptual relevance makes it particularly valuable for industries focused on user experience and audio quality. As the demand for efficient audio processing solutions grows, Zimtohrli could play a crucial role in shaping future standards for audio quality assessment. The main contribution of this paper is the introduction of Zimtohrli, a novel psychoacoustic audio similarity metric that balances perceptual accuracy with computational efficiency. This work represents a meaningful advancement in audio quality assessment methodologies, offering a fresh perspective that could influence both academic research and practical applications in the audio engineering domain.
Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field towards multi-stage pipelines that rely on pre-trained speech tokenizers, but these create a semantic-acoustic divide, limiting holistic and expressive speech generation. We resolve these dilemma through hierarchical semantic-acoustic modeling with semi-discrete residual representations and present a novel tokenizer-free TTS model VoxCPM. Our framework introduces a differentiable quantization bottleneck that induces natural specialization: a Text-Semantic Language Model (TSLM) generates semantic-prosodic plans, while a Residual Acoustic Model (RALM) recovers fine-grained acoustic details. This hierarchical semantic-acoustic representation guides a local diffusion-based decoder to generate high-fidelity speech latents. Critically, the entire architecture is trained end-to-end under a simple diffusion objective, eliminating dependency on external speech tokenizers. Trained on a massive 1.8 million hours of bilingual corpus, our VoxCPM-0.5B model achieves state-of-the-art zero-shot TTS performance among open-source systems, demonstrating that our approach delivers expressive and stable synthesis. Besides, VoxCPM shows the capability to comprehend text to infer and generate appropriate prosody and style, delivering speech with context-aware expressiveness and natural flow. To facilitate community-driven research and development, VoxCPM is publicly accessible under Apache 2.0.
Primary: Tsinghua University
All Institutions: Tsinghua University, Tsinghua Shenzhen International Graduate School, ModelBest
The paper introduces VoxCPM, a tokenizer-free TTS model that effectively resolves the expressivity-stability trade-off in speech synthesis through hierarchical semantic-acoustic modeling. This work significantly advances the field by providing a robust framework that enhances both the quality and expressiveness of generated speech, while also addressing critical limitations of existing approaches.
The paper presents a novel hierarchical architecture for text-to-speech synthesis that effectively separates semantic-prosodic planning from acoustic rendering. The introduction of a differentiable quantization bottleneck (FSQ) is a significant innovation, allowing for a semi-discrete representation that stabilizes the model while preserving expressive capabilities. The use of a Text-Semantic Language Model (TSLM) and a Residual Acoustic Language Model (RALM) showcases a thoughtful approach to addressing the limitations of existing models that either rely on discrete tokens or continuous representations. The methodology is well-structured, with clear delineation of roles for each component, which is essential for achieving high-quality speech synthesis.
The experiments are robust, utilizing a massive bilingual corpus of 1.8 million hours, which significantly enhances the model's training and evaluation. The results demonstrate state-of-the-art performance on multiple benchmarks, including zero-shot TTS capabilities, which is a critical advancement in the field. The paper provides extensive ablation studies that validate the contributions of the FSQ and RALM components, reinforcing the effectiveness of the proposed architecture.
The paper lacks detailed implementation specifics that would facilitate full reproducibility, such as hyperparameter settings and training configurations for the models beyond general descriptions. While it mentions the use of the Megatron framework and provides some training details, a more comprehensive breakdown would be beneficial for other researchers looking to replicate the results.
The model's multilingual capabilities are limited, primarily optimized for English and Chinese, which may restrict its applicability in diverse linguistic contexts. Additionally, the controllability of speech attributes is not fully developed, and the current AudioVAE supports only 16kHz audio generation, which may not meet high-fidelity application standards. These limitations suggest areas for future research and development.
The advancements in TTS technology presented in this paper have significant implications for various applications, including virtual assistants, gaming, and accessibility tools. However, the potential for misuse, such as voice cloning for deceptive purposes, raises ethical concerns that must be addressed through responsible deployment and detection mechanisms. The paper introduces VoxCPM, a tokenizer-free TTS model that effectively resolves the expressivity-stability trade-off in speech synthesis through hierarchical semantic-acoustic modeling. This work significantly advances the field by providing a robust framework that enhances both the quality and expressiveness of generated speech, while also addressing critical limitations of existing approaches.
While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Shanghai Jiao Tong University
The main contribution of this paper is the introduction of WeSCon, a self-training framework that enables word-level emotional expression control in TTS synthesis, significantly advancing the state-of-the-art in this area. The comprehensive analysis of the methodology, experimental results, and implications highlights its significance in the field of machine learning and speech synthesis.
The proposed WeSCon framework introduces a novel two-stage self-training approach that enables word-level emotional expression control in text-to-speech synthesis without the need for extensive emotional datasets. The methodology is well-structured, incorporating a multi-round inference process with transition smoothing and dynamic speed control, which effectively addresses the challenges of emotional transitions in speech synthesis. The dynamic emotional attention bias mechanism further enhances the model's ability to maintain emotional coherence during synthesis. The approach is innovative in its use of self-training to leverage limited data, demonstrating a significant advancement over existing methods that typically rely on extensive emotional datasets.
The experimental setup is robust, utilizing both objective and subjective evaluation metrics to assess the performance of WeSCon against strong baseline models. The results indicate that WeSCon achieves state-of-the-art performance in word-level emotional expression control while preserving the zero-shot capabilities of the original TTS model. The ablation studies effectively demonstrate the contributions of each component of the framework, validating the importance of the transition-smoothing and dynamic speed control mechanisms.
The paper provides sufficient details regarding the training setup, model architecture, and evaluation metrics, which should facilitate reproducibility. However, the reliance on specific datasets and the absence of a publicly available implementation may hinder full reproducibility for some researchers.
The paper acknowledges limitations, including the lack of gradual emotion transitions and the fixed set of discrete emotions, which restricts the model's expressiveness. Additionally, the model's reliance on predefined emotional transitions may limit its adaptability in dynamic contexts. Future work is suggested to explore more flexible, context-aware control strategies.
The WeSCon framework has potential applications in various domains, including virtual agents, emotional storytelling, and expressive speech synthesis. However, it also raises ethical concerns related to speaker impersonation and the potential for generating biased or inappropriate outputs, necessitating careful consideration in deployment. The main contribution of this paper is the introduction of WeSCon, a self-training framework that enables word-level emotional expression control in TTS synthesis, significantly advancing the state-of-the-art in this area. The comprehensive analysis of the methodology, experimental results, and implications highlights its significance in the field of machine learning and speech synthesis.
Versatile audio super-resolution (SR) aims to predict high-frequency components from low-resolution audio across diverse domains such as speech, music, and sound effects. Existing diffusion-based SR methods often fail to produce semantically aligned outputs and struggle with consistent high-frequency reconstruction. In this paper, we propose SAGA-SR, a versatile audio SR model that combines semantic and acoustic guidance. Based on a DiT backbone trained with a flow matching objective, SAGA-SR is conditioned on text and spectral roll-off embeddings. Due to the effective guidance provided by its conditioning, SAGA-SR robustly upsamples audio from arbitrary input sampling rates between 4 kHz and 32 kHz to 44.1 kHz. Both objective and subjective evaluations show that SAGA-SR achieves state-of-the-art performance across all test cases. Sound examples and code for the proposed model are available online.
Primary: Graduate School of Culture Technology, Republic of Korea
All Institutions: Graduate School of Culture Technology, Republic of Korea
SAGA-SR presents a novel approach to audio super-resolution by combining semantic and acoustic guidance, achieving state-of-the-art performance across diverse audio types. The technical contributions are significant, addressing key limitations in existing methods and demonstrating robust experimental results that could influence future research and applications in audio processing.
The methodology of SAGA-SR is well-structured, leveraging a DiT backbone and integrating semantic and acoustic guidance through text and spectral roll-off embeddings. This dual conditioning approach is innovative, addressing the common pitfalls of previous audio super-resolution models that often lack semantic alignment and struggle with high-frequency reconstruction. The use of a flow matching objective is a notable enhancement, enabling the model to generate more coherent audio outputs. The architecture is clearly defined, and the incorporation of both embeddings into the training process is methodologically sound.
The experimental evaluation is comprehensive, utilizing both objective metrics (Log-Spectral Distance, Fréchet Distance) and subjective assessments through listening tests. The paper presents a robust dataset comprising diverse audio types, enhancing the generalizability of the results. The comparison with state-of-the-art models, alongside ablation studies, effectively demonstrates the advantages of the proposed method. However, the paper could benefit from more detailed statistical analysis of the subjective results to strengthen claims of superiority.
The paper provides sufficient implementation details, including training configurations, dataset descriptions, and evaluation metrics, which support reproducibility. The availability of code and sound examples online further enhances the potential for other researchers to replicate the findings. However, the absence of a GitHub repository link limits accessibility to the implementation.
The paper acknowledges limitations, such as challenges in handling overlapping audio sources and potential artifacts from low-frequency replacement post-processing. These are important considerations that could impact the practical applicability of the model in real-world scenarios. Future work is suggested to address these issues, indicating a forward-looking perspective.
SAGA-SR has significant potential applications in various domains, including music restoration, telecommunication, and audio enhancement for streaming services. By improving audio quality across different types of content, the model could enhance user experiences and accessibility in audio consumption. The integration of semantic guidance also opens avenues for more intelligent audio processing systems that can adapt to user preferences. SAGA-SR presents a novel approach to audio super-resolution by combining semantic and acoustic guidance, achieving state-of-the-art performance across diverse audio types. The technical contributions are significant, addressing key limitations in existing methods and demonstrating robust experimental results that could influence future research and applications in audio processing.
Generative universal speech enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. Diffusion- or flow-based generative models are capable of producing enhanced speech with high quality and fidelity. However, they typically achieve speech enhancement by learning an acoustic feature mapping from degraded speech to clean speech, while lacking awareness of high-level semantic information. This deficiency tends to cause semantic ambiguity and acoustic discontinuities in the enhanced speech. In contrast, humans can often comprehend heavily corrupted speech by relying on semantic priors, suggesting that semantics play a crucial role in speech enhancement. Therefore, in this paper, we propose SenSE, which leverages a language model to capture the semantic information of distorted speech and effectively integrates it into a flow-matching-based speech enhancement framework. Specifically, we introduce a semantic-aware speech language model to capture the semantics of degraded speech and generate semantic tokens. We then design a semantic guidance mechanism that incorporates semantic information into the flow-matching-based speech enhancement process, effectively mitigating semantic ambiguity. In addition, we propose a prompt guidance mechanism, which leverages a short reference utterance to alleviate the loss of speaker similarity under severe distortion conditions. The results of several benchmark data sets demonstrate that SenSE not only ensures high perceptual quality but also substantially improves speech fidelity while maintaining strong robustness under severe distortions. Codes and demos are available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University
The main contribution of this paper is the introduction of SenSE, a semantic-aware universal speech enhancement framework that effectively integrates high-level semantic information into the speech enhancement process. This innovative approach addresses semantic ambiguity and improves speaker similarity, marking a significant advancement in the field of speech enhancement.
The proposed methodology, SenSE, introduces a novel two-stage framework that integrates semantic information into flow-matching-based speech enhancement. The first stage utilizes a semantic-aware speech language model to generate semantic tokens from degraded speech, while the second stage employs a semantic guidance mechanism to enhance speech quality. This dual approach is innovative as it addresses the semantic ambiguity often present in existing generative models. The incorporation of a prompt guidance mechanism further enhances speaker similarity, particularly under severe distortion conditions, showcasing a thoughtful integration of language modeling techniques into speech enhancement.
The experimental evaluation is robust, utilizing a comprehensive set of datasets, including the DNS Challenge and VCTK GSR test sets, to validate the model's performance across various distortion types. The results demonstrate that SenSE achieves competitive or superior performance in perceptual quality and semantic fidelity compared to state-of-the-art models. Metrics such as DNSMOS, NISQA, and SpeechBERTScore provide a well-rounded assessment of the model's capabilities, indicating strong robustness under challenging conditions.
The paper provides sufficient implementation details, including model configurations, training setups, and URLs for code and demos, which enhances reproducibility. The authors mention the use of pretrained models and specific training strategies, which are crucial for other researchers attempting to replicate the findings.
The paper identifies limitations related to the diversity of training data, particularly in noise types and room impulse responses, which may restrict the model's generalizability. Additionally, the complexity of the model architecture could pose challenges in terms of computational cost and efficiency, suggesting a need for future optimization.
The potential applications of SenSE extend to various domains requiring high-quality speech enhancement, such as telecommunications, assistive technologies, and media production. By improving speech intelligibility and fidelity in adverse conditions, this work could significantly enhance user experiences in real-world scenarios. The main contribution of this paper is the introduction of SenSE, a semantic-aware universal speech enhancement framework that effectively integrates high-level semantic information into the speech enhancement process. This innovative approach addresses semantic ambiguity and improves speaker similarity, marking a significant advancement in the field of speech enhancement.
We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.
Primary: unknown
All Institutions: unknown
MGM-Omni presents a unified approach to omni-modal understanding and long-horizon speech generation, showcasing significant advancements in audio processing and generation methodologies. The innovative architecture and experimental validation highlight its potential impact on the field, despite some limitations in reproducibility and institutional transparency.
The methodology of MGM-Omni is innovative, featuring a dual-track architecture that separates multimodal reasoning from speech generation, which is a significant departure from traditional cascaded systems. The use of a dual audio encoder and a chunk-based parallel decoding mechanism is particularly noteworthy, as it addresses the challenges of long-form audio understanding and generation effectively. The unified training strategy is well-conceived, allowing for robust cross-modal reasoning and efficient processing of diverse audio inputs.
The experiments are extensive and demonstrate the capabilities of MGM-Omni across various tasks, including long audio understanding and speech generation. The introduction of the Long-TTS-Eval benchmark adds value by systematically assessing long-form speech generation, which is often overlooked in existing evaluations. The results indicate that MGM-Omni outperforms existing models in several key areas, including timbre consistency and context-aware speech generation, providing strong evidence of its effectiveness.
The paper lacks detailed implementation specifics, such as code availability or clear instructions for reproducing the results. While the methodology is described in detail, the absence of a project URL or demo limits the ability of other researchers to replicate the findings independently.
One limitation is the lack of information regarding the primary institution and the absence of a demo or project URL, which hinders accessibility for further exploration of the model. Additionally, while the model shows promise, it may still face challenges in real-world applications, such as varying acoustic conditions and the need for extensive training data.
MGM-Omni has the potential to significantly advance the field of multimodal AI by improving the integration of audio with other modalities, which could enhance applications in areas such as virtual assistants, automated transcription services, and personalized speech synthesis. The ability to generate long-form speech with consistent timbre opens up new possibilities for content creation and accessibility. MGM-Omni presents a unified approach to omni-modal understanding and long-horizon speech generation, showcasing significant advancements in audio processing and generation methodologies. The innovative architecture and experimental validation highlight its potential impact on the field, despite some limitations in reproducibility and institutional transparency.