Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking of AQA metrics. It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance. Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment, especially for longer answers. Third, we propose a new metric - AURA score, to better evaluate open-ended model responses. On AQEval, AURA achieves state-of-the-art correlation with human ratings, significantly outperforming all baselines. Through this work, we aim to highlight the limitations of current AQA evaluation methods and motivate better metrics. We release both the AQEval benchmark and the AURA metric to support future research in holistic AQA evaluation.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of AQEval and the AURA score, which together provide a comprehensive framework for evaluating open-ended responses in audio question answering. This work addresses critical shortcomings in existing evaluation metrics and sets a new standard for future research in the field.
The paper introduces AQEval, a novel benchmark for Audio Question Answering (AQA) metrics, which is a significant advancement in evaluating open-ended responses in audio contexts. The methodology employs a combination of human annotations and a new metric, AURA, which integrates reasoning capabilities of large language models (LLMs) with an audio entailment component. This dual approach allows for a more nuanced evaluation of responses, addressing the limitations of existing metrics that primarily focus on surface-level similarity.
The experimental setup is robust, utilizing a dataset of 10k annotated responses that allows for systematic benchmarking of AQA metrics. The authors provide a comprehensive analysis of existing metrics, demonstrating their weak correlation with human judgments, particularly for longer answers. AURA is shown to outperform traditional metrics significantly, achieving state-of-the-art correlation with human ratings. The ablation studies further validate the effectiveness of the proposed methodology.
The paper includes detailed descriptions of the dataset construction, annotation process, and experimental setup, which enhances reproducibility. However, the reliance on specific LLMs for scoring may limit the generalizability of the results to other models or contexts.
While the paper addresses significant gaps in AQA evaluation, it does not explore the potential biases in human annotations or the limitations of the LLMs used. Additionally, the performance of AURA in real-world applications remains to be fully validated.
The introduction of AQEval and AURA has the potential to significantly influence future research in audio-language models and their evaluation. By providing a more accurate assessment of model responses, this work can lead to improvements in the development of ALMs and their applications in various domains, including accessibility, education, and content creation. The main contribution of this paper is the introduction of AQEval and the AURA score, which together provide a comprehensive framework for evaluating open-ended responses in audio question answering. This work addresses critical shortcomings in existing evaluation metrics and sets a new standard for future research in the field.
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University, Tsinghua University
The main contribution of this paper is the introduction of the Siren framework, which effectively bridges the gap between language models and diffusion models in text-to-audio generation. The comprehensive analysis of the methodology and experimental results highlights its potential to reshape the landscape of audio synthesis, making it a notable advancement in the field.
The proposed methodology introduces a novel framework, Siren, which utilizes multiple isolated transformers with causal conditioning and anti-causal alignment. This approach effectively addresses the limitations of existing RVQ tokenizers in T2A generation by mitigating gradient conflicts and enhancing audio reconstruction fidelity. The use of reinforcement learning for alignment is innovative, although the complexity of the architecture may pose challenges for implementation and scalability.
The experiments are extensive and demonstrate that Siren outperforms both existing LM-based and diffusion-based systems, achieving state-of-the-art results. However, the paper mentions the use of a curated dataset smaller than those in prior work, which raises questions about the generalizability of the results. The evaluation metrics, particularly in terms of fidelity, are well-defined, but further comparisons with a broader range of benchmarks would strengthen the findings.
The paper provides a GitHub repository link for the implementation, which is crucial for reproducibility. However, details on the training process, hyperparameters, and specific datasets used are somewhat limited, which could hinder replication efforts by other researchers.
The authors acknowledge several limitations, including training efficiency due to the sequential training of transformer modules, the trade-off between model size and semantic richness, and the need for larger, more diverse datasets. Addressing these limitations in future work will be essential for advancing the field.
The work has significant implications for multi-modal generation frameworks, potentially enabling more cohesive integration of audio and text. By repositioning LMs as competitive in T2A tasks, it opens pathways for applications in content creation, gaming, and accessibility technologies. The main contribution of this paper is the introduction of the Siren framework, which effectively bridges the gap between language models and diffusion models in text-to-audio generation. The comprehensive analysis of the methodology and experimental results highlights its potential to reshape the landscape of audio synthesis, making it a notable advancement in the field.
Sound field reconstruction involves estimating sound fields from a limited number of spatially distributed observations. This work introduces a differentiable physics approach for sound field reconstruction, where the initial conditions of the wave equation are approximated with a neural network, and the differential operator is computed with a differentiable numerical solver. The use of a numerical solver enables a stable network training while enforcing the physics as a strong constraint, in contrast to conventional physics-informed neural networks, which include the physics as a constraint in the loss function. We introduce an additional sparsity-promoting constraint to achieve meaningful solutions even under severe undersampling conditions. Experiments demonstrate that the proposed approach can reconstruct sound fields under extreme data scarcity, achieving higher accuracy and better convergence compared to physics-informed neural networks.
Primary: Technical University of Denmark (DTU)
All Institutions: Technical University of Denmark (DTU), Universidad Politécnica de Madrid (UPM)
This work introduces a differentiable physics approach for sound field reconstruction, significantly enhancing accuracy and convergence under data scarcity. The methodology combines neural networks with numerical PDE solvers, showcasing a promising direction for future research in acoustics and machine learning.
The paper presents a novel differentiable physics approach that integrates a neural network with a numerical PDE solver for sound field reconstruction. This method improves stability and convergence compared to traditional physics-informed neural networks (PINNs) by directly incorporating physical constraints through the numerical solver rather than as a penalty in the loss function. The introduction of a sparsity-promoting constraint is particularly innovative, allowing the model to perform well under extreme data scarcity. The use of automatic differentiation (AD) to compute gradients through the numerical solver is a significant methodological advancement, streamlining the training process.
The experiments conducted are rigorous and demonstrate the effectiveness of the proposed method across various scenarios, including single Gaussian pulses and complex source distributions. The results indicate that the differentiable physics approach significantly outperforms PINNs in terms of accuracy and convergence speed, particularly in highly undersampled conditions. The use of normalized mean squared error (NMSE) as a performance metric is appropriate, and the experiments are well-structured to showcase the strengths of the proposed method.
The paper provides sufficient detail regarding the implementation, including the architecture of the neural networks, the training process, and the numerical methods used. The availability of the code repository enhances reproducibility, allowing other researchers to replicate the experiments and build upon the work.
While the proposed method shows promising results, the paper does not extensively discuss potential limitations, such as the sensitivity of the model to the choice of hyperparameters or the specific numerical methods employed. Additionally, the method may face challenges in more complex acoustic environments that were not tested in the experiments.
The proposed differentiable physics approach has significant implications for sound field reconstruction and could be applied to various fields, including acoustics, audio engineering, and environmental monitoring. The ability to reconstruct sound fields from limited data could enhance applications in virtual reality, architectural acoustics, and audio signal processing. The integration of physics with machine learning also opens avenues for addressing other inverse problems in different domains. This work introduces a differentiable physics approach for sound field reconstruction, significantly enhancing accuracy and convergence under data scarcity. The methodology combines neural networks with numerical PDE solvers, showcasing a promising direction for future research in acoustics and machine learning.
Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do not evaluate models in realistic long context settings. To address this gap, we introduce AudioMarathon, a benchmark designed to evaluate both understanding and inference efficiency on long-form audio. AudioMarathon provides a diverse set of tasks built upon three pillars: long-context audio inputs with durations ranging from 90.0 to 300.0 seconds, which correspond to encoded sequences of 2,250 to 7,500 audio tokens, respectively, full domain coverage across speech, sound, and music, and complex reasoning that requires multi-hop inference. We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows. We also study acceleration techniques and analyze the trade-offs of token pruning and KV cache eviction. The results show large gaps across current LALMs and highlight the need for better temporal reasoning and memory-efficient architectures. We believe AudioMarathon will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University
The main contribution of this paper is the introduction of AudioMarathon, a benchmark designed to evaluate long-context audio understanding and efficiency in LALMs, addressing a critical gap in current audio processing research. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for driving advancements in audio understanding models.
The paper introduces AudioMarathon, a benchmark that addresses the limitations of existing audio benchmarks by focusing on long-form audio processing. The methodology is well-structured, emphasizing the need for long-context inputs and complex reasoning. The authors provide a clear framework for evaluating LALMs, which includes diverse tasks and a comprehensive approach to assessing both understanding and efficiency. The exploration of acceleration techniques such as token pruning and KV cache eviction adds depth to the methodology, demonstrating a thoughtful approach to optimizing model performance.
The experiments are robust, involving state-of-the-art LALMs and a variety of tasks that reflect real-world audio processing challenges. The results clearly indicate performance drops with increasing audio length, which is a significant finding that underscores the current limitations of LALMs. The analysis of trade-offs in acceleration techniques provides valuable insights into the practical implications of model efficiency, though further quantitative details on the performance metrics would enhance the evaluation.
The paper lacks specific implementation details and code availability, which are critical for reproducibility in machine learning research. While the methodology is sound, the absence of a publicly accessible implementation or dataset limits the ability of other researchers to replicate the findings and build upon this work.
One limitation is the lack of a comprehensive comparison with existing benchmarks, which could provide a clearer context for the performance of LALMs on AudioMarathon. Additionally, the paper does not address potential biases in the dataset or the implications of model performance across different audio domains, which could affect generalizability.
The introduction of AudioMarathon has the potential to significantly influence the audio and multimodal research communities by providing a standardized benchmark for long-context audio understanding. This could lead to advancements in model architectures and techniques that improve audio processing capabilities, ultimately benefiting applications in various fields such as speech recognition, music analysis, and sound event detection. The main contribution of this paper is the introduction of AudioMarathon, a benchmark designed to evaluate long-context audio understanding and efficiency in LALMs, addressing a critical gap in current audio processing research. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for driving advancements in audio understanding models.
The speech of people with Parkinson's Disease (PD) has been shown to hold important clues about the presence and progression of the disease. We investigate the factors based on which humans experts make judgments of the presence of disease in speech samples over five different speech tasks: phonations, sentence repetition, reading, recall, and picture description. We make comparisons by conducting listening tests to determine clinicians accuracy at recognizing signs of PD from audio alone, and we conduct experiments with a machine learning system for detection based on Whisper. Across tasks, Whisper performs on par or better than human experts when only audio is available, especially on challenging but important subgroups of the data: younger patients, mild cases, and female patients. Whisper's ability to recognize acoustic cues in difficult cases complements the multimodal and contextual strengths of human experts.
Primary: Concordia University
All Institutions: Concordia University, McGill University, Nouvelle Voix, CRBLM, Mila Quebec AI Institute, Montreal Neurological Institute
The main contribution of this paper is the comparative analysis of human expert and machine learning performance in detecting Parkinson's Disease from speech samples, demonstrating that the Whisper model can match or exceed human accuracy in specific demographic groups. This work is significant as it bridges the gap between clinical expertise and machine learning capabilities, highlighting the potential for AI to enhance diagnostic processes in healthcare.
The methodology is well-structured, combining human expert evaluations with machine learning experiments using a frozen Whisper model. The authors effectively designed listening tests to gather qualitative insights from experienced clinicians, which adds depth to the analysis. The use of a minimal configuration on the Whisper model to preserve pretraining effects is a thoughtful approach, although the paper could benefit from a more detailed description of the training process and hyperparameter tuning. The inclusion of data augmentation techniques is commendable, as it helps mitigate overfitting and enhances model robustness.
The experiments are comprehensive, utilizing a well-defined dataset from the Quebec Parkinson Network. The performance comparisons across various tasks and demographic groups provide valuable insights into the strengths and weaknesses of both human experts and the Whisper model. However, the paper lacks detailed statistical analysis or significance testing for the reported results, which would strengthen the claims made regarding performance differences.
While the paper outlines the experimental setup and model architecture, it lacks sufficient detail for complete reproducibility. Key aspects such as the exact training procedure, parameter settings, and data preprocessing steps are not fully elaborated. Providing a supplementary material or a GitHub repository with code and data would enhance reproducibility.
The study has several limitations, including the small sample size and potential biases in the dataset. The reliance on audio alone for diagnosis may not fully capture the complexities of Parkinson's Disease, as clinicians typically integrate multimodal information. Additionally, the model's "black box" nature raises concerns about interpretability and accountability in clinical settings.
This research has significant implications for the early detection and monitoring of Parkinson's Disease, potentially improving access to diagnostic care. The findings suggest that machine learning models like Whisper can complement human expertise, particularly in challenging cases. However, the integration of such models into clinical practice will require careful consideration of ethical and interpretative challenges. The main contribution of this paper is the comparative analysis of human expert and machine learning performance in detecting Parkinson's Disease from speech samples, demonstrating that the Whisper model can match or exceed human accuracy in specific demographic groups. This work is significant as it bridges the gap between clinical expertise and machine learning capabilities, highlighting the potential for AI to enhance diagnostic processes in healthcare.
Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.
Primary: Stony Brook University
All Institutions: Stony Brook University
The main contribution of this paper is the introduction of EmoHRNet, a novel high-resolution neural network architecture for speech emotion recognition that achieves state-of-the-art performance across multiple datasets. This work significantly advances the field of SER by effectively capturing emotional nuances through its innovative architecture and methodological rigor.
The methodology presented in EmoHRNet is robust, leveraging the HRNet architecture to maintain high-resolution representations throughout the network. The transformation of audio signals into Mel-spectrograms is a well-established approach in SER, but the adaptation of HRNet for this specific task is innovative. The use of data augmentation techniques, such as frequency and time masking, is appropriate and enhances the model's ability to generalize across different emotional expressions. The architecture's design, which includes high-resolution input modules and multi-resolution stages, is well thought out and addresses the challenges of capturing emotional nuances in speech.
The experimental evaluation is thorough, utilizing three benchmark datasets (RAVDESS, IEMOCAP, and EMOVO) to validate the model's performance. The reported accuracies are impressive, particularly the 92.45% on RAVDESS, which suggests that EmoHRNet significantly outperforms existing models. The comparison with state-of-the-art techniques is comprehensive, providing a clear context for the model's performance. However, the paper could benefit from additional details on the experimental setup, such as the specific training and validation splits used.
The paper provides a reasonable level of detail regarding the training process, including the optimizer settings and loss function. However, the absence of a publicly available code repository or demo limits reproducibility. Future iterations should consider sharing the implementation to facilitate further research and validation.
While EmoHRNet demonstrates strong performance, the paper does not address potential limitations such as the model's computational efficiency or real-time applicability in practical scenarios. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other contexts or languages.
The implications of EmoHRNet are significant for applications in human-machine interaction, particularly in enhancing the emotional intelligence of AI systems. Improved SER capabilities can lead to more empathetic and effective communication in various domains, including customer service, mental health support, and interactive entertainment. The research sets a new benchmark in SER, paving the way for future advancements in the field. The main contribution of this paper is the introduction of EmoHRNet, a novel high-resolution neural network architecture for speech emotion recognition that achieves state-of-the-art performance across multiple datasets. This work significantly advances the field of SER by effectively capturing emotional nuances through its innovative architecture and methodological rigor.
Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.
Primary: Hangzhou Institute for Advanced Study
All Institutions: Hangzhou Institute for Advanced Study, National Natural Science Foundation of China, Zhejiang Provincial Natural Science Foundation of China
The paper presents a novel approach to fine-grained emotion control in LLM-based TTS systems, leveraging reinforcement learning to enhance emotional expressiveness while maintaining synthesis quality. The combination of global and local prosody control mechanisms represents a significant advancement in the field, with promising implications for future research and applications.
The proposed EMORL-TTS framework effectively integrates supervised fine-tuning with reinforcement learning to achieve fine-grained emotional control in LLM-based TTS systems. The unification of global intensity control in the VAD space with local emphasis regulation is a significant methodological advancement. The use of task-specific rewards tailored for emotion category, intensity, and emphasis enhances the model's ability to synthesize emotionally expressive speech. The methodology is well-structured, with clear stages of SFT and GRPO, although the reliance on discrete speech tokens presents inherent challenges that the authors address through innovative reinforcement learning strategies.
The experimental setup is robust, utilizing both objective and subjective evaluation metrics to assess the performance of EMORL-TTS. The use of multiple emotional corpora and the design of comprehensive evaluation tasks, such as Emotion Accuracy Test and Emphasis Accuracy Test, provide a thorough assessment of the model's capabilities. Results indicate significant improvements in emotional accuracy and emphasis clarity compared to baseline models, demonstrating the effectiveness of the proposed method. However, the lack of detailed statistical analysis of the results may limit the depth of the findings.
The paper provides a reasonable level of detail regarding the experimental setup, including training epochs, batch sizes, and learning rates. However, the absence of a publicly available code repository or detailed implementation instructions may hinder full reproducibility. The authors mention that synthesized samples are available online, which is a positive aspect for validation but does not fully address reproducibility concerns.
One limitation of the study is the potential challenge in generalizing the findings across different languages and cultural contexts, as the experiments are conducted solely in English. Additionally, while the model shows improvements in emotional expressiveness, the reliance on discrete token representations may still restrict the model's ability to capture the full spectrum of emotional nuances. The paper also does not address the computational complexity of the proposed method, which could be a concern for practical applications.
The advancements in fine-grained emotional control in TTS systems have significant implications for various applications, including virtual assistants, audiobooks, and interactive gaming. By enhancing the expressiveness of synthesized speech, EMORL-TTS can lead to more engaging and human-like interactions in technology. The potential for cross-lingual extensions and multimodal integration further broadens the scope of its impact, making it a valuable contribution to the field of machine learning and audio synthesis. The paper presents a novel approach to fine-grained emotion control in LLM-based TTS systems, leveraging reinforcement learning to enhance emotional expressiveness while maintaining synthesis quality. The combination of global and local prosody control mechanisms represents a significant advancement in the field, with promising implications for future research and applications.
Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.
Primary: Sapienza University of Rome
All Institutions: Sapienza University of Rome, Sony AI, Sony Group Corporation
The main contribution of this paper is the introduction of StereoSync, a novel framework for generating spatially-aware stereo audio from video, which significantly enhances the quality and immersion of audio-visual experiences. The technical contributions, particularly the integration of depth and bounding box information into the audio generation process, represent a meaningful advancement in the field of machine learning and audio synthesis.
The methodology presented in StereoSync is innovative, leveraging pretrained foundation models for efficient audio generation that is spatially aware and temporally synchronized with video content. The integration of depth maps and bounding boxes as cross-attention conditioning signals in a diffusion-based audio generation model is a notable advancement. The authors effectively combine various modalities to enhance the audio generation process, ensuring that the generated audio reflects the spatial dynamics of the video scene. However, the paper could benefit from a more detailed explanation of the conditioning mechanisms and the specific architecture of the diffusion model used.
The experimental evaluation is robust, utilizing a well-defined dataset (Walking The Maps) that is appropriate for the task of video-to-audio generation. The metrics employed, including FAD, FAVD, and Spatial AV-Align, provide a comprehensive assessment of audio quality, semantic alignment, and spatial coherence. The results demonstrate that StereoSync achieves significant improvements over a baseline model without spatial conditioning, indicating the effectiveness of the proposed approach. However, the paper lacks a comparative analysis with existing state-of-the-art methods, which would strengthen the claims of advancement.
The paper provides sufficient details about the training process, including the use of specific models and parameters, which aids in reproducibility. However, the lack of publicly available code or a demo URL limits the ability of other researchers to replicate the results directly. Providing access to the trained models or a code repository would enhance reproducibility.
One limitation noted is the reliance on a relatively small dataset, which may affect the generalization of the model. Additionally, while the Spatial AV-Align metric is useful, it may not fully capture the nuances of spatial audio generation, as acknowledged by the authors. Future work should address these limitations by exploring larger datasets and refining evaluation metrics.
The implications of this work are significant for fields such as film production, video game design, and virtual reality, where immersive audio experiences are crucial. By advancing the state of video-to-audio generation, StereoSync could enhance the quality of sound design in multimedia applications, leading to more engaging and realistic experiences for users. The main contribution of this paper is the introduction of StereoSync, a novel framework for generating spatially-aware stereo audio from video, which significantly enhances the quality and immersion of audio-visual experiences. The technical contributions, particularly the integration of depth and bounding box information into the audio generation process, represent a meaningful advancement in the field of machine learning and audio synthesis.
Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight averaging. Ablations examine optimal temperature schedules for in- and cross-domain transfer. Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM with minimal forgetting, showing that chain learning remains effective with token interfaces and models.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
TokenChain introduces a novel discrete speech chain framework that effectively integrates semantic-token ASR with a two-stage TTS system, demonstrating significant improvements in performance and convergence. This work represents a meaningful advancement in the field of speech processing, with potential applications across various domains.
The methodology presented in TokenChain is innovative, leveraging a fully discrete speech chain that integrates semantic-token ASR with a two-stage TTS system. The authors employ advanced techniques such as straight-through estimators and Gumbel-Softmax to facilitate end-to-end feedback, which is a significant improvement over traditional continuous intermediate approaches. The dynamic weight averaging for balancing the ASR and TTS components is a noteworthy addition that enhances the training process.
The experimental evaluation is rigorous, utilizing well-established datasets such as LibriSpeech and TED-LIUM. The results demonstrate that TokenChain surpasses baseline models in terms of accuracy and convergence speed, achieving improvements in word error rates (WER) and character error rates (CER). The ablation studies on temperature schedules for in- and cross-domain transfer further strengthen the findings, showcasing a comprehensive approach to model evaluation.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which would allow other researchers to replicate the experiments. However, the absence of a public code repository or demo URL limits the ease of reproducibility.
One limitation is the reliance on specific datasets, which may not generalize across all speech recognition and synthesis tasks. Additionally, the paper does not address potential computational overheads associated with the two-stage TTS system, which could affect real-time applications.
The implications of this work are significant for the fields of automatic speech recognition and text-to-speech synthesis, particularly in enhancing the efficiency and effectiveness of machine speech systems. The approach could lead to more robust applications in voice assistants, accessibility tools, and language learning technologies. TokenChain introduces a novel discrete speech chain framework that effectively integrates semantic-token ASR with a two-stage TTS system, demonstrating significant improvements in performance and convergence. This work represents a meaningful advancement in the field of speech processing, with potential applications across various domains.
Diffusion models have demonstrated remarkable performance in speech synthesis, but typically require multi-step sampling, resulting in low inference efficiency. Recent studies address this issue by distilling diffusion models into consistency models, enabling efficient one-step generation. However, these approaches introduce additional training costs and rely heavily on the performance of pre-trained teacher models. In this paper, we propose ECTSpeech, a simple and effective one-step speech synthesis framework that, for the first time, incorporates the Easy Consistency Tuning (ECT) strategy into speech synthesis. By progressively tightening consistency constraints on a pre-trained diffusion model, ECTSpeech achieves high-quality one-step generation while significantly reducing training complexity. In addition, we design a multi-scale gate module (MSGate) to enhance the denoiser's ability to fuse features at different scales. Experimental results on the LJSpeech dataset demonstrate that ECTSpeech achieves audio quality comparable to state-of-the-art methods under single-step sampling, while substantially reducing the model's training cost and complexity.
Primary: Xinjiang University
All Institutions: Xinjiang University, Tsinghua University, Tianjin University of Technology
The main contribution of this paper is the introduction of ECTSpeech, a novel framework that leverages Easy Consistency Tuning to achieve efficient one-step speech synthesis while maintaining high audio quality. This work significantly advances the field by addressing the limitations of existing diffusion models and consistency models, thereby enhancing the practical applicability of speech synthesis technologies.
The methodology presented in ECTSpeech is innovative as it introduces the Easy Consistency Tuning (ECT) strategy to the domain of speech synthesis for the first time. This approach allows for high-quality one-step generation without the need for a separate student model, significantly streamlining the training process. The incorporation of the multi-scale gate module (MSGate) enhances the model's ability to fuse features at different scales, which is crucial for capturing the nuances of speech signals. The two-stage training process, consisting of diffusion pretraining followed by consistency tuning, is well-structured and effectively addresses the challenges of inference efficiency and training complexity.
The experimental evaluation is robust, utilizing the LJSpeech dataset to benchmark the proposed model against several state-of-the-art methods. The results indicate that ECTSpeech achieves comparable or superior audio quality with significantly reduced training costs and inference times. The use of both subjective (Mean Opinion Score) and objective (Fréchet Distance, Fréchet Audio Distance) metrics provides a comprehensive assessment of the model's performance. The ablation studies further validate the contributions of the MSGate and consistency tuning, demonstrating their importance in enhancing synthesis quality.
The paper provides sufficient details regarding the model architecture, training protocols, and evaluation metrics, which would allow for reproducibility of the results. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation of the study is the reliance on a single dataset (LJSpeech) for evaluation, which may not fully represent the diversity of speech synthesis tasks. Additionally, while the model shows promising results in one-step generation, the paper does not extensively discuss its performance in more complex scenarios, such as multi-speaker or emotional speech synthesis.
The advancements made in efficient speech synthesis through ECTSpeech have significant implications for applications in voice assistants, content creation, and accessibility technologies. By reducing training complexity and improving inference efficiency, this research could facilitate broader adoption of high-quality speech synthesis in real-time applications. The main contribution of this paper is the introduction of ECTSpeech, a novel framework that leverages Easy Consistency Tuning to achieve efficient one-step speech synthesis while maintaining high audio quality. This work significantly advances the field by addressing the limitations of existing diffusion models and consistency models, thereby enhancing the practical applicability of speech synthesis technologies.
Over the past two decades, speech emotion recognition (SER) has received growing attention. To train SER systems, researchers collect emotional speech databases annotated by crowdsourced or in-house raters who select emotions from predefined categories. However, disagreements among raters are common. Conventional methods treat these disagreements as noise, aggregating labels into a single consensus target. While this simplifies SER as a single-label task, it ignores the inherent subjectivity of human emotion perception. This dissertation challenges such assumptions and asks: (1) Should minority emotional ratings be discarded? (2) Should SER systems learn from only a few individuals' perceptions? (3) Should SER systems predict only one emotion per sample? Psychological studies show that emotion perception is subjective and ambiguous, with overlapping emotional boundaries. We propose new modeling and evaluation perspectives: (1) Retain all emotional ratings and represent them with soft-label distributions. Models trained on individual annotator ratings and jointly optimized with standard SER systems improve performance on consensus-labeled tests. (2) Redefine SER evaluation by including all emotional data and allowing co-occurring emotions (e.g., sad and angry). We propose an ``all-inclusive rule'' that aggregates all ratings to maximize diversity in label representation. Experiments on four English emotion databases show superior performance over majority and plurality labeling. (3) Construct a penalization matrix to discourage unlikely emotion combinations during training. Integrating it into loss functions further improves performance. Overall, embracing minority ratings, multiple annotators, and multi-emotion predictions yields more robust and human-aligned SER systems.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of innovative modeling and evaluation approaches in Speech Emotion Recognition that account for the subjectivity of annotators and the ambiguity of emotions, significantly enhancing the performance and applicability of SER systems.
The paper proposes a novel approach to Speech Emotion Recognition (SER) by addressing the subjectivity of emotion perception and the ambiguity of emotional boundaries. It introduces three main methodologies: (1) retaining all emotional ratings and using soft-label distributions for training, (2) redefining evaluation methods to include co-occurring emotions through an "all-inclusive rule," and (3) employing a penalization matrix to discourage unlikely emotion combinations during training. This multifaceted approach is well-justified by psychological findings and shows a clear departure from traditional single-label methods, making it a significant contribution to the field.
The experiments conducted on four English emotion databases demonstrate the effectiveness of the proposed methodologies. The results indicate that the new methods outperform conventional majority and plurality labeling approaches, showcasing improvements in SER system performance across various test conditions. The use of multiple datasets strengthens the validity of the findings, although the paper could benefit from more extensive comparative analysis with other state-of-the-art methods.
The paper provides a detailed account of the methodologies, datasets, and experimental setups, which aids reproducibility. However, it lacks explicit URLs or links to code repositories or demo pages, which would enhance the ability of other researchers to replicate the work. Clear documentation of the datasets used and the specific configurations for experiments would further support reproducibility.
One limitation is the reliance on subjective annotations, which can introduce variability and noise in the data. While the paper addresses this by proposing methods to incorporate all ratings, the inherent subjectivity of emotion perception remains a challenge. Additionally, the paper does not explore the potential impact of demographic factors on emotion perception, which could be an avenue for future research.
The findings have significant implications for the development of more robust and human-aligned SER systems, which can be applied in various domains such as customer service, mental health monitoring, and human-computer interaction. By embracing the complexity of human emotions, the proposed methodologies could lead to advancements in emotional AI technologies that better understand and respond to human emotional states. The main contribution of this paper is the introduction of innovative modeling and evaluation approaches in Speech Emotion Recognition that account for the subjectivity of annotators and the ambiguity of emotions, significantly enhancing the performance and applicability of SER systems.
Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking of AQA metrics. It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance. Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment, especially for longer answers. Third, we propose a new metric - AURA score, to better evaluate open-ended model responses. On AQEval, AURA achieves state-of-the-art correlation with human ratings, significantly outperforming all baselines. Through this work, we aim to highlight the limitations of current AQA evaluation methods and motivate better metrics. We release both the AQEval benchmark and the AURA metric to support future research in holistic AQA evaluation.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of AQEval and the AURA score, which together provide a comprehensive framework for evaluating open-ended responses in audio question answering. This work addresses critical shortcomings in existing evaluation metrics and sets a new standard for future research in the field.
The paper introduces AQEval, a novel benchmark for Audio Question Answering (AQA) metrics, which is a significant advancement in evaluating open-ended responses in audio contexts. The methodology employs a combination of human annotations and a new metric, AURA, which integrates reasoning capabilities of large language models (LLMs) with an audio entailment component. This dual approach allows for a more nuanced evaluation of responses, addressing the limitations of existing metrics that primarily focus on surface-level similarity.
The experimental setup is robust, utilizing a dataset of 10k annotated responses that allows for systematic benchmarking of AQA metrics. The authors provide a comprehensive analysis of existing metrics, demonstrating their weak correlation with human judgments, particularly for longer answers. AURA is shown to outperform traditional metrics significantly, achieving state-of-the-art correlation with human ratings. The ablation studies further validate the effectiveness of the proposed methodology.
The paper includes detailed descriptions of the dataset construction, annotation process, and experimental setup, which enhances reproducibility. However, the reliance on specific LLMs for scoring may limit the generalizability of the results to other models or contexts.
While the paper addresses significant gaps in AQA evaluation, it does not explore the potential biases in human annotations or the limitations of the LLMs used. Additionally, the performance of AURA in real-world applications remains to be fully validated.
The introduction of AQEval and AURA has the potential to significantly influence future research in audio-language models and their evaluation. By providing a more accurate assessment of model responses, this work can lead to improvements in the development of ALMs and their applications in various domains, including accessibility, education, and content creation. The main contribution of this paper is the introduction of AQEval and the AURA score, which together provide a comprehensive framework for evaluating open-ended responses in audio question answering. This work addresses critical shortcomings in existing evaluation metrics and sets a new standard for future research in the field.
Sound field reconstruction involves estimating sound fields from a limited number of spatially distributed observations. This work introduces a differentiable physics approach for sound field reconstruction, where the initial conditions of the wave equation are approximated with a neural network, and the differential operator is computed with a differentiable numerical solver. The use of a numerical solver enables a stable network training while enforcing the physics as a strong constraint, in contrast to conventional physics-informed neural networks, which include the physics as a constraint in the loss function. We introduce an additional sparsity-promoting constraint to achieve meaningful solutions even under severe undersampling conditions. Experiments demonstrate that the proposed approach can reconstruct sound fields under extreme data scarcity, achieving higher accuracy and better convergence compared to physics-informed neural networks.
Primary: Technical University of Denmark (DTU)
All Institutions: Technical University of Denmark (DTU), Universidad Politécnica de Madrid (UPM)
This work introduces a differentiable physics approach for sound field reconstruction, significantly enhancing accuracy and convergence under data scarcity. The methodology combines neural networks with numerical PDE solvers, showcasing a promising direction for future research in acoustics and machine learning.
The paper presents a novel differentiable physics approach that integrates a neural network with a numerical PDE solver for sound field reconstruction. This method improves stability and convergence compared to traditional physics-informed neural networks (PINNs) by directly incorporating physical constraints through the numerical solver rather than as a penalty in the loss function. The introduction of a sparsity-promoting constraint is particularly innovative, allowing the model to perform well under extreme data scarcity. The use of automatic differentiation (AD) to compute gradients through the numerical solver is a significant methodological advancement, streamlining the training process.
The experiments conducted are rigorous and demonstrate the effectiveness of the proposed method across various scenarios, including single Gaussian pulses and complex source distributions. The results indicate that the differentiable physics approach significantly outperforms PINNs in terms of accuracy and convergence speed, particularly in highly undersampled conditions. The use of normalized mean squared error (NMSE) as a performance metric is appropriate, and the experiments are well-structured to showcase the strengths of the proposed method.
The paper provides sufficient detail regarding the implementation, including the architecture of the neural networks, the training process, and the numerical methods used. The availability of the code repository enhances reproducibility, allowing other researchers to replicate the experiments and build upon the work.
While the proposed method shows promising results, the paper does not extensively discuss potential limitations, such as the sensitivity of the model to the choice of hyperparameters or the specific numerical methods employed. Additionally, the method may face challenges in more complex acoustic environments that were not tested in the experiments.
The proposed differentiable physics approach has significant implications for sound field reconstruction and could be applied to various fields, including acoustics, audio engineering, and environmental monitoring. The ability to reconstruct sound fields from limited data could enhance applications in virtual reality, architectural acoustics, and audio signal processing. The integration of physics with machine learning also opens avenues for addressing other inverse problems in different domains. This work introduces a differentiable physics approach for sound field reconstruction, significantly enhancing accuracy and convergence under data scarcity. The methodology combines neural networks with numerical PDE solvers, showcasing a promising direction for future research in acoustics and machine learning.
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University, Tsinghua University
The main contribution of this paper is the introduction of the Siren framework, which effectively bridges the gap between language models and diffusion models in text-to-audio generation. The comprehensive analysis of the methodology and experimental results highlights its potential to reshape the landscape of audio synthesis, making it a notable advancement in the field.
The proposed methodology introduces a novel framework, Siren, which utilizes multiple isolated transformers with causal conditioning and anti-causal alignment. This approach effectively addresses the limitations of existing RVQ tokenizers in T2A generation by mitigating gradient conflicts and enhancing audio reconstruction fidelity. The use of reinforcement learning for alignment is innovative, although the complexity of the architecture may pose challenges for implementation and scalability.
The experiments are extensive and demonstrate that Siren outperforms both existing LM-based and diffusion-based systems, achieving state-of-the-art results. However, the paper mentions the use of a curated dataset smaller than those in prior work, which raises questions about the generalizability of the results. The evaluation metrics, particularly in terms of fidelity, are well-defined, but further comparisons with a broader range of benchmarks would strengthen the findings.
The paper provides a GitHub repository link for the implementation, which is crucial for reproducibility. However, details on the training process, hyperparameters, and specific datasets used are somewhat limited, which could hinder replication efforts by other researchers.
The authors acknowledge several limitations, including training efficiency due to the sequential training of transformer modules, the trade-off between model size and semantic richness, and the need for larger, more diverse datasets. Addressing these limitations in future work will be essential for advancing the field.
The work has significant implications for multi-modal generation frameworks, potentially enabling more cohesive integration of audio and text. By repositioning LMs as competitive in T2A tasks, it opens pathways for applications in content creation, gaming, and accessibility technologies. The main contribution of this paper is the introduction of the Siren framework, which effectively bridges the gap between language models and diffusion models in text-to-audio generation. The comprehensive analysis of the methodology and experimental results highlights its potential to reshape the landscape of audio synthesis, making it a notable advancement in the field.
We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in speech editing and very competitive results in zero-shot TTS, while not being explicitly trained on the latter task, outperforming leading autoregressive and diffusion models on diverse, real-world audio. By integrating Mamba for efficient audio sequence modeling with cross-attention for precise text-acoustic alignment, MAVE enables context-aware voice editing with exceptional naturalness and speaker consistency. In pairwise human evaluations on a random 40-sample subset of the RealEdit benchmark (400 judgments), 57.2% of listeners rated MAVE - edited speech as perceptually equal to the original, while 24.8% prefered the original and 18.0% MAVE - demonstrating that in the majority of cases edits are indistinguishable from the source. MAVE compares favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and standalone mean opinion score (MOS) evaluations. For zero-shot TTS, MAVE exceeds VoiceCraft in both speaker similarity and naturalness, without requiring multiple inference runs or post-processing. Remarkably, these quality gains come with a significantly lower memory cost and approximately the same latency: MAVE requires ~6x less memory than VoiceCraft during inference on utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch size 1). Our results demonstrate that MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the synergistic integration of structured state-space modeling and cross-modal attention.
Primary: unknown
All Institutions: unknown
MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the integration of structured state-space modeling and cross-modal attention. The paper presents a compelling advancement in the field of audio processing, demonstrating significant improvements in both efficiency and quality, although it would benefit from enhanced reproducibility measures and a discussion of ethical implications.
The paper presents MAVE, an autoregressive architecture that integrates a cross-attentive mechanism with a Mamba backbone for voice editing and TTS synthesis. The methodology is well-structured, leveraging state-space modeling and cross-modal attention to achieve high fidelity in voice editing and synthesis. The use of cross-attention for text-acoustic alignment is particularly innovative, allowing for context-aware modifications to audio. The autoregressive nature of the model also suggests a thoughtful approach to sequence generation, although details on the training regimen and hyperparameter tuning could enhance the understanding of the model's performance.
The experiments are robust, utilizing a variety of benchmarks, including the RealEdit dataset, to evaluate the model's performance. The human evaluation metrics, including pairwise comparisons and MOS scores, provide a comprehensive view of the model's effectiveness. The results indicate that MAVE not only matches but often exceeds existing models like VoiceCraft and FluentSpeech, particularly in terms of memory efficiency and naturalness. However, the paper could benefit from a more detailed analysis of the datasets used and the statistical significance of the results.
The paper lacks sufficient detail regarding the implementation specifics, such as the training process, data preprocessing, and evaluation metrics, which are crucial for reproducibility. While the architecture is described, the absence of code or a demo URL limits the ability for other researchers to replicate the findings. Including a link to a GitHub repository or supplementary materials would significantly enhance reproducibility.
One limitation is the reliance on subjective human evaluations, which can introduce variability and bias. Additionally, while the model shows promise in zero-shot TTS, the performance on diverse speaker characteristics and accents remains unexplored. The paper does not address potential ethical concerns related to voice synthesis technology, such as misuse in deepfakes or privacy violations.
The implications of MAVE are significant, particularly in applications like voice dubbing, personalized voice assistants, and content creation. The ability to edit voice recordings seamlessly has the potential to revolutionize industries reliant on audio content. However, the technology also raises ethical questions regarding consent and the potential for misuse, necessitating careful consideration in its deployment. MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the integration of structured state-space modeling and cross-modal attention. The paper presents a compelling advancement in the field of audio processing, demonstrating significant improvements in both efficiency and quality, although it would benefit from enhanced reproducibility measures and a discussion of ethical implications.
Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation.
Primary: unknown
All Institutions: unknown
The paper presents UniVoice, a unified transformer framework that integrates autoregressive speech recognition with flow-matching-based speech synthesis. This work is significant as it explores a novel approach to joint modeling in speech processing, addressing critical limitations in current methodologies and demonstrating robust performance across multiple tasks.
The methodology presented in this paper is innovative, as it proposes a unified framework that integrates autoregressive speech recognition with flow-matching-based synthesis. The dual attention mechanism and text-prefix-guided speech infilling method are significant contributions that address the limitations of existing models that treat ASR and TTS as separate tasks. The continuous representation approach is a notable departure from traditional discrete tokenization methods, which often suffer from information loss. The paper also provides a clear description of the model architecture, training objectives, and attention mask design, which enhances the understanding of the proposed methods.
The experimental evaluation is thorough, utilizing the LibriHeavy dataset for both ASR and TTS tasks. The results demonstrate that UniVoice achieves competitive performance compared to state-of-the-art models in both domains, with specific metrics provided for robustness, similarity, and quality. The ablation studies effectively showcase the advantages of the proposed methods over baseline models, although the paper acknowledges trade-offs in performance when compared to specialized models.
The paper provides sufficient implementation details, including model architecture, training procedures, and evaluation metrics, which supports reproducibility. The authors also mention plans to open-source the code and checkpoints, which is a positive step towards enabling other researchers to replicate and build upon their work.
The paper identifies several limitations, including the focus on only ASR and TTS tasks, the relatively small dataset and model size, and the underutilization of the conversational capabilities of LLMs. These limitations suggest that while the work is a significant step forward, there is potential for further development and exploration in future research.
The unified framework proposed in this paper has the potential to advance the field of speech processing by enabling more efficient and effective models that can handle both recognition and synthesis tasks. This could lead to improvements in applications such as virtual assistants, automated transcription services, and voice cloning technologies, ultimately enhancing user experience and accessibility in various domains. The paper presents UniVoice, a unified transformer framework that integrates autoregressive speech recognition with flow-matching-based speech synthesis. This work is significant as it explores a novel approach to joint modeling in speech processing, addressing critical limitations in current methodologies and demonstrating robust performance across multiple tasks.