FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency



Rui Liu1, Jiatian Xi1,*, Ziyue Jiang2, Haizhou Li3,4
1Speech Understanding and Speech Generation (S2) Lab, Inner Mongolia University, Hohhot, China
2Zhejiang University, China
3Shenzhen Research Institute of Big Data, School of Data Science,The Chinese University of Hong Kong, Shenzhen(CUHK-Shenzhen), China
4National University of Singapore, Singapore
liurui_imu@163.com, x_jiatian@163.com, ziyuejiang@zju.edu.cn, haizhouli@cuhk.edu.cn
ABSTRACT
Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous FluentEditor model, termed FluentEditor2, by modeling the multi-scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose Hierarchical Local Acoustic Smoothness Constraint (\( \mathcal{L}_{H L A C} \) ) to align the acoustic properties of speech frames, phonemes, and words at the boundary between the generated speech in the edited region and the speech in the unedited region. For global prosody consistency, we propose Contrastive Global Prosody Consistency Constraint (\( \mathcal{L}_{G C P C} \) ) to keep the speech in the edited region consistent with the prosody of the original utterance. Extensive experiments on the VCTK and LibriTTS datasets show that FluentEditor2 surpasses existing neural networks-based TSE methods, including Editspeech, Campnet, \( A^3T \), FluentSpeech, and our Fluenteditor, in both subjective and objective. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system. Speech demos are available at: https://github.com/Ai-S2-Lab/FluentEditor2
Speech Demo
Dataset: VCTK and LibriTTS
Operations: Insertion, Replacement and Deletion
1. Operation of text-based speech editing based on FluentEditor2
Insertion
item_name GT FluentEditor2
p274_339



original_text:There is a handful of rewarding paintings .



edited_text: There is a handful of rewarding but challenging paintings .
4640_19187
_000026_000005



original_text:The mouse , plus the cat , is the proof of creation revised and corrected .



edited_text: The mouse , plus the cat , is the ultimate proof of creation revised and corrected .
Replacement
item_name GT FluentEditor
p260_179



original_text:I saw military vehicles in the distance .



edited_audio: I observed(saw) military vehicles in the distance .
3983_5371
_000008_000001



original_text:Much he knows of heaven !



edited_audio: Much he understands(knows) of heaven !
Deletion
item_name GT FluentEditor
p293_073



original_text:It has been a lovely family occasion .



edited_text: it has been a lovely family occasion .
5808_54425
_000010_000005



original_text:She kept her promise and threw him over .



edited_text: She kept her promise and threw him over .
2. Comparison between FluentEditorand Other Systems
  • FluentEditor2: FluentEditor2 improves TSE by enforcing local acoustic smoothness and global prosody consistency, ensuring fluency between edited and unedited regions; [paper] [code]
  • FluentEditor: FluentEditor propose a fluency speech editing model tomaintain the speech fluency, by considering fluency-aware training criterion in the TSE training; [paper] [code]
  • FluentSpeech: FluentSpeech takes the diffusion model as backbone and predict the masked feature with the help of context speech; [paper] [code]
  • \( A^3T \): \( A^3T \) propose the alignment-aware acoustic-text pre-training that takes both phonemes and partially-masked spectrograms as inputs; [paper] [code]
  • CampNet: CampNet propose a context-aware mask prediction network to simulate the process of text-based speech editing. [paper]
  • FluentEditor2 FluentEditor FluentSpeech \( A^3T \) CampNet EditSpeech
    operation: Insert
    item_name: p288_080
    original_text: It's all going to clear the debt .
    edited_audio: It's eventually all going to clear the debt .






    operation: Insert
    item_name: 8580_287364_000068_000001
    original_text: But i hoped to accomplish it by other means .
    edited_audio: But I sincerely hoped to accomplish it by other means .






    operation: Replace
    item_name: p278_145
    original_text: It was his job to check .
    edited_audio: It was his responsibility (job) to check .






    operation: Replace
    item_name: p4195_186238_000047_000001
    original_text: We're dreadfully rich , uncle john ; so you neednt worry if you dont strike a job yourself all at once .
    edited_audio: We're incredibly (dreadfully) rich , uncle john ; so you neednt worry if you dont strike a job yourself all at once .






    operation: Deletion
    item_name: p238_047
    original_text: They have shown a great desire and attitude .
    edited_audio: They have shown a great desire and attitude .






    operation: Deletion
    item_name: 1867_154071_000012_000003
    original_text: There could be no shadow of a doubt about it .
    edited_audio: There could be no no shadow of a doubt about it .






    3. Ablation study on FluentEditor
    To further validate the contribution of our \( \mathcal{L}_{H L A C} \) and \( \mathcal{L}_{G C P C} \) respectively, two ablation experiments are designed:
  • \( \text { w/o } \mathcal{L}_{H L A C} \): Remove the acoustic consistency training criterion
  • \( \text { w/o } \mathcal{L}_{G C P C} \): Remove the prosody consistency training criterion
  • GT FluentEditor2 \( \text { w/o } \mathcal{L}_{H L A C} \) \( \text { w/o } \mathcal{L}_{C G P C} \)
    operation: Insert
    item_name: p265_292
    original_text: Last night was a key episode .
    edited_audio: Last night was undoubtedly a key episode .




    operation: Insert
    item_name: 405_130895_000037_000005
    original_text: Later, Sir Richard Hawkins called them the Maidenland, after the blessed virgin .
    edited_audio: Later, Sir Richard Hawkins called them the Maidenland, named after the Blessed Virgin .