FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

Rui Liu1, Jiatian Xi1, Ziyue Jiang2, Haizhou Li3,4
1Speech Understanding and Speech Generation (S2) Lab, Inner Mongolia University, Hohhot, China
2Zhejiang University, China
3Shenzhen Research Institute of Big Data, School of Data Science,The Chinese University of Hong Kong, Shenzhen(CUHK-Shenzhen), China
4National University of Singapore, Singapore

liurui_imu@163.com, x_jiatian@163.com, ziyuejiang@zju.edu.cn, haizhouli@cuhk.edu.cn
Abstract
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. Despite much progress in neural network-based TSE techniques, the current techniques have focused on reducing the difference between the generated speech segment and the reference target in the editing region, ignoring its local and global fluency in the context and original utterance. To maintain the speech fluency, we propose a fluency speech editing model, termed FluentEditor, by considering fluency-aware training criterion in the TSE training. Specifically, the acoustic consistency constraint aims to smooth the transition between the edited region and its neighboring acoustic segments consistent with the ground truth, while the prosody consistency constraint seeks to ensure that the prosody attributes within the edited regions remain consistent with the overall style of the original utterance. The subjective and objective experimental results on VCTK demonstrate that our FluentEditor outperforms all advanced baselines in terms of naturalness and fluency. The audio samples and code are available at: https://github.com/ai-s2-lab/fluenteditor
Speech Demo
Dataset: VCTK
Operations: Insertion and Replacement
1. FluentEditor performance in terms of Insertion and Replacement
Insertion
Item_name GT FluentEditor
p308_100



Original_Text:We have no idea what caused the derailment .




Edited_Text: We have absolutely no idea what caused the derailment.
p272_017



Original_Text:Others have tried to explain the phenomenon physically.




Edited_Text: Others have tried to explain the rare phenomenon for them(phenomenon) physically .
Replacement
Item_name GT FluentEditor
p273_225



Original_Text:He is obviously very dangerous.




Edited_Text: He is distinctly(obviously) very dangerous.
p363_352



Original_Text:He was later deported.




Edited_Text: He was eventually and reluctantly(later) deported.
2. Comparison between FluentEditor and Other Systems
  • FluentEditor: FluentEditor propose a fluency speech editing model tomaintain the speech fluency, by considering fluency-aware training criterion in the TSE training; [paper] [code]
  • FluentSpeech: FluentSpeech takes the diffusion model as backbone and predict the masked feature with the help of context speech; [paper] [code]
  • \( A^3T \): \( A^3T \) propose the alignment-aware acoustic-text pre-training that takes both phonemes and partially-masked spectrograms as inputs; [paper] [code]
  • CampNet: CampNet propose a context-aware mask prediction network to simulate the process of text-based speech editing. [paper]
  • GT FluentEditor FluentSpeech \( A^3T \) CampNet
    Operation: Insertion
    Item_name: p256_053
    Original_Text: I would love to have him home .
    Edited_Text: I would absolutely love to have him home.










    Operation: Insertion
    Item_name: p294_237
    Original_Text: He was overwhelmed by the response.
    Edited_Text: He was overwhelmed by the incredible response.










    Operation: Replacement
    Item_name: p286_427
    Original_Text: No date has been fixed for his return .
    Edited_Text: No date has been appointed (fixed) for his return.










    Operation: Replacement
    Item_name: p245_345
    Original_Text: He was then replaced by ross.
    Edited_Text: He was subsequently substituted (then replaced) by ross.










    3. Ablation Study of FluentEditor
    To further validate the contribution of our \( {L}_{A C} \) and \( {L}_{P C} \) respectively, two ablation experiments are designed:
  • \( \text { w/o } {L}_{A C} \): Remove the acoustic consistency training criterion
  • \( \text { w/o } {L}_{P C} \): Remove the prosody consistency training criterion
  • GT FluentEditor \( \text { w/o } {L}_{A C} \) \( \text { w/o } {L}_{P C} \)
    Operation: Insertion
    Item_name: p267_058
    Original_Text: It's just the timing of the game.
    Edited_Text: It's just the timing of the exciting game.








    Operation: Insertion
    Item_name: p306_126
    Original_Text: There is a lack of chemistry.
    Edited_Text: There is a noticeable lack of chemistry.








    Operation: Replacement
    Item_name: p341_357
    Original_Text: This is normal for him now.
    Edited_Text: This is routine (normal)for him now.








    Operation: Replacement
    Item_name: p323_194
    Original_Text: You could see he was thinking about it .
    Edited_Text: You could see he was deep in thought (thinking) about it.