FluentEditor Demo

FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

Rui Liu¹, Jiatian Xi¹, Ziyue Jiang², Haizhou Li^3,4
¹Speech Understanding and Speech Generation (S2) Lab, Inner Mongolia University, Hohhot, China
²Zhejiang University, China
³Shenzhen Research Institute of Big Data, School of Data Science,The Chinese University of Hong Kong, Shenzhen(CUHK-Shenzhen), China
⁴National University of Singapore, Singapore

liurui_imu@163.com, x_jiatian@163.com, ziyuejiang@zju.edu.cn, haizhouli@cuhk.edu.cn

Abstract

Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. Despite much progress in neural network-based TSE techniques, the current techniques have focused on reducing the difference between the generated speech segment and the reference target in the editing region, ignoring its local and global fluency in the context and original utterance. To maintain the speech fluency, we propose a fluency speech editing model, termed FluentEditor, by considering fluency-aware training criterion in the TSE training. Specifically, the acoustic consistency constraint aims to smooth the transition between the edited region and its neighboring acoustic segments consistent with the ground truth, while the prosody consistency constraint seeks to ensure that the prosody attributes within the edited regions remain consistent with the overall style of the original utterance. The subjective and objective experimental results on VCTK demonstrate that our FluentEditor outperforms all advanced baselines in terms of naturalness and fluency. The audio samples and code are available at: https://github.com/ai-s2-lab/fluenteditor

Speech Demo

Dataset: VCTK
Operations: Insertion and Replacement

1. FluentEditor performance in terms of Insertion and Replacement

Insertion

Item_name	GT	FluentEditor
p308_100	Original_Text:We have no idea what caused the derailment .	Edited_Text: We have absolutely no idea what caused the derailment.
p272_017	Original_Text:Others have tried to explain the phenomenon physically.	Edited_Text: Others have tried to explain the rare phenomenon for them(~~phenomenon~~) physically .

Replacement

Item_name	GT	FluentEditor
p273_225	Original_Text:He is obviously very dangerous.	Edited_Text: He is distinctly(~~obviously~~) very dangerous.
p363_352	Original_Text:He was later deported.	Edited_Text: He was eventually and reluctantly(~~later~~) deported.

2. Comparison between FluentEditor and Other Systems

FluentEditor: FluentEditor propose a fluency speech editing model tomaintain the speech fluency, by considering fluency-aware training criterion in the TSE training; [paper] [code]

FluentSpeech: FluentSpeech takes the diffusion model as backbone and predict the masked feature with the help of context speech; [paper] [code]

\( A^3T \): \( A^3T \) propose the alignment-aware acoustic-text pre-training that takes both phonemes and partially-masked spectrograms as inputs; [paper] [code]

CampNet: CampNet propose a context-aware mask prediction network to simulate the process of text-based speech editing. [paper]

GT	FluentEditor	FluentSpeech	\( A^3T \)	CampNet
Operation: Insertion Item_name: p256_053	Original_Text: I would love to have him home . Edited_Text: I would absolutely love to have him home.

Operation: Insertion Item_name: p294_237	Original_Text: He was overwhelmed by the response. Edited_Text: He was overwhelmed by the incredible response.

Operation: Replacement Item_name: p286_427	Original_Text: No date has been fixed for his return . Edited_Text: No date has been appointed (~~fixed~~) for his return.

Operation: Replacement Item_name: p245_345	Original_Text: He was then replaced by ross. Edited_Text: He was subsequently substituted (~~then replaced~~) by ross.

3. Ablation Study of FluentEditor

To further validate the contribution of our \( {L}_{A C} \) and \( {L}_{P C} \) respectively, two ablation experiments are designed:

\( \text { w/o } {L}_{A C} \): Remove the acoustic consistency training criterion

\( \text { w/o } {L}_{P C} \): Remove the prosody consistency training criterion

GT	FluentEditor	\( \text { w/o } {L}_{A C} \)	\( \text { w/o } {L}_{P C} \)
Operation: Insertion Item_name: p267_058	Original_Text: It's just the timing of the game. Edited_Text: It's just the timing of the exciting game.

Operation: Insertion Item_name: p306_126	Original_Text: There is a lack of chemistry. Edited_Text: There is a noticeable lack of chemistry.

Operation: Replacement Item_name: p341_357	Original_Text: This is normal for him now. Edited_Text: This is routine (~~normal~~)for him now.

Operation: Replacement Item_name: p323_194	Original_Text: You could see he was thinking about it . Edited_Text: You could see he was deep in thought (~~thinking~~) about it.