FluentEditor2 Demo

FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency

Rui Liu¹, Jiatian Xi^1,*, Ziyue Jiang², Haizhou Li^3,4
¹Speech Understanding and Speech Generation (S2) Lab, Inner Mongolia University, Hohhot, China
²Zhejiang University, China
³Shenzhen Research Institute of Big Data, School of Data Science,The Chinese University of Hong Kong, Shenzhen(CUHK-Shenzhen), China
⁴National University of Singapore, Singapore
liurui_imu@163.com, x_jiatian@163.com, ziyuejiang@zju.edu.cn, haizhouli@cuhk.edu.cn

ABSTRACT

Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording. Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance. However, the generated speech in the edited region should maintain acoustic and prosodic consistency with the unedited region and the original speech at both the local and global levels. To maintain speech fluency, we propose a new fluency speech editing scheme based on our previous FluentEditor model, termed FluentEditor2, by modeling the multi-scale acoustic and prosody consistency training criterion in TSE training. Specifically, for local acoustic consistency, we propose Hierarchical Local Acoustic Smoothness Constraint (\( \mathcal{L}_{H L A C} \) ) to align the acoustic properties of speech frames, phonemes, and words at the boundary between the generated speech in the edited region and the speech in the unedited region. For global prosody consistency, we propose Contrastive Global Prosody Consistency Constraint (\( \mathcal{L}_{G C P C} \) ) to keep the speech in the edited region consistent with the prosody of the original utterance. Extensive experiments on the VCTK and LibriTTS datasets show that FluentEditor2 surpasses existing neural networks-based TSE methods, including Editspeech, Campnet, \( A^3T \), FluentSpeech, and our Fluenteditor, in both subjective and objective. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system. Speech demos are available at: https://github.com/Ai-S2-Lab/FluentEditor2

CONTENT

TSE Operation

Insertion

Replacement

Deletion

Comparison with Other Baselines

Ablation study

Speech Demo

Dataset: VCTK and LibriTTS
Operations: Insertion, Replacement and Deletion

1. Operation of text-based speech editing based on FluentEditor2

Insertion

item_name GT FluentEditor2

p274_339

original_text:There is a handful of rewarding paintings .

edited_text: There is a handful of rewarding but challenging paintings .

4640_19187
_000026_000005

original_text:The mouse , plus the cat , is the proof of creation revised and corrected .

edited_text: The mouse , plus the cat , is the ultimate proof of creation revised and corrected .

Replacement

item_name GT FluentEditor

p260_179

original_text:I saw military vehicles in the distance .

edited_audio: I observed(~~saw~~) military vehicles in the distance .

3983_5371
_000008_000001

original_text:Much he knows of heaven !

edited_audio: Much he understands(~~knows~~) of heaven !

Deletion

item_name GT FluentEditor

p293_073

original_text:It has been a lovely family occasion .

edited_text: it has been a ~~lovely~~ family occasion .

5808_54425
_000010_000005

original_text:She kept her promise and threw him over .

edited_text: She kept her promise ~~and threw~~ him over .

2. Comparison between FluentEditorand Other Systems

FluentEditor2: FluentEditor2 improves TSE by enforcing local acoustic smoothness and global prosody consistency, ensuring fluency between edited and unedited regions; [paper] [code]

FluentEditor: FluentEditor propose a fluency speech editing model tomaintain the speech fluency, by considering fluency-aware training criterion in the TSE training; [paper] [code]

FluentSpeech: FluentSpeech takes the diffusion model as backbone and predict the masked feature with the help of context speech; [paper] [code]

\( A^3T \): \( A^3T \) propose the alignment-aware acoustic-text pre-training that takes both phonemes and partially-masked spectrograms as inputs; [paper] [code]

CampNet: CampNet propose a context-aware mask prediction network to simulate the process of text-based speech editing. [paper]

FluentEditor2 FluentEditor FluentSpeech \( A^3T \) CampNet EditSpeech

operation: Insert
item_name: p288_080 original_text: It's all going to clear the debt .
edited_audio: It's eventually all going to clear the debt .

operation: Insert
item_name: 8580_287364_000068_000001 original_text: But i hoped to accomplish it by other means .
edited_audio: But I sincerely hoped to accomplish it by other means .

operation: Replace
item_name: p278_145 original_text: It was his job to check .
edited_audio: It was his responsibility (~~job~~) to check .

operation: Replace
item_name: p4195_186238_000047_000001 original_text: We're dreadfully rich , uncle john ; so you neednt worry if you dont strike a job yourself all at once .
edited_audio: We're incredibly (~~dreadfully~~) rich , uncle john ; so you neednt worry if you dont strike a job yourself all at once .

operation: Deletion
item_name: p238_047 original_text: They have shown a great desire and attitude .
edited_audio: They have shown ~~a great~~ desire and attitude .

operation: Deletion
item_name: 1867_154071_000012_000003 original_text: There could be no shadow of a doubt about it .
edited_audio: There could be no ~~no shadow of a~~ doubt about it .

3. Ablation study on FluentEditor

To further validate the contribution of our \( \mathcal{L}_{H L A C} \) and \( \mathcal{L}_{G C P C} \) respectively, two ablation experiments are designed:
\( \text { w/o } \mathcal{L}_{H L A C} \): Remove the acoustic consistency training criterion

\( \text { w/o } \mathcal{L}_{G C P C} \): Remove the prosody consistency training criterion

GT FluentEditor2 \( \text { w/o } \mathcal{L}_{H L A C} \) \( \text { w/o } \mathcal{L}_{C G P C} \)

operation: Insert
item_name: p265_292 original_text: Last night was a key episode .
edited_audio: Last night was undoubtedly a key episode .

operation: Insert
item_name: 405_130895_000037_000005 original_text: Later, Sir Richard Hawkins called them the Maidenland, after the blessed virgin .
edited_audio: Later, Sir Richard Hawkins called them the Maidenland, named after the Blessed Virgin .

item_name	GT	FluentEditor2
p274_339	original_text:There is a handful of rewarding paintings .	edited_text: There is a handful of rewarding but challenging paintings .
4640_19187 _000026_000005	original_text:The mouse , plus the cat , is the proof of creation revised and corrected .	edited_text: The mouse , plus the cat , is the ultimate proof of creation revised and corrected .

item_name	GT	FluentEditor
p260_179	original_text:I saw military vehicles in the distance .	edited_audio: I observed(~~saw~~) military vehicles in the distance .
3983_5371 _000008_000001	original_text:Much he knows of heaven !	edited_audio: Much he understands(~~knows~~) of heaven !

item_name	GT	FluentEditor
p293_073	original_text:It has been a lovely family occasion .	edited_text: it has been a ~~lovely~~ family occasion .
5808_54425 _000010_000005	original_text:She kept her promise and threw him over .	edited_text: She kept her promise ~~and threw~~ him over .

FluentEditor2	FluentEditor	FluentSpeech	\( A^3T \)	CampNet	EditSpeech
operation: Insert item_name: p288_080	original_text: It's all going to clear the debt . edited_audio: It's eventually all going to clear the debt .

operation: Insert item_name: 8580_287364_000068_000001	original_text: But i hoped to accomplish it by other means . edited_audio: But I sincerely hoped to accomplish it by other means .

operation: Replace item_name: p278_145	original_text: It was his job to check . edited_audio: It was his responsibility (~~job~~) to check .

operation: Replace item_name: p4195_186238_000047_000001	original_text: We're dreadfully rich , uncle john ; so you neednt worry if you dont strike a job yourself all at once . edited_audio: We're incredibly (~~dreadfully~~) rich , uncle john ; so you neednt worry if you dont strike a job yourself all at once .

operation: Deletion item_name: p238_047	original_text: They have shown a great desire and attitude . edited_audio: They have shown ~~a great~~ desire and attitude .

operation: Deletion item_name: 1867_154071_000012_000003	original_text: There could be no shadow of a doubt about it . edited_audio: There could be no ~~no shadow of a~~ doubt about it .

GT	FluentEditor2	\( \text { w/o } \mathcal{L}_{H L A C} \)	\( \text { w/o } \mathcal{L}_{C G P C} \)
operation: Insert item_name: p265_292	original_text: Last night was a key episode . edited_audio: Last night was undoubtedly a key episode .

operation: Insert item_name: 405_130895_000037_000005	original_text: Later, Sir Richard Hawkins called them the Maidenland, after the blessed virgin . edited_audio: Later, Sir Richard Hawkins called them the Maidenland, named after the Blessed Virgin .