Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech
Rui Liu1, Bin Liu1, Haizhou Li2,3 1Inner Mongolia University, China  
2Shenzhen Research Institute of Big Data, School of Data Science,The Chinese University of Hong Kong, Shenzhen, China
3National University of Singapore, Singapore
liurui_imu@163.com, iframe_liu@163.com, haizhouli@cuhk.edu.cn
ABSTRACT
Prosodic phrasing is crucial to the naturalness and intelligibility of end-to-end Text-to-Speech (TTS). There exist both linguistic and emotional prosody in natural speech. As the study of prosodic phrasing has been linguistically motivated, prosodic phrasing for expressive emotion rendering has not been well studied. In this paper, we propose an emotion-aware prosodic phrasing model, termed \textit{EmoPP}, to mine the emotional cues of utterance accurately and predict appropriate phrase breaks. We first conduct objective observations on the ESD dataset to validate the strong correlation between emotion and prosodic phrasing. Then the objective and subjective evaluations show that the EmoPP outperforms all baselines and achieves remarkable performance in terms of emotion expressiveness. The audio samples and the code are available at https://github.com/AI-S2-Lab/EmoPP.
SPEECH DEMO
To further validate our EmoPP in terms of human perception, we build two emotional TTS systems that take both input text and the phrase breaks information as input. The phrase break information of the first system is obtained by the BiLSTM model, while the second is obtained by our EmoPP. The emotional TTS is trained with an emotional conversational TTS dataset, DailyTalk, by following this project: https://github.com/keonlee9420/DailyTalk.
Note: We attempted to train the emotional TTS model using the IEMOCAP dataset. However, the synthesized speech produced significant noise. Since IEMOCAP was not originally designed for TTS purposes, it is not optimal for our subjective test.
Utterances Emotion BiLSTM

We predict the phrase breaks using the BiLSTM model.
If a word is followed by a break, it is marked with a "#".

EmoPP

We predict the phrase breaks using the EmoPP model.
If a word is followed by a break, it is marked with a "#".

oh my god, what are you going to do.

surprise

oh my god#, what# are you going to do.

oh my god# what are you going to do.

a rapper party, ho yeah, okay.

surprise

a rapper party ho yeah# okay.

a rapper# party# ho# yeah# okay.

i guess we don't need glasses.

happy

i guess# we don't need glasses.

i guess we don't need glasses.

nonsense, they have a bag of venom behind their fangs and they snap, they snap.

angry

nonsense# they have a bag of venom# behind their fangs# and they snap they snap.

nonsense# they have a bag of venom behind their fangs and they snap# they snap.

uh, huh, i didn't come here, get in yelling match either.

neutral

uh# huh# i didn't come here get in yelling match either.

uh# huh# i didn't come here# get in yelling match either.

oh, yeah, absolutely absolutely.

neutral

oh yeah# absolutely absolutely.

oh# yeah# absolutely absolutely.

my computer which has all of my data which i'm collecting right now.

angry

my computer which has all of my data which# i'm collecting right now.

my computer which has all of my data which i'm collecting right now.

just kind of feel numb, you know.

sad

just kind of feel numb# you know.

just kind of feel numb# you know.

you have a business here, i said, what the hell is this.

angry

you have a business here# i said what the hell is this.

you have a business here# i said# what the hell is this.

they don't know why, we don't know why, no one like sent them an invitation or gave them a map or direction.

surprise

they don't know why# we don't know why no one like sent them an invitation or gave them a map or direction.

they don't know why# we don't know why# no one like sent them an invitation or gave them a map or direction.

well, so what do you think.

surprise

well# so what do you think.

well# so what do you think.

are you cold, huh, do you want to go home.

neutral

are you cold huh# do you want to go home.

are you cold# huh# do you want to go home.

yeah, it's pretty good.

happy

yeah# it's pretty good.

yeah# it's pretty good.

yeah, i mean, candles wouldn't stay--no, i didn't--i didn't know anything about it.

happy

yeah# i mean candles wouldn't stay--no# i didn't--i didn't know# anything about it.

yeah# i mean# candles wouldn't stay--no# i didn't--i didn't know anything about it.

yea, i just want to get this done.

neutral

yea# i just want to get this done.

yea# i just want to get this done.

yea, i guess so, oh, my gosh, was she surprised.

surprise

yea# i guess so# oh my gosh was she surprised.

yea# i guess so# oh# my gosh# was she surprised.

oh, yes, they they, you know they love her, and so i mean.

happy

oh# yes# they they# you know# they love her and so i mean.

oh# yes# they they# you know they love her# and so i mean.

yes, i mean, she cared about all of us, she was great.

sad

yes# i mean she cared about all of us# she was great.

yes# i mean# she cared about all of us# she was great.

yeah, right a cult, i'm looking forward to being in the cult.

surprise

yeah# right a cult i'm looking forward to being in the cult.

yeah# right a cult# i'm looking forward to being in the cult.

no, i'm just making myself fascinating for you.

neutral

no i'm# just making myself fascinating for you.

no# i'm just making myself fascinating for you.