Audio samples from "Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS"

Abstract:Expressive text-to-speech has shown improved performance in recent years. However, the style control of synthetic speech is often restricted to discrete emotion categories and requires training data recorded by the target speaker in the target style. In many practical situations, users may not have reference speech recorded in target emotion but still be interested in controlling speech style just by typing text description of desired emotional style. In this paper, we propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal style encoder which models the semantic relationship between text description embedding and speech style embedding with a pretrained language model. To further improve cross-speaker style transfer on disjoint, multi-style datasets, we propose the novel style loss. The experimental results show that our model can generate high-quality expressive speech even in unseen style.


1. Dataset

Source speaker dataset sample (a neutral, reading book style)

Style Tag : # 차분하게(calm), # 진지하게(seriously), # 설명하듯(in a descriptive tone)

Target speaker dataset sample (a conversational style with 4 emotion categories)

Style Tag : # 궁금한듯(curious), # 평온하게(peacefully), # 나긋하게(mildly)

Style Tag : # 밝게(brightly), # 상냥하게(kindly)

Style Tag : # 화를 내며(in anger), # 비꼬듯(sarcastically), # 도발하듯(provocatively)

Style Tag : # 포기하듯(as if giving up), # 체념하듯(as if resigning), # 슬픈 목소리로(in a sad voice), # 불행하게(unhappy)

2. Seen and unseen style transfer using style tag:

Proposed(Seen) Proposed(Unseen)
1

# 차분하게 (# Calm)

# 밝게 (# Bright)

# 미안한 (# Feeling sorry)

# 다급한 (# Urgent)

# 차분하게 (# Calm)

# 밝게 (# Bright)

# 미안한 (# Feeling sorry)

# 다급한 (# Urgent)

2

# 울먹이며 (# Weeping)

# 힘없이 (# Helplessly)

# 비난하듯 (# Accusing)

# 다정하게 (# Kindly)

# 울먹이며 (# Weeping)

# 힘없이 (# Helplessly)

# 비난하듯 (# Accusing)

# 다정하게 (# Kindly)

3

# 황당한 (# Absurd)

# 진지하게 (# Seriously)

# 속삭이듯 (# Whispering)

# 화가난듯 (# Mad)

# 황당한 (# Absurd)

# 진지하게 (# Seriously)

# 속삭이듯 (# Whispering))

# 화가난듯 (# Mad)

4

# 소리치며 (# Shouting)

# 짜증난듯 (# Annoyed)

# 해맑게 (# Cheerful)

# 씁쓸하게 (# Bitterish)

# 소리치며 (# Shouting)

# 짜증난듯 (# Annoyed)

# 해맑게 (# Cheerful)

# 씁쓸하게 (# Bitterish)


3. Seen and unseen style transfer using reference audio:

Reference Baseline(Tacotron2-GST) Proposed(Seen) Proposed(Unseen)
1
2
3
4
5
6
7

4. Case studies:

4.1. Detailed control of emotion using multiple style tags:

# unsatisfied

+

# urgent

=

unsatisfied & urgent

4.2. Strength control of emotion using quantifiers:

# little angry

<

# more angry

4.3. Generalization to unseen style tags:

style tag in sentence form

# 울적한 마음을 감추지 못하고 눈물을 보였다

(# She couldn't hide my sad heart and showed tears)

style tag in noun form

# 높은 톤의 밝은 목소리

(# High-pitched, bright voice)