Audio Samples from "Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech"

ArXiv: arXiv:2011.01174

Authors

Abstract

Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation and efficiently without increasing the inference time or model complexity. The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.

Datasets

We used two subsets of our Korean speech dataset: 6-hour-long Small dataset and 18-hour-long Large dataset. The datasets were built as a part of work supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No. 10080667, Development of conversational speech synthesis technology to express emotion and personality of robots through sound source diversification). More details about the datasets is available at https://github.com/emotiontts/emotiontts_open_db.

Information about Korean language

Korean is a syllabic language, and a Korean syllable consists of an initial consonant, a vowel, and and optional final consonant. Here, we call each Korean consonant or vowel "Korean character." There are a total of 67 unique Korean characters; 19 initial consonants, 21 vowels, and 27 final consonants. However, three final consonants are rarely used and actually not included in the datasets we used. Therefore, we used 64 Korean characters excluding punctuations.

Notations

Audio samples

Sentence 1: 그럼, 본격적으로 미팅을 시작할까요?

GT GT (Mel) Transformer-L P-Transformer-L
FastSpeech-L P-FastSpeech-L Transformer-S P-Transformer-S

Sentence 2: 다른 운전자들은 피해가 더 심각해 보이더라고요.

GT GT (Mel) Transformer-L P-Transformer-L
FastSpeech-L P-FastSpeech-L Transformer-S P-Transformer-S

Sentence 3: 어머니, 어디 아프신 곳은 없으세요?

GT GT (Mel) Transformer-L P-Transformer-L
FastSpeech-L P-FastSpeech-L Transformer-S P-Transformer-S

Sentence 4: 내가 여기서 친구를 만나기로 했는데 휴대폰을 두고 왔어요.

GT GT (Mel) Transformer-L P-Transformer-L
FastSpeech-L P-FastSpeech-L Transformer-S P-Transformer-S

Sentence 5: 오늘 정말 즐거웠습니다.

GT GT (Mel) Transformer-L P-Transformer-L
FastSpeech-L P-FastSpeech-L Transformer-S P-Transformer-S