Audio Samples from "Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech"

ArXiv: arXiv:2011.01174

Authors

Yeunju Choi (wkadldppdy@kaist.ac.kr)
Youngmoon Jung (dudans@kaist.ac.kr)
Youngjoo Suh (youngjoo.suh@konantech.com)
Hoirin Kim (hoirkim@kaist.ac.kr)

Abstract

Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied independently regardless of the TTS model architecture or the cause of speech quality degradation and efficiently without increasing the inference time or model complexity. The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.

Datasets

We used two subsets of our Korean speech dataset: 6-hour-long Small dataset and 18-hour-long Large dataset. The datasets were built as a part of work supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No. 10080667, Development of conversational speech synthesis technology to express emotion and personality of robots through sound source diversification). More details about the datasets is available at https://github.com/emotiontts/emotiontts_open_db.

Information about Korean language

Korean is a syllabic language, and a Korean syllable consists of an initial consonant, a vowel, and and optional final consonant. Here, we call each Korean consonant or vowel "Korean character." There are a total of 67 unique Korean characters; 19 initial consonants, 21 vowels, and 27 final consonants. However, three final consonants are rarely used and actually not included in the datasets we used. Therefore, we used 64 Korean characters excluding punctuations.

Notations

GT: recorded file
GT (Mel): system where Parallel WaveGAN converts the ground truth Mel-spectrogram into a waveform
Transformer-L: Transformer trained on the Large dataset
P-Transformer-L: perceptually guided Transformer on the Large dataset
FastSpeech-L: FastSpeech trained on the Large dataset
P-FastSpeech-L: perceptually guided FastSpeech on the Large dataset
Transformer-S: Transformer trained on the Small dataset
P-Transformer-S: perceptually guided Transformer on the Small dataset