A Text-to-Speech (TTS) synthesizer has to generate intelligible and natural speech while modeling linguistic and paralinguistic components characterizing human voice. In this work, we present ITAcotron 2, an Italian TTS synthesizer able to generate speech in several voices. In its development, we explored the power of transfer learning by iteratively fine-tuning an English Tacotron 2 spectrogram predictor on different Italian data sets. Moreover, we introduced a conditioning strategy to enable ITAcotron 2 to generate new speech in the voice of a variety of speakers. To do so, we examined the zero-shot behaviour of a speaker encoder architecture, previously trained to accomplish a speaker verification task with English speakers, to represent Italian speakers’ voiceprints. We asked 70 volunteers to evaluate intelligibility, naturalness, and similarity between synthesised voices and real speech from target speakers. Our model achieved a MOS score of 4.15 in intelligibility, 3.32 in naturalness, and 3.45 in speaker similarity. These results showed the successful adaptation of the refined system to the new language and its ability to synthesize novel speech in the voice of several speakers.
ITAcotron 2: the Power of Transfer Learning in Expressive TTS Synthesis
Licia Sbattella;Roberto Tedesco;Vincenzo Scotti
2023-01-01
Abstract
A Text-to-Speech (TTS) synthesizer has to generate intelligible and natural speech while modeling linguistic and paralinguistic components characterizing human voice. In this work, we present ITAcotron 2, an Italian TTS synthesizer able to generate speech in several voices. In its development, we explored the power of transfer learning by iteratively fine-tuning an English Tacotron 2 spectrogram predictor on different Italian data sets. Moreover, we introduced a conditioning strategy to enable ITAcotron 2 to generate new speech in the voice of a variety of speakers. To do so, we examined the zero-shot behaviour of a speaker encoder architecture, previously trained to accomplish a speaker verification task with English speakers, to represent Italian speakers’ voiceprints. We asked 70 volunteers to evaluate intelligibility, naturalness, and similarity between synthesised voices and real speech from target speakers. Our model achieved a MOS score of 4.15 in intelligibility, 3.32 in naturalness, and 3.45 in speaker similarity. These results showed the successful adaptation of the refined system to the new language and its ability to synthesize novel speech in the voice of several speakers.File | Dimensione | Formato | |
---|---|---|---|
paper_fst+.pdf
accesso aperto
:
Pre-Print (o Pre-Refereeing)
Dimensione
661.28 kB
Formato
Adobe PDF
|
661.28 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.