A Text-to-Speech (TTS) synthesizer has to generate intelligible and natural speech while modeling linguistic and paralinguistic components characterizing human voice. In this work, we present ITAcotron 2, an Italian TTS synthesizer able to generate speech in several voices. In its development, we explored the power of transfer learning by iteratively fine-tuning an English Tacotron 2 spectrogram predictor on different Italian data sets. Moreover, we introduced a conditioning strategy to enable ITAcotron 2 to generate new speech in the voice of a variety of speakers. To do so, we examined the zero-shot behaviour of a speaker encoder architecture, previously trained to accomplish a speaker verification task with English speakers, to represent Italian speakers’ voiceprints. We asked 70 volunteers to evaluate intelligibility, naturalness, and similarity between synthesised voices and real speech from target speakers. Our model achieved a MOS score of 4.15 in intelligibility, 3.32 in naturalness, and 3.45 in speaker similarity. These results showed the successful adaptation of the refined system to the new language and its ability to synthesize novel speech in the voice of several speakers.

ITAcotron 2: the Power of Transfer Learning in Expressive TTS Synthesis

Licia Sbattella;Roberto Tedesco;Vincenzo Scotti
2023-01-01

Abstract

A Text-to-Speech (TTS) synthesizer has to generate intelligible and natural speech while modeling linguistic and paralinguistic components characterizing human voice. In this work, we present ITAcotron 2, an Italian TTS synthesizer able to generate speech in several voices. In its development, we explored the power of transfer learning by iteratively fine-tuning an English Tacotron 2 spectrogram predictor on different Italian data sets. Moreover, we introduced a conditioning strategy to enable ITAcotron 2 to generate new speech in the voice of a variety of speakers. To do so, we examined the zero-shot behaviour of a speaker encoder architecture, previously trained to accomplish a speaker verification task with English speakers, to represent Italian speakers’ voiceprints. We asked 70 volunteers to evaluate intelligibility, naturalness, and similarity between synthesised voices and real speech from target speakers. Our model achieved a MOS score of 4.15 in intelligibility, 3.32 in naturalness, and 3.45 in speaker similarity. These results showed the successful adaptation of the refined system to the new language and its ability to synthesize novel speech in the voice of several speakers.
2023
Analysis and Application of Natural Language and Speech Processing
9783031110344
Natural Language Processing; Speech Synthesis; Speaker Conditioning; Speaker Embeddings; Italian; Transfer Learning; Tacotron 2; Encoder-Decoder; Intelligibility; Naturalness; Speaker Similarity
File in questo prodotto:
File Dimensione Formato  
paper_fst+.pdf

accesso aperto

: Pre-Print (o Pre-Refereeing)
Dimensione 661.28 kB
Formato Adobe PDF
661.28 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1223325
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact