Synthetic speech generators can produce high quality speech. It can be difficult for humans to perceptually distinguish between synthesized speech and authentic human speech. Identifying the synthesizer used for generating synthetic speech, known as synthetic speech attribution, is an important problem. An open problem in synthetic speech attribution is attributing speech to new, unknown synthesizers, which are not present in the training set. Existing methods can identify known speech synthesizers but they cannot differentiate an unknown synthesizer from another unknown synthesizer. In this paper, we describe a system for attribution of unknown synthesizers i.e., assigning different labels to different unknown synthesizers. Our system is known as Fine-Grain Synthetic Speech Attribution Transformer (FGSSAT). FGSSAT is unsupervised and uses transformer, dimensionality reduction and clustering for attribution. Our experiments use the ASVspoof2019 dataset. We train on real speech and 6 synthesizers and evaluate on real speech and 17 synthesizers, which include 11 unknown synthesizers. FGSSAT identifies known synthesizers with 99.6% accuracy and classifies all speech generated from unknown synthesizers with 76.5% accuracy, which is an improvement on existing work.

FGSSAT : Unsupervised Fine-Grain Attribution of Unknown Speech Synthesizers Using Transformer Networks

Bestagini P.;
2023-01-01

Abstract

Synthetic speech generators can produce high quality speech. It can be difficult for humans to perceptually distinguish between synthesized speech and authentic human speech. Identifying the synthesizer used for generating synthetic speech, known as synthetic speech attribution, is an important problem. An open problem in synthetic speech attribution is attributing speech to new, unknown synthesizers, which are not present in the training set. Existing methods can identify known speech synthesizers but they cannot differentiate an unknown synthesizer from another unknown synthesizer. In this paper, we describe a system for attribution of unknown synthesizers i.e., assigning different labels to different unknown synthesizers. Our system is known as Fine-Grain Synthetic Speech Attribution Transformer (FGSSAT). FGSSAT is unsupervised and uses transformer, dimensionality reduction and clustering for attribution. Our experiments use the ASVspoof2019 dataset. We train on real speech and 6 synthesizers and evaluate on real speech and 17 synthesizers, which include 11 unknown synthesizers. FGSSAT identifies known synthesizers with 99.6% accuracy and classifies all speech generated from unknown synthesizers with 76.5% accuracy, which is an improvement on existing work.
2023
Conference Record - Asilomar Conference on Signals, Systems and Computers
deepfake speech
speech forensics
Synthetic speech attribution
transformer
unsupervised clustering
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1265885
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact