RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Synthetic speech generators can produce high quality speech. It can be difficult for humans to perceptually distinguish between synthesized speech and authentic human speech. Identifying the synthesizer used for generating synthetic speech, known as synthetic speech attribution, is an important problem. An open problem in synthetic speech attribution is attributing speech to new, unknown synthesizers, which are not present in the training set. Existing methods can identify known speech synthesizers but they cannot differentiate an unknown synthesizer from another unknown synthesizer. In this paper, we describe a system for attribution of unknown synthesizers i.e., assigning different labels to different unknown synthesizers. Our system is known as Fine-Grain Synthetic Speech Attribution Transformer (FGSSAT). FGSSAT is unsupervised and uses transformer, dimensionality reduction and clustering for attribution. Our experiments use the ASVspoof2019 dataset. We train on real speech and 6 synthesizers and evaluate on real speech and 17 synthesizers, which include 11 unknown synthesizers. FGSSAT identifies known synthesizers with 99.6% accuracy and classifies all speech generated from unknown synthesizers with 76.5% accuracy, which is an improvement on existing work.

FGSSAT : Unsupervised Fine-Grain Attribution of Unknown Speech Synthesizers Using Transformer Networks

Bhagtani K.;Yadav A. K. S.;Xiang Z.;Bestagini P.;Delp E. J.

2023-01-01

Abstract

Synthetic speech generators can produce high quality speech. It can be difficult for humans to perceptually distinguish between synthesized speech and authentic human speech. Identifying the synthesizer used for generating synthetic speech, known as synthetic speech attribution, is an important problem. An open problem in synthetic speech attribution is attributing speech to new, unknown synthesizers, which are not present in the training set. Existing methods can identify known speech synthesizers but they cannot differentiate an unknown synthesizer from another unknown synthesizer. In this paper, we describe a system for attribution of unknown synthesizers i.e., assigning different labels to different unknown synthesizers. Our system is known as Fine-Grain Synthetic Speech Attribution Transformer (FGSSAT). FGSSAT is unsupervised and uses transformer, dimensionality reduction and clustering for attribution. Our experiments use the ASVspoof2019 dataset. We train on real speech and 6 synthesizers and evaluate on real speech and 17 synthesizers, which include 11 unknown synthesizers. FGSSAT identifies known synthesizers with 99.6% accuracy and classifies all speech generated from unknown synthesizers with 76.5% accuracy, which is an improvement on existing work.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Titolo del libro
	
				Conference Record - Asilomar Conference on Signals, Systems and Computers
			
	Titolo della collana
	
				CONFERENCE RECORD - ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS, & COMPUTERS
			
	Parole chiave
	
				deepfake speech
speech forensics
Synthetic speech attribution
transformer
unsupervised clustering
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1265885

Citazioni

ND

5

3

ND

social impact