Tools to generate high quality synthetic speech that is perceptually indistinguishable from speech recorded from hu-man speakers are easily available. Many incidents report misuse of synthetic speech for spreading misinformation and committing financial fraud. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods without providing reasoning for the decisions they make. This limits the explainability of these approaches. In this paper, we use disentangled representation learning for developing a synthetic speech detector. We propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two stage trained variational autoencoder that processes spectrograms of speech to generate features that disentangle synthetic and bona fide speech. We evaluated DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (> 98%) on detecting synthetic speech from 6 known and 10 unknown speech synthesizers. Further, the visualization of disentangled features obtained from DSVAE provides rea-soning behind the working principle of DSVAE, improving its explainability. DSVAE performs well compared to several existing methods. Additionally, DSVAE works in practical scenarios such as detecting synthetic speech uploaded on social platforms and against simple attacks such as removing silence regions.

DSVAE: Disentangled Representation Learning for Synthetic Speech Detection

Bestagini P.;Tubaro S.;
2023-01-01

Abstract

Tools to generate high quality synthetic speech that is perceptually indistinguishable from speech recorded from hu-man speakers are easily available. Many incidents report misuse of synthetic speech for spreading misinformation and committing financial fraud. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods without providing reasoning for the decisions they make. This limits the explainability of these approaches. In this paper, we use disentangled representation learning for developing a synthetic speech detector. We propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two stage trained variational autoencoder that processes spectrograms of speech to generate features that disentangle synthetic and bona fide speech. We evaluated DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (> 98%) on detecting synthetic speech from 6 known and 10 unknown speech synthesizers. Further, the visualization of disentangled features obtained from DSVAE provides rea-soning behind the working principle of DSVAE, improving its explainability. DSVAE performs well compared to several existing methods. Additionally, DSVAE works in practical scenarios such as detecting synthetic speech uploaded on social platforms and against simple attacks such as removing silence regions.
2023
Proceedings - 22nd IEEE International Conference on Machine Learning and Applications, ICMLA 2023
autoencoder
disentangled representation learning
explainable AI
synthetic speech detection
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1265887
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact