Tools to generate high quality synthetic speech that is perceptually indistinguishable from speech recorded from hu-man speakers are easily available. Many incidents report misuse of synthetic speech for spreading misinformation and committing financial fraud. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods without providing reasoning for the decisions they make. This limits the explainability of these approaches. In this paper, we use disentangled representation learning for developing a synthetic speech detector. We propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two stage trained variational autoencoder that processes spectrograms of speech to generate features that disentangle synthetic and bona fide speech. We evaluated DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (> 98%) on detecting synthetic speech from 6 known and 10 unknown speech synthesizers. Further, the visualization of disentangled features obtained from DSVAE provides rea-soning behind the working principle of DSVAE, improving its explainability. DSVAE performs well compared to several existing methods. Additionally, DSVAE works in practical scenarios such as detecting synthetic speech uploaded on social platforms and against simple attacks such as removing silence regions.
DSVAE: Disentangled Representation Learning for Synthetic Speech Detection
Bestagini P.;Tubaro S.;
2023-01-01
Abstract
Tools to generate high quality synthetic speech that is perceptually indistinguishable from speech recorded from hu-man speakers are easily available. Many incidents report misuse of synthetic speech for spreading misinformation and committing financial fraud. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods without providing reasoning for the decisions they make. This limits the explainability of these approaches. In this paper, we use disentangled representation learning for developing a synthetic speech detector. We propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two stage trained variational autoencoder that processes spectrograms of speech to generate features that disentangle synthetic and bona fide speech. We evaluated DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (> 98%) on detecting synthetic speech from 6 known and 10 unknown speech synthesizers. Further, the visualization of disentangled features obtained from DSVAE provides rea-soning behind the working principle of DSVAE, improving its explainability. DSVAE performs well compared to several existing methods. Additionally, DSVAE works in practical scenarios such as detecting synthetic speech uploaded on social platforms and against simple attacks such as removing silence regions.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.