RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Automatic speech generation algorithms, enhanced by deep learning techniques, enable an increasingly seamless and immediate machine-to-human interaction. As a result, the latest generation of phone-calling bots sounds more convincingly human than previous generations. The application of this technology has a strong social impact in terms of privacy issues (e.g., in customer-care services), fraudulent actions (e.g., social hacking) and erosion of trust (e.g., generation of fake conversation). For these reasons, it is crucial to identify the nature of a speaker, as either a human or a bot. In this paper, we propose a speech classification algorithm based on Convolutional Neural Networks (CNNs), which enables the automatic classification of human vs non-human speakers from the analysis of short audio excerpts. We evaluate the effectiveness of the proposed solution by exploiting a real human speech database populated with audio recordings from various sources, and automatically generated speeches using state-of-the-art text-to-speech generators based on deep learning (e.g., Google WaveNet).

Hello? Who Am i Talking to? A Shallow CNN Approach for Human vs. Bot Speech Classification

Lieto A.;Moro D.;Devoti F.;Parera C.;Lipari V.;Bestagini P.;Tubaro S.

2019-01-01

Abstract

Automatic speech generation algorithms, enhanced by deep learning techniques, enable an increasingly seamless and immediate machine-to-human interaction. As a result, the latest generation of phone-calling bots sounds more convincingly human than previous generations. The application of this technology has a strong social impact in terms of privacy issues (e.g., in customer-care services), fraudulent actions (e.g., social hacking) and erosion of trust (e.g., generation of fake conversation). For these reasons, it is crucial to identify the nature of a speaker, as either a human or a bot. In this paper, we propose a speech classification algorithm based on Convolutional Neural Networks (CNNs), which enables the automatic classification of human vs non-human speakers from the analysis of short audio excerpts. We evaluate the effectiveness of the proposed solution by exploiting a real human speech database populated with audio recordings from various sources, and automatically generated speeches using state-of-the-art text-to-speech generators based on deep learning (e.g., Google WaveNet).

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2019
			
	Titolo del libro
	
				ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
			
	Titolo della collana
	
				PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING
			
	ISBN (International Standard Book Number)
	
				978-1-4799-8131-1
			
	Parole chiave
	
				Audio forensics; convolutional neural network; speaker detection
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
post.pdf Accesso riservato : Publisher’s version Dimensione 355.99 kB Formato Adobe PDF Visualizza/Apri	355.99 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1126641

Citazioni

ND

26

11

social impact