RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

The complete understanding of the decision-making process of Convolutional Neural Networks (CNNs) is far from being fully reached. Many researchers proposed techniques to interpret what a network actually 'learns' from data. Nevertheless many questions still remain unanswered. In this work we study one aspect of this problem by reconstructing speech from the intermediate embeddings computed by a CNNs. Specifically, we consider a pre-trained network that acts as a feature extractor from speech audio. We investigate the possibility of inverting these features, reconstructing the input signals in a black-box scenario, and quantitatively measure the reconstruction quality by measuring the word-error-rate of an off-the-shelf ASR model. Experiments performed using two different CNN architectures trained for six different classification tasks, show that it is possible to reconstruct time-domain speech signals that preserve the semantic content, whenever the embeddings are extracted before the fully connected layers.

Reconstructing speech from CNN embeddings

Comanducci L.;Bestagini P.;Tagliasacchi M.;Sarti A.;Tubaro S.

2021-01-01

Abstract

The complete understanding of the decision-making process of Convolutional Neural Networks (CNNs) is far from being fully reached. Many researchers proposed techniques to interpret what a network actually 'learns' from data. Nevertheless many questions still remain unanswered. In this work we study one aspect of this problem by reconstructing speech from the intermediate embeddings computed by a CNNs. Specifically, we consider a pre-trained network that acts as a feature extractor from speech audio. We investigate the possibility of inverting these features, reconstructing the input signals in a black-box scenario, and quantitatively measure the reconstruction quality by measuring the word-error-rate of an off-the-shelf ASR model. Experiments performed using two different CNN architectures trained for six different classification tasks, show that it is possible to reconstruct time-domain speech signals that preserve the semantic content, whenever the embeddings are extracted before the fully connected layers.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Titolo della rivista
	
				IEEE SIGNAL PROCESSING LETTERS
			
	Parole chiave
	
				Audio processing
explainable deep learning
speech recognition
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1183176

Citazioni

ND

10

3

social impact