RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

The problems of speech separation and enhancement concern the extraction of the speech emitted by a target speaker when placed in a scenario where multiple interfering speakers or noise are present, respectively. A plethora of practical applications such as home assistants and teleconferencing require some sort of speech separation and enhancement pre-processing before applying Automatic Speech Recognition (ASR) systems. In the recent years, most techniques have focused on the application of deep learning to either time-frequency or time-domain representations of the input audio signals. In this paper we propose a real-time multichannel speech separation and enhancement technique, which is based on the combination of a directional representation of the sound field, denoted as beamspace, with a lightweight Convolutional Neural Network (CNN). We consider the case where the Direction-Of-Arrival (DOA) of the target speaker is approximately known, a scenario where the power of the beamspace-based representation can be fully exploited, while we make no assumption regarding the identity of the talker. We present experiments where the model is trained on simulated data and tested on real recordings and we compare the proposed method with a similar state-of-the-art technique.

REAL-TIME MULTICHANNEL SPEECH SEPARATION AND ENHANCEMENT USING A BEAMSPACE-DOMAIN-BASED LIGHTWEIGHT CNN

Marco Olivieri;Luca Comanducci;Mirco Pezzoli;Davide Balsarri;Luca Menescardi;Michele Buccoli;Simone Pecorino;Antonio Grosso;Fabio Antonacci;Augusto Sarti

2023-01-01

Abstract

The problems of speech separation and enhancement concern the extraction of the speech emitted by a target speaker when placed in a scenario where multiple interfering speakers or noise are present, respectively. A plethora of practical applications such as home assistants and teleconferencing require some sort of speech separation and enhancement pre-processing before applying Automatic Speech Recognition (ASR) systems. In the recent years, most techniques have focused on the application of deep learning to either time-frequency or time-domain representations of the input audio signals. In this paper we propose a real-time multichannel speech separation and enhancement technique, which is based on the combination of a directional representation of the sound field, denoted as beamspace, with a lightweight Convolutional Neural Network (CNN). We consider the case where the Direction-Of-Arrival (DOA) of the target speaker is approximately known, a scenario where the power of the beamspace-based representation can be fully exploited, while we make no assumption regarding the identity of the talker. We present experiments where the model is trained on simulated data and tested on real recordings and we compare the proposed method with a similar state-of-the-art technique.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Titolo del libro
	
				2023 48th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
			
	Parole chiave
	
				Multichannel Speech Separation, Speech Enhancement, Neural Beamformer
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
ICASSP_2023_separation_bdsound-2.pdf accesso aperto Descrizione: pre-print : Pre-Print (o Pre-Refereeing) Dimensione 1.11 MB Formato Adobe PDF Visualizza/Apri	1.11 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1233357

Citazioni

ND

8

ND

social impact