RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by feature selection (FS), that offers several further advantages, such as decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become suboptimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the reconstruction error of a deep sparse autoencoders ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated reconstruction error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments, both simulated and on high-dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.

Feature selection for imbalanced data with deep sparse autoencoders ensemble

Massi, Michela Carlotta;Gasperoni, Francesca;Ieva, Francesca;Paganoni, Anna Maria

2021-01-01

Abstract

Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by feature selection (FS), that offers several further advantages, such as decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become suboptimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the reconstruction error of a deep sparse autoencoders ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated reconstruction error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments, both simulated and on high-dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Titolo della rivista
	
				STATISTICAL ANALYSIS AND DATA MINING
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
sam.11567.pdf Accesso riservato Descrizione: Articolo principale : Publisher’s version Dimensione 3.45 MB Formato Adobe PDF Visualizza/Apri	3.45 MB	Adobe PDF	Visualizza/Apri
11311-1192007_Massi.pdf Open Access dal 13/12/2022 : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 5.71 MB Formato Adobe PDF Visualizza/Apri	5.71 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1192007

Citazioni

ND

6

4

ND

social impact