DNA microarray datasets are characterized by a large number of features with very few samples, which is a typical cause of overfitting and poor generalization in the classification task. Here we introduce a novel feature selection (FS) approach which employs the distance correlation (dCor) as a criterion for evaluating the dependence of the class on a given feature subset. The dCor index provides a reliable dependence measure among random vectors of arbitrary dimension, without any assumption on their distribution. Moreover, it is sensitive to the presence of redundant terms. The proposed FS method is based on a probabilistic representation of the feature subset model, which is progressively refined by a repeated process of model extraction and evaluation. A key element of the approach is a distributed optimization scheme based on a vertical partitioning of the dataset, which alleviates the negative effects of its unbalanced dimensions. The proposed method has been tested on several microarray datasets, resulting in quite compact and accurate models obtained at a reasonable computational cost.

A distributed feature selection algorithm based on distance correlation with an application to microarrays

Brankovic, Aida;HOSSEINI, MARJAN;Piroddi, Luigi
2019-01-01

Abstract

DNA microarray datasets are characterized by a large number of features with very few samples, which is a typical cause of overfitting and poor generalization in the classification task. Here we introduce a novel feature selection (FS) approach which employs the distance correlation (dCor) as a criterion for evaluating the dependence of the class on a given feature subset. The dCor index provides a reliable dependence measure among random vectors of arbitrary dimension, without any assumption on their distribution. Moreover, it is sensitive to the presence of redundant terms. The proposed FS method is based on a probabilistic representation of the feature subset model, which is progressively refined by a repeated process of model extraction and evaluation. A key element of the approach is a distributed optimization scheme based on a vertical partitioning of the dataset, which alleviates the negative effects of its unbalanced dimensions. The proposed method has been tested on several microarray datasets, resulting in quite compact and accurate models obtained at a reasonable computational cost.
2019
Classification; Complexity theory; Computational modeling; Correlation; Distance correlation; DNA microarrays; Feature extraction; Feature selection; Model selection; Optimization; Randomized methods; Redundancy; Task analysis; Biotechnology; Genetics; Applied Mathematics
File in questo prodotto:
File Dimensione Formato  
uArrays_bare_jrnl_compsoc.pdf

accesso aperto

Descrizione: Articolo
: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 561.23 kB
Formato Adobe PDF
561.23 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1063182
Citazioni
  • ???jsp.display-item.citation.pmc??? 2
  • Scopus 23
  • ???jsp.display-item.citation.isi??? 17
social impact