We investigate the important clinical problem of predicting prognosis-related breast cancer molecular subtypes using whole-transcriptome information present in The Cancer Genome Atlas Project (TCGA) dataset. From a Machine Learning perspective, the data is both high-dimensional with over nineteen thousand features, and extremely small with only about one thousand labeled instances in total. To deal with the dearth of information we compare classical, deep and semi-supervised learning approaches on the subtyping task. Specifically, we compare a L₁ -regularized Logistic Regression, a 2-hidden layer Feed Forward Neural Network and a Variational Autoencoder based semi-supervised learner that makes use of pan-cancer TCGA data as well as normal breast tissue data from a second source. We find that the classical supervised technique performs at least as well as the deep and semi-supervised learning approaches, although learning curve analysis suggests that insufficient unlabeled data may be being provided for the chosen semi-supervised learning technique to be effective.

Evaluating Deep Semi-supervised Learning for Whole-Transcriptome Breast Cancer Subtyping

Cascianelli, Silvia;Cristovao, Francisco;Canakoglu, Arif;Carman, Mark;Nanni, Luca;Pinoli, Pietro;Masseroli, Marco
2020-01-01

Abstract

We investigate the important clinical problem of predicting prognosis-related breast cancer molecular subtypes using whole-transcriptome information present in The Cancer Genome Atlas Project (TCGA) dataset. From a Machine Learning perspective, the data is both high-dimensional with over nineteen thousand features, and extremely small with only about one thousand labeled instances in total. To deal with the dearth of information we compare classical, deep and semi-supervised learning approaches on the subtyping task. Specifically, we compare a L₁ -regularized Logistic Regression, a 2-hidden layer Feed Forward Neural Network and a Variational Autoencoder based semi-supervised learner that makes use of pan-cancer TCGA data as well as normal breast tissue data from a second source. We find that the classical supervised technique performs at least as well as the deep and semi-supervised learning approaches, although learning curve analysis suggests that insufficient unlabeled data may be being provided for the chosen semi-supervised learning technique to be effective.
2020
COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS
978-3-030-63060-7
978-3-030-63061-4
File in questo prodotto:
File Dimensione Formato  
Springer_Lecture_Notes_CIBB.pdf

Accesso riservato

: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 367.98 kB
Formato Adobe PDF
367.98 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1159916
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact