RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

While exponential growth in public genomic data can afford great insights into biological processes underlying diseases, a lack of structured metadata often impedes its timely discovery for analysis. In the Gene Expression Omnibus, for example, descriptions of genomic samples lack structure, with different terminology (such as “breast cancer”, “breast tumor”, and “malignant neoplasm of breast”) used to express the same concept. To remedy this, we learn models to extract salient information from this textual metadata. Rather than treating the problem as classification or named entity recognition, we model it as machine translation, leveraging state-of-the-art sequence-to-sequence (seq2seq) models to directly map unstructured input into a structured text format. The application of such models greatly simplifies training and allows for imputation of output fields that are implied but never explicitly mentioned in the input text. We experiment with two types of seq2seq models: an LSTM with attention and a transformer (in particular GPT-2), noting that the latter outperforms both the former and also a multi-label classification approach based on a similar transformer architecture (RoBERTa). The GPT-2 model showed a surprising ability to predict attributes with a large set of possible values, often inferring the correct value for unmentioned attributes. The models were evaluated in both homogeneous and heterogenous training/testing environments, indicating the efficacy of the transformer-based seq2seq approach for real data integration applications.

Automated Integration of Genomic Metadata with Sequence-to-Sequence Models

Cannizzaro, Giuseppe;Leone, Michele;Bernasconi, Anna;Canakoglu, Arif;Carman, Mark J.

2021-01-01

Abstract

While exponential growth in public genomic data can afford great insights into biological processes underlying diseases, a lack of structured metadata often impedes its timely discovery for analysis. In the Gene Expression Omnibus, for example, descriptions of genomic samples lack structure, with different terminology (such as “breast cancer”, “breast tumor”, and “malignant neoplasm of breast”) used to express the same concept. To remedy this, we learn models to extract salient information from this textual metadata. Rather than treating the problem as classification or named entity recognition, we model it as machine translation, leveraging state-of-the-art sequence-to-sequence (seq2seq) models to directly map unstructured input into a structured text format. The application of such models greatly simplifies training and allows for imputation of output fields that are implied but never explicitly mentioned in the input text. We experiment with two types of seq2seq models: an LSTM with attention and a transformer (in particular GPT-2), noting that the latter outperforms both the former and also a multi-label classification approach based on a similar transformer architecture (RoBERTa). The GPT-2 model showed a surprising ability to predict attributes with a large set of possible values, often inferring the correct value for unmentioned attributes. The models were evaluated in both homogeneous and heterogenous training/testing environments, indicating the efficacy of the transformer-based seq2seq approach for real data integration applications.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Titolo del libro
	
				Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track. ECML PKDD 2020.
			
	Titolo della collana
	
				LECTURE NOTES IN ARTIFICIAL INTELLIGENCE
			
	ISBN (International Standard Book Number)
	
				978-3-030-67669-8
978-3-030-67670-4
			
	Parole chiave
	
				Genomics, High-throughput sequencing, Metadata integration, Deep Learning, Translation models, Natural language processing
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
sub_929.pdf accesso aperto : Pre-Print (o Pre-Refereeing) Dimensione 2.44 MB Formato Adobe PDF Visualizza/Apri	2.44 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1162454

Citazioni

ND

5

3

social impact