While exponential growth in public genomic data can afford great insights into biological processes underlying diseases, a lack of structured metadata often impedes its timely discovery for analysis. In the Gene Expression Omnibus, for example, descriptions of genomic samples lack structure, with different terminology (such as “breast cancer”, “breast tumor”, and “malignant neoplasm of breast”) used to express the same concept. To remedy this, we learn models to extract salient information from this textual metadata. Rather than treating the problem as classification or named entity recognition, we model it as machine translation, leveraging state-of-the-art sequence-to-sequence (seq2seq) models to directly map unstructured input into a structured text format. The application of such models greatly simplifies training and allows for imputation of output fields that are implied but never explicitly mentioned in the input text. We experiment with two types of seq2seq models: an LSTM with attention and a transformer (in particular GPT-2), noting that the latter outperforms both the former and also a multi-label classification approach based on a similar transformer architecture (RoBERTa). The GPT-2 model showed a surprising ability to predict attributes with a large set of possible values, often inferring the correct value for unmentioned attributes. The models were evaluated in both homogeneous and heterogenous training/testing environments, indicating the efficacy of the transformer-based seq2seq approach for real data integration applications.
Automated Integration of Genomic Metadata with Sequence-to-Sequence Models
Leone, Michele;Bernasconi, Anna;Canakoglu, Arif;Carman, Mark J.
2021-01-01
Abstract
While exponential growth in public genomic data can afford great insights into biological processes underlying diseases, a lack of structured metadata often impedes its timely discovery for analysis. In the Gene Expression Omnibus, for example, descriptions of genomic samples lack structure, with different terminology (such as “breast cancer”, “breast tumor”, and “malignant neoplasm of breast”) used to express the same concept. To remedy this, we learn models to extract salient information from this textual metadata. Rather than treating the problem as classification or named entity recognition, we model it as machine translation, leveraging state-of-the-art sequence-to-sequence (seq2seq) models to directly map unstructured input into a structured text format. The application of such models greatly simplifies training and allows for imputation of output fields that are implied but never explicitly mentioned in the input text. We experiment with two types of seq2seq models: an LSTM with attention and a transformer (in particular GPT-2), noting that the latter outperforms both the former and also a multi-label classification approach based on a similar transformer architecture (RoBERTa). The GPT-2 model showed a surprising ability to predict attributes with a large set of possible values, often inferring the correct value for unmentioned attributes. The models were evaluated in both homogeneous and heterogenous training/testing environments, indicating the efficacy of the transformer-based seq2seq approach for real data integration applications.File | Dimensione | Formato | |
---|---|---|---|
sub_929.pdf
accesso aperto
:
Pre-Print (o Pre-Refereeing)
Dimensione
2.44 MB
Formato
Adobe PDF
|
2.44 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.