RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository - already integrating several important sources - which is exposed by means of practical user interfaces to respond biological researchers' needs.

META-BASE: a Novel Architecture for Large-Scale Genomic Metadata Integration

Bernasconi, Anna;Canakoglu, Arif;Masseroli, Marco;Ceri, Stefano

2022-01-01

Abstract

The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository - already integrating several important sources - which is exposed by means of practical user interfaces to respond biological researchers' needs.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2022
			
	Titolo della rivista
	
				IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
			
	Parole chiave
	
				Bioinformatics
Genomic Datasets
Metadata Management
Open Data
Rule-Based Languages
Data Integration
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
09104916.pdf Accesso riservato Descrizione: Postprint : Publisher’s version Dimensione 6.26 MB Formato Adobe PDF Visualizza/Apri	6.26 MB	Adobe PDF	Visualizza/Apri
META_BASE__a_Novel_Architecture_for_Large_Scale_Genomic_Metadata_Integration__TCBB.pdf accesso aperto : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 3.06 MB Formato Adobe PDF Visualizza/Apri	3.06 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1146003

Citazioni

9

18

19

social impact