RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Motivation We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. Results The new system has a well-designed modular architecture featuring: i) an intermediate representation supporting many different implementations (including Spark, Flink, and SciDB); ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database, or others); iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. Availability The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/

Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data

Marco Masseroli;Arif Canakoglu;Pietro Pinoli;Abdulrahman Kaitoua;GULINO, ANDREA;HORLOVA, OLHA;NANNI, LUCA;BERNASCONI, ANNA;PERNA, STEFANO;STAMOULAKATOU, EIRINI;Stefano Ceri

2019-01-01

Abstract

Motivation We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. Results The new system has a well-designed modular architecture featuring: i) an intermediate representation supporting many different implementations (including Spark, Flink, and SciDB); ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database, or others); iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. Availability The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2019
			
	Titolo della rivista
	
				BIOINFORMATICS
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
PDF30661153-579728432.pdf accesso aperto Descrizione: Articolo principale : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 309.02 kB Formato Adobe PDF Visualizza/Apri	309.02 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1058980

Citazioni

20

49

43

social impact