RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today’s genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available. In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today’s big data are raw reads of the sequencing machines, tomorrow’s big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions. Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.

Experiences in the development of a data management system for genomics

Ceri, Stefano;Canakoglu, Arif;Kaitoua, Abdulrahman;Masseroli, Marco;Pinoli, Pietro

2018-01-01

Abstract

GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today’s genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available. In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today’s big data are raw reads of the sequencing machines, tomorrow’s big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions. Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2018
			
	Titolo del libro
	
				Data Management Technologies and Applications. DATA 2017.
			
	ISBN (International Standard Book Number)
	
				9783319948089
			
	Parole chiave
	
				Cloud computing; Data translation and optimization; Genomic computing; Next generation sequencing; Open data; Computer Science (all); Mathematics (all)
			
	Appare nelle tipologie:
	
				02.1 Contributo in Volume

File in questo prodotto:

File	Dimensione	Formato
dataconference.pdf accesso aperto : Pre-Print (o Pre-Refereeing) Dimensione 1.64 MB Formato Adobe PDF Visualizza/Apri	1.64 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1068379

Citazioni

ND

0

ND

social impact