RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Next Generation Sequencing (NGS) is a family of technologies for reading the DNA or RNA, capable of producing whole genome sequences at an impressive speed, and causing a revolution of both biological research and medical practice. In this exciting scenario, while a huge number of specialized bio-informatics programs extract information from sequences, there is an increasing need for a new generation of systems and frameworks capable of integrating such information, providing holistic answers to the needs of biologists and clinicians. To respond to this need, we developed GMQL, a new query language for genomic data management that operates on heterogeneous genomic datasets. In this paper, we focus on three domain-specific operations of GMQL used for the efficient processing of operations on genomic regions, and we describe their efficient implementation; the paper develops a theory of binning strategies as a generic approach to parallel execution of genomic operations, and then describes how binning is embedded into two efficient implementations of the operations using Flink and Spark, two emerging frameworks for data management on the cloud.

Framework for Supporting Genomic Operations

Kaitoua, Abdulrahman;Pinoli, Pietro;Bertoni, Michele;Ceri, Stefano

2017-01-01

Abstract

Next Generation Sequencing (NGS) is a family of technologies for reading the DNA or RNA, capable of producing whole genome sequences at an impressive speed, and causing a revolution of both biological research and medical practice. In this exciting scenario, while a huge number of specialized bio-informatics programs extract information from sequences, there is an increasing need for a new generation of systems and frameworks capable of integrating such information, providing holistic answers to the needs of biologists and clinicians. To respond to this need, we developed GMQL, a new query language for genomic data management that operates on heterogeneous genomic datasets. In this paper, we focus on three domain-specific operations of GMQL used for the efficient processing of operations on genomic regions, and we describe their efficient implementation; the paper develops a theory of binning strategies as a generic approach to parallel execution of genomic operations, and then describes how binning is embedded into two efficient implementations of the operations using Flink and Spark, two emerging frameworks for data management on the cloud.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2017
			
	Titolo della rivista
	
				IEEE TRANSACTIONS ON COMPUTERS
			
	Parole chiave
	
				cloud-based systems; data modeling; Genomic data management; operations for genomics; query languages; Software; Theoretical Computer Science; Hardware and Architecture; Computational Theory and Mathematics
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
Framework.pdf accesso aperto : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 2.83 MB Formato Adobe PDF Visualizza/Apri	2.83 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1045267

Citazioni

ND

23

17

social impact