RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

With the huge growth of genomic data, exposing multiple heterogeneous features of genomic regions for millions of individuals, we increasingly need to support domain-specific query languages and knowledge extraction operations, capable of aggregating and comparing trillions of regions arbitrarily positioned on the human genome. While row-based models for regions can be effectively used as a basis for cloud-based implementations, in previous work we have shown that the array-based model is effective in supporting the class of region-preserving operations, i.e. operations which do not create any new region but rather compose existing ones.In this paper, we remove the above constraint, and describe an array-based implementation which applies to unrestricted region operations, as required by the Genometric Query Language. Specifically, we define a wide spectrum of operations over datasets which are represented using arrays, and we show that the arraybased implementation scales well upon Spark, also thanks to a data representation which is effectively used for supporting machine learning. Our benchmark, which uses an independent, pre-existing collection of queries, shows that in many cases the novel array-based implementation significantly improves the performance of the row-based implementation.

Array-based data management for genomics

Horlova O.;Kaitoua A.;Ceri S.

2020-01-01

Abstract

With the huge growth of genomic data, exposing multiple heterogeneous features of genomic regions for millions of individuals, we increasingly need to support domain-specific query languages and knowledge extraction operations, capable of aggregating and comparing trillions of regions arbitrarily positioned on the human genome. While row-based models for regions can be effectively used as a basis for cloud-based implementations, in previous work we have shown that the array-based model is effective in supporting the class of region-preserving operations, i.e. operations which do not create any new region but rather compose existing ones.In this paper, we remove the above constraint, and describe an array-based implementation which applies to unrestricted region operations, as required by the Genometric Query Language. Specifically, we define a wide spectrum of operations over datasets which are represented using arrays, and we show that the arraybased implementation scales well upon Spark, also thanks to a data representation which is effectively used for supporting machine learning. Our benchmark, which uses an independent, pre-existing collection of queries, shows that in many cases the novel array-based implementation significantly improves the performance of the row-based implementation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2020
			
	Titolo del libro
	
				2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020)
			
	Titolo della collana
	
				PROCEEDINGS - INTERNATIONAL CONFERENCE ON DATA ENGINEERING
			
	ISBN (International Standard Book Number)
	
				978-1-7281-2903-7
			
	Parole chiave
	
				Algbraic description of genomic operations
Parallel execution
			
	Parole chiave
	
				Genomic computing
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
290300a109.pdf accesso aperto : Publisher’s version Dimensione 759.35 kB Formato Adobe PDF Visualizza/Apri	759.35 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1146279

Citazioni

ND

4

4

social impact