In previous work, we presented GenoMetric Query Language (GMQL), an algebraic language for querying genomic datasets, supported by Genomic Data Management System (GDMS), an open-source big data engine implemented on top of Apache Spark. GMQL datasets are represented as genomic regions (i.e. intervals of the genome, included within a start and stop position) with an associated value, representing the signal associated to that region (the most typical signals represent gene expressions, peaks of expressions, and variants relative to a reference genome.) GMQL can process queries over billions of regions, organized within distinct datasets.In this paper, we focus on the efficient execution of region preserving GMQL operations, in which the regions of the result are a subset of the regions of one of the operands; most GMQL operations are region-preserving. Chains of region-preserving operations can be efficiently executed by taking advantage of an array-based data organization, where region management can be separated from value management. We discuss this optimization in the context of the current GDMS system which has a row-based (relational) organization, and therefore requires dynamic data transformations. A similar approach applies to other application domains with interval-based data organization.
Multi-dimensional genomic data management for region-preserving operations
Horlova O.;Kaitoua A.;Ceri S.
2019-01-01
Abstract
In previous work, we presented GenoMetric Query Language (GMQL), an algebraic language for querying genomic datasets, supported by Genomic Data Management System (GDMS), an open-source big data engine implemented on top of Apache Spark. GMQL datasets are represented as genomic regions (i.e. intervals of the genome, included within a start and stop position) with an associated value, representing the signal associated to that region (the most typical signals represent gene expressions, peaks of expressions, and variants relative to a reference genome.) GMQL can process queries over billions of regions, organized within distinct datasets.In this paper, we focus on the efficient execution of region preserving GMQL operations, in which the regions of the result are a subset of the regions of one of the operands; most GMQL operations are region-preserving. Chains of region-preserving operations can be efficiently executed by taking advantage of an array-based data organization, where region management can be separated from value management. We discuss this optimization in the context of the current GDMS system which has a row-based (relational) organization, and therefore requires dynamic data transformations. A similar approach applies to other application domains with interval-based data organization.File | Dimensione | Formato | |
---|---|---|---|
18_11_2019_2魈8溻_ⶥ_騚'_.pdf
accesso aperto
:
Publisher’s version
Dimensione
815.91 kB
Formato
Adobe PDF
|
815.91 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.