We are developing a new, holistic data management system for genomics, which provides high-level abstractions for querying large genomic datasets. We designed our system so that it leverages on data management engines for low-level data access. Such design can be adapted to two different kinds of data engines: the family of scientific databases (among them, SciDB) and the broader family of generic platforms (among them, Spark). Trade-offs are not obvious; scientific databases are expected to outperform generic platforms when they use features which are embedded within their specialized design, but generic platforms are expected to outperform scientific databases on general purpose operations. In this paper, we compare our SciDB and Spark implementations at work on genomic abstractions. We use four typical genomic operations as benchmark, stemming from the concrete requirements of our project, and encoded using SciDB and Spark; we discuss their common aspects and differences, specifically discussing how genomic regions and operations can be expressed using SciDB arrays. We comparatively evaluate the performance and scalability of the two implementations over datasets consisting of billions of genomic regions.
Evaluating genomic big data operations on SciDB and spark
Simone Cattani;Stefano Ceri;Abdulrahman Kaitoua;Pietro Pinoli
2017-01-01
Abstract
We are developing a new, holistic data management system for genomics, which provides high-level abstractions for querying large genomic datasets. We designed our system so that it leverages on data management engines for low-level data access. Such design can be adapted to two different kinds of data engines: the family of scientific databases (among them, SciDB) and the broader family of generic platforms (among them, Spark). Trade-offs are not obvious; scientific databases are expected to outperform generic platforms when they use features which are embedded within their specialized design, but generic platforms are expected to outperform scientific databases on general purpose operations. In this paper, we compare our SciDB and Spark implementations at work on genomic abstractions. We use four typical genomic operations as benchmark, stemming from the concrete requirements of our project, and encoded using SciDB and Spark; we discuss their common aspects and differences, specifically discussing how genomic regions and operations can be expressed using SciDB arrays. We comparatively evaluate the performance and scalability of the two implementations over datasets consisting of billions of genomic regions.File | Dimensione | Formato | |
---|---|---|---|
icwe.pdf
Accesso riservato
:
Publisher’s version
Dimensione
485.75 kB
Formato
Adobe PDF
|
485.75 kB | Adobe PDF | Visualizza/Apri |
paper_111_icwe.pdf
accesso aperto
:
Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione
557.55 kB
Formato
Adobe PDF
|
557.55 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.