Binning the genome is used in order to parallelize big data operations upon regions. In this extended abstract, we comparatively evaluate the performance and scalability of Spark and SciDB implementations over datasets consisting of billions of genomic regions. In particular, we introduce an original method for binning the genome, i.e. partitioning it into sections of small sizes, and show that it outperforms conventional binning used by SciDB and closes the gap between SciDB and a Spark-based implementation. The concept of bi-dimensional binning is new and can be extended to other systems and technologies.
Bi-Dimensional Binning for Big Genomic Datasets
CATTANI, SIMONE;Stefano Ceri;Abdulrahman Kaitoua;Pietro Pinoli
2017-01-01
Abstract
Binning the genome is used in order to parallelize big data operations upon regions. In this extended abstract, we comparatively evaluate the performance and scalability of Spark and SciDB implementations over datasets consisting of billions of genomic regions. In particular, we introduce an original method for binning the genome, i.e. partitioning it into sections of small sizes, and show that it outperforms conventional binning used by SciDB and closes the gap between SciDB and a Spark-based implementation. The concept of bi-dimensional binning is new and can be extended to other systems and technologies.File | Dimensione | Formato | |
---|---|---|---|
beyondmr.pdf
Accesso riservato
:
Publisher’s version
Dimensione
1.8 MB
Formato
Adobe PDF
|
1.8 MB | Adobe PDF | Visualizza/Apri |
beyondmr.pdf
accesso aperto
:
Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione
1.62 MB
Formato
Adobe PDF
|
1.62 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.