Motivation We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. Results The new system has a well-designed modular architecture featuring: i) an intermediate representation supporting many different implementations (including Spark, Flink, and SciDB); ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database, or others); iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. Availability The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/
Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data
Marco Masseroli;Arif Canakoglu;Pietro Pinoli;Abdulrahman Kaitoua;GULINO, ANDREA;HORLOVA, OLHA;NANNI, LUCA;BERNASCONI, ANNA;PERNA, STEFANO;STAMOULAKATOU, EIRINI;Stefano Ceri
2019-01-01
Abstract
Motivation We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. Results The new system has a well-designed modular architecture featuring: i) an intermediate representation supporting many different implementations (including Spark, Flink, and SciDB); ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database, or others); iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. Availability The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/File | Dimensione | Formato | |
---|---|---|---|
PDF30661153-579728432.pdf
accesso aperto
Descrizione: Articolo principale
:
Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione
309.02 kB
Formato
Adobe PDF
|
309.02 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.