Cloud-based management and genome-wide processing of numerous and heterogeneous genomic feature datasets through GMQL (publicly available on CINECA cloud).

Kaitoua, Abdulrahman; Canakoglu, Arif; Pinoli, Pietro; Ceri, Stefano; Masseroli, Marco

Motivation: Next Generation Sequencing (NGS) and its data processing pipelines are providing, quickly and at low cost, an increasing amount of sequencing data and associated (epi)genomic features of numerous individual genomes in many biological and clinical conditions. These valuable data are mainly publicly available within well-curated repositories, and are thought to include the information to answer fundamental biological and clinical questions, e.g. how protein-DNA interactions and DNA three-dimensional conformation affect gene activity, how driving mutations occur, how cancer develops, how much complex diseases are dependent on personal genomic traits or environmental factors. Personalized and precision medicine based on genomic information is becoming a reality; yet, the efficient management and integrative processing of these data is becoming the biggest and most important “big data” problem of mankind. Multiple heterogeneous sample processing can help data-driven biomedical discoveries, such as finding how diverse genomic, transcriptomic and epigenomic features contribute to characterize biomolecular functions; however, it requires state-of-the-art “big data” computing strategies, with abstractions beyond the capabilities of generally used tools. Recently, we launched a new approach in NGS data management and processing (http://www.bioinformatics.deib.polimi.it/genomic_computing/), based on a simple Genomic Data Model (GDM) and a high-level, declarative GenoMetric Query Language (GMQL) for genomics data (http://www.bioinformatics.deib.polimi.it/GMQL/). GDM uses few general abstractions for genomic region data and associated experimental, biological and clinical metadata to ensure interoperability between existing data formats; GMQL works downstream of NGS raw data preprocessing pipelines and leverages on GDM to support seamless processing of multiple heterogeneous datasets. We proved their usefulness, flexibility and simplicity of use through several biological query examples, whose computational efficiency and high scalability are obtained using parallel computing on clusters or public clouds. Methods: We developed a software system for the easy execution of GMQL scripts on big genomic data in a cloud computing environment. It includes a repository organized in a Hadoop Distributed File System (HDFS), a processing engine and a GMQL layer, which consists of an orchestrator and a compiler. It is accessible through a RESTful web service Application Programming Interface (API) using standard HTTP protocol and communicating with JSON and XML files, which allows the use of GMQL from within bioinformatics software and workflow engines, such as R/Bioconductor and Galaxy, and multiple different web interfaces. Our system repository stores datasets of genomic data and their metadata, as well as system information (including the schema of each dataset, encoded in XML), which are used to guide the processing. All datasets are stored in their original format (providing adapters to GDM schemas), as usually these files must be concurrently available to users for other computations; only the datasets that are selected by a GMQL query are translated to the GDM format on demand. In this way, we do not replicate data in the native and GDM formats, and minimize data translations from native into GDM format. The system orchestrator controls the processing flow of the GMQL scripts, including data selection from the repository, scheduling of the efficient execution, and storing of the resulting datasets in the repository in standard format. Results: We created an efficient implementation of our software system for the execution of GMQL processing on big genomic data in a cloud computing environment; the full evaluation of its performance shows that indeed it scales linearly with the number of considered genomic samples/regions, and the performance varies linearly with the increase of the processing nodes of the cluster. A new, web service based, system implementation of GMQL is freely available for download at http://www.bioinformatics.deib.polimi.it/GMQL/, in which GMQL is executed on the Apache Hadoop YARN framework. We installed our GMQL system in the CINECA cloud and made it publicly accessible through both its web service API and two web interfaces that we specifically developed; they include an intuitive web application where biologists can use a set of predefined parametric GMQL queries on ENCODE and Roadmap Epigenomics data (http://www.bioinformatics.deib.polimi.it/GMQL/queries/), and a simple web interface (http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/) where bioinformaticians can browse the datasets of genomic features and biological/clinical metadata that we collected in our system repository from ENCODE, Roadmap Epigenomics and TCGA, build GMQL queries upon them, and efficiently run such queries on thousands of samples in several heterogeneous datasets in the CINECA cloud.