Next-generation sequencing (NGS) technologies and data processing pipelines are rapidly and inexpensively providing increasingly numerous sequencing data and associated (epi)genomic features of many individual genomes in multiple biological and clinical conditions, generally made publicly available within well-curated repositories. Answers to fundamental biomedical problems are hidden in these data; yet, their efficient management and integrative processing is becoming the biggest and most important “big data” problem of mankind. Multi-sample processing of heterogeneous information can support data-driven discoveries and biomolecular sense making, such as discovering how heterogeneous genomic, transcriptomic and epigenomic features cooperate to characterize biomolecular functions; yet, it requires state-of-the-art “big data” computing strategies, with abstractions beyond commonly used tool capabilities. We recently proposed a new paradigm in NGS data management and processing by introducing an essential Genomic Data Model (GDM) using few general abstractions for genomic region data and associated experimental, biological and clinical metadata that guarantee interoperability between existing data formats. Leveraging on GDM, we developed a next-generation, high-level, declarative GenoMetric Query Language (GMQL) for genomics data; here, we demonstrate its usefulness, flexibility and simplicity of use through several biological query examples. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous samples; computational efficiency and high scalability are achieved by using parallel computing on clusters or public clouds. GDM and GMQL are applicable to federated repositories, and can be exploited to provide integrated access to curated data, made available by large consortia such as ENCODE, Epigenomics Roadmap, or TCGA, through user-friendly search services.

Next generation genomic computing

CERI, STEFANO;KAITOUA, ABDULRAHMAN;PINOLI, PIETRO;CANAKOGLU, ARIF;MASSEROLI, MARCO
2016

Abstract

Next-generation sequencing (NGS) technologies and data processing pipelines are rapidly and inexpensively providing increasingly numerous sequencing data and associated (epi)genomic features of many individual genomes in multiple biological and clinical conditions, generally made publicly available within well-curated repositories. Answers to fundamental biomedical problems are hidden in these data; yet, their efficient management and integrative processing is becoming the biggest and most important “big data” problem of mankind. Multi-sample processing of heterogeneous information can support data-driven discoveries and biomolecular sense making, such as discovering how heterogeneous genomic, transcriptomic and epigenomic features cooperate to characterize biomolecular functions; yet, it requires state-of-the-art “big data” computing strategies, with abstractions beyond commonly used tool capabilities. We recently proposed a new paradigm in NGS data management and processing by introducing an essential Genomic Data Model (GDM) using few general abstractions for genomic region data and associated experimental, biological and clinical metadata that guarantee interoperability between existing data formats. Leveraging on GDM, we developed a next-generation, high-level, declarative GenoMetric Query Language (GMQL) for genomics data; here, we demonstrate its usefulness, flexibility and simplicity of use through several biological query examples. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous samples; computational efficiency and high scalability are achieved by using parallel computing on clusters or public clouds. GDM and GMQL are applicable to federated repositories, and can be exploited to provide integrated access to curated data, made available by large consortia such as ENCODE, Epigenomics Roadmap, or TCGA, through user-friendly search services.
ISMB 2016: International Conference on Intelligent Systems for Molecular Biology
Genomic data modeling and management, Operations for genomics, Region-based query language, Cloud-based genomic computing system
INF; bioinformatics
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11311/1013822
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact