Next-generation sequencing (NGS) has dramatically reduced the cost and time of reading the DNA. Huge investments are targeted to sequencing the DNA of large populations, and repositories of well-curated sequence data are being collected. Answers to fundamental biomedical problems are hidden in these data, e.g. how cancer arises, how driving mutations occur, how much cancer is dependent on environment. So far, the bio-informatics research community has been mostly challenged by primary analysis (production of sequences in the form of short DNA segments, or ''reads'') and secondary analysis (alignment of reads to a reference genome and search for specific features on the reads); yet, the most important emerging problem is the so-called tertiary analysis, concerned with multi-sample processing of heterogeneous information. Tertiary analysis is responsible of sense making, e.g., discovering how heterogeneous regions interact with each other. This new scenario creates an opportunity for rethinking genomic computing through the lens of fundamental data management. We propose an essential data model, using few general abstractions that guarantee interoperability between existing data formats, and a new-generation query language inspired by classic relational algebra and extended with orthogonal, domain-specific abstractions for genomics. They open doors to the seamless integration of descriptive statistics and high-level data analysis (e.g., DNA region clustering and extraction of regulatory networks). In this vision, computational efficiency is achieved by using parallel computing on both clusters and public clouds; the technology is applicable to federated repositories, and can be exploited for providing integrated access to curated data, made available by large consortia, through user-friendly search services. Our most far-fetching vision is to move towards an Internet of Genomes exploiting data indexing and crawling.

Data management for next generation genomic computing

CERI, STEFANO;KAITOUA, ABDULRAHMAN;MASSEROLI, MARCO;PINOLI, PIETRO;VENCO, FRANCESCO
2016

Abstract

Next-generation sequencing (NGS) has dramatically reduced the cost and time of reading the DNA. Huge investments are targeted to sequencing the DNA of large populations, and repositories of well-curated sequence data are being collected. Answers to fundamental biomedical problems are hidden in these data, e.g. how cancer arises, how driving mutations occur, how much cancer is dependent on environment. So far, the bio-informatics research community has been mostly challenged by primary analysis (production of sequences in the form of short DNA segments, or ''reads'') and secondary analysis (alignment of reads to a reference genome and search for specific features on the reads); yet, the most important emerging problem is the so-called tertiary analysis, concerned with multi-sample processing of heterogeneous information. Tertiary analysis is responsible of sense making, e.g., discovering how heterogeneous regions interact with each other. This new scenario creates an opportunity for rethinking genomic computing through the lens of fundamental data management. We propose an essential data model, using few general abstractions that guarantee interoperability between existing data formats, and a new-generation query language inspired by classic relational algebra and extended with orthogonal, domain-specific abstractions for genomics. They open doors to the seamless integration of descriptive statistics and high-level data analysis (e.g., DNA region clustering and extraction of regulatory networks). In this vision, computational efficiency is achieved by using parallel computing on both clusters and public clouds; the technology is applicable to federated repositories, and can be exploited for providing integrated access to curated data, made available by large consortia, through user-friendly search services. Our most far-fetching vision is to move towards an Internet of Genomes exploiting data indexing and crawling.
Proceedings of the 19th International Conference on Extending Database, EDBT 2016
978-3-89318-070-7
Genomic data management
INF; bioinformatics
File in questo prodotto:
File Dimensione Formato  
paper-122.pdf

Accesso riservato

: Publisher’s version
Dimensione 1.22 MB
Formato Adobe PDF
1.22 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11311/1013791
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact