TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas

Cumbo, F; Fiscon, G; Masseroli, Marco; Ceri, Stefano; Weitschek, E.

Background: Data extraction and integration methods are becoming essential in order to effectively access huge amounts of genomics and clinical data. In this work, we focus on The Cancer Genome Atlas a comprehensive archive of tumoral data containing Next Generation Sequencing experiments of more than 30 cancer types. Results: We propose TCGA2BED a software tool to download and convert TCGA data in the structured BED format. Additionally, we extend TCGA data with several other genomic databases (i.e., NCBI Entrez, HGNC, UCSC). Finally, we provide and maintain an automatically updated data repository with all publicly available CNV, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental and meta data converted into the BED format. Conclusions: The use of our proposed BED format reduces the time spent in managing TCGA data: it is possible to efficiently deal with huge amounts of cancer data, and to search, query, and extend them. Our proposed BED format facilitates the investigators allowing several knowledge discovery analyses on all actually known tumor types with the final aim of aiding cancer treatments.