| Literature DB >> 28049410 |
Fabio Cumbo1,2, Giulia Fiscon1, Stefano Ceri3, Marco Masseroli3, Emanuel Weitschek4,5.
Abstract
BACKGROUND: Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types.Entities:
Keywords: Cancer; Data extraction; Data integration; Knowledge extraction
Mesh:
Substances:
Year: 2017 PMID: 28049410 PMCID: PMC5210259 DOI: 10.1186/s12859-016-1419-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Example of TCGA data belonging to the Kidney Renal Papillary Cell Carcinoma RNA-seq gene quantification experiment. In panel a we report the original TCGA data [14] and in panel b its converted BED format version (it is worth noting that it has been extended with genomic coordinates)
Fig. 2Interaction diagram of the TCGA2BED software architecture. It is composed of: a the controller, which executes the operations (e.g., download, conversion) specified either with a XML input configuration file or through the user interface; b TCGA retrieval system, which searches and retrieves TCGA genomic data of multiple types (i.e., CNV, DNA-seq, DNA-methylation, miRNA-seq, and RNA-seq V1, V2) and their associated clinical and biospecimen meta data; c the BioParser, which converts them in the tab-delimited BED format, and all their corresponding clinical and biospecimen meta data in tab-delimited attribute-value text format. Dashed blue and full green arrowed lines correspond to the paths of data download and conversion, respectively; from left to right, blue thick line rectangles refer to software components, green thin line ones represent the BioParser extensions with the links to the four external databases for additional genomic data retrieval (i.e., UCSC, HGNC, NCBI Entrez Gene, and miRBase). The roman (arabic) numbers refer to the sequence of download (conversion) operations that a user can perform
Number (#) of considered data for each type of experiment, across all TCGA tumors
| Experiment type | # Aliquots | # Samples | # Patients | # Tumors |
|---|---|---|---|---|
| CNV | 22,632 | 22,409 | 11,162 | 33 |
| DNA-seq | 6,914 | 6,884 | 6,852 | 30 |
| DNA-methylation | 12,841 | 12,508 | 11,26 | 33 |
| miRNA-seq | 9,909 | 9,763 | 9,031 | 32 |
| RNA-seq V1 | 3,675 | 3,674 | 3,393 | 15 |
| RNA-seq V2 | 9,825 | 9,823 | 9,107 | 31 |
| All | 62,335 | 22,840 | 11,317 | 33 |
Fig. 3Example of GMQL query on DNA-seq data of TCGA patients that groups samples by tumor type and patient ethnicity, and counts the distinct DNA somatic mutations in each group
Fig. 4Example of GMQL query on TCGA CNV and miRNA-seq data, which matches samples regarding the same biospecimen and extracts the DNA copy number variations occurring within expressed miRNA genes in the paired samples
Fig. 5Example of GMQL query on RNA-seq, DNA-methylation and DNA-seq data that finds the DNA somatic mutations within the first 2000 bp outside of the genes both expressed and methylated in at least a TCGA HNSC biospecimen, and extracts these somatic mutations of the top three samples with the highest number of such mutations