| Literature DB >> 27402678 |
Pawan Upadhyay1, Nilesh Gardi1, Sanket Desai1, Bikram Sahoo1, Ankita Singh1, Trupti Togar1, Prajish Iyer1, Ratnam Prasad1, Pratik Chandrani1, Sudeep Gupta2, Amit Dutt3.
Abstract
Cancer is predominantly a somatic disease. A mutant allele present in a cancer cell genome is considered somatic when it's absent in the paired normal genome along with public SNP databases. The current build of dbSNP, the most comprehensive public SNP database, however inadequately represents several non-European Caucasian populations, posing a limitation in cancer genomic analyses of data from these populations. We present the T: ata M: emorial C: entre-SNP D: ata B: ase (TMC-SNPdb), as the first open source, flexible, upgradable, and freely available SNP database (accessible through dbSNP build 149 and ANNOVAR)-representing 114 309 unique germline variants-generated from whole exome data of 62 normal samples derived from cancer patients of Indian origin. The TMC-SNPdb is presented with a companion subtraction tool that can be executed with command line option or using an easy-to-use graphical user interface with the ability to deplete additional Indian population specific SNPs over and above dbSNP and 1000 Genomes databases. Using an institutional generated whole exome data set of 132 samples of Indian origin, we demonstrate that TMC-SNPdb could deplete 42, 33 and 28% false positive somatic events post dbSNP depletion in Indian origin tongue, gallbladder, and cervical cancer samples, respectively. Beyond cancer somatic analyses, we anticipate utility of the TMC-SNPdb in several Mendelian germline diseases. In addition to dbSNP build 149 and ANNOVAR, the TMC-SNPdb along with the subtraction tool is available for download in the public domain at the following:Database URL: http://www.actrec.gov.in/pi-webpages/AmitDutt/TMCSNP/TMCSNPdp.html.Entities:
Mesh:
Year: 2016 PMID: 27402678 PMCID: PMC4940432 DOI: 10.1093/database/baw104
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Development of TMC-SNPdb using whole exome sequencing. Schematic flow representation of steps followed during development of TMC-SNP database. The whole exome sequencing of 62 normal tissue obtained from three different tissues of cancer patients was performed and analysed using GATK (Genome Analysis Tool Kit) to generate VCF files. Raw variants obtained were further filtered using mentioned criteria to find a list of variants absent in dbSNP v142 and COSMICdb v68. Remaining variants constitutes the ‘TMC-SNPdb’ shown at the end of the funnel.
Figure 2.Overall overview of characteristic features of TMC-SNP database. (A) Circle plot of coding and non-coding variants obtained in the dataset. (B) Percent minor allele frequency distribution of variants in ‘TMC-SNPdb’ across 62 normal samples. Percentage frequencies are presented on the top of each bar. (C) Genome-wide distribution of percent frequency of variants obtained in each chromosome as compared with dbSNP database.
Application of TMC-SNPdb across cancer types to filter germline variants in Indian population
| S.No. | Cancer type | Total variants | Number of samples | Number and percentage (along row) of novel variants | Overall reduction by TMC-SNPdb post dbSNP depletion | |
|---|---|---|---|---|---|---|
| Post dbSNP depletion | Post TMC-SNPdb depletion | |||||
| 1 | Tongue cancer | 613 055 | 24 | 84 001 (13.7%) | 48 182 (7.8%) | 42.6% |
| 2 | Cervical cancer | 923 547 | 34 | 99 032 (10.7%) | 71 594 (7.7%) | 27.7% |
| 3 | Gall-bladder | 328 245 | 17 | 26 530 (8%) | 17 682 (5.3%) | 33.3% |
Total number of variants observed for each cancer types and reduction in number and percent variants post dbSNP and post TMC-SNPdb subtraction is tabulated for three cancer types. Number of samples analysed across tumor is also denoted.