| Literature DB >> 28155632 |
Mengmeng Wu1,2, Ting Chen1,2, Rui Jiang3,4.
Abstract
BACKGROUND: Whole exome sequencing (WES) has recently emerged as an effective approach for identifying genetic variants underlying human diseases. However, considerable time and labour is needed for careful investigation of candidate variants. Although filtration based on population frequencies and functional prediction scores could effectively remove common and neutral variants, hundreds or even thousands of rare deleterious variants still remain. In addition, current WES platforms also provide variant information in flanking noncoding regions, such as promoters, introns and splice sites. Despite of being recognized to harbour causal variants, these regions are usually ignored by current analysis pipelines.Entities:
Mesh:
Substances:
Year: 2016 PMID: 28155632 PMCID: PMC5260102 DOI: 10.1186/s12859-016-1325-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schematic overview of Glints. Glints requires input candidate SNVs (e.g. VCF) and query disease of interest. The process of Glints consists of four parts: 1) annotate each SNV into four regions as Exon, Promoter, Intron and Splice site; 2) select and extract functional scores for each candidate SNV according to its region; 3) infer association between genes hosting candidate SNVs and query disease via multivariate regression; 4) integrate both variant-level and gene-level information via Fisher’s method and produces statistical significance (q-value) for each SNV
Summary statistics for data used in simulated experiment across different regions
| Exon | Promoter | Intron | Splice site | ||
|---|---|---|---|---|---|
| Causal | Variant | 8350 | 114 | 303 | 1105 |
| Gene | 1063 | 34 | 132 | 280 | |
| Control (average) | Variant | 9512 | 18,181 | 2532 | 78 |
| Gene | 5336 | 8486 | 2102 | 77 | |
For control, the numbers of neutral variants across different regions are average number of corresponding neutral variants in 1092 individuals from the 1000 Genomes Project Phase I
The prioritization performance of Glints and individual scores on 1000 Genomes Project based simulated data
| Method | Exon | Promoter | Intron | Splice site | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TOP | MRR | AUC | TOP | MRR | AUC | TOP | MRR | AUC | TOP | MRR | AUC | |
| CADD | 171 | 12.86% | 87.13% | 0 | 14.66% | 85.33% | 7 | 20.29% | 79.71% | 776 | 13.33% | 87.14% |
| DANN | 108 | 10.97% | 89.03% | 0 | 18.99% | 81.02% | 35 | 18.60% | 81.41% | 844 | 9.21% | 91.29% |
| FATHMM-MKL | 127 | 11.80% | 88.19% | 0 | 15.20% | 84.80% | 74 | 11.25% | 88.70% | 850 | 8.67% | 91.84% |
| Eigen | 95 | 5.47% | 94.50% | 29 | 21.97% | 78.05% | 4 | 14.95% | 85.06% | 218 | 19.89% | 80.52% |
| LRT | 0 | 13.17% | 86.95% | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| MSRV | 1872 | 7.53% | 92.38% | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| MutationAccessor | 583 | 9.81% | 90.16% | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| PolyPhen2 | 0 | 8.25% | 91.77% | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| SinBaD | 150 | 7.62% | 92.36% | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| SIFT | 0 | 13.64% | 86.29% | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| GERP | 58 | 16.49% | 83.51% | 2 | 24.29% | 75.70% | 50 | 17.01% | 82.96% | 736 | 12.43% | 88.06% |
| Siphy | 60 | 36.37% | 63.63% | 8 | 23.70% | 76.30% | 5 | 47.26% | 52.75% | 391 | 35.16% | 65.06% |
| Phylop | 119 | 12.96% | 87.02% | 19 | 26.24% | 73.73% | 79 | 15.87% | 84.06% | 863 | 9.03% | 91.46% |
| PhastCons | 0 | 14.12% | 85.78% | 0 | 28.52% | 71.48% | 0 | 15.88% | 84.06% | 585 | 16.22% | 84.21% |
| gexp | 1330 | 17.36% | 82.61% | 55 | 21.18% | 78.77% | 101 | 21.25% | 78.74% | 673 | 20.59% | 79.80% |
| gobp | 3006 | 10.44% | 89.40% | 56 | 11.47% | 88.33% | 173 | 10.09% | 89.78% | 937 | 11.03% | 89.39% |
| kegg | 2321 | 20.21% | 79.85% | 10 | 19.53% | 80.44% | 143 | 20.30% | 79.77% | 737 | 22.16% | 78.34% |
| mrna | 1462 | 24.25% | 75.73% | 31 | 30.62% | 69.52% | 96 | 32.99% | 67.05% | 437 | 29.47% | 70.86% |
| pfam | 2297 | 17.69% | 82.30% | 53 | 20.93% | 79.13% | 137 | 20.49% | 79.54% | 717 | 20.02% | 80.42% |
| pseq | 1194 | 22.21% | 77.87% | 7 | 25.21% | 74.91% | 55 | 23.15% | 76.96% | 697 | 23.78% | 76.70% |
| sign | 1447 | 28.10% | 72.07% | 54 | 25.19% | 74.91% | 111 | 32.49% | 67.75% | 507 | 31.01% | 69.49% |
| strg | 3086 | 10.96% | 88.92% | 57 | 6.31% | 93.46% | 140 | 11.51% | 88.39% | 922 | 11.28% | 89.17% |
| tsfc | 1248 | 30.07% | 69.83% | 37 | 35.64% | 64.28% | 103 | 33.45% | 66.47% | 490 | 33.58% | 66.59% |
| Glintsa | 4646 | 2.12% | 97.61% | 82 | 4.51% | 95.26% | 209 | 4.12% | 95.68% | 1012 | 5.20% | 95.29% |
| Glints | 4736 | 2.12% | 97.62% | 82 | 3.63% | 96.20% | 219 | 3.65% | 96.13% | 1047 | 4.06% | 96.43% |
NA denotes unavailability of the individual score on corresponding region. Glinta denotes conservative results of Glints after excluding CADD, DANN, FATHMM-MKL, MSRV and SinBaD. TOP denotes number of causal variants ranked in top 10, MRR denotes mean rank ratio and AUC denotes area under rank ROC. Some abbreviations for score name: gexp gene expression, gobp gene ontology, kegg KEGG pathway, mrna microRNA regulation, pfam protein families, pseq protein sequence, sign signaling pathway, strg protein-protein interaction, tsfc transcriptional regulation
Fig. 2Results on simulation studies based on sequencing data from 1000 Genome Project. For each region, results on simulation studies are summarized as four metrics: (1) the number of neutral variants; (2) TOP, the number of causal variants ranked in top 10; (3) MRR, the average rank ratio of causal variants; (4) AUC, the average area under the rank ROC. Regions are categorized into a Exon; b Promoter; c Intron; d Splice site. The x axis denotes different populations. Population abbreviations: ASW, people with African ancestry in Southwest United States; CEU, Utah residents with ancestry from Northern and Western Europe; CHB, Han Chinese in Beijing, China; CHS, Han Chinese South, China; CLM, Colombiansin Medellin, Colombia; FIN, Finnish in Finland; GBR, British from England and Scotland, UK; IBS, Iberian populations in Spain; LWK, Luhya in Webuye, Kenya; JPT, Japanese in Tokyo, Japan; MXL, people with Mexican ancestry in Los Angeles, California; PUR, Puerto Ricans in Puerto Rico; TSI, Toscani in Italia; YRI, Yoruba in Ibadan, Nigeria. Ancestry-based groups: AFR, African; AMR, Americas; EAS, East Asian; EUR, European
Fig. 3Comparsion of Glints with existing methods on prioritization of nsSNV. Comparsion is performed on disease variants and neutral variants from Swiss-prot database. Both partial rank ROC (a) and boxplot (b) indicate the superior performance of Glints
Fig. 4Correlations between functional prediction scores across different regions. Pearson’s correlation coefficients are calculated from causal variants across different regions: a Exon, b Promoter, c Intron, d Splice site