| Literature DB >> 22084253 |
Jiarui Ding1, Ali Bashashati, Andrew Roth, Arusha Oloumi, Kane Tse, Thomas Zeng, Gholamreza Haffari, Martin Hirst, Marco A Marra, Anne Condon, Samuel Aparicio, Sohrab P Shah.
Abstract
MOTIVATION: The study of cancer genomes now routinely involves using next-generation sequencing technology (NGS) to profile tumours for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge.Entities:
Mesh:
Year: 2011 PMID: 22084253 PMCID: PMC3259434 DOI: 10.1093/bioinformatics/btr629
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
The definitions of features x1 to x20
| (1) Number of reads covering or bridging the site | (11) Sum of squares of reference mapping qualities |
| (2) Number of reference Q13 bases on the forward strand | (12) Sum of non-reference mapping qualities |
| (3) Number of reference Q13 bases on the reverse strand | (13) Sum of squares of non-reference mapping qualities |
| (4) Number of non-reference Q13 bases on the forward strand | (14) Sum of tail distances for reference bases |
| (5) Number of non-reference Q13 bases on the reverse strand | (15) Sum of squares of tail distance for reference bases |
| (6) Sum of reference base qualities | (16) Sum of tail distances for non-reference bases |
| (7) Sum of squares of reference base qualities | (17) Sum of squares of tail distance for non-reference bases |
| (8) Sum of non-reference base qualities | (18) |
| (9) Sum of squares of non-reference base qualities | (19) max |
| (10) Sum of reference mapping qualities | (20) ∑ |
Q13 means base quality bigger or equal to Phred score 13; D represents the three dimensional vector (depth, number of reference bases and number of non-reference bases) at the current site; G∈{aa, ab, bb} means the genotype at site i, where a, b∈{A, C, T, G} and a is the reference allele and b is the non-reference allele. These features are constructed from Samtools.
The definitions of features x41 to x60
| (41) QUAL: phred-scaled probability of the call given data | (51) QD: variant confidence/unfiltered depth |
| (42) Allele count for non-ref allele in genotypes | (52) SB: strand bias (the variation being seen on only the forward or only the reverse strand) |
| (43) AF: allele frequency for each non-ref allele | (53) SumGLbyD |
| (44) Total number of alleles in called genotypes | (54) Allelic depths for the ref-allele |
| (45) Total (unfiltered) depth over all samples | (55) Allelic depths for the non-ref allele |
| (46) Fraction of reads containing spanning deletions | (56) DP: read depth (only filtered reads used for calling) |
| (47) HRun: largest contiguous homopolymer run of variant allele in either direction | (57) GQ: genotype quality computed based on the genotype likelihood |
| (48) HaplotypeScore: estimate the probability that the reads at this locus are coming from no more than 2 local haplotypes | (58) |
| (49) MQ: root mean square mapping quality | (59) |
| (50) MQ0: total number of reads with mapping quality zero | (60) |
These features are constructed from GATK.
The definitions of x98 to x106
| (98) Forward strand non-reference base ratio | (103) Sum of squares of non-reference mapping quality ratio |
| (99) Reverse strand non-reference base ratio | (104) Sum of non-reference tail distance ratio |
| (100) Sum of non-reference base quality ratio | (105) Sum of squares of non-reference tail distance ratio |
| (101) Sum of squares of non-reference base quality ratio | (106) Non-reference allele depth ratio |
| (102) Sum of non-reference mapping quality ratio |
These features are used to boost weak mutation signals in the tumour and decrease the influence of germline polymorphism. In this table, F means the normalized version of the i-th feature.
Fig. 1.(a) Accuracy results from cross-validation experiments on all the exome capture data (SeqVal1+2). All classifiers showed better results than Samtools and GATK's prediction results in terms of ROC comparison. The numbers in parentheses are the prediction accuracy by fixing the sensitivity at 0.99, except for Samtools and GATK's prediction results because their outputs are deterministic. (b) Accuracy results from cross-validation experiments on the exome capture data of SeqVal1. (c) Accuracy results from cross-validation experiments on the exome capture data of SeqVal1 after GATK's local realignment around indels and base quality recalibration. (d) Accuracy results from cross-validation experiments on the exome capture data of SeqVal2. (e) Comparison of classifiers and Samtools's (ST) performance at the specificity and sensitivity level given by Samtools. (f) Comparison of classifiers and GATK's performance at the specificity and sensitivity level given by GATK.
Fig. 2.ROC curves derived from the held-out whole genome shotgun independent test data from four cases show different classifiers' prediction results as well as Samtools and GATK's prediction results. The numbers in the parentheses are the prediction accuracy by using the same threshold as for the exome capture data (except for Samtools and GATK's prediction results).
The classification accuracy of classifiers by using different feature sets
| Model/Feature | RF_F (18) | BART_F (23) | SVM_F (17) | Logit_F (17) |
|---|---|---|---|---|
| RF | 0.9369 | 0.9487 | 0.9448 | 0.9329 |
| BART | 0.9369 | 0.9428 | 0.9369 | 0.9310 |
| SVM | 0.9034 | 0.9408 | 0.9369 | 0.9408 |
| Logit | 0.8856 | 0.9487 | 0.9250 | 0.9310 |
| Mean | 0.9157 | 0.9359 | 0.9339 | |
Here RF_F means the feature selected by RF classifier. BART_F, SVM_F and Logit_F are similarly defined. The numbers in parentheses are the number of feature selected.