| Literature DB >> 20130035 |
Rodrigo Goya1, Mark G F Sun, Ryan D Morin, Gillian Leung, Gavin Ha, Kimberley C Wiegand, Janine Senz, Anamaria Crisan, Marco A Marra, Martin Hirst, David Huntsman, Kevin P Murphy, Sam Aparicio, Sohrab P Shah.
Abstract
MOTIVATION: Next-generation sequencing (NGS) has enabled whole genome and transcriptome single nucleotide variant (SNV) discovery in cancer. NGS produces millions of short sequence reads that, once aligned to a reference genome sequence, can be interpreted for the presence of SNVs. Although tools exist for SNV discovery from NGS data, none are specifically suited to work with data from tumors, where altered ploidy and tumor cellularity impact the statistical expectations of SNV discovery.Entities:
Mesh:
Year: 2010 PMID: 20130035 PMCID: PMC2832826 DOI: 10.1093/bioinformatics/btq040
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(A) Schematic diagram of input data to SNVMix1. We show how allelic counts (bottom) are derived from aligned reads (top). The reference sequence is shown indicated in blue. The arrows indicate positions representing SNVs. The non-reference bases are shown in red. (B) Input data for SNVMix2 that consists of the mapping and base qualities. The darker the background for a read represents a higher quality alignment. The brighter colored nucleotides represent higher quality base calls. Therefore, high contrast nucleotides are more trustworthy than lower contrast nucleotides. (C) SNVMix1 shown as a probabilistic graphical model. Circles represent random variables, and rounded squares represent fixed constants. Shaded notes indicate observed data [the allelic counts and the read depth from (A)]. Unshaded nodes indicate quantities that are inferred during EM. G∈{aa, ab, bb} represents the genotype, N∈{0, 1,…,} is the number of reads and a∈{0, 1,…, N} is the number of reference reads. π is the prior over genotypes and μ is the genotype-specific Binomial parameter for genotype k. (D) SNVMix2 shown as a probabilistic graphical model. In comparison to SNVMix1, a is unobserved and we expand the input to consider read-specific information indexed by j where z=1 indicates that read j is correctly aligned, q is the base quality and r is the mapping quality.
Fig. 2.(A) Theoretical behavior of SNVmix at depths of 2, 3, 5, 10, 15, 20, 35, 50 and 100. The plots show how the distribution of marginal probabilities changes with the number of reference alleles given the model parameters fit to a 10× breast cancer genome dataset. (B) ROC curves from fitting SNVMix2 to synthetic data with increasing levels of certainty in the base call.
Fig. 3.Conditional probability distributions of SNVMix model.
Description of random variables in SNVMix1 and SNVMix2
| Parameter | Description | Value |
|---|---|---|
| δ | Dirichlet prior on π | (1000,100,100) |
| π | Multinomial distribution over genotypes | Estimated by EM (M-step) |
| Genotype at position | Estimated by EM (E-step) | |
| Indicates whether read | Observed in SNVMix1, latent in SNVMix2 | |
| Indicates whether read | Latent | |
| Probability that base call is correct | Observed (SNVMix2 only) | |
| Probability that alignment is correct | Observed (SNVMix2 only) | |
| μ | Parameter of the Binomial for genotype | Estimated by EM (M−step) |
| α | Shape parameter of Beta prior on μ | (1000,500,1) |
| β | Scale parameter of Beta prior on μ | (1,500,1000) |
Fig. 4.Distribution of AUC over 16 ovarian cancer transcriptomes comparing accuracy of SNV detection for two Maq runs, the best and worst SNVMix1 runs in the cross-validation experiment (middle) and best and worst runs for SNVMix2 (mbQ0 = no quality thresholding, MbQ30 = keeping only reads with mapping qualities > Q30). SNVMix1 and SNVMix2 runs were statistically more accurate than both Maq runs (ANOVA, P < 0.0001). SNVMix2 runs were better than SNVMix1, but not statistically significantly.
Comparison of accuracy of SNVMix1, SNVMix2 and SNVMix combined with base and mapping quality thresholding
| Model | Run | Train AUC | TP | FP | TN | FN | Sens | Prec | |
|---|---|---|---|---|---|---|---|---|---|
| SNVMix1 | 10× | 0.9880 | 305 | 192 | 0 | 0 | 1.0000 | 0.6137 | 0.7606 |
| 40× | 0.9924 | 293 | 107 | 85 | 12 | 0.9607 | 0.7325 | 0.8312 | |
| SNVMix2 | 10× | 0.9905 | 299 | 162 | 30 | 6 | 0.9803 | 0.6486 | 0.7807 |
| 40× | 0.9929 | 290 | 107 | 85 | 15 | 0.9508 | 0.7305 | 0.8262 | |
| SNVMix2+thresholding | mQ50_bQ20 (10×) | 0.9882 | 287 | 88 | 104 | 18 | 0.9410 | 0.7653 | 0.8441 |
| mQ50_bQ15 (40×) | 0.9928 | 287 | 71 | 121 | 18 | 0.9410 | 0.8017 | 0.8658 |