| Literature DB >> 28334373 |
Steven H Wu1, Rachel S Schwartz1,2, David J Winter1, Donald F Conrad3, Reed A Cartwright1,4.
Abstract
MOTIVATION: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others.Entities:
Mesh:
Year: 2017 PMID: 28334373 PMCID: PMC5860108 DOI: 10.1093/bioinformatics/btx133
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Mixtures of Dirichlet-multinomials provide the best fits to genomic datasets. QQ plots evaluate the fit of three different models to (a) the CHM1 chromosome 21 RD dataset. (b) The CEU 2013 chromosome 21 TH dataset. The quantiles of the observed read count frequencies are calculated from the datasets, and the quantiles of the expected read count frequencies are estimated from the fitted model. A model that fits the data well produces points that fall along the diagonal
The number of components (c) in the best MDM model according to BIC values for CHM1 and CEU datasets, and parameters estimated for these models
| Dataset | ||||||
|---|---|---|---|---|---|---|
| CHM1 RD Chr21 | 2 | 1.00 | NA | 2.20e−4 | 0.000 | 0.751 |
| 1.00 | NA | 3.61e−4 | 1.53e−2 | 0.249 | ||
| CHM1 RD Chr10 | 2 | 1.00 | NA | 2.60e−4 | 2.69e−3 | 0.942 |
| 0.999 | NA | 8.33e−4 | 4.75e−2 | 5.85e−2 | ||
| CHM1 FD Chr21 | 2 | 1.00 | NA | 2.49e−4 | 2.52e−3 | 0.972 |
| 0.982 | NA | 1.76e−2 | 0.892 | 2.78e−2 | ||
| CHM1 FD Chr10 | 2 | 1.00 | NA | 2.82e−4 | 4.15e−3 | 0.984 |
| 0.975 | NA | 2.54e−2 | 0.948 | 1.64e−2 | ||
| CEU13 TH Chr21 | 3 | 0.504 | 0.496 | 3.53e−04 | 2.46e−04 | 0.939 |
| 0.508 | 0.491 | 5.26e−04 | 6.89e−02 | 6.04e−02 | ||
| 0.239 | 0.483 | 0.278 | 6.56e−02 | 5.87e−04 | ||
| CEU12 TH Chr21 | 2 | 0.509 | 0.491 | 3.15e−04 | 1.29e−04 | 0.961 |
| 0.541 | 0.457 | 2.04e−03 | 7.73e−02 | 3.90e−02 | ||
| CEU11 TH Chr21 | 2 | 0.509 | 0.491 | 3.15e−04 | 1.31e−04 | 0.961 |
| 0.541 | 0.457 | 1.98e−03 | 7.82e−02 | 3.91e−02 | ||
| CEU10 TH Chr21 | 2 | 0.533 | 0.465 | 2.25e−03 | 1.52e−03 | 0.922 |
| 0.670 | 0.327 | 2.70e−03 | 0.000 | 7.83e−02 | ||
| CEU13 TH Chr10 | 2 | 0.502 | 0.498 | 3.36e−04 | 4.28e−04 | 0.922 |
| 0.504 | 0.490 | 6.21e−03 | 1.24e−01 | 7.54e−03 | ||
| CEU12 TH Chr10 | 2 | 0.508 | 0.491 | 3.19e−04 | 5.53e−04 | 0.986 |
| 0.540 | 0.457 | 3.05e−03 | 7.60e−02 | 1.38e−02 | ||
| CEU11 TH Chr10 | 2 | 0.508 | 0.491 | 3.19e−04 | 5.56e−04 | 0.986 |
| 0.542 | 0.455 | 3.01e−03 | 7.37e−02 | 1.39e−02 | ||
| CEU10 TH Chr10 | 2 | 0.534 | 0.463 | 3.05e−03 | 0.000 | 0.684 |
| 0.550 | 0.449 | 6.45e−04 | 1.29e−02 | 0.316 |
Note: Each row represents a different component in the model. π, π and π are the proportion of the reference, alternative and error terms respectively. φ is the overdispersion parameter. When φ approaches 0, the distribution approaches a multinomial, and when φ approaches 1, the distribution is nearly completely overdispersed. ρ is the proportion of sites in each component.
The minor components have a higher percentage of copy number variable regions (CNVs) and repetitive/low complexity regions (LCRs)
| Dataset | Non-CNV/CNV | % | Non-LCR/LCR | % | |||
|---|---|---|---|---|---|---|---|
| CEU13 Chr21 | Major | 26 924/415 | 1.5 | 4.11e−143 | 13 362/13 977 | 51.1 | 2.91e−48 |
| Minor | 10 851/773 | 6.7 | 4746/6878 | 59.2 | |||
| CEU13 Chr10 | Major | 32 012/94 | 0.3 | 4.19e−3 | 16 043/16 063 | 50.0 | 5.39e−06 |
| Minor | 5864/32 | 0.5 | 2756/3140 | 53.3 | |||
| CEU12 Chr21 | Major | 26 891/193 | 0.7 | 4.54e−87 | 13 362/13 722 | 50.7 | 2.67e−57 |
| Minor | 7760/331 | 4.1 | 3178/4913 | 60.7 | |||
| CEU12 Chr10 | Major | 32 560/90 | 0.3 | 1.58e−3 | 16 204/16 446 | 50.4 | 2.08e−16 |
| Minor | 7035/37 | 0.5 | 3129/3943 | 55.8 | |||
| CEU11 Chr21 | Major | 26 858/191 | 0.7 | 1.11e−88 | 13 344/13 705 | 50.7 | 3.11e−54 |
| Minor | 7743/333 | 4.1 | 3194/4882 | 60.5 | |||
| CEU11 Chr10 | Major | 32 539/89 | 0.3 | 1.01e−3 | 16 195/16 433 | 50.4 | 3.19e−16 |
| Minor | 7060/38 | 0.5 | 3144/3954 | 55.7 | |||
| CEU10 Chr21 | Major | 21 968/109 | 0.5 | 3e−90 | 11 936/10 141 | 45.9 | 3.62e−71 |
| Minor | 8518/326 | 3.7 | 3790/5054 | 57.1 | |||
| CEU10 Chr10 | Major | 26 380/49 | 0.2 | 0.198 | 14 305/12 124 | 45.9 | 1.68e−53 |
| Minor | 10 197/26 | 0.3 | 4617/5606 | 54.8 |
Note: The number and percent (%) of CNV regions and of LCRs are shown for the major and minor components (combined) for the best fit model for each CEU PH dataset. P value was calculated for the difference between the proportion of CNVs or LCRs in each component.
Fig. 2The receiver operating characteristic (ROC) curve for the CEU13 chromosome 21 dataset demonstrates better classification of heterozygous sites using our approach. This dataset is shown as an example of how the model can be used to classify sites. Sensitivity and specificity are calculated for three possible optimizing criteria from the MDM heterozygotes site classifier: cost benefit (CB), closest point to (0,1) (ROC01), and Youden’s Index (Youden). For comparison, the output from the GATK recommended workflow is also shown