| Literature DB >> 29722813 |
Matthew Jobin1,2, Haiko Schurz3, Brenna M Henn4.
Abstract
We introduce IMPUTOR, software for phylogenetically aware imputation of missing haploid nonrecombining genomic data. Targeted for next-generation sequencing data, IMPUTOR uses the principle of parsimony to impute data marked as missing due to low coverage. Along with efficiently imputing missing variant genotypes, IMPUTOR is capable of reliably and accurately correcting many nonmissing sites that represent spurious sequencing errors. Tests on simulated data show that IMPUTOR is capable of detecting many induced mutations without making erroneous imputations/corrections, with as many as 95% of missing sites imputed and 81% of errors corrected under optimal conditions. We tested IMPUTOR with human Y-chromosomes from pairs of close relatives and demonstrate IMPUTOR's efficacy in imputing missing and correcting erroneous calls.Entities:
Mesh:
Year: 2018 PMID: 29722813 PMCID: PMC5961346 DOI: 10.1093/gbe/evy088
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
. 1.—Rootward Case 1. A site found within a clearly defined clade of sufficient size to contain the threshold number of neighbors for both missing and nonmissing data.
. 2.—Rootward Case 2. A site with insufficient near neighbors to reach the threshold number for either missing or nonmissing data.
. 3.—Proportion of corrected errors as a function of the maxdepth parameter for missing sites and rootward neighbor collection method.
Effect of the Reversion Check Feature of IMPUTOR on Accuracy
| Method | Reversion Check | Mean Imputed Distance | S.D. Imputed Distance | Prop. Corrected Errors |
|---|---|---|---|---|
| Y | 6.10 | 2.77 | ||
| Y | 3.90 | 2.55 | ||
| Y | 5.00 | 2.21 | ||
| N | 13.3 | 1.57 | ||
| N | 12.1 | 2.28 | ||
| N | 13.0 | 2.36 |
NOTE.—Simulated data generated in SFS_CODE was randomly altered to create ten new files, replacing bases with missing data. These altered files, which had a mean number of pairwise differences of 73.7 from the original file (S.D. 10.15) were then run in IMPUTOR. The “Prop. Corrected Errors” column above is a metric of accuracy in recovering the original sequence.
. 4.—Proportion of corrected errors as a function of the maxhops parameter for nonmissing sites and hops neighbor collection method.
. 5.—Proportion of corrected errors as a function of the number of sequences, for a missingness of 0.01 and Θ = 0.01, for two software programs, SHAPEIT and IMPUTOR.
Effect of Θ on Ratio of Imputed to Unimputed Pairwise Distances to an Original SFS_CODE-generated File
| Prop. Corrected Err. | Variance | Failed/10 | |
|---|---|---|---|
| 0.93 | 0.00195 | 0 | |
| 0.89 | 0.0148 | 2 | |
| 0.82 | n/a | 9 |
NOTE.—The unimputed file was created by randomly replacing bases with missing codes at a frequency of 0.001, simulating damage. Ten iterations of simulation were run for each value of Θ, with mean and variance shown. RAxML would not run on the number of entries in the Failed/10 column.