| Literature DB >> 22436995 |
Ericka A Becker1, Charles M Burns, Enrique J León, Saravanan Rajabojan, Robert Friedman, Thomas C Friedrich, Shelby L O'Connor, Austin L Hughes.
Abstract
Factors affecting the reliability of Roche/454 pyrosequencing for analyzing sequence polymorphism in within-host viral populations were assessed by two experiments: 1) sequencing four clonal simian immunodeficiency virus (SIV) stocks and 2) sequencing mixtures in different proportions of two SIV strains with known fixed nucleotide differences. Observed nucleotide diversity and frequency of undetermined nucleotides were increased at sites in homopolymer runs of four or more identical nucleotides, particularly at AT sites. However, in the mixed-strain experiments, the effects on estimated nucleotide diversity of such errors were small in comparison to known strain differences. The results suggest that biologically meaningful variants present at a frequency of around 10% and possibly much lower are easily distinguished from artifacts of the sequencing process. Analysis of the clonal stocks revealed numerous rare variants that showed the signature of purifying selection and that elimination of variants at frequencies of less than 1% reduced estimates of nucleotide diversity by about an order of magnitude. Thus, using a 1% frequency cutoff for accepting a variant as real represents a conservative standard, which may be useful in studies that are focused on the discovery of specific mutations (such as those conferring immune escape or drug resistance). On the other hand, if the goal is to estimate nucleotide diversity, an optimal strategy might be to include all observed variants (even those at less than 1% frequency), while masking out homopolymer runs of four or more nucleotides.Entities:
Mesh:
Year: 2012 PMID: 22436995 PMCID: PMC3342875 DOI: 10.1093/gbe/evs029
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FMean nucleotide diversity (π) of four SIV stocks at sites categorized by length of homopolymer runs (Run Length) and the reference sequence base pair (AT or GC; Reference bp): (A) including all variable sites and (B) excluding sites with % variant < 1%. The figure illustrates the significant interactions in factorial ANOVA between the variables Run Length and Reference bp.
FMean proportion undetermined nucleotides (Prop. N) of four SIV stocks at sites categorized by length of homopolymer runs (Run Length) and the reference sequence nucleotide. The figure illustrates the significant interaction in factorial ANOVA between the variables Run Length and Reference bp.
Mean Synonymous (πS) and Nonsynonymous (πN) Nucleotide Diversity in Clonal SIVmac239 Stocks
| πS ± SE | πN ± SE | πN:πS | |
| All observed variants | |||
| All substitutions | 0.00366 ± 0.00013 | 0.00236 ± 0.00007*** | 0.645 |
| Excluding H ≥ 4 | 0.00349 ± 0.00013 | 0.00202 ± 0.00005***,† | 0.579 |
| % Variant ≥ 1% only | |||
| All substitutions | 0.00063 ± 0.00006 | 0.00046 ± 0.00002** | 0.730 |
| Excluding H ≥ 4 | 0.00041 ± 0.00006† | 0.00021 ± 0.00002**,† | 0.512 |
NOTE.—SE, standard error.
Homopolymer runs of four or more nucleotides.
Paired t-test of the hypothesis that πS = πN: *P < 0.05, **P < 0.01, ***P < 0.001.
Paired t-test of the hypothesis that πS or πN for H ≥ 4 equals the corresponding value for all substitutions: †P < 0.001.
FNucleotide diversity (π) at individual variable sites plotted against homopolymer run length in four strain combinations: (A) 1:1, (B) 1:2, (C) 1:4, and (D) 1:19. In each case, open circles indicate sites (N = 17) with known strain differences and closed circles indicate all other sites. Separate linear regression lines are drawn for the sites with known strain differences (dotted lines) and all other sites (solid lines).
FRegression of the percentage of the overall variance in π that is accounted for by fixed strain differences against the observed mean π at sites with known strain differences. The expected strain ratio is indicated for each point.