| Literature DB >> 16243783 |
Shea N Gardner1, Marisa W Lam, Jason R Smith, Clinton L Torres, Tom R Slezak.
Abstract
Sequencing pathogen genomes is costly, demanding careful allocation of limited sequencing resources. We built a computational Sequencing Analysis Pipeline (SAP) to guide decisions regarding the amount of genomic sequencing necessary to develop high-quality diagnostic DNA and protein signatures. SAP uses simulations to estimate the number of target genomes and close phylogenetic relatives (near neighbors or NNs) to sequence. We use SAP to assess whether draft data are sufficient or finished sequencing is required using Marburg and variola virus sequences. Simulations indicate that intermediate to high-quality draft with error rates of 10(-3)-10(-5) (approximately 8x coverage) of target organisms is suitable for DNA signature prediction. Low-quality draft with error rates of approximately 1% (3x to 6x coverage) of target isolates is inadequate for DNA signature prediction, although low-quality draft of NNs is sufficient, as long as the target genomes are of high quality. For protein signature prediction, sequencing errors in target genomes substantially reduce the detection of amino acid sequence conservation, even if the draft is of high quality. In summary, high-quality draft of target and low-quality draft of NNs appears to be a cost-effective investment for DNA signature prediction, but may lead to underestimation of predicted protein signatures.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16243783 PMCID: PMC1266063 DOI: 10.1093/nar/gki896
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Diagram of the SAP. For an SAP run, first a pool of target genome and a pool of NN genomes are collected. Then many random subsamples of target and NN genomes are selected from the pool, and each subsample is run through either the DNA signature pipeline or the protein signature pipeline, which identify regions conserved among target genomes and unique relative to non-target genomes, where unique regions are evaluated by comparing to a large sequence database of all currently available bacterial and viral complete genomes or the non-redundant protein database, excluding NNs from the NN pool that are not in that random subsample. Thus, each run of the SAP requires many runs of the DNA or protein signature pipelines with different random samples, generating a range of outcomes that are plotted on range plots.
Figure 2Range plots of the conserved fraction of the target genome for Marburg virus (A) finished sequences and (B) draft sequences. The range of values from different random samples for a given sample size (number of target sequences) is drawn as a horizontal line. The 75th quantile of each range is marked with a short vertical tick. The conserved fraction using all of the target sequences is given in the box labeled ANS (for ‘answer’) and marked with a vertical line along this value on the x-axis.
Figure 8Range plots for the number of protein signature candidates for variola virus (A) finished sequence, (B) simulated draft target sequence with a low error rate, and (C) simulated draft target sequence with an intermediate error rate and (D) simulated draft target sequence with a high error rate.
Figure 3Range plots as described in the methods for the number of TaqMan signature candidates for Marburg virus for (A) finished and (B) draft sequences. To discriminate samples in which zero NNs were used, the range is drawn as a horizontal gray line, and when n > 0, the range is drawn as a black line. The best estimate of the true value is the quality measure determined using the entire target and NN pools, and is represented by a vertical black line. This best estimate plus a constant c = 20 is at the location of the vertical dashed line and was selected to indicate a reasonable distance from the true answer. The 75% quantile for each range is shown with a black, vertical tick mark.
Figure 5Range plots for the number of TaqMan signature candidates for variola virus (A) finished target and NN sequences, (B) simulated draft target and NN sequences with a low error rate, (C) simulated draft target sequences with a low error rate and draft NN sequences with a high error rate and (D) simulated draft target sequences with an intermediate error rate and draft NN sequences with a high error rate.
Figure 6Range plots for the number of TaqMan signature candidates for variola virus (A) simulated draft target with a high error rate and finished NNs, (B) finished target and draft NNs with a high error rate and (C) draft target and draft NNs with a high error rate.
Summary of results using 28 variola genomes (finished or simulated draft as indicated) and 22 NN genomes from the Orthopox family, as well as the finished and draft Marburg results
| Species | Simulated draft or finished target | Draft or finished NNs | Percent conserved sequence (%) | Percent conserved and unique sequence (%) | Number TaqMan DNA signature candidates | Number conserved and unique regions | Longest conserved and unique region |
|---|---|---|---|---|---|---|---|
| Variola major virus | Simulated draft, high error rate | Finished | 58.30 | 57.79 | 0 | 4 | 23 |
| Variola major virus | Simulated draft, high error rate | Simulated draft, high error rate | 58.68 | 58.36 | 0 | 8 | 23 |
| Variola major virus | Finished | Simulated draft, high error rate | 98.90 | 3.91 | 1 | 71 | 49 |
| Variola major virus | Simulated draft, low error rate | Finished | 98.61 | 4.60 | 1 | 89 | 49 |
| Variola major virus | Simulated draft, low error rate | Simulated draft, low error rate | 98.65 | 4.43 | 0 | 86 | 49 |
| Variola major virus | Simulated draft, low error rate | Simulated draft, high error rate | 98.76 | 4.42 | 0 | 88 | 49 |
| Variola major virus | Simulated draft, intermediate error rate | Simulated draft, high error rate | 96.67 | 14.06 | 0 | 191 | 52 |
| Variola major virus | Finished | Simulated draft, low error rate | 98.84 | 4.05 | 1 | 80 | 49 |
| Variola major virus | Finished | Finished | 98.90 | 3.99 | 1 | 76 | 49 |
| Marburg virus | Real draft, 3× to 6× coverage | Finished | 92.60 | 83.31 | 43 | 250 | 198 |
| Marburg virus | Finished | Finished | 75.19 | 74.36 | 0 | 38 | 41 |
The percent of the target genome that is conserved varies slightly among the runs using finished target sequences because different genomes were randomly selected to be the reference strain in each multiple sequence alignment.
Figure 4Range plots for the conserved fraction of the target genome for variola virus for (A) finished sequence, (B) simulated draft sequence with a low error rate, (C) simulated draft with an intermediate error rate and (D) simulated draft with a high error rate.
Figure 7Range plots for the number of protein signature candidates for Marburg virus (A) finished and (B) draft sequence data.
Marburg finished genomes used in these analyses
| Fasta header | Sequence length | Sequence description |
|---|---|---|
| gi|13489275|ref|NC_001608.2| Marburg virus, complete genome | 19 112 | Marburg virus, complete genome |
| Unpublished strain 1 | 19 113 | Unpublished sequence of Marburg virus |
| PP3 | 19 113 | AY430365 |
| PP4 | 19 112 | AY430366 |
| Ozolin | 19 151 | AY358025 |
| Unpublished strain 2 | 19 083 | Unpublished sequence of Marburg virus |
Draft Marburg sequence data (intermediate versions of the same finished sequences given above): unpublished strain 1, 17 contigs ranging from 859 bases to 29 302 bases; PP3, 15 contigs ranging from 778 bases to 24 884 bases; PP4, 15 contigs ranging from 779 bases to 19 728 bases; unpublished strain 2, 42 contigs ranging from 818 bases to 40 767 bases.
Near neighbors for Marburg virus
| Fasta header | Sequence length |
|---|---|
| gi|23630482|gb|AY142960.1| Zaire Ebola virus strain Mayinga subtype Zaire, complete genome | 18 959 |
| gi|21702647|gb|AF499101.1| Zaire Ebola virus strain Mayinga, complete genome | 18 960 |
| gi|11761745|gb|AF272001.1| Zaire Ebola virus strain Mayinga, complete genome | 18 959 |
| gi|10313991|ref|NC_002549.1| Zaire Ebola virus, complete genome | 18 959 |
| gi|33860540|gb|AY354458.1| Zaire Ebola virus strain Zaire 1995, complete genome | 18 961 |
| Raw sequence of Ebola virus strain Zaire-95 from LLNL on Aug 29 2003 1:13PM | 18 961 |
| gi|15823608|dbj|AB050936.1| Reston Ebola virus genomic RNA, complete genome | 18 890 |
| gi|22789222|ref|NC_004161.1| Reston Ebola virus, complete genome | 18 891 |
Variola major sequence data
| Fasta header | Sequence length | Sequence description |
|---|---|---|
| gi|9627521|ref|NC_001611.1| Variola virus, complete genome | 185 578 | Variola virus, complete genome |
| gi|623595|gb|L22579.1| VARCG Variola major virus (strain Bangladesh-1975) complete genome | 186 103 | Variola major virus (strain Bangladesh-1975) complete genome |
| 26 Unpublished CDC sequences | Various | Variola virus, complete genome |
Near neighbors for variola
| Fasta header | Sequence length | Sequence description |
|---|---|---|
| gi|20152989|gb|AF482758.1| Cowpox virus strain Brighton Red, complete genome | 224 501 | Cowpox virus strain Brighton Red, complete genom |
| gi|30519405|emb|X94355.2| CV41KBPL Cowpox virus strain GRI-90, complete genome | 223 666 | Cowpox virus strain GRI-90, complete genome |
| gi|30844336|ref|NC_003663.2| Cowpox virus, complete genome | 224 499 | Cowpox virus, complete genome |
| 1 cowpox genomes | – | Unpublished CDC sequence |
| gi|17974913|ref|NC_003310.1| Monkeypox virus, complete genome | 196 858 | Monkeypox virus, complete genome |
| 3 Monkeypox genomes | Various | Unpublished sequence from the CDC |
| gi|29692106|gb|AY243312.1| Vaccinia virus strain WR, complete genome | 194 711 | Vaccinia virus strain WR, complete genome |
| gi|9790357|ref|NC_001559.1| Vaccinia virus, complete genome | 191 737 | Vaccinia virus, complete genome |
| gi|47088326|gb|AY603355.1| Vaccinia virus strain Acambis 3000 Modified Virus Ankara (MVA), complete genome | 166 722 | Vaccinia virus strain Acambis 3000 Modified Virus Ankara (MVA), complete genome |
| 3 Vaccinia genomes | Various | Unpublished CDC sequence |
| gi|22164589|ref|NC_004105.1| Ectromelia virus, complete genome | 209 771 | Ectromelia virus, complete genome |
| Ectromelia virus (Naval) | 207 620 | |
| 1 Taterapox genome | – | Unpublished CDC sequence |
| gi|18640237|ref|NC_003391.1| Camelpox virus, complete genome | 205 719 | Camelpox virus, complete genome |
| gi|19717929|gb|AY009089.1| Camelpox virus CMS, complete genome | 202 205 | Camelpox virus CMS, complete genome |
| 1 Buffalopox genome | – | Unpublished CDC sequence |
| gi|46401901|ref|NC_005858.1| Rabbitpox virus, complete genome | 197 731 | Rabbitpox virus, complete genome |
| RPXV-UTR_forward_1-197731 | 197 731 | Raw sequence of Rabbitpox virus (strain Utrecht) from poxvirus.org on Aug 05 2003 1:59PM |