| Literature DB >> 23056573 |
Osvaldo Zagordi1, Martin Däumer, Christian Beisel, Niko Beerenwinkel.
Abstract
Recent advancements of sequencing technology have opened up unprecedented opportunities in many application areas. Virus samples can now be sequenced efficiently with very deep coverage to infer the genetic diversity of the underlying virus populations. Several sequencing platforms with different underlying technologies and performance characteristics are available for viral diversity studies. Here, we investigate how the differences between two common platforms provided by 454/Roche and Illumina affect viral diversity estimation and the reconstruction of viral haplotypes. Using a mixture of ten HIV clones sequenced with both platforms and additional simulation experiments, we assessed the trade-off between sequencing coverage, read length, and error rate. For fixed costs, short Illumina reads can be generated at higher coverage and allow for detecting variants at lower frequencies. They can also be sufficient to assess the diversity of the sample if sequences are dissimilar enough, but, in general, assembly of full-length haplotypes is feasible only with the longer 454/Roche reads. The quantitative comparison highlights the advantages and disadvantages of both platforms and provides guidance for the design of viral diversity studies.Entities:
Mesh:
Year: 2012 PMID: 23056573 PMCID: PMC3463535 DOI: 10.1371/journal.pone.0047046
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary statistics of sequencing experiments, read mapping, and error rates.
| Platform | PCR amplification | Total reads | Reads mapped to protease (10–93) | Mapped read length (mean ± sd) | Reads included in the analysis | Error rate [%] (mean ± sd) |
| 454/Roche | No | 16,540 | 668 | 232±18 | 668 | 0.59±0.02 |
| 454/Roche | Yes | 45,973 | 4,331 | 236±18 | 4,331 | 1.09±0.01 |
| Illumina GA | No | 12,559,696 | 1,505,619 | 36 | 11,835 | 0.17±0.01 |
| Illumina GA | Yes | 12,242,508 | 1,346,481 | 36 | 8,904 | 0.38±0.01 |
For all four experiments, the total number of reads obtained and those overlapping amino acids 10 to 93 of the protease are reported. All 454/Roche reads mapping to this region were used in the haplotype reconstruction. For the Illumina Genome Analyzer, only those mapping to the region of highest entropy were considered. The last column reports mean and standard deviation of the sequencing error rate (1 – θ, where the parameter θ is estimated during haplotype reconstruction).
Figure 1Diversity of the protease region measured on the multiple sequence alignments.
The plot shows the Shannon entropy of each column of the multiple sequence alignment of all mapped reads (orange bars) and its moving average in a window of 35 bp (blue lines). Numbering of bases follows the nucleotide position on the protease, i.e., position 1 corresponds to position 2253 on HXB2. As a reference, the top subfigure shows the diversity of the mixture of the original ten clones assuming equal frequencies. The remaining subfigures refer to the four sequencing experiments using either 454/Roche or Illumina GA and PCR amplification or not.
Performance of local haplotype reconstruction.
| Platform | PCR amplification | Reconstructed | TP | FP | FN | Sensitivity [%] | Specificity [%] |
| 454/Roche | No | 13 | 5 | 8 | 5 | 50 | 38 |
| 454/Roche | Yes | 30 | 6 | 24 | 4 | 60 | 20 |
| Illumina GA | No | 10 | 9 | 1 | 1 | 90 | 90 |
| Illumina GA | Yes | 10 | 6 | 4 | 4 | 60 | 60 |
For all four experiments, we report the total number of predicted haplotypes (column Reconstructed), the number of correct haplotypes (true positives, TP), the number of reconstructed haplotypes that do not match any of the original clones (false positives, FP), and the number of missed haplotypes (false negatives, FN). This number is equal to 10 – TP, because ten is the total number of haplotypes present in the sample. Sensitivity is defined as TP/(TP+FN) and specificity as TP/(TP+FP). Local haplotype reconstruction was performed on the 252 bp region of the HIV pol gene coding for protease amino acids 10 to 93 for the 454/Roche data, and on the 35 bp subregion of highest entropy for the Illumina reads.
Frequencies of all perfectly reconstructed haplotypes.
| Platform | PCR amplification | Method | 07-56681 | 07-54825 | 07-56951 | 08-59712 | 08-04134 | 08-01315 | 08-02659 | 08-57881 | 08-04512 | Total |
| 454/Roche | No | ShoRAH | 10.6 | 14.1 | 14.1 | 13.9 | 4.9 | — | — | — | — | 57.6 |
| 454/Roche | No | Direct mapping | 27.3 | 21.2 | 30.0 | 11.0 | 7.1 | 2.1 | 0.3 | 0.3 | 0.1 | 99.4 |
| 454/Roche | Yes | ShoRAH | 3.6 | 15.7 | 22.0 | 11.4 | 7.0 | 0.3 | — | — | — | 60.0 |
| 454/Roche | Yes | Direct mapping | 6.0 | 34.3 | 37.2 | 9.6 | 11.7 | 0.4 | 0.4 | 0.1 | 0.2 | 99.9 |
| Illumina GA | No | ShoRAH | 53.1 | 19.5 | 15.1 | 7.2 | 2.7 | 1.6 | 0.2 | 0.2 | 0.2 | 99.8 |
| Illumina GA | No | Direct mapping | 41.7 | 15.4 | 24.8 | 10.3 | 4.5 | 1.5 | 0.3 | 0.3 | 0.1 | 98.9 |
| Illumina GA | Yes | ShoRAH | 7.6 | 46.8 | 27.1 | 7.3 | 5.3 | 1.9 | — | — | — | 96.0 |
| Illumina GA | Yes | Direct mapping | 5.9 | 34.7 | 36.6 | 10.4 | 10.3 | 0.7 | 0.6 | 0.2 | 0.3 | 99.7 |
Reported are, for all four experiments, the relative frequencies in percent of the reconstructed haplotypes matching exactly one of the original clones (named 07-56681, …, 08-04512) as estimated by direct mapping and by ShoRAH. Undetected haplotypes are indicated by a dash (‘—’).
Figure 2Global haplotype reconstruction at high diversity.
The mean distance between clones of the underlying population was 7.5%. For global haplotype reconstruction, different conditions were tested, including varying read lengths (1st column: 36 bases, 2nd column: 75 bases, 3rd column: 150 bases), numbers of reads (1st row: 10,000, 2nd row: 20,000, 3rd row: 50,000), and sequencing error rates (grey: 0.01%, orange: 0.05%, blue: 0.1% per base). The y-axis reports reconstruction performance as the proportion close (φ), defined as the fraction of reconstructed haplotypes that have at most q mismatches with respect to the original clones. The genomic region considered codes for amino acids 10 to 93 of the HIV protease.
Figure 3Global haplotype reconstruction at low diversity.
Same as Figure 2, but the mean distance between clones is 1.9%.