| Literature DB >> 21989211 |
Irina Astrovskaya1, Bassam Tork, Serghei Mangul, Kelly Westbrooks, Ion Măndoiu, Peter Balfe, Alex Zelikovsky.
Abstract
BACKGROUND: RNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21989211 PMCID: PMC3194189 DOI: 10.1186/1471-2105-12-S6-S1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1ViSpA’s flowchart.
Figure 2Statistical validation on error-free reads from known HCV quasispecies. Left: PPV and sensitivity as a function of the number of quasispecies in the original population (40K reads with average read length 300). Right: the relative entropy as a function of the average read length (40K reads from 10 quasispecies).
Comparison of three methods – ViSpA, ShoRAH, and ShoRAHreads+ViSpA – on the read data simulated by FlowSim.
| ShoRAH | ViSpA | ShoRAHreads+ViSpA | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PPV | Sensitivity | Reproducibility | PPV | Sensitivity | Reproducibility | PPV | Sensitivity | Reproducibility | ||||
| Max | Average | Max | Average | Max | Average | |||||||
| k=0 | 0.0097 | 0.3 | 0.45 | 0.11 | 0.0008 | 0.1 | 0.1 | 0.1 | 0.5 | 0.5 | 0.95 | 0.95 |
| k=1 | 0.0129 | 0.4 | 0.6 | 0.32 | 0.0008 | 0.1 | 0.1 | 0.1 | 0.5 | 0.5 | 0.95 | 0.95 |
| k=9 | 0.0162 | 0.5 | 0.95 | 0.64 | 0.0015 | 0.2 | 0.1 | 0.1 | 0.5 | 1 | 0.95 | 0.95 |
The quasispecies sequence is considered found if one of candidate sequences matches it exactly (k = 0) or with at most k (1 or 9) mismatches. All methods are run 100 times on 10% - reduced data. For the i-th (i = 1, .., 10) most frequent sequence assembled on the whole data, we record its reproducibility, i.e., percentage of runs when there is a match (exact or with at most k mismatches) among 10 most frequent sequences found on reduced data. ”Reproducibility: Max” and ”Reproducibility: Average” report respectively maximum and average of those percentages.”
Figure 3Percentage of candidate sequences which cumulative frequency is 85%, 90%, and 95%. The values on x-axis corresponds to the number of allowed mismatches during read graph construction. n_m means that up to n mismatches are allowed in superreads and up to m mismatches are allowed in edges.
Figure 4The neighbor-joining phylogenetic tree for 10 most frequent HCV quasispecies variants on a 5,205bp-long fragment obtained by ViSpA and ShoRAH. Sequences are labeled with software name and its rank among 10 most frequent assembled sequences.
Figure 5Percentage of runs when the i-th most frequent sequence is reproduced among 10 most frequent quasispecies assembled on the 10%-reduced set of reads. The i-th point at x-axis corresponds to the i-th most frequent sequence assembled on the 100% of reads. No data are shown for the sequences that are reproduced less than 5% of runs.