| Literature DB >> 28008945 |
Gonzalo Yebra1, Emma B Hodcroft1, Manon L Ragonnet-Cronin1, Deenan Pillay2, Andrew J Leigh Brown1.
Abstract
HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA_HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree's using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.Entities:
Mesh:
Year: 2016 PMID: 28008945 PMCID: PMC5180198 DOI: 10.1038/srep39489
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1(A) Proportion of the maximum likelihood trees splits shared with the true tree for each gene and sampling coverage level. Genes are sorted according to length. The top and bottom limits of the boxes represent, respectively, the first and third quartiles (the distance between them represents the inter-quartile range, IQR). The lines (whiskers) include the highest and lowest values that lie within the 1.5 × IQR distance from the first and third quartiles, respectively. Data points outside this range are outliers. (B) Proportion of the maximum likelihood trees splits shared with the true tree according to gene length. All sampling coverage levels were considered together (see Supplementary Figure 1 for an analysis broken down by sampling coverage level). The regression line is shown in blue, for which the formula, the correlation coefficient (R2) and the p-value are presented. The shaded area shows the regression line’s confidence intervals. The grey, dotted vertical lines show the length of each gene considered.
Proportion of the maximum likelihood trees splits shared with the true tree according to gene and sampling coverage level.
| Gene | Length (bp) | Sampling coverage level (mean [95% confidence interval]) | ||||
|---|---|---|---|---|---|---|
| All | 100% | 60% | 20% | 5% | ||
| 6987 | 0.965 (0.964–0.966) | 0.967 | 0.971 (0.970–0.971) | 0.965 (0.964–0.966) | 0.959 (0.957–0.961) | |
| 4479 | 0.951 (0.950–0.952) | 0.954 | 0.953 (0.953–0.954) | 0.950 (0.948–0.951) | 0.950 (0.948–0.953) | |
| 3000 | 0.934 (0.933–0.935) | 0.936 | 0.935 (0.934–0.935) | 0.933 (0.931–0.934) | 0.936 (0.933–0.938) | |
| 2508 | 0.932 (0.930–0.934) | 0.947 | 0.946 (0.945–0.946) | 0.935 (0.934–0.936) | 0.915 (0.912–0.918) | |
| 1479 | 0.879 (0.877–0.880) | 0.879 | 0.880 (0.879–0.881) | 0.880 (0.878–0.881) | 0.877 (0.873–0.880) | |
| Partial | 1302 | 0.867 (0.866–0.869) | 0.868 | 0.870 (0.869–0.871) | 0.875 (0.873–0.877) | 0.857 (0.853–0.861) |
The table shows the mean value and its 95% confidence intervals for the 100 replicates performed in each case. Note that for the full dataset (100% sampling coverage) only one estimation is shown because no replicates can be performed. The genes are ordered in descending order of sequence length.