| Literature DB >> 30627489 |
Therese A Catanach1,2,3, Andrew D Sweet2,4, Nam-Phuong D Nguyen5, Rhiannon M Peery6,7, Andrew H Debevec8, Andrea K Thomer9, Amanda C Owings10, Bret M Boyd2,11, Aron D Katz2,12, Felipe N Soto-Adames13,14, Julie M Allen15.
Abstract
Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected "by eye" prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.Entities:
Keywords: Automated alignment; Genome; HBV; Manual alignment; Virus; s-region
Year: 2019 PMID: 30627489 PMCID: PMC6321758 DOI: 10.7717/peerj.6142
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Genetic map of the hepatitis B virus genome.
Arrows indicate the reading frames of genes.
Comparison of different alignment approaches used for the genome and total (genomes + fragments) data sets from HBV.
| Traditional | MUSCLE + manual | MUSCLE + manual |
| Automated | Multiple methods (See | UPP: Genomes-manual backbone |
| Automated | PASTA | UPP: PASTA backbone |
Figure 2Workflow outline for the alignment of hepatitis B virus genome and total (genomes + fragments) data set.
Additional details for each step are illustrated in Fig. S1.
Comparison of alignment methods.
All alignments were attempted using Genomes_Manual_degapped.fasta as the input file.
| Method | Alignment length | Wall clock time | Method reference |
|---|---|---|---|
| Manual | 4,269 | NA | |
| PASTA | 3,423 | 3.5 h | |
| MAFFT | 4,578 | 4 min | Mafft 7.305b; |
| MUSCLE | 4,938 | 32 h | Muscle v3.8.31; |
| Clustal Omega | 3,846 | 2.25 h | Clustal Omega 1.2.4; |
Figure 3Heatmaps representing the pairwise comparisons of hepatitis B virus phylogenies.
(A) Ratio of incompatible edges to total edges in a comparison. (B) Robinson–Foulds distances between two phylogenies. Darker cells indicate two trees with greater differences. Alignments used to estimate the phylogenies are indicated on the x- and y-axes.
Figure 4Box-and-whisker plots showing the distributions of pairwise comparison values for hepatitis B virus phylogenies.
(A) and (B) show the ratio of incompatible edges to total edges in a comparison. (C) and (D) show Robinson-Foulds distances. Alignments used to estimate the phylogenies are indicated on the x-axis. (A) and (C) are colored according to the edge collapse threshold (bootstrap value). (B) and (D) are colored according to the second tree in a comparison.
Figure 5Hepatitis B virus genotype occupancy.
(A) Histograms showing the proportion of each genotype that makes up the minimum clade including all individuals of that genotype. Genotypes are indicated on the x-axis, and genotype occupancy is shown as percentages along the y-axis. Bootstrap support collapse thresholds are indicated above each panel. (B) Fully bifurcating (i.e., no branches collapsed), midpoint-rooted cladogram of HBV sequences from the PASTA genome alignment. Tips are colored according to genotype.