| Literature DB >> 31792410 |
Edgar Garriga1, Paolo Di Tommaso1, Cedrik Magis1, Ionas Erb1, Leila Mansouri1, Athanasios Baltzis1, Hafid Laayouni2,3, Fyodor Kondrashov4, Evan Floden5, Cedric Notredame6,7.
Abstract
Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around from the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes6.Entities:
Mesh:
Year: 2019 PMID: 31792410 PMCID: PMC6894943 DOI: 10.1038/s41587-019-0333-6
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1Regressive algorithm overview
(A) Parent and children sub-MSAs are merged via their common sequence (blue) whose indels are projected from child to parent (green) and parent to child (red). (B) The sub-MSAs are produced after collecting sequences from a binary guide tree with each node labelled with the name of its longest descendant sequence. Sequences are collected by traversing the tree in a breadth-first fashion. Pale red colour blocks indicate how the N parent sequences (N=3) are collected by recursively expanding nodes. The same process is then applied to gather the children (green) and the grandchildren (blue). (C) In the nine resulting sub-MSAs that are displayed, one should note the presence of a common representative sequence between each child and its parent.
Total Column Score and average CPU time (s) on the 20 HomFam datasets containing over 10,000 sequences.
| total column score (%) | cpu time (s) | |||||||
|---|---|---|---|---|---|---|---|---|
| tree method | MSA algorithm | non-regressive | regressive | reference | non-regressive | regressive | ||
| PartTree | Fftns1 | 29.64 | 35.16 | 47.84 | 334 | 118 | ||
| mBed | Fftns1 | 41.33 | 37.94 | 52.03 | 277 | 156 | ||
| PartTree | ClustalO | 26.94 | 42.21 | 50.54 | 3,017 | 377 | ||
| mBed | ClustalO | 39.03 | 41.91 | 53.71 | 570 | 338 | ||
| default/mBed | UPP | 44.93 | 47.15 | 49.78 | 8,354 | 7,186 | ||
| default/mBed | Sparsecore | 44.98 | 51.06 | 53.50 | 2,313 | 3,184 | ||
| PartTree | Gins1 | - | 47.54 | 49.46 | - | 12,478 | ||
| mBed | Gins1 | - | 50.20 | 53.07 | - | 10,834 | ||
Figure 2Relative performances of alternative MSA algorithm combinations.
(A) Average differential accuracy of datasets larger than Number of Sequences (horizontal axis). The differences of accuracy are measured between the reference sequence MSAs and their embedded projection in the large datasets. For each combination, n=75 independent MSA samples. The envelope is the standard deviation. (B) In this constrained correspondence analysis (CCA) the first component (horizontal axis, 14.1% of the variance) is constrained to be the total column score accuracy as measured on datasets larger than 10,000. The best unconstrained component (vertical axis) explains 20.8% of the remaining variance. Combinations (dots with their accuracy on the lower horizontal axis) are categorized by their guide-tree (blue), MSA algorithms (grey) and regressive/non-regressive procedure (red). Vectors indicate the contributions to variance of each category from the three variables. Their projection onto the upper horizontal axis quantifies the contribution to variance of overall accuracy. For each combination, represented as a dot, n=20 independent MSA samples.
Figure 3CPU requirements of the regressive algorithm on HomFam datasets containing more than 10,000 sequences.
(A) The total CPU requirements (horizontal axis) and average total column score accuracies (vertical axis). The corresponding non-regressive (blue square) and regressive (red circles) combinations are connected by a dashed line with the exception of Gins1 for which the non-regressive computation costs are prohibitive. For each combination, represented as a circles and squares, n=20 independent MSA samples. (B) Comparison of CPU time requirements for ClustalO using mBed trees using a regressive and a non-regressive procedure on HomFam datasets containing more than 10,000 sequences. Each point represents an independent MSA. n=20 independent MSA samples. A linear regression (grey) was fitted on the resulting graph (R2 = 0.89, p-value = 6.9*10-10).