| Literature DB >> 36038949 |
Bryce Kille1, Advait Balaji1, Fritz J Sedlazeck2, Michael Nute1, Todd J Treangen3.
Abstract
With the arrival of telomere-to-telomere (T2T) assemblies of the human genome comes the computational challenge of efficiently and accurately constructing multiple genome alignments at an unprecedented scale. By identifying nucleotides across genomes which share a common ancestor, multiple genome alignments commonly serve as the bedrock for comparative genomics studies. In this review, we provide an overview of the algorithmic template that most multiple genome alignment methods follow. We also discuss prospective areas of improvement of multiple genome alignment for keeping up with continuously arriving high-quality T2T assembled genomes and for unlocking clinically-relevant insights.Entities:
Keywords: Comparative genomics; Homology; Multiple genome alignment; Synteny
Mesh:
Substances:
Year: 2022 PMID: 36038949 PMCID: PMC9421119 DOI: 10.1186/s13059-022-02735-6
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Fig. 1An example of evolution resulting in different classes of homology. In (a), deletion and speciation events are depicted, resulting in both orthologs and paralogs. The resulting relationships between all pairs of X segments as well as between all pairs of Y segments are depicted in (b). Notably, segments and participate in three important homology relationship types. As the most recent common ancestor of and is at a speciation event, they are orthologous to each other. Segments and on the other hand are paralogs, as their homology is a result of duplication. Finally, and are special types of paralogs known as “false orthologs” i.e. paralogous segments all of whose other copies in their respective genomes have since been deleted, resulting in two segments which appear to never have been duplicated. The presence of such false orthologies complicates the problem of core-genome alignment particularly, but also the problem of further categorizing homologies identified through genome alignment. It is worth noting that homology relationships within a single species’ genome do exist, but are not depicted here. For example, and are all paralogous to each other
Fig. 2Timeline of MGA tools. The original sequencing of human and mouse genomes spurred the development of a number of multiple genome alignment tools. Following this initial spurt, the next generation of genome aligners (starting with Enredo-Pecan and ending with Cactus) were developed, followed by a 6-year period of silence, with Parsnp being one of the few tools released between 2012 and 2019
Fig. 3a-c The number of assemblies available for different groups of eukaryotic species in NCBI. d-f The number of eukaryotic species with available genome assemblies in NCBI. The emergence of 3rd-generation sequencing technologies and their accompanying assembly algorithms in the early 2010s is largely responsible for the increasing rate of novel eukaryotic genome deposits
The many challenges of MGA and potential solutions. As MGA incorporates both local alignment and global MSA, many challenges are shared. While some ideas from MSA have been able to improve the capabilities of MGA, the challenges are by no means solved. Improving runtime, performance across divergent genomes, and accuracy in the presence of repeats still remain important challenges for MGA
| Challenge | Solution |
|---|---|
| Evolutionary distance between genomes increases alignment difficulty [ | Adding closely related species to the input dataset bridges the evolutionary gap between genomes in the input and can increase alignment sizes [ |
| Computational costs are currently prohibitive for many large-scale MGAs. | Progressive methods eliminate the need for |
| Low complexity repeats often cause spurious alignments and/or dramatically increase computational cost. | Repeat-masking greatly simplifies the problem of alignment. Indeed, the Cactus GitHub repository states that “genomes that aren’t properly masked can still take tens of times longer to align that those that are masked." [ |
| Progressive alignments for whole genomes don’t account for incomplete lineage sorting or horizontal gene transfer, rendering it an incomplete model for both eukaryotes and bacteria. | Divide-and-conquer approaches, similar to those recently used for MSA, could potentially be used to allow sections of genomes to be treated as arising from different phylogenies. |
| Sequencing error and micro-rearrangements can mask the existence of longer stretches of homologous regions. | Modifications to the genome graphs, similar to those for de-Bruijn graph cleaning, can result in longer, more inclusive LCBs. |
| The number of pairwise anchors will grow significantly with the number of genomes being compared and the number of anchors present in all genomes will decrease. | Anchoring methods that aim to identify multiple local alignments may bridge the gap between all-pairs anchors and anchors present in all genomes. |
| While MSA allows for alignment of regions within an LCB, there are limited methods for extending the borders of LCBs. | Multiple alignment extension algorithms, such as the one used in procrastAligner [ |
Multiple genome alignment methods. Tools which only perform part of MGA or are only suited for pairwise genome alignment are excluded. Progressive aligners are italicized. See [24] for a comprehensive list of tools for all subproblems of MGA
| Method | Anchor type | Anchor discovery method | Anchor chaining | MSA method |
|---|---|---|---|---|
| A-Bruijn Aligner [ | Pairwise alignments | BlastZ [ | Maximum Subgraph with Large Girth problem [ | - |
| Cactus [ | Pairwise alignments | LastZ [ | Cactus alignment filter | Pecan [ |
| M-GCAT [ | MUMs | Compressed suffix graph | Recursive match chaining | MUSCLE [ |
| Mauve [ | MUMs | GRIL [ | Greedily removes low-scoring LCBs | ClustalW [ |
| Mugsy [ | Pairwise alignments | MUMmer [ | Min-Cut in alignment graph | T-Coffee [ |
| Multiple Genome Aligner [ | MEMs | Suffix array | Maximum-weight path in an acyclic graph [ | ClustalW [ |
| Parsnp [ | MUMs | Compressed suffix graph | Weighted recursive match chaining | libMUSCLE [ |
| SibeliaZ [ | TwoPaCo [ | Carrying paths | SIMD partial order alignment [ | |
Fig. 4The pipeline of MGA. a An example of large-scale genomic rearrangement, insertion, and deletion occurring across a set genomes. The lines denote bounds of homologous segments, and inversions are denoted by crossing lines. b Anchors from the section of 3 genomes surrounded by the dotted box in a. Once again, the lines between genomes represent homology. The labeled blocks on each genome correspond to anchors, where two blocks with the same label are inferred to be potentially homologous sites. c The alignment graph obtained by merging the 3 linear genome graphs in b. At this step, the aim of MGA is to find colinear paths in the graph i.e. sequences of anchors which are traversed by a group of genomes in the same order. d Often times, the initial set of anchors will be too noisy, containing spurious alignments which prevent the formation of longer, more reliable colinear paths. By removing anchor F, the alignment graph becomes much simpler and yields longer colinear paths, where each set of colinear paths is denoted by the color. e For each colinear path, MGA tools perform an MSA, yielding a set of sequence alignments which together make up the genome alignment