| Literature DB >> 35682936 |
Anton E Shikov1,2, Yury V Malovichko1,2, Anton A Nizhnikov1,2, Kirill S Antonets1,2.
Abstract
The role of genetic exchanges, i.e., homologous recombination (HR) and horizontal gene transfer (HGT), in bacteria cannot be overestimated for it is a pivotal mechanism leading to their evolution and adaptation, thus, tracking the signs of recombination and HGT events is importance both for fundamental and applied science. To date, dozens of bioinformatics tools for revealing recombination signals are available, however, their pros and cons as well as the spectra of solvable tasks have not yet been systematically reviewed. Moreover, there are two major groups of software. One aims to infer evidence of HR, while the other only deals with horizontal gene transfer (HGT). However, despite seemingly different goals, all the methods use similar algorithmic approaches, and the processes are interconnected in terms of genomic evolution influencing each other. In this review, we propose a classification of novel instruments for both HR and HGT detection based on the genomic consequences of recombination. In this context, we summarize available methodologies paying particular attention to the type of traceable events for which a certain program has been designed.Entities:
Keywords: HGT detection; homologous recombination (HR); horizontal gene transfer (HGT); phylogenetic methods; recombination detection; synteny
Mesh:
Year: 2022 PMID: 35682936 PMCID: PMC9181119 DOI: 10.3390/ijms23116257
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 6.208
Figure 1A combined classification of methods for detecting homologous recombination and horizontal gene transfer depending on the genomic consequences of the events. HR—homologous recombination, HGT—horizontal gene transfer, ARGs—ancestral recombination graphs.
Current bioinformatics tools for detecting homologous recombination and horizontal gene transfer in genetic data. The table summarizes tools’ properties in terms of algorithms applied, input files and output results, type of detected events, advantages, and limitations.
| Tool | Applied Approach | Method’s Class | Input | Output | Detected Events | Advantages | Limitations | References |
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| RDP4/RDP5 | Combination of phylogenetic and distance methods | Phylogenetic and distance-based | Alignments in FASTA format | Recombination events with phylogenetic relationships and breakpoints coordinates for chimeric sequences in tabular format | Recent | Robustness and providing the information on the direction of exchanges | Inability to reveal distant events and high computational costs | [ |
| Gubbins | Revealing increased substitution rate among ML-tree branches | Phylogenetic | Alignments in FASTA format | Coordinates of recombination events tabular format and their visualization on the genome alignment | Recent and ancestral | Precise reconstruction of ancestral state | High computational costs and possible false-positive results when analyzing trees with short branches (theoretically) | [ |
| ClonalFrameML | Maximal likelihood-based clonal model | Phylogenetic | Alignments in FASTA format and guiding tree | Phylogeny regarding recombination and visualization of events’ coordinates on the genome alignment in tabular format | Recent and ancestral | Computational effectiveness | Underestimation of recombination rate in datasets with intensive recombination | [ |
| BratNextGen | Bayesian modeling | Substitution distribution | Alignments in FASTA format | Coordinates of the events in tabular format and visualization of transmitted regions on the genome alignment | Recent and ancestral | Computational effectiveness | False-negative results in the case of mosaic sequences with multiple recombination events | [ |
| fastGEAR | HMM algorithms coupled with Bayesian clustering | Substitution distribution | Alignments in FASTA format | Coordinates of ancestral and recent recombination events in tabular format | Recent and ancestral | Computational effectiveness, high sensitivity, and handling of missing data | Missing events between closely related species | [ |
| ptACR | Genome-wise average SNP compatibility calculation | Compatibility | Gap-free alignments in PHYLIP format | Genomic coordinates of recombination events in tabular format | Recent | High accuracy and robustness to false-positive results | Inability to process alignments with gaps and high false-negative rate when processing divergent sequences | [ |
| HREfinder | Genome partitioning into SNP-flanked blocks | Substitution distribution | Genomes in FASTA format, tree in Newick format, and SNP list in tabular format | List of sequences subjected to recombination in tabular format | Recent | High accuracy | High false-negative rate when processing divergent sequences | [ |
| mcorr | Building correlation profile of synonymous substitution | Parametric | Alignments in XMFA or BAM formats | Tables and figures depicting the average recombination rate | The total rate of recent/ancient events | The ability to process raw reads and metagenomic data | Has not been compared to conventional r/m rate calculating tools | [ |
| Bacter | Markov chain Monte Carlo (MCMC) | ARG | Alignments in FASTA format | Ancestral recombination graph (ARG) in Newick format | Recent | Improved detection of the events in the case of poor phylogenetic signal | Dependence on predetermined parameters and high computational costs | [ |
| TARGet | Topological data analysis (TDA) | ARG | Alignments in FASTA format without gaps or segregating sites denoted by 1 and 0 | Ancestral recombination graph (ARG) in XML format and positions of reticulate events | Recent | Computational effectiveness | Inability to process alignments with gaps | [ |
|
| ||||||||
| Clusterflock | Self-organizing flock algorithm | Parametric | Sequences and a distance matrix | Clusters of sequences in tabular format | Recent | Applicability to any distance metrics and resilience to missing data | Has not been compared to the existing tools | [ |
| gmos | Pairwise local alignments with subsequent regions overlapping | Parametric | Query and subject genomes in FASTA format | Structural variants in FASTA format | Recent | Computational effectiveness and the ability to reveal both HR and HGT | Depends heavily on the high similarity between transferred regions | [ |
| GeneMates | Association tests with the linear-mixed model accounting for population structure | Parametric | Genome assemblies in FASTA format and raw reads in FASTQ format | The linkage network of horizontally co-transferred alleles in tabular format | Recent | Resolving co-occurred HGT events | Reduced sensitivity due to the dependence on core SNPs | [ |
| ShadowCaster | Support vector machine-based hybrid approach | Implicit phylogenetic and parametric | A query genome and proteome and list of related proteomes in FASTA format | The list of HGT candidates with corresponding likelihood calculations in tabular format | Recent and ancestral | High sensitivity when reveling both recent and ancient events and reduced false-positive rate | Does not determine the directions of transfers and processes only a single genome | [ |
| nearHGT | Calculating synteny index (SI) followed by constant relative mutability (CRM) measurement | Synteny-based and parametric | Reference and putatively transferred sequences in FASTA format | Chi-square-based | Recent | High sensitivity | No ready-made application is available | [ |
| HGT-Finder | Similarity ratio evaluation for proteins according to BLAST hits and taxonomic distance calculation based on the NCBI Taxonomy annotation | Implicit phylogenetic | The BLAST search result and the NCBI Taxonomy database | Tabular format file with the transfer index value for a protein | Recent | Detecting mostly true events | High reliance on the taxonomic nomenclature and low sensitivity | [ |
| HGTector | Analyzing BLAST hit distribution patterns according to predefined evolutionary categories | Implicit phylogenetic | FASTA files of amino acid sequences for each analyzed genome | List of candidate HGT-derived genes with the respective silhouette scores in tabular format | Recent | Insensitive to gene loss, rate variations, and database errors | High reliance on the taxonomic nomenclature and low sensitivity | [ |
| RecentHGT | The expectation-maximization algorithm based on the sequence-similarity distribution of orthologous genes | Implicit phylogenetic | Tabular file with strains information and RAST-annotated GenBank file | Putative HGT events in chromosomal and plasmid regions in tabular format | Recent | Reduced false-positive rate when processing conserved genes | Missing events when analyzing divergent sequences | [ |
| Daisy | Mapping-based detection relying on short read pairs and coverage information | Parametric | Reads from the analyzed organism and poposed acceptor and donor genomes in FASTA format | A variant call format (VCF) file reporting HGT candidates meeting the predefined threshold and tabular format file with all potential events | Recent | Outperforms reference genome-based approaches if short reads are available | Requires short reads only and explicit specifying recipient and donor genomes | [ |