| Literature DB >> 31878100 |
Guillermin Agüero-Chapin1,2, Deborah Galpert3, Reinaldo Molina-Ruiz4, Evys Ancede-Gallardo5, Gisselle Pérez-Machado6, Gustavo A de la Riva7,8, Agostinho Antunes1,2.
Abstract
Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs. We also highlight the tendency of integrating AF features/measures with the AB ones, either into the same prediction model or by assembling the predictions from different algorithms using voting/weighting strategies, for improving the detection of remote signals. Lastly, we briefly discuss the efforts made to scale up AB and AF features/measures for the comparison of multiple genomes and proteomes. Alongside the achieved experiences in remote homology detection by both the most popular AF tools and other less known ones, we provide our own using the graphical-numerical methodologies, MARCH-INSIDE, TI2BioP, and ProtDCal. We also present a new Python-based tool (SeqDivA) with a friendly graphical user interface (GUI) for delimiting the twilight zone by using several similar criteria.Entities:
Keywords: QSAR; alignment-free; big data; bioinformatics; topological indices
Mesh:
Year: 2019 PMID: 31878100 PMCID: PMC7022958 DOI: 10.3390/biom10010026
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1Workflow used for homology detection within the twilight zone according to the selected alignment-free (AF) methodology. This selection is conditioned in turn by the input data and the availability of scalable solutions.
Figure 2Screen shot of the SeqDivA’s GUI. The input fasta file made up by 10 hypothetical protein sequences and the main outputs: The identity matrix all-vs.-all and the dot plot representing the identity/similarly/bit-score variation among the sequence pairs.
Summary of the most popular AF features applied to detect remote homology.
| Word-Frequency Methods | |||
|---|---|---|---|
| AF Feature | Low-Similarity Dataset | Web-Implementation | Ref. |
| Amino Acid Composition (ACC) | G-protein coupled receptor superfamily | COPid | [ |
| Pseudo Amino Acid (PseACC) | G-protein coupled receptor superfamily |
| [ |
| PseACC | Designed dataset identity from ENZYME SwissPro database in [ |
| [ |
| PseACC | Chou’s designed dataset [ |
| [ |
| k-mers | Benchmark Structural data designed based on [ | No publicly available for proteins | [ |
| k-mers | Benchmark Structural data designed in [ | No publicly available for proteins | [ |
| Information theory-based methods | |||
| Lempel-Ziv complexity | Subset of SCOP designed by [ | No publicly available | [ |
| Kolmogorov complexity | Subset of SCOP designed by [ | No publicly available | [ |
| Kolmogorov complexity (Universal Similarity Metric) | Benchmark Structural data < 25% designed based on [ | No publicly available | [ |
| Kolmogorov complexity (Universal Similarity Metric) | Clustering protein structures using at low sequence similarity |
| [ |
Figure 3(A) The internal transcribed spacer (ITS2) sequence from the endophytic fungus Petrakia sp. pseudo-folded into the 2D-Cartesian system. (B) RNase III protein sequence from Escherichia coli BL 21 pseudo-folded into the 2D-Cartesian system extended to amino acid clustering into the four main physicochemical properties (acid, basic, polar, and nonpolar). (C) Representation of the human coding region of the ß-globin gene as a spiral of square cells and four-color maps [99]. (D) Four-color DNA maps are extended to the ß-globin protein applying the same amino acid clustering of 2D-Cartesian systems. (E) Spectral representation of the human ND6 protein based on the assignment of y-axis values (1–20) to the 20 amino acids. X-axis represents the length of the sequence (174 aa) [94]. (F) The star graph for the human insulin (21 aa long) [94,100].
Figure 4Workflow for the calculation of spectral moments as graph theory-based sequence descriptors. The protein fragment “IGIHVGR” was pseudo-folded into the 2D-Cartesian system of hydrophobicity (H) and polarity (P). The seven amino acids of the protein fragment are distributed according to their physicochemical nature into the 2D-Cartesian system starting from the 0,0 coordinates. The resulting 2D-Cartesian (HP) map is used to derive an edge adjacency matrix which is raised at different k powers. The trace operator (Tr) is applied to each (matrix)k to finally estimate the spectral moments as protein TIs.
Figure 52D-Cartesian maps for several Ribonucleases (RNase) III sequences from prokaryotes (dark grey), eukaryotes (light grey), and rPac1 [DQ647826] from Schizosaccharomyces pombe strain 428-4-1 (black). Thin white lines represent the beginning of all RNases III (5′ region) and the terminal 3′ region of the Pac1 protein. The last amino acid from each sequence is represented as a black squared dot. This figure was taken from [122].
Figure 6Pseudo-folding of the Cry 1Ab C-terminal domain sequence (in black) into the bacteriocins 2D-HP space (in red). This figure was taken from the reference [134].
Figure 7Re-annotation of the A-domains in the proteome of Microcystis aeruginosa by using an ensemble of algorithms. Five putative A-domain remote homologs were consensually detected by the Decision Tree Model (DTM) and the profile Hidden Markov Model (HMM) among the five hypothetical proteins. This figure was taken from the reference [38].
Summary of the graphical–numerical features applied to detect remote homology.
| Graph-Theory-Based Sequence Descriptors | ||||
|---|---|---|---|---|
| AF Feature | Low-Similarity Dataset | Graphical Representation | New Member Detected | Ref. |
| Stochastic spectral moments ( | RNase III family | 2D Cartesian protein maps | Pac1 | [ |
| Markovian entropies ( | Cellulase complex | 2D Cartesian protein maps | - | [ |
| Markovian entropies, spectral moments and electrostatic potentials ( | Mycobacterial promoters | 2D Cartesian DNA maps | - | [ |
| 3D-Markovian descriptors | D&D benchmark dataset [ | 3D protein representation from PDB files considering distances between Cα of aa | - | [ |
| Set of TIs for | Natural and unnatural proteins | 2D star protein graphs | [ | |
| Set of TIs for | D&D benchmark dataset [ | 2D star protein graphs | [ | |
| Spectral moments ( | Bacteriocin proteins | 2D Cartesian protein maps | Bacteriocin-like protein in the Cry 1Ab C-terminal domain | [ |
| Spectral moments ( | RNase III family | 2D Cartesian protein maps | RNase III GU190214 | [ |
| Spectral moments ( | ITS2 family | 2D Cartesian DNA maps | ITS2 from | [ |
| Spectral moments ( | A-domains from NRPSs | Four-colour maps | Remote homologous in the proteome of | [ |
| 3D protein bilinear indices | Chou’s designed dataset [ | 3D PDB graphical information considering Cα and non-covalent interactions | - | [ |
| 3D protein three-linear indices | Chou’s designed dataset [ | 3D PDB graphical information considering Cα, Cβ and average of the coordinates of all atoms in the amino acid | - | [ |
| 3D and 1D descriptors ( | D&D benchmark dataset [ | 1D Sequence information | [ | |
Summary of the strategies combining AF and alignment-based (AB) features/measures applied to detect remote homology.
| AB and AF Features/Measures Integrated under the Same Model/Algorithm | |||
|---|---|---|---|
| AB/AF Features-Methods | Low-Similarity Dataset | Integrative Algorithm | Ref. |
| BLAST-bitscores (AB) | - Complete viral genomes | k-NN algorithm provides a combined score resulted from the combination/weighting of the individual scores resulting from AB and AF-based classifications | [ |
| Profile-based sequence representation based on PSI-BLAST alignments | Benchmark dataset - SCOP structural classes [ | Original sequences are replaced by their profile-based representation containing evolutionary information of the family, then the PseACC concept is applied to generate AF predictors | [ |
| Smith-Waterman (AB) | Benchmark dataset reported in [ | Decision Tree Models (DTM) implemented in the Big Data Spark platform | [ |
|
| |||
| Multi-template BLASTp (AB) | Real dataset made up of NRPS’s A-domains (10–40% of identity) and CATH domains | Assembling the predictions from AB and AF sequence similarity searches. The consensus prediction is more sensitive and reliable for detecting A-domain remote homologous. | [ |
| Support Vector Machines (SVM) | Subset of SCOP structural classes designed by [ | SVM-Ensemble weighted voting strategy | [ |