| Literature DB >> 31388082 |
Julio E Terán1,2, Yovani Marrero-Ponce3,4, Ernesto Contreras-Torres1, César R García-Jacas5, Ricardo Vivas-Reyes6,7, Enrique Terán1, F Javier Torres2.
Abstract
In this report, a new type of tridimensional (3D) biomacro-molecular descriptors for proteins are proposed. These descriptors make use of multi-linear algebra concepts based on the application of 3-linear forms (i.e., Canonical Trilinear (Tr), Trilinear Cubic (TrC), Trilinear-Quadratic-Bilinear (TrQB) and so on) as a specific case of the N-linear algebraic forms. The definition of the kth 3-tuple similarity-dissimilarity spatial matrices (Tensor's Form) are used for the transformation and for the representation of the existing chemical information available in the relationships between three amino acids of a protein. Several metrics (Minkowski-type, wave-edge, etc) and multi-metrics (Triangle area, Bond-angle, etc) are proposed for the interaction information extraction, as well as probabilistic transformations (e.g., simple stochastic and mutual probability) to achieve matrix normalization. A generalized procedure considering amino acid level-based indices that can be fused together by using aggregator operators for descriptors calculations is proposed. The obtained results demonstrated that the new proposed 3D biomacro-molecular indices perform better than other approaches in the SCOP-based discrimination and the prediction of folding rate of proteins by using simple linear parametrical models. It can be concluded that the proposed method allows the definition of 3D biomacro-molecular descriptors that contain orthogonal information capable of providing better models for applications in protein science.Entities:
Mesh:
Year: 2019 PMID: 31388082 PMCID: PMC6684663 DOI: 10.1038/s41598-019-47858-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Schematic indication of the transformation of the information contained on macro-molecular vectors using spatial information of the protein (Three-Tuple-(Dis)Similarity-Matrices) (TDSM) and algebraic forms. Where n is the number of amino acids present on the protein, [X], [Y], [P] are macro-molecular vectors; z are elements of the TDSM and L is the resulting MD. These algebraic forms are defined by the physicochemical nature of the macro-molecular vectors.
Figure 2Graphical representation of the differences on the computation between (a) total and (b) amino acid-based tensors for the novel 3D algebraic MDs for a simple example, i.e., truncated peptide PDB file (5WRX).
Figure 3Novel molecular descriptors calculation illustration. (A) Protein structure is filtered considering a protein representation (Section 2.4.) for calculating the relationship between two (metrics, SMI-C) and three amino acids (multi-metrics, Table 2). (B,C) The computation of the macromolecular vectors considers a group of physicochemical properties and the sequence of the structure (Section 2.1.). (D) The T-TDSM can be filtered considering several groups of amino acids to evaluate their role for a certain application (Section 2.2.). (E) The non-stochastic tensor is raised to the kth power (−12 to 12) applying a Haddamard matrix product, to evaluate the interactions between amino acids (Section 2.4.). (F,G) The non-stochastic tensor can be normalized using the simple stochastic and the mutual probability methods, respectively. (Section 2.5.). (H) The total tensor can be split into amino acid-based tensors (Section 2.1.). (I) The application of N-algebraic forms allows the transformation of the extracted information present on the macromolecular vectors and the tensors (Section 2.1.). (J) The obtained amino acid-based indices are stored in a Local Amino Acidic Invariant (LAI) (Section 2.1.). (K) The use of aggregation operators is proposed as a fusion operation for the LAI (Section 2.3.)
Multi-metrics available for the calculation of the novel 3D algebraic MDs for proteins. In bold, the software ID number of the multi-metric is indicated.
| Measure | Formula | Symmetry |
|---|---|---|
|
| ||
Triangle Area ( |
| S |
Triangle’s Incircle Area ( |
| S |
Summation Sides ( | A | |
Bond angle (Angle between sides) ( |
| A |
|
| ||
MIN-RULE [1-Nearest neighbor (NN)] ( |
| A |
JOIN-RULE (2-NN)
|
| S |
MAX-RULE (Furthest neighbor) ( |
| A |
AVE-RULE (Average-link)
|
| A |
MED-RULE ( |
| A |
WAR-RULE ( |
| A |
ADJ-RULE ( |
| A |
MAH-RULE Similarity with the Ward’s method ( |
| |
ADD-RULE (Average D/D degree) ( |
| S |
SUM-RULE (Wiener index) ( |
| S |
PRO-RULE ( |
| S |
QUA-RULE ( |
| S |
GEO-RULE ( |
| S |
RAN-RULE ( |
| S |
|
| ||
IC-RULE Additivity-Corrected ( |
| A |
AC-RULE Aditividad-corregida ( |
| S |
PC-RULE Proportionality-Corrected ( |
| S |
LC-RULE Linearity-Corrected (mean pair-wise pearson correlation) ( |
| S |
x are the mean centroids for the atoms X, Y, Z (XY) in the protein, respectively, d is the Mahalanobis distance, n is the dimension (3), k is the number of combinations (i, j), when i < j [(1, 2) (1, 3) and (2, 3)], is the arithmetic mean of the the variable U. The values of the subscript “” (1, 2, 3) stands for the atoms (X, Y, Z), respectively (e.g for the combination (1, 2) U1 and U2 represent the atoms X and Y) and r is the Pearson correlation between variables X and Y, pXY is the topological distance between the amino acids containing atoms (X and Y).
Amino acids groups considered for the computation of the novel 3D algebraic biomacro-molecular descriptors for proteins.
| Group | Amino acids |
|---|---|
| FAHa | ALA, CYS, LEU, MET, GLU, GLN, HIS, LYS. |
| FBSb | VAL, ILE, PHE, TYR, TRP, THR. |
| UFGc | GLY, PRO. |
| AFTd | GLY, SER, ASP, ASN, PRO. |
| ALGe | GLY, ALA, PRO, VAL, LEU, ILE, MET. |
| AROf | PHE, TYR, TRP. |
| RPCg | LYS, HIS, ARG. |
| RNCh | ASP, GLU. |
| RAPi | PRO, ILE, ALA, VAL, LEU, PHE, TRP, MET. |
| RPUj | ASN, CYS, GLY, SER, THR, TYR, GLN. |
aAlpha helix favoring amino acids; bBeta-sheets favoring amino acids; cUnfolding amino acids; dBeta-turn favoring amino acids; eAliphatic; fAromatic; gPolar positively charged; hPolar negatively charged; eApolar; jPolar uncharged.
Figure 4Computation of the Three-Tuple-(Dis) Similarity Matrix (TDSM) for an example truncated peptide (5WRX). Zijl is the value resulting of the use of a multi-metric (Bond Angle, Triangle Perimeter) (see Table 2). The obtained tensor has n × n × n dimensions, where n is the number of amino acids on the protein.
Figure 5Selection of multi-metrics or metrics for the definition of the Three-Tuple-(Dis) Similarity Matrix (TDSM) on the truncate peptide 5WRX by using AB representation. A multi-metric is considered (a) Complete when it considers not only the relationships between 3 amino acids (multi-metrics, here Triangle Perimeter), but also relationships between 2 amino acids (metrics, here Euclidean Distance). A multi-metric is considered (b) Non-Complete when it considers only the relationships between 3 amino acids (relationships between 2 amino acids are defined as zero in the TDSM). Moreover, the diagonal of the tensor (conformed by all the tensor elements where i = j = l), could have zero values if the measure was applied considering every aa as a reference or they could be different from zero values if the measure was applied considering the center of mass of the protein.
Figure 6Application of the Hadamard Matrix Product on the Three-Tuple-(Dis) Similarity Matrix (TDSM) for the example truncated peptide 5WRX.
Figure 7Application of probabilistic transformations on the Three-Tuple-(Dis) Similarity Matrix (TDSM). The simple stochastic transformation (SS) consists on dividing every element of a 2D matrix for the sum of all elements in that 2D matrix. The mutual probability procedure consists on dividing every element of a 2D matrix for the sum of all elements in the tensor (3D matrix).
Best models obtained for the folding rate prediction of 96 proteins using these novel molecular descriptors.
| Model | Q2LOO | Q2BOOT | SDEP | Q2EXT (w/outliers) | SDEPext (w/outliers) | Q2EXT (w/o outliers) | SDEPext (w/o outliers) |
|---|---|---|---|---|---|---|---|
|
| |||||||
|
| 77.79 | 76.57 | 2.035 | 34.16 | 3.180 | 82.37 | 2.938 |
|
| 74.80 | 73.83 | 2.167 | 32.28 | 3.170 | 85.75 | 2.786 |
|
| |||||||
|
| 77.69 | 77.62 | 2.0392 | 60.87 | 2.387 | 79.57 | 2.964 |
|
| 79.70 | 79.26 | 1.9454 | 55.57 | 2.556 | 78.19 | 2.606 |
Comparison of the training and test set’s folding rate statistical parameters of several existing molecular descriptors for proteins against this approach.
| Descriptors/Models | Descriptor Dimension | Cutoff Length | Q2 (%) ( | SDEP ( | Q2 (%) ( | SDEP |
|---|---|---|---|---|---|---|
|
| ||||||
| Folding degree[ | 3D | — | 73.96 | 2.20 | 54.76 | 2.03 |
| Long Range Order[ | 3D | 4 | 72.25 | 2.28 | — | — |
| Contact order[ | 3D | 2 | 73.96 | 2.19 | — | — |
| Total Contact Distance[ | 3D | 2 | 73.96 | 2.21 | — | — |
| FoldRate web server[ | 1D | * | 77.44 | 2.03 | — | — |
|
| ||||||
|
| 3D | — | 79.70 | 1.95 | 78.19 | 2.60 |
|
| 3D | — | 74.80 | 2.17 | 87.52 | 2.06 |
*Model constructed with an ensemble of mathematical equations.
Best models obtained for the protein secondary structural classification of 204 proteins using these novel MDs.
| Model | Representation | Number of Variables | Correct Classification (%) Training (149) | MCC Training | Correct Classification (%) Test (55) | MCC Test |
|---|---|---|---|---|---|---|
|
| ||||||
|
| Cβ | 16 | 98.65 | 0.962 | 92.59 | 0.777 |
|
| AVG, Cβ | 19 | 95.97 | 0.884 | 89.09 | 0.718 |
|
| ||||||
|
| AB, Cβ, AVG | 13 | 99.33 | 0.981 | 96.36 | 0.893 |
|
| AB, Cβ, AVG | 9 | 99.33 | 0.981 | 98.18 | 0.943 |
Comparison of the training set’s protein structural classification correct classification percentage of several existing molecular descriptors against this approach.
| Descriptors/Models | Correct Classification (%) Training | Correct Classification (%) Test |
|---|---|---|
|
| ||
| AA composition[ | 83.80 | — |
| Pseudo AA composition[ | 91.20 | — |
| Pair coupled AA composition[ | 74.50 | — |
| PSI-BLAST[ | 94.10 | — |
| Bilinear descriptors[ | 92.60 | 92.70 |
|
| ||
|
| 99.33 | 96.36 |
|
| 99.33 | 98.18 |