| Literature DB >> 35911960 |
David Medina-Ortiz1,2, Sebastian Contreras3, Juan Amado-Hinojosa1,4, Jorge Torres-Almonacid2, Juan A Asenjo1,4, Marcelo Navarrete5, Álvaro Olivera-Nappa1,4.
Abstract
Computational methods in protein engineering often require encoding amino acid sequences, i.e., converting them into numeric arrays. Physicochemical properties are a typical choice to define encoders, where we replace each amino acid by its value for a given property. However, what property (or group thereof) is best for a given predictive task remains an open problem. In this work, we generalize property-based encoding strategies to maximize the performance of predictive models in protein engineering. First, combining text mining and unsupervised learning, we partitioned the AAIndex database into eight semantically-consistent groups of properties. We then applied a non-linear PCA within each group to define a single encoder to represent it. Then, in several case studies, we assess the performance of predictive models for protein and peptide function, folding, and biological activity, trained using the proposed encoders and classical methods (One Hot Encoder and TAPE embeddings). Models trained on datasets encoded with our encoders and converted to signals through the Fast Fourier Transform (FFT) increased their precision and reduced their overfitting substantially, outperforming classical approaches in most cases. Finally, we propose a preliminary methodology to create de novo sequences with desired properties. All these results offer simple ways to increase the performance of general and complex predictive tasks in protein engineering without increasing their complexity.Entities:
Keywords: digital signal processing; fourier transform; machine learning; numerical representation strategies; predictive models; protein engineering
Year: 2022 PMID: 35911960 PMCID: PMC9329607 DOI: 10.3389/fmolb.2022.898627
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
FIGURE 1The AAIndex database of amino acid physicochemical properties can be split into eight semantically-consistent groups. (A) Combining doc2vec strategies with unsupervised learning algorithms, we proposed a methodology to generate groups that preserve semantic consistency within the partition. Applying an RBF kernel PCA on the whole dataset, we observe that the groups are linearly separable in the PCA1/PCA2 space, as their convex hulls are disjoint. (B,C) Combining our encoders with FFT improves model performance and helps reducing overfitting in several predictive tasks. Here, boxplots summarize the distribution of performances reached in each experiment across the 1,000 independent realizations of the 80/20 split of the input dataset for the task. Central circles represent medians, bars the interquartile range, and whiskers the 95% CI. Complementary analyses of model performance, including other metrics (such as recall, F-Score, and area under the receiver operating curves AUC), are presented in Supplementary Section S3 and summarized in Supplementary Tables S3–S6.
Generalized property-based encoders for amino acids.
| Amino acid |
|
| Hydrophobicity | Volume | Energy | Hydropathy | Secondary structure | Other indexes |
|---|---|---|---|---|---|---|---|---|
| A | 290.41 | 71.85 | 6.25 | 44.65 | −107.79 | 15.33 | 56.16 | 92.92 |
| R | 172.57 | −6.96 | 84.09 | 200.15 | 51.15 | 172.36 | 1.44 | −37.39 |
| N | −38.37 | −90.14 | −21.73 | −191.18 | 73.94 | −259.13 | −54.69 | −77.74 |
| D | 159.43 | −56.58 | −28.96 | −232.26 | 55.36 | −216.01 | −29.38 | −7.42 |
| C | −4.24 | 15.67 | −34.88 | −156.21 | −54.19 | −242.01 | 10.07 | 40.04 |
| Q | −268.55 | −32.61 | 38.46 | 179.88 | 31.44 | 145.73 | -15.43 | −45.52 |
| E | −0.02 | 21.03 | −21.48 | −170.44 | −49.97 | 8.11 | 20.20 | 50.74 |
| G | −104.49 | −62.33 | 53.16 | 250.66 | 92.25 | 256.52 | -39.89 | −95.41 |
| H | −159.87 | 31.27 | −69.67 | 194.47 | −39.54 | 455.61 | 34.12 | 43.37 |
| I | −34.08 | 164.64 | −54.85 | −88.56 | −48.44 | −274.76 | 25.05 | 52.40 |
| L | −91.11 | −16.38 | −64.98 | −201.08 | 7.56 | −257.27 | −10.20 | 4.27 |
| K | 195.59 | 54.45 | −52.92 | −118.84 | −109.99 | −136.28 | 55.31 | 85.66 |
| M | 21.94 | −18.77 | −26.70 | −227.61 | −7.39 | −139.71 | −19.45 | 16.04 |
| F | 88.02 | 21.61 | −21.46 | −78.96 | −56.97 | 80.68 | 30.31 | 46.42 |
| P | 317.10 | 115.37 | −22.23 | −44.80 | −157.63 | −126.45 | 95.69 | 136.09 |
| S | −314.20 | −106.56 | 61.31 | 221.12 | 174.08 | 248.05 | −85.57 | −122.66 |
| T | −252.51 | −23.99 | 13.72 | −3.30 | 17.50 | −153.13 | −25.56 | −31.46 |
| W | −118.15 | −76.02 | 88.28 | 34.80 | 105.47 | 19.24 | −59.91 | −124.49 |
| Y | −10.20 | −15.49 | 40.85 | 203.07 | 36.61 | 171.61 | −4.25 | −33.07 |
| V | 150.75 | 9.929 | 33.77 | 184.45 | −13.45 | 231.50 | 15.99 | 7.21 |
FIGURE 2The combination of our encoders with FFT unveils frequency profiles associated to specific protein folding and functions. We used the encoder of secondary structure combined with FFT to create profiles related to folding and protein functions. (A) Fourier spectra for two family enzymes (hydrolases and ligases) in a dataset of enzyme families. (B,C) Fourier spectra of the same family separated by folding, showing that our methodology is sensitive to apparent differences between alpha and beta folding types. (D) Fourier spectra for alpha and beta folding in a dataset of different protein families. (E,F) Fourier spectra of the same folding separated by protein family, showing that our methodology is sensitive to proteins with the same folding but belonging to different families. N for frequency normalization = 1,024.
FIGURE 3Fourier spectra of encoded amino acid sequences with different activities are visually separated. Sub figures show the Fourier spectrum of different sequences of peptides, encoded according to the groups of properties proposed in this article, represented in panels (A–I). We analyse two types of peptides: Antimicrobial (AMPs) and non-Antimicrobial (nonAMP). AMPs are subsequently divided into five categories: Antibacterial Peptides (AB), Anticancer Peptides (AC), Antifungal Peptides (AF), Anti-HIV Peptides (AHIV), and Antiviral Peptides (AV). The signals analyzed show a clear differentiation for AMPs concerning nonAMPs. N for frequency normalization = 128.