| Literature DB >> 33500680 |
Alla P Toropova1, Maria Raškova2, Ivan Raška2, Andrey A Toropov1.
Abstract
The algorithm of building up a model for the biological activity of peptides as a mathematical function of a sequence of amino acids is suggested. The general scheme is the following: The total set of available data is distributed into the active training set, passive training set, calibration set, and validation set. The training (both active and passive) and calibration sets are a system of generation of a model of biological activity where each amino acid obtains special correlation weight. The numerical data on the correlation weights calculated by the Monte Carlo method using the CORAL software (http://www.insilico.eu/coral). The target function aimed to give the best result for the calibration set (not for the training set). The final checkup of the model is carried out with data on the validation set (peptides, which are not visible during the creation of the model). Described computational experiments confirm the ability of the approach to be a tool for the design of predictive models for the biological activity of peptides (expressed by pIC50).Entities:
Keywords: Amino acid; Index of ideality of correlation; Monte Carlo method; Peptide; QSAR
Year: 2021 PMID: 33500680 PMCID: PMC7820519 DOI: 10.1007/s00214-020-02707-8
Source DB: PubMed Journal: Theor Chem Acc ISSN: 1432-2234 Impact factor: 1.702
Structures and 1-letter codes for Amino acids
| Amino acid | 1-letter code | Structure |
|---|---|---|
| Alanine | A |
|
| Arginine | R |
|
| Asparagine | N |
|
| Aspartic Acid | D |
|
| Cysteine | C |
|
| Glutamic acid | E |
|
| Glutamine | Q |
|
| Glycine | G |
|
| Histidine | H |
|
| Isoleucine | I |
|
| Leucine | L |
|
| Lysine | K |
|
| Methionine | M |
|
| Phenylalanine | F |
|
| Proline | P |
|
| Serine | S |
|
| Threonine | T |
|
| Tryptophan | W |
|
| Tyrosine | Y |
|
| Valine | V |
|
Fig. 1Histories of the Monte Carlo optimization (Split 1) with target functions TF and TF
Statistical quality of models for three random splits
| Split | Set | RMSE | |||
|---|---|---|---|---|---|
| Optimization with | |||||
| 1 | Active training | 0.7625 | 0.5558 | 0.8732 | 0.360 |
| Passive training | 0.8250 | 0.7065 | 0.6739 | 0.395 | |
| Calibration | 0.6012 | 0.4017 | 0.3695 | 0.506 | |
| Validation | 0.6220 | 0.4816 | 0.490 | ||
| 2 | Active training | 0.8205 | 0.7052 | 0.9058 | 0.333 |
| Passive training | 0.9165 | 0.8301 | 0.4709 | 0.374 | |
| Calibration | 0.5223 | 0.2836 | 0.4258 | 0.592 | |
| Validation | 0.5481 | 0.3476 | 0.515 | ||
| 3 | Active training | 0.8846 | 0.8229 | 0.9406 | 0.265 |
| Passive training | 0.7283 | 0.5982 | 0.8264 | 0.599 | |
| Calibration | 0.5053 | 0.2612 | 0.3745 | 0.927 | |
| Validation | 0.5900 | 0.3277 | 0.700 | ||
| Optimization with | |||||
| 1 | Active training | 0.6416 | 0.3506 | 0.5340 | 0.442 |
| Passive training | 0.7231 | 0.5868 | 0.4120 | 0.507 | |
| Calibration | 0.9486 | 0.9157 | 0.9679 | 0.142 | |
| Validation | 0.7766 | 0.6298 | 0.306 | ||
| 2 | Active training | 0.6976 | 0.4905 | 0.5568 | 0.432 |
| Passive training | 0.9543 | 0.9192 | 0.8516 | 0.332 | |
| Calibration | 0.7102 | 0.5447 | 0.8406 | 0.337 | |
| Validation | 0.7856 | 0.6596 | 0.270 | ||
| 3 | Active training | 0.5326 | 0.1846 | 0.7298 | 0.533 |
| Passive training | 0.8128 | 0.6796 | 0.6251 | 0.562 | |
| Calibration | 0.8743 | 0.8139 | 0.8827 | 0.214 | |
| Validation | |||||
Each set contains ten peptides
The best model is indicated by bold
Amino acids which are promoters of increase / decrease for pIC50 for examined peptides
| Comment | ||||||||
|---|---|---|---|---|---|---|---|---|
| Increase | V………. | 0.47695 | 0.30991 | 0.26611 | 10 | 10 | 10 | 0.0000 |
| L………. | 1.29542 | 0.73164 | 0.31587 | 8 | 5 | 7 | 0.0067 | |
| F………. | 1.07326 | 0.70770 | 0.37614 | 6 | 6 | 7 | 0.0077 | |
| I………. | 0.76211 | 0.16717 | 0.34684 | 6 | 3 | 4 | 0.0200 | |
| A………. | 0.54686 | 0.01821 | 0.06304 | 4 | 3 | 2 | 0.0333 | |
| G………. | 0.44966 | 0.52819 | 0.73395 | 4 | 5 | 4 | 0.0000 | |
| Y………. | 1.46411 | 0.65332 | 0.40546 | 4 | 5 | 5 | 0.0111 | |
| M………. | 1.29967 | 0.55126 | 0.39601 | 2 | 0 | 3 | 0.0200 | |
| Decrease | T………. | − 0.26044 | − 0.28480 | − 0.34702 | 6 | 9 | 6 | 0.0000 |
| E………. | − 0.62472 | − 0.62778 | − 0.55954 | 1 | 3 | 1 | 0.0000 |
NAT, NPT, and NC are the frequencies of an amino acid in the active training set, passive training set, and the calibration set, respectively
Experimental and calculated with Eq. 17 pIC50 for model obtained with split 3 (the best model): “ + ” is the indicator for the active training set; “– ” is the indicator for the passive training set; “#” is the indicator of calibration set; and “*” is the indicator for validation set
| Set | ID | Sequence of amino acids | pIC50 Expr | pIC50 Calc | Applicability | ||
|---|---|---|---|---|---|---|---|
| – | P01 | WLEPGPVTA | 1.98966 | 6.0820 | 6.9048 | 0.0754 | YES |
| – | P02 | ITSQVPFSV | 1.62921 | 6.1960 | 6.6291 | 0.1259 | YES |
| # | P03 | FLEPGPVTA | 2.17966 | 6.8980 | 7.0501 | 0.0485 | YES |
| # | P04 | ITAQVPFSV | 2.21389 | 7.0200 | 7.0763 | 0.1029 | YES |
| + | P05 | YLEPGPVTL | 2.98174 | 7.0580 | 7.6637 | 0.0421 | YES |
| # | P06 | YTDQVPFSV | 2.39417 | 7.0660 | 7.2142 | 0.0862 | YES |
| – | P07 | YLEPGPVTI | 2.21031 | 7.1870 | 7.0736 | 0.0754 | YES |
| * | P08 | YLEPGPVTV | 2.20698 | 7.3420 | 7.0710 | 0.0421 | YES |
| # | P09 | YLSPGPVTA | 3.06834 | 7.3830 | 7.7299 | 0.0651 | YES |
| # | P10 | IIDQVPFSV | 3.11987 | 7.3980 | 7.7693 | 0.1219 | YES |
| + | P11 | ITWQVPFSV | 1.93195 | 7.4630 | 6.8607 | 0.1529 | YES |
| + | P12 | ITYQVPFSV | 2.14528 | 7.4800 | 7.0238 | 0.1195 | YES |
| # | P13 | ILSQVPFSV | 3.05039 | 7.6990 | 7.7162 | 0.1117 | YES |
| – | P14 | IMDQVPFSV | 2.69191 | 7.7190 | 7.4420 | 0.0886 | YES |
| * | P15 | YLMPGPVTV | 3.23638 | 7.9320 | 7.8584 | 0.0421 | YES |
| # | P16 | WLDQVPFSV | 3.60203 | 7.9390 | 8.1381 | 0.1052 | YES |
| * | P17 | YLAPGPVTA | 3.65302 | 8.0320 | 8.1771 | 0.0421 | YES |
| + | P18 | YLYPGPVTV | 3.58840 | 8.0510 | 8.1277 | 0.0587 | YES |
| * | P19 | YLWPGPVTV | 3.37507 | 8.1250 | 7.9645 | 0.0921 | YES |
| # | P20 | ILYQVPFSV | 3.56646 | 8.3100 | 8.1109 | 0.1052 | YES |
| – | P21 | ILDQVPFSV | 3.89130 | 8.4810 | 8.3594 | 0.0886 | YES |
| – | P22 | YLFPGPVTA | 3.56108 | 8.4950 | 8.1068 | 0.0651 | YES |
| + | P23 | YLDQVPFSV | 3.81535 | 8.6380 | 8.3013 | 0.0719 | YES |
| – | P24 | ILFQVPFSV | 3.54314 | 8.6990 | 8.0931 | 0.1117 | YES |
| – | P25 | ILWQVPFSV | 3.35313 | 8.7700 | 7.9477 | 0.1386 | YES |
| + | P26 | WTDQVPFSV | 2.18084 | 6.1450 | 7.0510 | 0.1195 | YES |
| * | P27 | YLEPGPVTA | 2.20298 | 6.6680 | 7.0680 | 0.0421 | YES |
| * | P28 | ITDQVPFSV | 2.47011 | 6.9470 | 7.2723 | 0.1029 | YES |
| * | P29 | ITFQVPFSV | 2.12196 | 7.1790 | 7.0060 | 0.1259 | YES |
| * | P30 | FTDQVPFSV | 2.37085 | 7.2120 | 7.1964 | 0.0926 | YES |
| – | P31 | ITMQVPFSV | 1.79326 | 7.3980 | 6.7546 | 0.1029 | YES |
| # | P32 | YLSPGPVTV | 3.07233 | 7.6420 | 7.7330 | 0.0651 | YES |
| + | P33 | YLYPGPVTA | 3.58440 | 7.7720 | 8.1246 | 0.0587 | YES |
| + | P34 | YLAPGPVTV | 3.65702 | 7.8180 | 8.1802 | 0.0421 | YES |
| * | P35 | ILAQVPFSV | 3.63508 | 7.9390 | 8.1634 | 0.0886 | YES |
| * | P36 | ILMQVPFSV | 3.21445 | 8.1250 | 7.8417 | 0.0886 | YES |
| # | P37 | YLFPGPVTV | 3.56508 | 8.2370 | 8.1099 | 0.0651 | YES |
| – | P38 | YLMPGPVTA | 3.23239 | 8.3670 | 7.8554 | 0.0421 | YES |
| + | P39 | YLWPGPVTA | 3.37107 | 8.4950 | 7.9615 | 0.0921 | YES |
| + | P40 | FLDQVPFSV | 3.79203 | 8.6580 | 8.2835 | 0.0783 | YES |
Numerical data on the correlation weights to calculate model with Eq. 17
| Amino acids, | NAT | NPT | NC | ||
|---|---|---|---|---|---|
| A………. | 0.42063 | 3 | 3 | 3 | 0.0000 |
| D………. | 0.67685 | 3 | 2 | 3 | 0.0000 |
| E………. | − 1.02940 | 1 | 2 | 1 | 0.0000 |
| F………. | 0.32870 | 5 | 7 | 8 | 0.0231 |
| G………. | 0.17820 | 5 | 4 | 4 | 0.0111 |
| I………. | 0.42796 | 2 | 7 | 4 | 0.0333 |
| L………. | 1.19938 | 7 | 7 | 7 | 0.0000 |
| M………. | 0.0 | 0 | 3 | 0 | 0.0000 |
| P………. | 0.43966 | 10 | 10 | 10 | 0.0000 |
| Q………. | 0.13354 | 5 | 6 | 6 | 0.0091 |
| S………. | − 0.16405 | 5 | 6 | 8 | 0.0231 |
| T………. | − 0.22180 | 8 | 6 | 6 | 0.0143 |
| V………. | 0.42463 | 10 | 10 | 10 | 0.0000 |
| W………. | 0.13869 | 3 | 2 | 1 | 0.0500 |
| Y………. | 0.35202 | 7 | 3 | 5 | 0.0167 |
NAT, NPT, and NC are the frequencies of an amino acid in the active training set, passive training set, and the calibration set, respectively
Calculation of DCW(1,15) and pIC50 for epitope-peptide = WLEPGPVTA
| Structure | ||
|---|---|---|
| W |
| 0.13869 |
| L |
| 1.19938 |
| E |
| − 1.02940 |
| P |
| 0.43966 |
| G |
| 0.17820 |
| P |
| 0.43966 |
| V |
| 0.42463 |
| T |
| − 0.22180 |
| A |
| 0.42063 |
1.98966 6.9048 |