| Literature DB >> 33350829 |
Rainier Barrett1, Andrew D White1.
Abstract
Often the development of novel functional peptides is not amenable to high throughput or purely computational screening methods. Peptides must be synthesized one at a time in a process that does not generate large amounts of data. One way this method can be improved is by ensuring that each experiment provides the best improvement in both peptide properties and predictive modeling accuracy. Here, we study the effectiveness of active learning, optimizing experiment order, and meta-learning, transferring knowledge between contexts, to reduce the number of experiments necessary to build a predictive model. We present a multitask benchmark database of peptides designed to advance these methods for experimental design. Each task is a binary classification of peptides represented as a sequence string. We find neither active learning method tested to be better than random choice. The meta-learning method Reptile was found to improve the average accuracy across data sets. Combining meta-learning with active learning offers inconsistent benefits.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33350829 PMCID: PMC7842147 DOI: 10.1021/acs.jcim.0c00946
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 4.956
Positive and Negative Examples Chosen for Training the Classifiersa
| Positive data set | Size | Negative data sets |
|---|---|---|
| antibacterial | 2079 | shp2 |
| anticancer | 183 | shp2, tula2, insoluble, antifungal, anti-HIV, antiparasital,
antibacterial |
| antifungal | 891 | shp2, tula2, insoluble, anti-HIV, anticancer, antibacterial, scrambled |
| anti-HIV | 87 | shp2, tula2, insoluble, antifungal, anticancer, antiparasital, antibacterial, scrambled |
| anti-MRSA | 119 | shp2, tula2, insoluble, anti-HIV, anticancer, antiparasital, scrambled |
| antiparasital | 90 | shp2, tula2, insoluble, anti-HIV, anticancer, scrambled |
| antiviral | 150 | shp2, tula2, insoluble, antifungal, anticancer, antiparasital, antibacterial, scrambled |
| hemolytic | 253 | shp2, tula2, insoluble, human |
| soluble | 7028 | insoluble |
| shp2 | 120 | scrambled |
| tula2 | 65 | scrambled |
| human | 2880 | insoluble, hemolytic, scrambled |
Negative data sets were sampled to be the same size as the positive data sets.
It is exceedingly rare to find antibacterial peptides, so it is assumed that the SHP-2/Tula-2 binding peptides are not good examples of antibacterial peptides.
Insoluble peptides cannot be successful antibacterial peptides.
Antifungal and antibacterial activity are different mechanisms, so it is unlikely that a given peptide is both antifungal and antibacterial.
It is known that antimicrobial and anticancer peptides often have a similar method of action, and there is significant overlap between the two data sets. Including these data sets in one another’s negative example sets might be expected to reduce overall model accuracy. This conservative choice of data set still resulted in accuracy near baseline on these two tasks in the context of meta-learning and active learning.
It is assumed that fragments of proteins found on surfaces of proteins found in humans are not hemolytic (kill red blood cells).
No scrambled data set is necessary because there are known negative examples.
Classifying SHP-2 and Tula-2 is in the context of fixed-length peptides so only scrambled peptides with the same length are used for classification.
Figure 1Neural network structure for active learning. Here, Lmax is the maximum width of a peptide in the data set (although a convolution can use any length), K is the number of motif classets, and A is the length of the amino acid alphabet. Peptides are first translated to a one-hot encoding (Lmax × A) and a vector of normalized amino acid counts (1 × N). The output of the max pool layer is passed through one fully connected layer with ReLU activation; then, amino acid counts are appended to the output. This is then passed into two more fully connected layers with a final output dimension of 2 for positive and negative class labels. Labels below neural network layers indicate the dimensionality of the data as it passes through the layer.
Figure 2Training curves of uncertainty minimization active learning compared with baseline (gray) trained across all data points and randomly choosing examples. The y-axis is accuracy on withheld data. Light traces are individual 30 runs, and the dark trace is median (only one set of traces is shown). Each run has a different train/withheld split and random number generator seeds. Each subplot is a different task, arranged in increasing order of number of training points from left to right, top to bottom.
Figure 4Training curves of uncertainty minimization active learning with and without Reptile meta-learning compared with baseline (gray) trained across all data points and training with randomly chosen examples with and without meta-learning. The y-axis is accuracy on withheld data. Light traces are the 50 individual runs, and the dark trace is the median. Each run has a different train/withheld split and random number generator seeds. Each subplot is a different task which was withheld during meta-learning. Meta-learning offers inconsistent improvements, while active learning consistently offers no improvements over random, unless paired with meta-learning (although still not consistently).
Figure 3Box-and-whisker plot comparing average accuracy values after 10 and 50 training examples across all 12 data sets for five different methods explored in this work. Asterisks indicate statistically significant difference (p ≤ 0.05) in mean accuracy, calculated using Wilcoxon’s signed ranks test among the 12 average accuracy values compared between two methods. Meta-learning significantly improves few-shot performance over uncertainty minimization or QBC alone when combined with these active learning methods, but only ML+QBC shows significant performance improvement after 50 training examples. This indicates that meta-learning can be a good tool for increasing few-shot learning in settings where data is scarce. See Tables S7 and S8 for all possible p-value pairs.
Figure 5Features across tasks for baseline model. The barplot shows the partial derivative of probability of activity with respect to amino acid count averaged across training data. This gives importance for assigning label. The right side of the plots shows the maximum magnitude weight in the convolution, which roughly corresponds to the most attended motif in the sequence. The y-axis label on the right side shows the frequency of this motif across 50 training iterations. “Z” is the normalized length of the peptide.
Figure 6Training curves for different model choices, using the antibacterial data set. Baseline is the same in each panel, and the baseline subplot is the model as presented in text. “No Motifs” has the convolution layers removed. “No Counts” has amino acid counts removed. “Abs Loss” uses an absolute difference loss instead of cross-entropy. “No Label Swap” means that labels were not swapped during meta-learning so that zero-shot accuracy is maximized. These results show that meta-learning is not always better but is consistently a good choice for few-shot learning.