| Literature DB >> 29021608 |
Michał Burdukiewicz1, Piotr Sobczyk2, Stefan Rödiger3, Anna Duda-Madej4, Paweł Mackiewicz1, Małgorzata Kotulska5.
Abstract
Amyloids are proteins associated with several clinical disorders, including Alzheimer's, and Creutzfeldt-Jakob's. Despite their diversity, all amyloid proteins can undergo aggregation initiated by short segments called hot spots. To find the patterns defining the hot spots, we trained predictors of amyloidogenicity, using n-grams and random forest classifiers. Since the amyloidogenicity may not depend on the exact sequence of amino acids but on their more general properties, we tested 524,284 reduced amino acid alphabets of different lengths (three to six letters) to find the alphabet providing the best performance in cross-validation. The predictor based on this alphabet, called AmyloGram, was benchmarked against the most popular tools for the detection of amyloid peptides using an external data set and obtained the highest values of performance measures (AUC: 0.90, MCC: 0.63). Our results showed sequential patterns in the amyloids which are strongly correlated with hydrophobicity, a tendency to form β-sheets, and lower flexibility of amino acid residues. Among the most informative n-grams of AmyloGram we identified 15 that were previously confirmed experimentally. AmyloGram is available as the web-server: http://smorfland.uni.wroc.pl/shiny/AmyloGram/ and as the R package AmyloGram. R scripts and data used to produce the results of this manuscript are available at http://github.com/michbur/AmyloGramAnalysis .Entities:
Mesh:
Substances:
Year: 2017 PMID: 29021608 PMCID: PMC5636826 DOI: 10.1038/s41598-017-13210-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Characteristics of training and test data sets used in the cross-validation.
| Set | Sequence length | Status | Sequences | Hexapeptides |
|---|---|---|---|---|
| Training | 6 | Non-amyloid | 841 | 841 |
| Amyloid | 247 | 247 | ||
| 6–10 | Non-amyloid | 964 | 1412 | |
| Amyloid | 312 | 475 | ||
| 6–15 | Non-amyloid | 992 | 1653 | |
| Amyloid | 342 | 720 | ||
| Test | 6 | Non-amyloid | 841 | 841 |
| Amyloid | 247 | 247 | ||
| 7–10 | Non-amyloid | 123 | 571 | |
| Amyloid | 65 | 228 | ||
| 11–15 | Non-amyloid | 28 | 241 | |
| Amyloid | 30 | 245 | ||
| 16–25 | Non-amyloid | 41 | 571 | |
| Amyloid | 55 | 778 |
We derived sequences of different lengths from AmyLoad database (column ‘Sequences’) and from them extracted all possible overlapping hexapeptides (column ‘Hexapeptides’). Training data sets are partially overlapping (e.g. the set 6–10 contains also sequences from the set 6). Test data sets are always non-overlapping.
Figure 1The scheme of reduced alphabets generation and n-gram extraction from studied peptide sequences. (A) Generation of 18,535 unique amino acid encodings using all possible combinations of selected 17 physicochemical properties. Amino acids (AA) are clustered into groups (ID) using a combination of various physicochemical properties (P1, P2, P3, P4, …). (B) Extraction of n-grams. (1) Extraction of overlapping hexapeptides from peptides with known amyloidicity status. (2) Encoding amino acids of hexapeptides into corresponding groups (reduced alphabet) using alphabets generated (shown in (A)). (3) Extraction of encoded n-grams of different types: continuous with the length from 1 to 3 residues; gapped 2-grams with a gap of the length from 1 to 3 residues; gapped 3-grams with a single gap between residues (not all possibilities are shown). (4) Selection of informative n-grams using Quick Permutation Test (QuiPT). (5) Cross-validation of encodings using random forest classifier, which is trained on the informative n-grams.
Figure 2Distribution of mean AUC values of classifiers with various encodings for every possible combination of training and testing data set including different lengths of sequences. The left and right ends of boxes correspond to the 0.25 and 0.75 quartiles. The bar inside the box represents the median. The gray circles correspond to the encodings with the AUC outside the 0.95 confidence interval.
The best-performing encoding.
| Subgroup ID | Amino acids |
|---|---|
| I | G |
| II | K, P, R |
| III | I, L, V |
| IV | F, W, Y |
| V | A, C, H, M |
| VI | D, E, N, Q, S, T |
Figure 3The frequency of important n-grams used by the best-performing classifier in amyloid and non-amyloid sequences. Amino acids possible on a given position of the n-grams are specified inside the brackets. X denotes any amino acid. The frequency was computed using the total number of occurrences divided by the number of possible n-grams of their length. Open and closed circles denote experimentally validated n-grams occurring in motifs found in amyloidogenic and non-amyloidogenic sequences, respectively[30].
Figure 4Similarity and AUC of the reduced alphabets studied in the cross-validation. Classifiers the most similar to the best-performing classifier have the highest values of AUC. The color of the square is proportional to the number of alphabets in its area.
Results of benchmark on the pep424 data set for PASTA 2.0, FoldAmyloid, APPNN, and AmyloGram trained on n-grams extracted for the full amino acid alphabet and for sequences with the length specified in the brackets.
| Classifier | AUC | MCC | Sensitivity | Specificity |
|---|---|---|---|---|
| AmyloGram (6) | 0.8856 | 0.6057 | 0.6779 | 0.9037 |
| full alphabet (6) | 0.8411 | 0.5427 | 0.4966 |
|
| AmyloGram (6–10) |
|
| 0.8658 | 0.7889 |
| full alphabet (6–10) | 0.8581 | 0.5698 | 0.7517 | 0.8259 |
| AmyloGram (6–15) | 0.8728 | 0.5420 |
| 0.6111 |
| full alphabet (6–15) | 0.8610 | 0.5490 | 0.8188 | 0.7519 |
| PASTA 2.0 | 0.8550 | 0.4291 | 0.3826 | 0.9519 |
| FoldAmyloid | 0.7351 | 0.4526 | 0.7517 | 0.7185 |
| APPNN | 0.8343 | 0.5823 | 0.8859 | 0.7222 |