| Literature DB >> 31471543 |
Yoshinori Fukasawa1, Kentaro Tomii2,3,4.
Abstract
Proteins often work as oligomers or multimers in vivo. Therefore, elucidating their oligomeric or multimeric form (quaternary structure) is crucially important to ascertain their function. X-ray crystal structures of numerous proteins have been accumulated, providing information related to their biological units. Extracting information of biological units from protein crystal structures represents a meaningful task for modern biology. Nevertheless, although many methods have been proposed for identifying biological units appearing in protein crystal structures, it is difficult to distinguish biological protein-protein interfaces from crystallographic ones. Therefore, our simple but highly accurate classifier was developed to infer biological units in protein crystal structures using large amounts of protein sequence information and a modern contact prediction method to exploit covariation signals (CSs) in proteins. We demonstrate that our proposed method is promising even for weak signals of biological interfaces. We also discuss the relation between classification accuracy and conservation of biological units, and illustrate how the selection of sequences included in multiple sequence alignments as sources for obtaining CSs affects the results. With increased amounts of sequence data, the proposed method is expected to become increasingly useful.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31471543 PMCID: PMC6717244 DOI: 10.1038/s41598-019-48913-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Comparison of the numbers of sequences included in MSAs. The X-axis shows the numbers of sequences in MSAs generated using the sequence database of 2011; the Y-axis shows those using the sequence database of 2016. Monomers in the Duarte dataset were compared. Dots show each monomer. Blue dots represent both 2011 and 2016 MSAs satisfying the diversity requirement of PSICOV. Green dots represent only 2016 MSAs satisfying the requirement. Pink dots represent neither 2011 nor 2016 MSAs satisfying the criterion.
Number of interfaces in each dataset and PSICOV criterion passing rates.
| Data source | Biological | Crystallographic | Applicable portion |
|---|---|---|---|
| Duarte | 72 (83) | 76 (82) | 89.7% |
| Bahadur | 105 (121) | 170 (185) | 89.9% |
| Zhu | 59 (74) | 93 (106) | 84.4% |
Numbers in cells represent the numbers of instances that passed the PSICOV criterion. Numbers in parentheses in cells are total numbers of respective classes in the datasets.
Figure 2Differences of CSs between biological (blue) and crystal (red) contacts. (a) Distributions of PSICOV scores. Filtered pair (indirect coupling sites) are omitted. (b) Whisker plot of interface areas for the crystal and the biological contacts in the Duarte dataset. (c) Whisker plot of the number of interface contact pairs with PSICOV scores higher than the threshold: 0.4.
Figure 3Feature and performance comparisons. (a) Feature importance quantified and ranked by F-score. (b) Receiver operation characteristic (ROC) curves of classifiers. Blue lines show the classifier performance using selected known features but the CS feature. Orange lines show that of the classifiers using both selected features and the CS feature. Solid lines show performances of RF models, whereas dashed lines show those of SVM models.
Classification performance of the RF model using CS features.
| RF model using CS features | ||||
|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy | MCC | |
| Duarte (5-fold c.v.) | 82% | 89% | 85% | 0.7 |
| Bahadur | 83% | 95% | 90% | 0.79 |
| Zhu | 88% | 98% | 94% | 0.87 |
Classification performance of other representative classifiers.
| PRODIGY-CRYSTAL | ||||
|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy | MCC | |
| Duarte | 94% | 55% | 74% | 0.53 |
| Bahadur | 93% | 85% | 88% | 0.77 |
| Zhu | 93% | 91% | 92% | 0.83 |
|
| ||||
| Sensitivity | Specificity | Accuracy | MCC | |
| Duarte (training) | 90% | 71% | 80% | 0.59 |
| Bahadur | 89% | 86% | 87% | 0.74 |
| Zhu | 86% | 97% | 93% | 0.84 |
|
| ||||
| Sensitivity | Specificity | Accuracy | MCC | |
| Duarte | 91% | 59% | 74% | 0.52 |
| Bahadur | 89% | 73% | 80% | 0.62 |
| Zhu | 84% | 90% | 88% | 0.74 |
|
| ||||
| Sensitivity | Specificity | Accuracy | MCC | |
| Duarte | 69% | 62% | 65% | 0.3 |
| Bahadur | 86% | 71% | 78% | 0.57 |
| Zhu (training) | 97% | 98% | 97% | 0.94 |
Classification performance of our RF models, PRODIGY-CRYSTAL, and EPPIC based on Uniref100 (2016_07).
| Performance on the large-scale dataset | ||||
|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy | MCC | |
| RF model*1 | 90% | 93% | 91% | 0.82 |
| RF model*2 | 92% (92%) | 94% (94%) | 93% (93%) | 0.86 (0.86) |
| RF model*3 | 95% | 96% | 95% | 0.91 |
| PRODIGY-CRYSTAL*4 | 91% | 94% | 92% | 0.85 |
| EPPIC | 90% | 88% | 89% | 0.78 |
*1: The model was trained on the Duarte dataset, which is the same model described in Table 2 exploiting CS features. *2: The model was trained on the merged datasets (the Duarte, Bahadur, and Zhu datasets were merged). Numbers in parentheses represent performance on the dataset having no overlapped entries, where 35 overlapped entries between the merged and the large datasets were removed. *3: The model was trained and evaluated using the large-scale dataset by applying 10-fold cross-validation. *4: Because the classifier was trained on the same dataset, 10-fold cross-validation was conducted. For our RF model and PRODIGY-CRYSTAL, the same partitioning was applied for each fold.
Figure 4A practical example of our method and enhanced CSs in contact pairs of homodimers in reduced MSAs. (a) Enhanced CSs in contact pairs of homodimeric hemoglobin. Red bars show contact pairs with CS higher than 0.6 when using the reduced alignment. Blue bars show those when using the original alignment. Green bars show contact pairs having CS higher than 0.6 in both alignments. (b) Number of sequences included in MSAs of 3SDH with the default (green) and the conservative (red) thresholds. (c) Enhanced CSs in contact pairs of human coagulation factor XIII. Color scheme is the same as 3SDH. (d) Number of sequences included in MSAs of 1F13. Color scheme is the same as 3SDH. (e) Schematic diagram of differences in contact prediction of interchain and intrachain interactions. Required sequences for the estimation of CS depend on the degree of conservation in target contacts. In general, CS estimation of interchain contacts requires more similar sequences because of their lower degree of conservation of oligomeric states compared to folds. Although intrachain contacts are more conserved (i.e., even quite diverged sequences are still informative for intrachain contact estimation), interchain contacts are less conserved because oligomeric states can vary more than folds. This is at least true for our manually confirmed samples. In such cases, greatly diverged sequences, which accommodate different (oligomeric) states from that of the target, can be a source of noise in the estimation. The sequence threshold for these samples is apparently appropriate at 20–40%. Family A in the figure illustrates a protein family where diverged sequences can be noise. In contrast, even diverged sequences are still informative for interchain contacts if oligomeric state is highly conserved. Family B is an example for this case.