| Literature DB >> 26650466 |
Varsha D Badal1, Petras J Kundrotas1, Ilya A Vakser1,2.
Abstract
The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking). Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu). The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features) approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound benchmark set, significantly increasing the docking success rate.Entities:
Year: 2015 PMID: 26650466 PMCID: PMC4674139 DOI: 10.1371/journal.pcbi.1004630
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Flowchart of the text mining protocol.
Regular expressions for amino acids in the information extraction part of the text mining protocol.
| Parameter | Value |
|---|---|
|
| [1–9][0–9] |
|
| [Ala,…, Val] OR [ala,…, val] OR [ALA,…, VAL] |
|
| AA(no space)Number OR AA(space)Number OR AA–Number |
|
| [Alanine,…,Valine] OR [alanine,…,valine] |
|
| Full_AA(no space)Number OR |
|
| [A,….,V] |
|
| Single_AA(no space)Number(no space)Single_AA |
|
| AA(no space)Number(no space)AA OR AA-Number(no space)AA |
a Non-zero digit followed by any number of digits
b Three-letter abbreviation for standard amino acids
c Full name of amino acid
d One-letter abbreviation for amino acid
Sets of features (stems) for SVM models.
Manually selected features are sorted alphabetically and automatically selected features are sorted based on the ratio δ (Eq 5) large to small. PPI-relevant features are in bold.
| Number of words in a bag | Bag of words |
|---|---|
|
| |
|
| activ, |
|
| affin, alloster, associ, attach, bind, bond, bound, catalyt, chang, cleavag, complex, conform, conserv, cooper, contact, cycliz, delet, diminish, direct, domain, downstream, enhanc, enzym, facilit, growth, increas, induc, inhibit, interact, interfac, involv, linkag, mechan, metabol, modifi, modul, preferenti, reassoci, recognit, regulatori, signal, specif, stabil, stimul, substrat, suppress, surfac, target, transform, trigger. |
|
| affin, alloster, associ, attach, bind, bond, bound, catalyt, cleavag, complex, conform, conserv, cooper, contact, cycliz, delet, diminish, domain, enhanc, enzym, facilit, increas, induc, inhibit, interact, interfac, linkag, mechan, modifi, modul, preferenti, recognit, regulatori, specif, stabil, substrat, surfac, target, transform, trigger. |
|
| affin, alloster, associ, attach, bind, bond, bound, cleavag, complex, conform, conserv, contact, cooper, domain, induc, interfac, interact, linkag, mechan, modifi, modul, preferenti, recognit, regulatori specif, stabil, surface, substrat, target, transform |
|
| alloster, bind, bond, bound, cleavag, complex, conform, contact, conserv, domain, induc, interfac, interact, mechan, modul, preferenti, recognit, specif, stabi, surface. |
|
| alloster, bind, complex, conform, conserv, contact, induc, interface, interact, recognit. |
|
| |
|
| polymorph, |
Performance of basic and SVM-enhanced TM protocols.
The SVM models were trained and tested on abstracts retrieved by the AND-queries. Best models were applied to abstracts retrieved by the OR-queries (see Methods). Total number of complexes in the dataset is 579, if not specified otherwise.
| Query type | SVM model |
|
| Coverage (%) | Success (%) | Accuracy (%) |
|---|---|---|---|---|---|---|
|
| ||||||
| AND | 128 | 108 | 22.1 | 18.7 | 84.4 | |
| OR | 328 | 273 | 56.6 | 47.2 | 83.2 | |
|
| ||||||
| AND | 142 | 118 | 24.5 | 20.4 | 83.1 | |
| OR | 342 | 283 | 59.1 | 48.9 | 82.7 | |
|
| ||||||
| AND | 96 | 75 | 16.6 | 13.0 | 78.1 | |
| OR | 268 | 202 | 46.3 | 34.9 | 75.4 | |
|
| ||||||
| OR | MF50L | 266 | 211 | 45.9 | 36.4 | 79.3 |
| OR | AF138L | 269 | 213 | 46.5 | 36.9 | 79.2 |
| OR | AF24L | 253 | 193 | 43.7 | 33.3 | 76.3 |
|
| ||||||
| OR | 93 | 82 | 93.9 | 82.8 | 88.2 | |
a Number of complexes for which TM protocol found at least one abstract with residues
b Number of complexes with at least one interface residue found in abstracts
c Ratio of and total number of complexes
d Ratio of and total number of complexes
e Ratio of and
Fig 2Distribution of complexes according to the quality of the basic TM.
The TM performance is according to P TM (Eq 1). The distribution is normalized to the total number of complexes for which residues were identified (column 3 in Table 3).
Fig 3Examples of residues extracted from an abstracts retrieved by OR-query.
The structure, chain ID, and residue numbers are from 1m27. Interface and non-interface residues are in brown and magenta, correspondingly.
Fig 4Matthews correlation coefficient vs. number of features in SVM model.
The Matthews correlation coefficient (MCC) is calculated according to Eq 6. The features were selected manually (A) and in automated mode (B), for linear and RBF SVM kernels. The data was obtained on the validation set of 261 abstracts. The SVM models were trained on 1,044 abstracts (see Methods).
Classification of abstracts in the test set by the three optimal SVM models.
Total number of abstracts 261 (90 PPI-relevant and 171 non-PPI).
| SVM model | TP | FN | TN | FP |
|---|---|---|---|---|
| MF50L | 48 | 42 | 123 | 48 |
| AF138L | 52 | 38 | 115 | 56 |
| AF24L | 46 | 44 | 128 | 43 |
Fig 5Performance of the best SVM models.
The abstracts were retrieved by the OR-queries. Distribution of complexes (A) is shown according to the TM performance, P TM (Eq 1). The distribution is normalized by the total number of complexes for which residues were identified (column 2 in Table 3). After filtering of abstracts by the optimal models, for a number of complexes (B) P TM improves (ΔP TM > 0), does not change (ΔP TM = 0) and gets worse (ΔP TM < 0). Hatched areas show the number of complexes, for which the optimal models removed all abstracts.
Fig 6Docking with TM constraints.
The results of benchmarking on the unbound X-ray set from Dockground. A complex was predicted successfully if at least one in top ten matches had ligand Cα interface RMSD ≤ 5 Å (A), and one in top hundred had RMSD ≤ 8 Å (B). The success rate is the percentage of successfully predicted complexes in the set. The low-resolution geometric scan output (20,000 matches) from GRAMM docking, with no post-processing, except removal of redundant matches, was scored by TM results. The reference bars show scoring by the actual interface residues (see text).