| Literature DB >> 29506465 |
Varsha D Badal1, Petras J Kundrotas2, Ilya A Vakser3.
Abstract
BACKGROUND: Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking.Entities:
Keywords: Binding site prediction; Dependency parser; Protein docking; Protein interactions; Rule-based system; Supervised learning
Mesh:
Substances:
Year: 2018 PMID: 29506465 PMCID: PMC5838950 DOI: 10.1186/s12859-018-2079-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Flowchart of NLP-enhanced text mining system. Scoring of surrounding sentences is shown for Method 3 (see text)
Manually generated dictionary used to distinguish relevant (PPI + ive) and irrelevant (PPI-ive) information on protein-protein binding sites. Only lemmas (stem words) are shown
| Category | Words |
|---|---|
| PPI + ive | bind, interfac, complex, hydrophob, recept, ligand, contact, recog, dock, groove, pocket, pouch, interact, crystal, latch, catal |
| PPi–ive | deamidation, IgM, IgG, dissociat, antibo, alloster, phosphory, nucleotide, polar, dCTP, dATP, dTTP, dUTP, dGTP, IgG1, IgG2, IgG3, IgG4, Fc, ubiquitin, neddylat, sumoyla, glycosylation, lipidation, carbonylation, nitrosylation, epitope, paratope, purine, pyrimidine, isomeriz, non-conserved, fucosylated, nonfucosylated, sialylation, galactosylation |
Overall text-mining performance with the residue filtering using semantic similarity of words in a residue-containing sentence to a generic concept in the WordNet vocabulary. For comparison, the results with basic residue filtering are also shown
| Query | Similarity measure |
|
| Coverage (%)c | Success (%)d | Accuracy (%)e | Δ | Δ |
|---|---|---|---|---|---|---|---|---|
| AND | – | 128 | 108 | 22.1 | 18.7 | 84.4 | ||
| OR | – | 328 | 273 | 56.6 | 47.2 | 83.2 | ||
| OR | Lesk [ | 319 | 267 | 55.1 | 46.1 | 83.7 | -3 | −1 |
| OR | Lin [ | 251 | 184 | 43.4 | 31.8 | 73.3 | + 8 | −8 |
| OR | Path [ | 316 | 265 | 54.6 | 45.8 | 83.9 | −3 | + 1 |
aNumber of complexes for which TM protocol found at least one abstract with residues
bNumber of complexes with at least one interface residue found in abstracts
cRatio of L and total number of complexes
dRatio of L and total number of complexes
eRatio of L and L
fCalculated by Eq. (2)
Fig. 2Performance of basic and advanced text mining protocols. Advanced filtering of the residues in the abstracts retrieved by the OR-queries was performed by calculating various similarity scores (see legend) between the words of residue-containing sentences and generic concept words from WordNet. The TM performance is calculated using Eq. (1). The distribution is normalized to the total number of complexes for which residues were extracted (third column in Table 1)
Overall text-mining performance with the residue filtering based on spotting in the residue-containing sentences keyword(s) from specialized dictionaries
| Dictionary and reference | Number of PPI keywords |
|
| Coverage (%)c | Success (%)d | Accuracy (%)e | Δ | Δ |
|---|---|---|---|---|---|---|---|---|
| Blaschke et al., [ | 43 | 265 | 205 | 45.8 | 35.4 | 77.4 | 0 | −8 |
| Chowdhary et al., [ | 191 | 284 | 233 | 49.1 | 40.2 | 82.0 | −7 | −4 |
| Hakenberg et al. [ | 234 | 297 | 232 | 51.3 | 40.1 | 78.1 | 6 | −7 |
| Plake et al. [ | 73 | 291 | 230 | 50.3 | 39.7 | 79.0 | 1 | −1 |
| Raja et al. [ | 412 | 302 | 247 | 52.2 | 42.7 | 81.8 | 0 | −5 |
| Schuhmann et al. [ | 64 | 212 | 152 | 36.6 | 26.3 | 71.7 | − 1 | 5 |
| Temkin et al. [ | 174 | 283 | 223 | 48.9 | 38.5 | 78.8 | 0 | −9 |
| Own dictionary | 16 | 224 | 169 | 38.7 | 29.2 | 75.4 | −6 | 8 |
For definitions of columns 3–9, see footnotes to Table 1. Full content of in-house dictionary is in Table 3, but only PPI + ive part was used to calculate the data in this Table
Fig. 3Performance of basic and advanced text mining protocols. Advanced filtering of the residues in the abstracts retrieved by the OR- queries was performed by spotting PPI-relevant keywords from various specialized dictionaries (see legend). The TM performance is calculated using Eq. (1). The distribution is normalized to the total number of complexes for which residues were extracted (third column in Table 2). Full content of the in-house dictionary is in Table 3, but only PPI + ive part was used to obtain results presented in this Figure. The data are shown in two panels for clarity
Overall text-mining performance with the residue filtering based on analysis of sentence parse tree
| Method of parse tree analysis |
|
| Coverage (%) | Success (%) | Accuracy (%) | Δ | Δ |
|---|---|---|---|---|---|---|---|
| Method 1. Scoring of the residue-containing sentence only | 222 | 173 | 38.3 | 29.9 | 77.9 | −13 | + 10 |
| Method 2. Scoring of the residue-containing sentence and keyword spotting in the context sentences | 208 | 154 | 35.9 | 26.6 | 74.0 | −7 | + 3 |
| Method 3. SVM model with scores of the residue-containing and context sentences | 182 | 146 | 31.4 | 25.2 | 80.2 | −27 | + 21 |
Keywords used in the analysis were taken from our dictionary (Table 3). For definitions of columns 2–8, see footnotes to Table 1
Fig. 4Performance of basic and advanced text mining protocols. Advanced filtering of the residues in the abstracts retrieved by the OR-queries was performed by different methods of analysis of the sentence parse trees (for method description see first column in Table 4) The TM performance was calculated using Eq. (1). The distribution is normalized to the total number of complexes for which residues were extracted (second column in Table 4)
Fig. 5Successful filtering of mined residues by the SVM-based approach of the parse-tree analysis (Method 3 in Table 4). The structure is 2uyz chains A (wheat) and B (cyan). Residues mined by the basic TM protocol are highlighted. The ones filtered out by the advanced TM protocol are in orange
Fig. 6TM contribution to docking. The success rate increase of the rigid-body global docking scan by GRAMM using constraints generated by basic TM and the advanced TM with NLP