| Literature DB >> 22151823 |
Anália Lourenço1, Michael Conover, Andrew Wong, Azadeh Nematzadeh, Fengxia Pan, Hagit Shatkay, Luis M Rocha.
Abstract
BACKGROUND: We participated, as Team 81, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. Our main goal was to exploit the power of available named entity recognition and dictionary tools to aid in the classification of documents relevant to Protein-Protein Interaction (PPI). For the IMT, we focused on obtaining evidence in support of the interaction methods used, rather than on tagging the document with the method identifiers. We experimented with a primarily statistical approach, as opposed to employing a deeper natural language processing strategy. In a nutshell, we exploited classifiers, simple pattern matching for potential PPI methods within sentences, and ranking of candidate matches using statistical considerations. Finally, we also studied the benefits of integrating the method extraction approach that we have used for the IMT into the ACT pipeline.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22151823 PMCID: PMC3269935 DOI: 10.1186/1471-2105-12-S8-S12
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Top 1000 SP Features on the Features are colored according to the value of S (darker indicating higher rank).
Top 10 SP features ranked with the S score.
| interact--with | 0.3220 | 0.0442 | 0.279 |
| interact--between | 0.1071 | 0.026 | 0.081 |
| complex--with | 0.0920 | 0.0153 | 0.0768 |
| protein--interact | 0.0666 | 0.006 | 0.0606 |
| crystal--structur | 0.0804 | 0.022 | 0.0584 |
| yeast--two-hybrid | 0.0542 | 0.0 | 0.0542 |
| with--protein | 0.0619 | 0.0123 | 0.0496 |
| protein--kinas | 0.0705 | 0.0233 | 0.0472 |
| here--report | 0.086 | 0.039 | 0.047 |
| transcript--factor | 0.0856 | 0.0417 | 0.0438 |
Top 10 bigram features ranked with the S score.
| interact--with | 0.3001 | 0.0397 | 0.2604 |
| interact--between | 0.1062 | 0.026 | 0.0802 |
| complex--with | 0.089 | 0.013 | 0.076 |
| crystal--structur | 0.0804 | 0.0218 | 0.0586 |
| yeast--two-hybrid | 0.0542 | 0.0 | 0.0542 |
| protein--interact | 0.052 | 0.0045 | 0.0475 |
| here--report | 0.0856 | 0.0384 | 0.0472 |
| protein--kinas | 0.0679 | 0.0224 | 0.0455 |
| transcript--factor | 0.0851 | 0.0415 | 0.0436 |
| ubiquitin--ligas | 0.0396 | 0.0031 | 0.0364 |
Figure 2Comparison of entity count features for The horizontal axis represents the number of mentions x, and the vertical axis the probability of documents with at least x mentions. The green lines denote probabilities for documents labeled relevant p(n ≥ x), while the red lines denote probabilities documents labeled irrelevant p(n ≥ x); the blue lines denote the difference between green and red lines (|p – p|).
Figure 3Comparison of entity count features for The horizontal axis represents the number of mentions x and the vertical axis the probability of documents with at least x mentions. The green line denotes probabilities for documents labeled relevant p(n ≥ x), while the red line denotes probabilities for documents labeled irrelevant p(n ≥ x); the blue line denotes the difference between green and red lines (|p – p|).
Figure 4The normalized plane for plotting the VTT decision surface. The coordinates x(d) and y(d) are computed according to Eq. (3) for every document d. The decision surface is computed with Eq. (4). On the left-hand side the threshold for the classification decision is shown (see text for description). On the right-hand side, the point of no threshold adjustment is shown (see text for description).
Parameter values for submitted classifiers after parameter search.
| Classifier | Features | ||||||
|---|---|---|---|---|---|---|---|
| VTT0 | SP | 1.1 | - | - | - | - | - |
| VTT0 | Bigrams | 1.1 | - | - | - | - | - |
| VTT1 | SP | 1.3 | 40 | - | - | - | - |
| VTT1 | Bigrams | 1.5 | 20 | - | - | - | - |
| VTT5 | SP | 2.2 | 6 | 50 | 70 | 4 | 40 |
| VTT5 | Bigrams | 2.1 | 6 | 50 | 60 | 5 | 30 |
| VTT3 | SP | 1.4 | 17 | 115 | 115 | - | - |
Performance of submitted classifiers on training data.
| Classifier | Features | |||
|---|---|---|---|---|
| VTT0 | SP | 0.7637 | 0.8308 | 0.6325 |
| VTT0 | Bigrams | 0.7541 | 0.832 | 0.6269 |
| VTT1 | SP | 0.7755 | 0.8386 | 0.6502 |
| VTT1 | Bigrams | 0.7568 | 0.8302 | 0.6265 |
| VTT5 | SP | |||
| VTT5 | Bigrams | 0.7751 | 0.842 | 0.6533 |
| VTT3 | SP | 0.771 | 0.8387 | 0.6466 |
Shown are the mean values obtained in cross-validation by the F-Score, Accuracy, and Matthew’s Correlation Coefficient. Boldfaced values represent best performance in table.
Performance of submitted classifiers on test data.
| Classifier | Features | ||||
|---|---|---|---|---|---|
| VTT0 | SP | 0.5399 | 0.8097 | 0.456 | 0.4935 |
| VTT0 | Bigrams | 0.5243 | 0.8382 | 0.4318 | 0.4287 |
| VTT1 | SP | 0.5667 | 0.8213 | 0.4909 | 0.5402 |
| VTT1 | Bigrams | 0.5575 | 0.8402 | 0.472 | 0.5015 |
| VTT5 | SP | ||||
| VTT5 | Bigrams | 0.6366 | 0.85.9 | 0.5752 | 0.7127 |
| VTT3 | SP | 0.628 | 0.8387 | 0.5735 | 0.7143 |
Shown are the values obtained on the official BC3 gold standard by the F-Score, Accuracy, Matthew’s Correlation Coefficient, and Area Under the interpolated Precision and Recall Curve (computed with the official script, and adding F-Score). Boldfaced values represent best performance in table.
Figure 5Decision surfaces of the VTT The decision surfaces are plotted with the parameters in Table 3, and x(d) and y(d) are computed according to Eq. (7) for every document d. The plots for VTT1 surfaces display many documents d with the same values of y(d), plotted in horizontal rows, while VTT5 displays a smoother ranking of documents. This happens because VTT1 uses information from a single NER tool (ABNER protein mentions), while VTT5 uses information from five such tools; thus, while in the VTT1 plot many documents have the same value of ABNER protein mentions, in the VTT5 plot the various NER measurements lead to a finer distinction between documents.
Central tendency and variation of the performance of all runs submitted to ACT on the official BC3 gold standard, including our original and our corrected runs.
| Mean | 0.7909 | 0.4624 | 0.3885 | 0.5048 |
| Median | 0.8452 | 0.5399 | 0.4608 | 0.5367 |
| Std. dev. | 0.1324 | 0.1732 | 0.1740 | 0.1505 |
| Mean + 95% CI | 0.8257 | 0.5079 | 0.4343 | 0.5444 |
| Std. error | 0.0174 | 0.0227 | 0.0229 | 0.0198 |
Shown are the values obtained by the F-Score, Accuracy, Matthew’s Correlation Coefficient, and Area Under the interpolated Precision and Recall Curve (computed with the official script, adding F-Score),
Performance of top 10 reported runs to ACT in BC3.
| Team | Run | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 0.864 | 21 | |||||||||
| 0.6132 | 5 | 0.55306 | 4 | 0.6796 | 5 | |||||
| T73 | RUN_4 | 0.8888 | 3 | 0.6142 | 4 | 0.55054 | 5 | 0.6798 | 4 | 240 |
| 0.859 | 25 | 0.7127 | 3 | 300 | ||||||
| 0.844 | 30 | 0.6280 | 3 | 0.57345 | 3 | 540 | ||||
| T73 | RUN-1 | 0.8755 | 16 | 0.6083 | 6 | 0.53524 | 6 | 0.6591 | 6 | 3456 |
| T73 | RUN_3 | 0.8778 | 13 | 0.6014 | 9 | 0.52932 | 8 | 0.6589 | 7 | 6552 |
| T73 | RUN_5 | 0.8762 | 15 | 0.6033 | 8 | 0.53031 | 7 | 0.6537 | 8 | 6720 |
| T90 | RUN_3 | 0.8832 | 9 | 0.5964 | 11 | 0.52914 | 9 | 0.6524 | 9 | 8019 |
| T65 | RUN_2 | 0.8793 | 12 | 0.5982 | 10 | 0.52727 | 10 | 0.6389 | 10 | 12000 |
Shown are the values obtained on the official BC3 gold standard by the F-Score, Accuracy, Matthew’s Correlation Coefficient, and Area Under the interpolated Precision and Recall Curve (computed with the official script, adding F-Score), as well as their ranks. RP4 denotes the rank product of these 4 measures. Boldfaced values represent best and second-best performance for respective measure.
Figure 6Decision surfaces of the VTT The decision surface and x(d) and y(d) are computed according to Eq. (7) for VTT1 (top) and Eq. (8) VTT5 (bottom), for every document d in test set. The plots for VTT1 surfaces display many documents d with the same values of y(d), plotted in horizontal rows, while VTT5 displays a smoother ranking of documents. This happens because VTT1 uses information from a single NER tool (ABNER protein mentions), while VTT5 uses information from five such tools; thus, while in the VTT1 plot many documents have the same value of ABNER protein mentions, in the VTT5 plot the various NER measurements lead to a finer distinction between documents.
Runs on the test set (after code correction)
| Run | Precision | Recall | F-Score | MCC | AUC iP/R | Total Docs Evaluated |
|---|---|---|---|---|---|---|
| All | 2.50% | 93.17% | 0.0487 | 0.0908 | 0.1852 | 222 |
| Top 40 | 4.83% | 82.92% | 0.0913 | 0.1604 | 0.1583 | 222 |
| RScore ≥6 | 26.61% | 50.58% | 0.3488 | 0.3535 | 0.1522 | 214 |
| RScore ≥7 | 28.44% | 48.62% | 0.3589 | 0.3591 | 0.1524 | 210 |
The table shows the results of running our (corrected) program, on the BC 3 test set. The measurements shown are of precision, recall, F-score, Matthews Correlation Coefficient (MCC), Area under the Curve, and the total number of articles being evaluated by our program.
The rows reflect four different runs: The first based on pattern-matching of methods to the text alone (All); the second scoring the sentence-method associations and reporting the top 40 scoring methods; the third reporting the top scoring methods whose raw score was at least 6, while the last reporting the top scoring methods whose top score was at least 7.
Summary of evaluation by three human annotators, over 1049 evidence sentences for PPI methods.
| Label | # of sentences tagged by the Majority as Label | % of sentences tagged by the Majority as Label |
|---|---|---|
| Y | 755 | 72% |
| M | 112 | 11% |
| N | 165 | 16% |
The table shows the statistics of majority annotation labelling 1049 sentences, each by three independent annotators. For each annotation value, shown in the right column, we list how many sentences were labelled with this value by at least two of the three annotators.
The possible labels are: Y - if the sentence discusses a method which can Potentially be applied for detecting protein-protein interaction; M - if the sentence discusses a method, but the method is NOT a protein-protein interaction detection method; N - if the sentence DOES NOT discuss a method.
Note that the total number of majority-vote sentences is 1032 rather than 1049, because on 17 sentences the 3 annotators had a 3-way disagreement. (Roughly 1% of the sentences, hence the total percentage is 99%)
IMT Runs on the training set (after code correction)
| Run | Precision | Recall | F-Score | MCC | AUC iP/R | Total Docs Evaluated |
|---|---|---|---|---|---|---|
| All | 2.38% | 94.80% | 0.0465 | 0.0937 | 0.2032 | 2002 |
| Top 40 | 4.54% | 85.16% | 0.0864 | 0.1598 | 0.2063 | 2002 |
| RScore ≥6 | 26.30% | 58.72% | 0.3633 | 0.3806 | 0.1997 | 1947 |
| RScore ≥7 | 29.14% | 50.25% | 0.3689 | 0.3711 | 0.1816 | 1871 |
The table shows the results of running our (corrected) program on the BC 3 training set. The measurements shown are of precision, recall, F-score, Matthews Correlation Coefficient (MCC), Area under the Curve, and the total number of articles being evaluated by our program.
The rows reflect four different runs: The first based on pattern-matching of methods to the text alone (All); the second scoring the sentence-method associations and reporting the top 40 scoring methods; the third reporting the top scoring methods whose raw score was at least 6, while the last reporting the top scoring methods whose top score was at least 7.
The distribution of the secondary labels for sentences tagged as Y by majority of annotators
| Label | # of sentences tagged by the Majority as Label | % with respect to all Y-tagged sentences (755) | % with respect to all sentences (1049) |
|---|---|---|---|
| Y2 | 199 | 26% | 19% |
| Y1 | 172 | 23% | 16% |
| Y0 | 297 | 39% | 28% |
Annotators assigning a "Y" to a sentence were further asked to assign a numeric label, indicating the actual protein-protein interaction content of the sentence, as follows: 2 - If Protein-protein interaction (PPI) is directly and explicitly mentioned within the sentence (along with the method of detection); 1 - if PPI is implied in the sentence (along with the method of detection), but not explicitly stated; 0 - if PPI is neither implied nor mentioned in the sentence.
The table shows the number of sentences labelled as Y2, Y1 and Y0 by a majority of the annotators, as well as the percentage with respect to the total number of sentences labelled as Y, and with respect to the whole collection of labelled sentences.
Note that the total number of majority Y2, Y1 and Y0 labels in the second column on the left does not sum to 755 (and the respective percentages do not sum to 100%), as for some of the sentences in which two or more annotators agree on the "Y" tag, there is not necessarily such agreement on the additional numerical label (0, 1 or 2).