| Literature DB >> 22151252 |
Abstract
BACKGROUND: Identifying protein-protein interactions (PPIs) from literature is an important step in mining the function of individual proteins as well as their biological network. Since it is known that PPIs have distinctive patterns in text, machine learning approaches have been successfully applied to mine these patterns. However, the complex nature of PPI description makes the extraction process difficult.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22151252 PMCID: PMC3269944 DOI: 10.1186/1471-2105-12-S8-S9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the proposed PPI article classification approach. Input articles are first evaluated whether there are gene/protein names in the text. After gene name detection, feature generation is performed in three different ways: word features including multi-words, sub-strings, and MeSH terms; syntactic features involving grammar relations between words; higher-order features obtained by evaluating a combination of different features.
The corpus information used in our experiments.
| Corpus Name | Positive Examples | Negative Examples | Total Examples |
|---|---|---|---|
| BioCreative II | 3874 | 2298 | 6172 |
| BioCreative II.5 | 124 | 1066 | 1190 |
| BioCreative III Training Set | 1140 | 1140 | 2280 |
| BioCreative III Development Set | 682 | 3318 | 4000 |
| Total Training Set | 5820 | 7822 | 13642 |
| BioCreative III Test Set | 910 | 5090 | 6000 |
BioCreative II, BioCreative II.5, BioCreative III training, and development sets were used as the training corpus for the ACT competition. While the training corpus is balanced, the BioCreative III test set is an imbalanced set with the number of negative examples about six times higher than the number of positive examples. Hence, for the official submission, system parameters were tuned for the BioCreative III development set.
The feature combinations used for submitted runs on the article classification task
| BC3 Dev Set | Multi-word | MeSH Term | Stemmed GRs | Feature Cut | Higher Order | |||
|---|---|---|---|---|---|---|---|---|
| UNI | BI | TRI | ||||||
| Run 1 | X | X | X | |||||
| Run 2 | X | X | X | X | ||||
| Run 3 | X | X | X | X | X | X | ||
| Run 4 | X | X | X | X | X | X | X | |
| Run 5 | X | X | X | X | X | X | X | |
The training data used in official submissions includes all examples of previous BioCreative PPI article tasks. However, the BioCreative III development set was selectively added for training in different runs. Unigrams (UNI), bigrams (BI), and trigrams (TRI) were used as multi-word features. MeSH feature is unigrams and bigrams from MeSH terms. For grammar relations (GRs), stemming was performed on Run 3 through Run 5. Feature cut was performed based on the frequency threshold four.
Official scores for the ACT competition.
| Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | |
|---|---|---|---|---|---|
| TP | 580 | 516 | 553 | 531 | 565 |
| FP | 417 | 257 | 376 | 288 | 398 |
| FN | 330 | 394 | 357 | 379 | 345 |
| TN | 4673 | 4833 | 4714 | 4802 | 4692 |
| Accuracy | 0.8755 | 0.8778 | 0.8888 | 0.8762 | |
| Specificity | 0.9181 | 0.9261 | 0.9434 | 0.9218 | |
| Sensitivity | 0.5670 | 0.6077 | 0.5835 | 0.6209 | |
| F1 score | 0.6083 | 0.6132 | 0.6014 | 0.6033 | |
| MCC | 0.53524 | 0.52932 | 0.55054 | 0.53031 | |
| AUC iP/R | 0.6591 | 0.6796 | 0.6589 | 0.6537 | |
TP, FP, FN, and TN are true positive, false positive, false negative, and true negative, respectively. MCC means Matthews’ correlation coefficient measure. AUC iP/R means the area under the interpolated precision and recall curve. F1 score and MCC evaluate the performance of binary classification. AUC iP/R evaluates system performance in terms of ranked results.
Performance results for corrected PPI classification on the ACT test set.
| Run 2 | Run 2’ | Run 4 | Run 4’ | |
|---|---|---|---|---|
| TP | 516 | 529 | 531 | 556 |
| FP | 257 | 271 | 288 | 311 |
| FN | 394 | 381 | 379 | 354 |
| TN | 4833 | 4819 | 4802 | 4779 |
| Accuracy | 0.8913 | 0.8888 | 0.8892 | |
| Specificity | 0.9468 | 0.9434 | 0.9389 | |
| Sensitivity | 0.5670 | 0.5813 | 0.5835 | |
| F1 score | 0.6132 | 0.6187 | 0.6142 | |
| MCC | 0.55306 | 0.55722 | 0.55054 | |
| AUC iP/R | 0.6796 | 0.6806 | 0.6798 | |
Run 2’ and Run 4’ are the corrected performance results for Run 2 and Run 4 respectively. For the official runs, gene names consisting of more than a single word were not treated as a single entity. Only this issue was fixed for Run 2’ and Run 4’.
Average precision rates when adding grammar relations to single words.
| Feature Set | Naïve Bayes | SVM | Huber |
|---|---|---|---|
| Single Words (SW) | 0.6169 | 0.6600 | 0.6646 |
| Grammar Relations (GR) | 0.6281 | 0.6391 | 0.6417 |
| SW + GR |
The best score is obtained by using both single words and grammar relations for all classifiers. The used training data was BioCreative II, BioCreative II.5, and BioCreative III training corpora. The performance was measured for the BioCreative development set.
Performance changes on the ACT development set by varying feature types.
| Used Features | Avg Prec | Precision | Recall | F1 score |
|---|---|---|---|---|
| Baseline | 0.7073 | 0.6403 | 0.6290 | 0.6346 |
| –Gene Anonymization | 0.7017 | 0.6166 | 0.6320 | 0.6242 |
| –Multi-words | 0.7035 | 0.6358 | 0.6349 | 0.6354 |
| –Sub-strings | 0.7019 | 0.6329 | 0.6320 | 0.6324 |
| –MeSH Terms | 0.7009 | 0.6334 | 0.6372 | |
| Baseline+Higher Order | 0.6311 | |||
The baseline performance is the result obtained from our system pipeline with the same setting used for Run 4. A row shows the evaluation results when a specific feature type is not used for the experiment. However, the last row is the performance results when higher-order features are applied.
Figure 2The non-interpolated precision-recall curve on the BioCreative III test set. The precision-recall curves show Run 4 and the result with single word features alone in the same classification pipeline. The points are the non-interpolated precision/recall value pairs obtained by the official BioCreative III evaluation script.
Figure 3An example of word feature extraction. Unigrams, bigrams, and trigrams of words are selected as multi-word features. The sub-string feature contains six-consecutive characters. MeSH terms are extracted from the MeSH field in each article and unigram and bigram subphrases are used.
Figure 4An example of syntactic feature extraction. Syntactic features are to analyze word-word relationships in a grammatical way. The words in each relation have different roles as a head word and a dependent word. To respond to general patterns, an anonymization technique is applied for gene names in the dependent position.