| Literature DB >> 22151701 |
Shashank Agarwal1, Feifan Liu, Hong Yu.
Abstract
BACKGROUND: Protein-protein interaction (PPI) is an important biomedical phenomenon. Automatically detecting PPI-relevant articles and identifying methods that are used to study PPI are important text mining tasks. In this study, we have explored domain independent features to develop two open source machine learning frameworks. One performs binary classification to determine whether the given article is PPI relevant or not, named "Simple Classifier", and the other one maps the PPI relevant articles with corresponding interaction method nodes in a standardized PSI-MI (Proteomics Standards Initiative-Molecular Interactions) ontology, named "OntoNorm".Entities:
Mesh:
Substances:
Year: 2011 PMID: 22151701 PMCID: PMC3269933 DOI: 10.1186/1471-2105-12-S8-S10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
ACT Data
| ACT Data (article abstracts) | |
|---|---|
| Training Data Total | 2280 |
| Training Data Positive | 1140 |
| Training Data Negative | 1140 |
| Development Data Total | 4000 |
| Development Data Positive | 682 |
| Development Data Negative | 3318 |
| Test Data Total | 6000 |
The organizers of BioCreative III provided training, development and test data for ACT. The size of the data is shown in this table.
IMT Data
| IMT Data (article full-texts) | |
|---|---|
| Training Number of Articles | 2035 |
| Training Number of Annotations | 4347 |
| Training Annotations per Article | 2.14 |
| Development Number of Articles | 587 |
| Development Number of Annotations | 1379 |
| Development Annotations per Article | 2.35 |
| Test Number of Articles | 305 |
The organizers of BioCreative III provided training, development and test data for IMT. The size of the data is shown in this table.
Runs submitted for ACT
| Run number | Label | Classifier Algorithm | Type of features used | Number of features | Training data |
|---|---|---|---|---|---|
| 1 | NBM-12-1k-td | NBM | Unigrams and Bigrams | 1000 | Training+Development |
| 2 | NBM-12-400-td | NBM | Unigrams and Bigrams | 400 | Training+Development |
| 3 | NBM-12-1k-d | NBM | Unigrams and Bigrams | 1000 | Development |
| 4 | SVM-12-400-d | SVM | Unigrams and Bigrams | 400 | Development |
| 5 | SVM-12-400-td | SVM | Unigrams and Bigrams | 400 | Training+Development |
| 6 | NBM-1-1k-td | NBM | Unigrams | 1000 | Training+Development |
| 7 | NBM-1-400-td | NBM | Unigrams | 400 | Training+Development |
| 8 | NBM-1-1k-d | NBM | Unigrams | 1000 | Development |
| 9 | SVM-1-400-d | SVM | Unigrams | 400 | Development |
| 10 | SVM-1-400-td | SVM | Unigrams | 400 | Training+Development |
For the BioCreative III challenge, each participating team was allowed to submit 10 runs for ACT. Five runs could be submitted offline and the other five runs could be submitted online, using XML-RPC. Runs 1-5 were submitted offline, while runs 6-10 were submitted online. For all runs, we used mutual information feature selection algorithm, as it gave better performance than chi-square score. We submitted 10 runs, listed here.
Top 10 Unigrams and Bigrams for IMT
| Term | Mutual information score | Chi-square value |
|---|---|---|
| two hybrid | 0.439 | 1225.574 |
| immunoprecipitation | 0.437 | 1110.124103 |
| hybrid | 0.398 | 1041.587496 |
| yeast two | 0.348 | 1061.789 |
| diffraction | 0.263 | 1142.337 |
| resonance | 0.236 | 969.286 |
| crystallography | 0.182 | 751.011 |
| x ray | 0.176 | 622.764 |
| yeast | 0.173 | 402.283 |
| gal4 | 0.168 | 576.122 |
The top 10 unigrams and bigrams by mutual information score and their corresponding chi-square values. The terms are sorted by their mutual information score.
Features used for IMT
| Feature | Feature type | Description |
|---|---|---|
| Perfect match (2 features) | Binary | For each node, checks if (1) the concept name or (2) any synonym name appears in the article |
| Term match (4 features) | Binary | For each node, checks if any unigram/bigram in the node’s (1, 2) concept name or (3, 4) synonyms appears in the article |
| Term match ratio (4 features) | Continuous | For each node, the ratio unigram/bigram in the node’s (1, 2) concept name or (3, 4) synonyms that appears in the article |
| Matched terms mutual information sum (4 features) | Continuous | Sum of mutual information score of each matching uni-gram/bigram in the node’s (1, 2) concept name or (3, 4) any synonym. |
| Matched term chi-squared sum (4 features) | Continuous | Sum of chi-squared value of each matching unigram/bigram in the node’s (1, 2) concept name or (3, 4) any synonym. |
| Node popularity | Integer | The number of times this node is annotated in the training data |
| Regex annotation | Binary | Checks if the regular expression-based annotator that was provided by the organizers of BioCreative III annotates the current article-ontology node pair |
| Keyword presence | Binary | Checks if the keyword for the ontology node appears in the article |
Runs submitted for IMT
| Run number | Label | Algorithm | Number of features |
|---|---|---|---|
| 1 | j48-21 | J48 | All (21 features) |
| 2 | rc-21 | Random Committee | All (21 features) |
| 3 | rf-21 | Random Forest | All (21 features) |
| 4 | j48-14 | J48 | 14 features |
| 5 | rf-12 | Random Forest | 12 features |
| 6 | rc-12 | Random Committee | 12 features |
| 7 | rc-14 | Random Committee | 14 features |
| 8 | rf-7 | Random Forest | 7 features |
| 9 | nbt-7 | Naïve Bayes Tree | 7 features |
| 10 | rf-15 | Random Forest | 15 features |
For the BioCreative III challenge, each participating team was allowed to submit 10 runs for IMT. Five runs could be submitted offline and the other five runs could be submitted online, using XML-RPC. Runs 1-5 were submitted offline, while runs 6-10 were submitted online. For all runs, we combined the training and the development data. We submitted 10 runs, listed here.
ACT Results
| Run number | Label | Accuracy (%) | Specificity (%) | Sensitivity (%) | F1-Score (%) | MCC | AUC iP/R (%) |
|---|---|---|---|---|---|---|---|
| 1 | NBM-12-1k-td | 80.02 | 80.90 | 75.06 | 53.26 | 0.449 | 61.29 |
| 2 | NBM-12-400-td | 81.00 | 81.75 | 76.81 | 55.08 | 0.472 | 62.13 |
| 3 | NBM-12-1k-d | 82.40 | 83.85 | 74.29 | 56.15 | 0.482 | 60.48 |
| 4 | SVM-12-400-d | 87.73 | 94.79 | 48.24 | 54.40 | 0.480 | 43.76 |
| 5 | SVM-12-400-td | 87.27 | 91.81 | 61.87 | 59.58 | 0.521 | 48.47 |
| 6 | NBM-1-1k-td | 77.80 | 77.84 | 77.58 | 51.46 | 0.432 | 57.44 |
| 7 | NBM-1-400-td | 78.05 | 78.15 | 77.47 | 51.71 | 0.434 | 57.56 |
| 8 | NBM-1-1k-d | 79.90 | 81.00 | 73.74 | 52.67 | 0.441 | 54.97 |
| 9 | SVM-1-400-d | 86.25 | 92.06 | 53.74 | 54.24 | 0.462 | 41.58 |
| 10 | SVM-1-400-td | 86.87 | 90.39 | 67.14 | 60.80 | 0.533 | 47.40 |
The result of submitted ACT runs on the test data. Legend: MCC=Matthew’s correlation coefficient
IMT Results
| Run | Label | Precision (%) | Recall (%) | F1-Score (%) | MCC | AUC iP/R (%) |
|---|---|---|---|---|---|---|
| 1 | j48-21 | 52.52 | 49.53 | 50.98 | 0.500 | 28.20 |
| 2 | rc-21 | 52.02 | 48.96 | 50.44 | 0.495 | 28.59 |
| 3 | rf-21 | 50.78 | 49.34 | 50.05 | 0.490 | 27.24 |
| 4 | j48-14 | 52.50 | 49.91 | 51.17 | 0.502 | 29.22 |
| 5 | rf-12 | 52.58 | 52.18 | 52.38 | 0.514 | 29.98 |
| 6 | rc-12 | 52.71 | 51.61 | 52.16 | 0.512 | 29.93 |
| 7 | rc-14 | 52.28 | 50.10 | 51.16 | 0.502 | 30.05 |
| 8 | rf-7 | 52.28 | 52.18 | 52.23 | 0.512 | 30.05 |
| 9 | nbt-7 | 49.55 | 52.56 | 51.01 | 0.500 | 29.30 |
| 10 | rf-15 | 51.76 | 50.29 | 51.01 | 0.500 | 29.80 |
The result of submitted IMT runs on the test data. Legend: MCC=Matthew’s correlation coefficient