| Literature DB >> 22151769 |
Xinglong Wang1, Rafal Rak, Angelo Restificar, Chikashi Nobata, C J Rupp, Riza Theresa B Batista-Navarro, Raheel Nawaz, Sophia Ananiadou.
Abstract
BACKGROUND: The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest's Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22151769 PMCID: PMC3269934 DOI: 10.1186/1471-2105-12-S8-S11
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Distribution of articles in the training, development, and test datasets
| Task | Training | Development | Test | Scope |
|---|---|---|---|---|
| IMT | 2,035 | 587 | 305 | full-text |
| ACT | 2,280 | 4,000 | 6,000 | abstract and full-text |
Distribution of articles in the training, development, and test datasets provided by the BioCreative III organisers for each of the PPI tasks.
IMT results on the development dataset
| System | Precision | Recall | F1 score | AUC iP/R |
|---|---|---|---|---|
| m-LR | 41.36 | 53.81 | 46.37 | 22.85 |
| m-SVM | 72.12 | 51.31 | 59.96 | 39.05 |
| b-SVM (*) | 68.35 | 61.05 | 42.02 | |
| union(m-SVM,b-SVM) (*) | 65.62 | 63.11 | 64.33 | |
| union(m-LR,b-SVM) | 42.33 | 50.73 | 27.76 | |
| union(m-LR,m-SVM) | 41.39 | 54.01 | 46.46 | |
| intersect(m-LR,b-SVM) (*) | 75.24 | 54.96 | 63.52 | 44.02 |
| intersect(m-LR,m-SVM,b-SVM) (*) | 50.17 | 61.13 | 40.92 |
Macro-averaged results on the IMT development dataset with 10 best models selected by cross-validation on the training data (%). m-LR – multi-label Logistic Regression; m-SVM – multi-label Support Vector Machines; b-SVM – binary Support Vector Machines. Asterisks (*) denote systems that were submitted to the challenge.
IMT results on the test dataset
| System | Precision | Recall | F1 score | AUC iP/R |
|---|---|---|---|---|
| m-LR | 58.37 | 55.80 | 57.06 | 34.96 |
| m-SVM | 62.33 | 48.94 | 54.83 | 33.18 |
| b-SVM (*) | 52.56 | 52.45 | 52.50 | 28.45 |
| union(m-SVM,b-SVM) (*) | 53.21 | 59.61 | 56.23 | 35.85 |
| union(m-LR,b-SVM) | 52.51 | |||
| union(m-LR,m-SVM) | 57.43 | 56.18 | 56.80 | 35.26 |
| intersect(m-LR,b-SVM) (*) | 64.06 | 44.01 | 52.17 | 29.47 |
| intersect(m-LR,m-SVM,b-SVM) (*) | 44.42 | 52.73 | 30.52 |
Results on the IMT test dataset (%). The models were trained on the combined training and development datasets. m-LR – multi-label Logistic Regression; m-SVM – multi-label Support Vector Machines; b-SVM – binary Support Vector Machines. Asterisks (*) denote systems that were submitted to the challenge.
Figure 1Learning curves of the IMT systems. Figure 1 shows learning curves of the following IMT systems: b-SVM, m-SVM, the union and the intersection of the output of b-SVM and m-SVM, as measured by precision, recall, F1 score and AUC iP/R. Each system was trained using increasing amounts of the data, i.e., 20%, 40%, 60%, 80% an 100% of the training dataset, and then tested on the development set.
Figure 2Comparison of MI ID distributions in IMT training, development and test datasets. Figure 2 plotted histograms showing distribution of frequencies of MI IDs in the training, development a test datasets, respectively.
ACT results on the test dataset
| System | F1 score | Specificity | Sensitivity | Accuracy | Matthews Coef | AUC iP/R |
|---|---|---|---|---|---|---|
| SVMMeSH ID | 57.44 | 49.23 | 52.237 | 49.26 | ||
| SVMMeSH Tree | 59.01 | 94.97 | 53.63 | 88.70 | 52.890 | 51.65 |
| LRMeSH ID | 93.93 | 88.32 |
Results on the ACT test dataset (%). SVMMeSH ID – SVM with MeSH identifiers; SVMMeSH Tree – SVM with MeSH tree structure; LR – LR with MeSH identifiers.
ACT 10-fold cross-validation results on the training and development datasets
| System | F1 score | Specificity | Sensitivity | Accuracy | Matthews Coef | AUC iP/R |
|---|---|---|---|---|---|---|
| SVMMeSH ID | 75.87 | 94.26 | 69.70 | 87.13 | 67.68 | 75.11 |
| SVMMeSH Tree | 71.08 | 76.22 | ||||
| LRMeSH ID | 76.78 | 93.49 | 87.33 | 68.37 | ||
| SVMMeSH ID&Tree | 76.90 | 70.91 | 87.64 | 69.01 | 76.20 |
The 10-fold cross-validation results for ACT on the training and development datasets (%). SVMMeSH ID – SVM with MeSH identifiers; SVMMeSH Tree – SVM with MeSH tree structure; LR – LR with MeSH identifiers; SVMMeSH ID&Tree – SVM with both MeSH identifiers and tree structure.
ACT results on the development dataset
| System | F1 score | Specificity | Sensitivity | Accuracy | Matthews Coef | AUC iP/R |
|---|---|---|---|---|---|---|
| SVMMeSH ID | 14.05 | 8.80 | 10.05 | 20.75 | ||
| SVMMeSH Tree | 15.68 | 96.11 | 10.12 | 81.45 | ||
| LRMeSH ID | 61.09 | 58.38 | 4.80 | 19.16 |
Results on the ACT development dataset with models trained on the training dataset (%). SVMMeSH ID – SVM with MeSH identifiers; SVMMeSH Tree – SVM with MeSH tree structure; LR – LR with MeSH identifiers.
Figure 3Learning curves of the ACT systems. Figure 3 shows how the size of training data affected ACT performance. We gradually increased the training size from 10% to 90%. The training data at each percentage point was randomly selected, and then 10-fold cross validations was performed and results plotted.
Mapping between MI IDs and MeSH terms
| Rank | MI ID | MI name | MeSH ID | MeSH term |
|---|---|---|---|---|
| 1 | MI:0007 | anti tag coimmunoprecipitation | E05.196.150.639 | Co-Immunoprecipitation |
| 2 | MI:0006 | anti bait coimmunoprecipitation | E05.196.150.639 | Co-Immunoprecipitation |
| 3 | MI:0096 | pull down | E05.196.181.400.170 | Affinity Chromatography |
| 4 | MI:0018 | two hybrid | E05.393.220.870 | Two-hybrid System Techniques |
| 5 | MI:0114 | X-ray crystallography | E05.196.309.742.225 | X-Ray Crystallography |
| 6 | MI:0071 | Molecular Sieving | E05.196.181.400.250 | Molecular Sieve Chromatography |
| 7 | MI:0416 | Fluorescence Microscopy | E01.370.350.515.458 | Fluorescence Microscopy |
| 8 | MI:0424 | Protein Kinase Assay | E05.196.630.570.700 | Protein Array Analysis |
| 9 | MI:0107 | Surface Plasmon Resonance | E05.196.890 | Surface Plasmon Resonance |
| 10 | MI:0663 | Confocal Microscopy | E01.370.350.515.395 | Confocal Microscopy |
The manually constructed mapping between interaction methods from the PSI-MI ontology and MeSH terms, ranked by occurrence frequency in the training data
Prior analysis of 10 sample abstracts
| Positive Samples | Negative Samples | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| PMID | A1 | A2 | A3 | A4 | A5 | PMID | A1 | A2 | A3 | A4 | A5 |
| 17517622 | yes | yes | no | yes | yes | 19413980 | no | no | no | no | no |
| 17586502 | yes | no | yes | no | yes | 19416831 | yes | no | no | yes | no |
| 17666011 | yes | yes | no | no | yes | 19421224 | yes | no | no | no | no |
| 17762861 | yes | yes | no | yes | yes | 19429605 | yes | no | yes | no | no |
| 17942705 | yes | yes | yes | yes | no | 19435285 | yes | no | yes | no | no |
ACT feature knock-out experiments for SVM
| Features | F1 score | Specificity | Sensitivity | Accuracy | Matthews Coef | AUC iP/R |
|---|---|---|---|---|---|---|
| B | 73.45 | 93.02 | 67.95 | 85.75 | 64.19 | 72.44 |
| N | 31.75 | 19.76 | 75.35 | 31.50 | 42.04 | |
| C | 69.47 | 93.58 | 61.58 | 84.30 | 60.03 | 69.98 |
| M | 69.07 | 91.63 | 63.56 | 83.49 | 58.33 | 68.33 |
| BC | 74.93 | 94.10 | 68.55 | 86.69 | 66.50 | 73.92 |
| BCM | 76.71 | 94.33 | 70.86 | 87.52 | 68.70 | 76.00 |
| BNCM | 94.48 | |||||
Results of feature knock-out experiments on the combined ACT training and development datasets (%) with Support Vector Machine (SVM). B – bag of words; N – named entities; C – contextual words surrounding proteins; M – MeSH descriptors.
ACT feature knock-out experiments for LR
| Features | F1 score | Specificity | Sensitivity | Accuracy | Matthews Coef | AUC iP/R |
|---|---|---|---|---|---|---|
| B | 72.33 | 91.61 | 68.28 | 84.84 | 62.14 | 78.97 |
| N | 50.05 | 38.20 | 77.88 | 40.75 | 60.12 | |
| C | 69.38 | 89.39 | 66.91 | 82.87 | 57.58 | 76.30 |
| M | 69.61 | 90.83 | 65.37 | 83.44 | 58.53 | 75.06 |
| BC | 74.57 | 92.69 | 70.09 | 86.13 | 65.34 | 80.75 |
| BCM | 76.45 | 93.20 | 72.17 | 87.10 | 67.84 | 82.67 |
| BNCM | 93.49 | |||||
Results of feature knock-out experiments on the combined ACT training and development datasets for the logistic regression (LR) model (%). B – bag of words; N – named entities; C – contextual words surrounding proteins; M – MeSH descriptors.