| Literature DB >> 22151929 |
Martin Krallinger1, Miguel Vazquez, Florian Leitner, David Salgado, Andrew Chatr-Aryamontri, Andrew Winter, Livia Perfetto, Leonardo Briganti, Luana Licata, Marta Iannuccelli, Luisa Castagnoli, Gianni Cesareni, Mike Tyers, Gerold Schneider, Fabio Rinaldi, Robert Leaman, Graciela Gonzalez, Sergio Matos, Sun Kim, W John Wilbur, Luis Rocha, Hagit Shatkay, Ashish V Tendulkar, Shashank Agarwal, Feifan Liu, Xinglong Wang, Rafal Rak, Keith Noto, Charles Elkan, Zhiyong Lu, Rezarta Islamaj Dogan, Jean-Fred Fontaine, Miguel A Andrade-Navarro, Alfonso Valencia.
Abstract
BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22151929 PMCID: PMC3269938 DOI: 10.1186/1471-2105-12-S8-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
ACT data overview
| Data set | Tot. articles | PPI | not PPI | Perc. PPI | Years | Journals |
|---|---|---|---|---|---|---|
| Training | 2,280 | 1,140 | 1,140 | 50% | 2007-2010 | 118 |
| Development | 4,000 | 682 | 3,318 | 17.05% | 2009-2010 | 113 |
| Test | 6,000 | 910 | 5,090 | 15.00% | 2009-2010 | 112 |
| Total | 12,280 | 2,732 | 9,548 | - | 2007-2010 | 121 |
Overview of the data collections provided for ACT
IMT data overview
| Data set | Tot. articles | Annotations | PSI-MI IDs | IDs/article | Years | Journals |
|---|---|---|---|---|---|---|
| Training | 2,003 | 4,348 | 86 | 2.17 | 2006-2010 | 87 |
| Development | 587 | 1,316 | 71 | 2.24 | 2006-2010 | 17 |
| Test | 223 | 528 | 46 | 2.36 | 2008-2010 | 9 |
Overview of the data collections provided for IMT
Figure 1IMT data set class distribution. Pie charts illustrating the most frequent methods encountered in the three IMT data collections. Classes are ordered by their frequency in the test set. The most frequent training and development set classes are shown in shades of brown that in total contribute more than 50% of all class assignments in those two sets. In blue is class MI:0114 (x-ray crystallography) that is not frequent in the test set. In green are classes that are significantly more frequent in the test set than the others. MI:0018 (two hybrid) and MI:0114 are frequent in the training and development set, while MI:0416 (fluorescence microscopy) and MI:0019 (coimmunoprecipitation) are frequent in the test set.
PPI task participating teams
| TeamId | Leader | Institution | Country | ACT | IMT | URL |
|---|---|---|---|---|---|---|
| 65 | Fabio Rinaldi | University of Zurich | Switzerland | 5 | 5 | [ |
| 69 | Robert Leaman | Arizona State University | USA | 0 | 5 | [ |
| 70 | Sergio Matos | Universidade de Aveiro, IEETA | Portugal | 5 | 5 | - |
| 73 | W John Wilbur | NCBI | USA | 5 | 0 | [ |
| 81 | Luis Rocha | Indiana University | USA | 10 | 5 | - |
| 88 | Ashish Tendulkar | IIT Madras | India | 2 | 2 | [ |
| 89 | Shashank Agarwal | University of Wisconsin-Milwaukee | USA | 10 | 10 | [ |
| 90 | Xinglong Wang | National Centre for Text Mining | UK | 5 | 5 | - |
| 92 | Keith Noto | Tufts University | USA | 1 | 0 | [ |
| 100 | Zhiyong Lu | NCBI | USA | 4 | 5 | - |
| 104 | Jean-Fred Fontaine | Max Delbrück Center | Germany | 5 | 0 | [ |
Overview of teams that participated in the PPI tasks and availability of resulting systems. The numbers in the column ACT and IMT correspond to the number of submitted runs by that team.
Figure 2ACT manual classification time per class. (A) Box plot of the manual classification time distribution. (B) ACT development and test set annotation time histogram for negative (non-PPI) abstracts and (C) for positive (PPI relevant) abstracts.
Figure 3ACT manual classification time per curator. Box plot of manual classification time spent by each individual curator. The labels correspond to: four expert curators, ordered by experience (lowest = IA, highest = ID), a CNIO annotator (CO), a BioGRID curator (BG), and four MINT curators (MA-MD).
ACT participant results
| Team | Run/Srvr | Accuracy | Specificity | Sensitivity | F-Score | MCC | AUC iP/R | Time_half |
|---|---|---|---|---|---|---|---|---|
| T65 | RUN_1 | 88.68 | 97.64 | 38.57 | 50.83 | 0.48297 | 63.85 | 40.19 |
| T65 | RUN_2 | 87.93 | 93.07 | 59.23 | 59.82 | 0.52727 | 63.89 | 40.19 |
| T65 | RUN_3 | 67.05 | 64.19 | 83.08 | 43.34 | 0.34244 | 41.74 | 55.95 |
| T65 | RUN_4 | 73.68 | 74.13 | 71.21 | 45.08 | 0.34650 | 41.74 | 55.95 |
| T65 | RUN_5 | 88.00 | 94.40 | 52.20 | 56.89 | 0.50255 | 62.39 | 40.83 |
| T70 | RUN_1 | 56.45 | 49.70 | 94.18 | 39.62 | 0.31789 | 56.76 | 42.12 |
| T70 | RUN_2 | 87.41 | 96.11 | 38.79 | 48.32 | 0.43346 | 56.76 | 42.13 |
| T70 | RUN_3 | 81.92 | 83.61 | 72.53 | 54.91 | 0.46563 | 56.76 | 42.12 |
| T70 | RUN_4 | 47.77 | 39.04 | 35.95 | 0.27060 | 56.76 | 42.12 | |
| T70 | RUN_5 | 86.84 | 98.62 | 20.99 | 32.62 | 0.34488 | 56.76 | 42.13 |
| T73 | RUN_1 | 87.55 | 91.81 | 63.74 | 60.83 | 0.53524 | 65.91 | 38.33 |
| T73 | RUN_2 | 94.95 | 56.70 | 61.32 | 67.96 | |||
| T73 | RUN_3 | 87.78 | 92.61 | 60.77 | 60.14 | 0.52932 | 65.89 | 38.19 |
| T73 | RUN_4 | 88.88 | 94.34 | 58.35 | 0.55054 | 37.15 | ||
| T73 | RUN_5 | 87.62 | 92.18 | 62.09 | 60.33 | 0.53031 | 65.37 | 38.40 |
| T81 | RUN_1 | 59.03 | 58.76 | 60.55 | 30.96 | 0.13949 | 19.93 | 82.27 |
| T81 | RUN_2 | 58.47 | 57.86 | 61.87 | 31.12 | 0.14219 | 19.69 | 82.76 |
| T81 | RUN_3 | 25.37 | 14.72 | 84.95 | 25.66 | -0.00344 | 15.66 | 102.73 |
| T81 | RUN_4 | 63.45 | 69.16 | 31.54 | 20.74 | 0.00538 | 16.20 | 104.95 |
| T81 | RUN_5 | 69.17 | 77.35 | 23.41 | 18.72 | 0.00645 | 15.63 | 98.72 |
| T81 | SRVR_9 | 84.88 | 0.44 | 0.88 | 0.05220 | 44.19 | 50.11 | |
| T81 | SRVR_10 | 85.38 | 99.61 | 5.82 | 10.78 | 0.17771 | 50.25 | 45.11 |
| T81 | SRVR_11 | 84.73 | 99.86 | 0.11 | 0.22 | -0.00272 | 46.02 | 48.23 |
| T81 | SRVR_12 | 84.30 | 98.86 | 2.86 | 5.23 | 0.05244 | 32.11 | 56.89 |
| T81 | SRVR_13 | 84.88 | 99.92 | 0.77 | 1.52 | 0.05791 | 18.59 | 113.11 |
| T88 | RUN_1 | 42.63 | 35.11 | 84.73 | 30.94 | 0.15238 | 21.97 | 84.90 |
| T88 | RUN_2 | 56.92 | 53.73 | 74.73 | 34.47 | 0.20417 | 26.04 | 75.33 |
| T89 | RUN_1 | 80.02 | 80.90 | 75.06 | 53.26 | 0.44911 | 61.29 | 41.31 |
| T89 | RUN_2 | 81.00 | 81.75 | 76.81 | 55.08 | 0.47242 | 62.13 | 40.99 |
| T89 | RUN_3 | 82.40 | 83.85 | 74.29 | 56.15 | 0.48180 | 60.48 | 41.72 |
| T89 | RUN_4 | 87.73 | 94.79 | 48.24 | 54.40 | 0.47967 | 43.76 | 43.09 |
| T89 | RUN_5 | 87.27 | 91.81 | 61.87 | 59.58 | 0.52082 | 48.47 | 44.57 |
| T89 | SRVR_4 | 77.80 | 77.84 | 77.58 | 51.46 | 0.43152 | 57.44 | 44.63 |
| T89 | SRVR_5 | 78.05 | 78.15 | 77.47 | 51.71 | 0.43424 | 57.56 | 45.20 |
| T89 | SRVR_6 | 79.90 | 81.00 | 73.74 | 52.67 | 0.44073 | 54.97 | 45.93 |
| T89 | SRVR_7 | 86.25 | 92.06 | 53.74 | 54.24 | 0.46156 | 41.58 | 45.94 |
| T89 | SRVR_8 | 86.87 | 90.39 | 67.14 | 60.80 | 0.53336 | 47.40 | 45.55 |
| T90 | RUN_1 | 88.73 | 95.15 | 52.86 | 58.73 | 0.52736 | 51.14 | 39.02 |
| T90 | RUN_2 | 88.70 | 94.97 | 53.63 | 59.01 | 0.52890 | 51.65 | 39.14 |
| T90 | RUN_3 | 88.32 | 93.93 | 56.92 | 59.64 | 0.52914 | 65.24 | 39.29 |
| T90 | RUN_4 | 88.93 | 96.03 | 49.23 | 57.44 | 0.52237 | 49.26 | 70.68 |
| T90 | RUN_5 | 88.60 | 95.05 | 52.53 | 58.29 | 0.52204 | 50.83 | 39.27 |
| T92 | RUN_1 | 86.22 | 90.77 | 60.77 | 57.22 | 0.49155 | 50.99 | 42.40 |
| T100 | RUN_1 | 88.77 | 96.82 | 43.74 | 54.15 | 0.50005 | 61.62 | 42.57 |
| T100 | RUN_2 | 88.27 | 93.89 | 56.81 | 59.49 | 0.52732 | 61.86 | 39.05 |
| T100 | RUN_3 | 81.13 | 82.69 | 72.42 | 53.80 | 0.45256 | 60.25 | 41.60 |
| T100 | RUN_4 | 81.85 | 82.85 | 76.26 | 56.04 | 0.48270 | 63.75 | 38.41 |
| T104 | RUN_1 | 80.12 | 80.69 | 76.92 | 53.99 | 0.45999 | 53.67 | 48.21 |
| T104 | RUN_2 | 80.07 | 80.47 | 77.80 | 54.21 | 0.46370 | 53.67 | 48.21 |
| T104 | RUN_3 | 64.93 | 59.86 | 93.3049 | 44.66 | 0.38161 | 53.67 | 48.21 |
| T104 | RUN_4 | 69.78 | 66.25 | 89.56 | 47.34 | 0.40530 | 53.67 | 48.21 |
| T104 | RUN_5 | 86.27 | 98.47 | 18.02 | 28.47 | 0.30064 | 53.67 | 142.95 |
Evaluation results based on the unrefined Gold Standard, in terms of Accuracy, MCC Score and AUC iP/R. The highest score for each evaluation column is show in bold typeface. Run/Srvr (RUN = offline run/SRVR = online run via the BCMS), MCC (Matthew's Correlation Coefficient), AUC iP/R (Area under the interpolated Precion/Recall curve). Time_half is the fraction of time needed to classify half of the positive abstracts using the output of that run when compared to unranked results. Note that some runs submitted by mistake the opposite ranking as requested for the negative records, which explains higher classification time (e.g. Team 104, RUN 5 with Time_half of 142.95), inverting in these cases the order of negative articles resulted in comparable time savings to the other systems.
Figure 4ACT consensus analysis. (A) The two black lines represent the number of relevant articles found while traversing the dataset. The diagonal line represents a random traversal, the parabolic line above represents a traversal following the ranking proposed by the consensus predictions. The green lines are bootstrap estimates for standard deviation. The horizontal red line represents half of the 910 articles. (B) Instead of showing the traversal in articles read, this shows the time spent reading them. The diagonal line represents a random traversal, the parabolic line above represents a traversal following the ranking proposed by the consensus predictions. The green lines represent traversals using the scores provided by the systems. Note that some runs seen below the random traversal seemed to have provided the opposite ranking for the negative documents than required, based on the submission format. TC: Team consensus prediction.
IMT macro-averaged participant results
| Team | Run/Srvr | Docs | Precision | Recall | F-Score | AUC iP/R |
|---|---|---|---|---|---|---|
| - | ||||||
| T65 | RUN_1 | 9.35 | 83.21 | 16.32 | 0.47884 | |
| T65 | RUN_2 | 2.45 | 4.75 | 0.44034 | ||
| T65 | RUN_3 | 9.99 | 79.38 | 17.16 | 0.47650 | |
| T65 | RUN_4 | 33.48 | 42.88 | 35.40 | 0.30927 | |
| T65 | RUN_5 | 2.44 | 4.74 | 0.50111 | ||
| T69 | RUN_1 | 214 | 54.87 | 57.91 | 52.39 | 0.52112 |
| T69 | RUN_2 | 211 | 57.01 | 57.35 | 53.42 | 0.51844 |
| T69 | RUN_3 | 203 | 60.24 | 56.41 | 54.45 | 0.51470 |
| T69 | RUN_4 | 199 | 62.46 | 55.17 | 0.51013 | |
| T69 | RUN_5 | 190 | 64.24 | 52.44 | 54.35 | 0.49390 |
| T70 | RUN_1 | 143 | 51.78 | 35.01 | 37.84 | 0.31402 |
| T70 | RUN_2 | 72 | 71.76 | 36.81 | 45.61 | 0.36215 |
| T70 | RUN_3 | 30 | 41.50 | 51.51 | 0.41500 | |
| T70 | RUN_4 | 205 | 31.65 | 38.72 | 31.75 | 0.32295 |
| T70 | RUN_5 | 159 | 36.36 | 21.26 | 24.75 | 0.18976 |
| T81 | RUN_1 | 4.44 | 63.91 | 8.19 | 0.22022 | |
| T81 | RUN_2 | 221 | 9.39 | 41.92 | 14.12 | 0.19766 |
| T81 | RUN_3 | 13.51 | 28.35 | 17.41 | 0.17010 | |
| T81 | RUN_4 | 13.21 | 29.57 | 17.34 | 0.20388 | |
| T81 | RUN_5 | 209 | 21.93 | 24.64 | 21.34 | 0.18733 |
| T88 | RUN_1 | 219 | 29.10 | 45.04 | 33.60 | 0.38590 |
| T88 | RUN_2 | 220 | 28.67 | 45.53 | 33.35 | 0.38373 |
| T89 | RUN_1 | 200 | 54.78 | 53.37 | 50.91 | 0.46061 |
| T89 | RUN_2 | 200 | 54.95 | 53.23 | 50.76 | 0.46423 |
| T89 | RUN_3 | 201 | 54.05 | 53.25 | 50.23 | 0.45330 |
| T89 | RUN_4 | 199 | 54.48 | 54.18 | 51.25 | 0.47211 |
| T89 | RUN_5 | 201 | 55.30 | 56.12 | 52.38 | 0.47807 |
| T89 | SRVR_4 | 200 | 55.33 | 55.61 | 52.11 | 0.47636 |
| T89 | SRVR_5 | 199 | 54.09 | 54.00 | 50.96 | 0.47650 |
| T89 | SRVR_6 | 201 | 55.14 | 56.12 | 52.35 | 0.48047 |
| T89 | SRVR_7 | 203 | 50.46 | 55.66 | 50.06 | 0.47392 |
| T89 | SRVR_8 | 199 | 54.04 | 54.05 | 50.84 | 0.47534 |
| T90 | RUN_1 | 200 | 56.11 | 51.59 | 50.72 | 0.44687 |
| T90 | RUN_2 | 203 | 56.37 | 53.19 | 51.20 | 0.47159 |
| T90 | RUN_3 | 217 | 55.29 | 59.90 | 54.62 | |
| T90 | RUN_4 | 177 | 63.98 | 46.89 | 51.36 | 0.44118 |
| T90 | RUN_5 | 164 | 66.26 | 46.78 | 52.02 | 0.44458 |
| T100 | RUN_1 | 213 | 47.26 | 54.97 | 47.06 | 0.43312 |
| T100 | RUN_2 | 41.19 | 54.61 | 44.18 | 0.43238 | |
| T100 | RUN_3 | 35.29 | 45.53 | 37.50 | 0.32459 | |
| T100 | RUN_4 | 35.29 | 45.53 | 37.50 | 0.32459 | |
| T100 | RUN_5 | 125 | 56.40 | 30.65 | 37.01 | 0.29387 |
Macro-averaged results when evaluating only documents for which the system reported results (i.e., measuring the average per-document performance only on the documents each run produced annotations for). The highest score for each evaluation column is show in bold typeface, the lowest in italics. Run/Srvr: RUN = offline run, SRVR = online server run via BCMS; Docs: number of documents annotated; AUC iP/R: Area under the interpolated precision/recall curve. Base Top4: baseline system that assigns the four most frequent classes ordered by their frequencies in the training/development set. Base Regex: simple matching strategy based on regular expressions.
IMT micro-averaged participant results
| Team | Run/Srvr | Precision | Recall | F-Score | MCC | AUC iP/R |
|---|---|---|---|---|---|---|
| - | ||||||
| - | - | |||||
| T65 | RUN_1 | 8.77 | 84.82 | 15.89 | 0.23552 | 0.27588 |
| T65 | RUN_2 | 2.45 | 4.78 | 0.06259 | 0.24484 | |
| T65 | RUN_3 | 9.42 | 81.78 | 16.89 | 0.24172 | 0.27727 |
| T65 | RUN_4 | 33.48 | 42.32 | 37.39 | 0.36166 | 0.14169 |
| T65 | RUN_5 | 2.44 | 4.76 | 0.06193 | 0.29016 | |
| T69 | RUN_1 | 52.07 | 55.03 | 53.51 | 0.52519 | 0.34302 |
| T69 | RUN_2 | 54.34 | 53.51 | 53.92 | 0.52958 | 0.33824 |
| T69 | RUN_3 | 57.36 | 50.29 | 53.59 | 0.52796 | 0.32539 |
| T69 | RUN_4 | 59.25 | 48.01 | 53.04 | 0.52456 | 0.31711 |
| T69 | RUN_5 | 61.33 | 43.64 | 51.00 | 0.50896 | 0.29373 |
| T70 | RUN_1 | 48.61 | 23.15 | 31.36 | 0.32617 | 0.12949 |
| T70 | RUN_2 | 70.00 | 11.95 | 20.42 | 0.28419 | 0.08731 |
| T70 | RUN_3 | 4.74 | 8.96 | 0.19270 | 0.03826 | |
| T70 | RUN_4 | 31.22 | 36.43 | 33.63 | 0.32216 | 0.15688 |
| T70 | RUN_5 | 32.69 | 15.94 | 21.43 | 0.21717 | 0.05734 |
| T81 | RUN_1 | 4.54 | 66.03 | 8.50 | 0.11406 | 0.07716 |
| T81 | RUN_2 | 8.71 | 42.13 | 14.43 | 0.15560 | 0.06239 |
| T81 | RUN_3 | 13.51 | 28.46 | 18.33 | 0.17168 | 0.04657 |
| T81 | RUN_4 | 13.20 | 27.70 | 17.88 | 0.16667 | 0.05601 |
| T81 | RUN_5 | 21.35 | 22.20 | 21.77 | 0.20090 | 0.05283 |
| T88 | RUN_1 | 28.44 | 45.16 | 34.90 | 0.34146 | 0.20244 |
| T88 | RUN_2 | 28.17 | 45.92 | 34.92 | 0.34263 | 0.20069 |
| T89 | RUN_1 | 52.52 | 49.53 | 50.98 | 0.49997 | 0.28202 |
| T89 | RUN_2 | 52.02 | 48.96 | 50.44 | 0.49451 | 0.28589 |
| T89 | RUN_3 | 50.78 | 49.34 | 50.05 | 0.49016 | 0.27238 |
| T89 | RUN_4 | 52.50 | 49.91 | 51.17 | 0.50181 | 0.29220 |
| T89 | RUN_5 | 52.58 | 52.18 | 52.38 | 0.51382 | 0.29980 |
| T89 | SRVR_4 | 52.71 | 51.61 | 52.16 | 0.51163 | 0.29926 |
| T89 | SRVR_5 | 52.28 | 50.10 | 51.16 | 0.50168 | 0.30046 |
| T89 | SRVR_6 | 52.28 | 52.18 | 52.23 | 0.51226 | 0.30049 |
| T89 | SRVR_7 | 49.55 | 52.56 | 51.01 | 0.49972 | 0.29303 |
| T89 | SRVR_8 | 51.76 | 50.29 | 51.01 | 0.49999 | 0.29766 |
| T90 | RUN_1 | 53.33 | 47.06 | 50.00 | 0.49113 | 0.26805 |
| T90 | RUN_2 | 52.56 | 48.77 | 50.59 | 0.49625 | 0.28386 |
| T90 | RUN_3 | 52.30 | 58.25 | |||
| T90 | RUN_4 | 61.09 | 38.14 | 46.96 | 0.47436 | 0.25209 |
| T90 | RUN_5 | 64.24 | 35.10 | 45.40 | 0.46707 | 0.24270 |
| T100 | RUN_1 | 44.59 | 51.61 | 47.85 | 0.46794 | 0.26055 |
| T100 | RUN_2 | 39.86 | 54.84 | 46.17 | 0.45448 | 0.26982 |
| T100 | RUN_3 | 35.29 | 44.59 | 39.40 | 0.38240 | 0.15734 |
| T100 | RUN_4 | 35.34 | 44.59 | 39.43 | 0.38271 | 0.15758 |
| T100 | RUN_5 | 54.86 | 18.22 | 27.35 | 0.30847 | 0.11109 |
Micro-averaged results when evaluating all documents (i.e., measuring the overall performance of each run on the whole document set). The highest score for each evaluation column is show in bold typeface, the lowest in italics. Run/Srvr: RUN = offline run, SRVR = online server run via BCMS; MCC: Matthew's Correlation Coefficient; AUC iP/R: Area under the interpolated precision/recall curve (micro- averaged by iterating over the precision/recall values of the highest ranked annotation of all articles, then all second ranked annotations, etc.). Base_Top4: baseline system that assigns the four most frequent classes ordered by their frequencies in the training/development set. Base_Regex: simple matching strategy based on regular expressions.
Figure 5IMT predictions for relevant method terms. This figure shows the average F-score (blue) across all runs obtained for test set predictions using PSI-MI interaction detection method terms with at least 5 annotations. Also the best F-score (red) obtained by an individual run is provided.
Overview of tools and resources. Collection of external tools and resources used for the PPI tasks by participating teams.
| Name | Type | URL | Summary |
|---|---|---|---|
| MALLET | ML | [ | Framework for feature extraction, logistic regression models and inference |
| SVMPerf | ML | [ | Support Vector Machine software for optimizing multivariate performance measures |
| Weka | ML | [ | Collection of machine learning algorithms for data mining, useful for feature selection |
| LIBSVM | ML | [ | Software for support vector classification |
| Matlab | ML | [ | Data analysis, and numeric computation software |
| Liblinear | ML | [ | Linear classifier software |
| MEGAM | ML | [ | Software for maximum entropy model implementation |
| C&C CCG parser | NLP | [ | Parser and taggers are written in C++ |
| TreeTagger | NLP | [ | Part-of-speech tagger (trained on PENN treebank) |
| SNOWBALL | NLP | [ | Stemming program |
| NooJ | NLP | [ | Corpus processing and dictionary matching |
| Lucene | NLP | [ | Full-featured text search engine library |
| LingPipe | NLP | [ | Tool kit for processing text using computational linguistics |
| PSI-MI | Lexical | [ | Molecular Interaction Ontology used by PPI databases |
| UMLS | Lexical | [ | Unified Medical Language System which contains a large vocabulary database about biomedical and health-related concepts |
| MeSH | Lexical | [ | Vocabulary thesaurus used for indexing PubMed |
| ChEBI | Lexical | [ | Chemical Entities of Biological Interest |
| BioLexicon | Lexical | [ | Terminological resources integrating data from various bioinformatics collections |
| Stop words | Lexical | [ | Collection of words that are filtered out prior to processing of natural language data |
| NLProt | BioNLP | [ | SVM-based tool for recognition of protein-names in text |
| OSCAR3 | BioNLP | [ | Tool for recognition of chemical name mentions in text |
| ABNER | BioNLP | [ | Bio-Named entity recognition (proteins, genes, DNA, etc.) |