| Literature DB >> 25221627 |
George Papadatos1, Gerard Jp van Westen1, Samuel Croset1, Rita Santos1, Simone Trubian1, John P Overington1.
Abstract
BACKGROUND: The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are 'ChEMBL-like' (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.Entities:
Keywords: Curation; Document classification; Machine learning; Triage
Year: 2014 PMID: 25221627 PMCID: PMC4158272 DOI: 10.1186/s13321-014-0040-8
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
BoW and n-grams example for two document titles
| Source | ChEMBL | Medline |
|---|---|---|
|
| 17994679 | 17886339 |
|
| Discovery of biaryl anthranilides as full agonists for the high affinity niacin receptor. | Automatic prediction of protein interactions with large scale motion. |
|
| Discover, biaryl, anthranilid, full, agonist, high, affin, niacin, receptor | Automat, predict, protein, interact, large, scale, motion |
|
| Dicovery_of, full_agonists, high_affinity, niacin_receptor, … | Automatic_prediction, protein_interaction, large_scale, … |
|
| Discovery_of_biaryl, high_affinity_niacin, affinity_niacin_receptor, … | Automatic_prediction_of, protein_interaction_with, large_scale_motion, … |
A document vector example from the titles of the documents in Table 1
| PubMed ID | Discover | Biaryl | Niacin | Receptor | Automat | Predict | Large | … |
|---|---|---|---|---|---|---|---|---|
|
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | … |
|
| 0 | 0 | 0 | 0 | 1 | 1 | 1 | … |
Figure 1Document processing and classification workflow. Abbreviations: NB - Naive Bayesian, RF - Random Forest.
Summary of classification validation statistics across different methods and validation sets
| Method/validation set | AUC | MCC | Sensitivity | Specificity |
|---|---|---|---|---|
| NB EV | 0.98 | 0.88 | 0.90 | 0.97 |
| NB n-grams EV | 1.00 | 0.91 | 0.95 | 0.96 |
| NB ChEMBL_17 | 0.96 | 0.90 | 0.92 | 0.98 |
| NB BindingDB | 0.97 | 0.79 | 0.80 | 0.97 |
| RF EV | 0.99 | 0.92 | 0.95 | 0.97 |
| RF CV Out-of-Bag | 0.99 | 0.92 | 0.94 | 0.97 |
Abbreviations: AUC Area Under the Curve, CV cross validation, EV external validation, MCC Matthews Correlation Coefficient, NB Naive Bayesian, RF Random Forest.
Figure 2Receiver operator characteristic curve and external validation performance (Pipeline Pilot model). The ROC curve generated by a Bayesian classifier (`Learn Good From Bad’ component) in the 80% - 20% stratified partition validation is shown in (A). The performance of this classifier in the test set is shown in (B). Abbreviations: Matthews Correlation Coefficient – MCC, Receiver Operator Characteristic – ROC.
Figure 3ROC curve and external validation performance (KNIME model). The ROC curves were generated by a Random Forest model (`Tree Ensemble Learner’ node). Plot A shows the ROC curve for the out-of-bag classification. Plot B shows the ROC curve for the 80% - 20% stratified cross validation.
Figure 4Word cloud visualization of feature importance according to the NB model (A) and RF model (B). More important terms are depicted in larger and bolder type. Blue coloured terms are correlated with ChEMBL whereas orange ones are correlated with the MEDLINE class.
Figure 5Word cloud visualization of the ChEMBL and MEDLINE data sets. (A) Words most frequent in the ChEMBL corpus (more frequent words are depicted larger). A large emphasis on chemistry related terms is apparent. (B) Word cloud visualization of the words most frequent in our MEDLINE background set. Here an emphasis on clinical data can be observed.
Figure 6Complementarity to current literature in ChEMBL. Several medicinal chemistry journals are routinely covered in ChEMBL (A). The ChEMBL-likeness classifier is able to retrieve relevant papers from journals that are not routinely covered (B).
Figure 7The @MalariaSARLit twitter bot. Schematic overview of the pipeline, controlled by an automated Python script (A). Examples of daily tweets with alerts for recent medicinal chemistry anti-malarial publications (B). The latter are automatically prioritized using the NB document classification model.