| Literature DB >> 33343371 |
Maria-Theodora Pandi1,2, Peter J van der Spek2, Maria Koromina1, George P Patrinos1,2,3,4.
Abstract
Text mining in biomedical literature is an emerging field which has already been shown to have a variety of implementations in many research areas, including genetics, personalized medicine, and pharmacogenomics. In this study, we describe a novel text-mining approach for the extraction of pharmacogenomics associations. The code that was used toward this end was implemented using R programming language, either through custom scripts, where needed, or through utilizing functions from existing libraries. Articles (abstracts or full texts) that correspond to a specified query were extracted from PubMed, while concept annotations were derived by PubTator Central. Terms that denote a Mutation or a Gene as well as Chemical compound terms corresponding to drug compounds were normalized and the sentences containing the aforementioned terms were filtered and preprocessed to create appropriate training sets. Finally, after training and adequate hyperparameter tuning, four text classifiers were created and evaluated (FastText, Linear kernel SVMs, XGBoost, Lasso, and Elastic-Net Regularized Generalized Linear Models) with regard to their performance in identifying pharmacogenomics associations. Although further improvements are essential toward proper implementation of this text-mining approach in the clinical practice, our study stands as a comprehensive, simplified, and up-to-date approach for the identification and assessment of research articles enriched in clinically relevant pharmacogenomics relationships. Furthermore, this work highlights a series of challenges concerning the effective application of text mining in biomedical literature, whose resolution could substantially contribute to the further development of this field.Entities:
Keywords: FastText, biomedical text classification, supervised learning; Pubmed; Pubtator; natural language processing; pharmacogenomics associations; text mining
Year: 2020 PMID: 33343371 PMCID: PMC7748107 DOI: 10.3389/fphar.2020.602030
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.810
Presentation of the default and selected hyperparameter values for FastText algorithm.
| Hyperparameter | Default value | Values used for both 1-pair and n-pair sentences |
|---|---|---|
| Size of vector (dim) | 100 | 200 |
| Minimal number occurrences of a word (minCount) | 1 | 5 |
| Size of the context window (ws) | 5 | 2 |
| Learning rate (lr) | 0.1 | 0.1 |
| Number of epochs (epoch) | 5 | 50 |
| Maximum length of a word ngram (worNgrams) | 1 | 2 |
| Loss function (loss) | Softmax | ns (negative sampling) |
Presentation of the default and selected, after grid search, hyperparameter values SVM, XGBoost, and glmnet models.
| Model | Hyperparameters | Default values | 1-Pair | n-Pairs |
|---|---|---|---|---|
| Linear SVM | Cost (C) | 1 | 1 | 1 |
| XGBoost | Learning rate (eta) | 0.3 [0, 1] | 0.2 | 0.2 |
| Maximum depth of a tree [maxdepth] | 6 [0, ∞) | 4 | 6 | |
| Subsample ratio of the training instances (subsample) | 1 (0, 1] | 0.7 | 0.7 | |
| Number of boosting iterations (nrounds) | — | 50 | 50 | |
| glmnet | Mixing percentage (alpha) | 1 [0, 1] | 0.1 | 1 (lasso penalty) |
| Regularization parameter (lambda) | — | 0.02888342 | 0.02229455 |
Total number of the 1) initially retrieved, 2) annotated by Pubtator Central, and 3) filtered papers, based on the “pharmacogenomics-related” Pubmed query, as described in the Methods section.
| Papers resulting from query | 11,302 unique PMIDs (3,165 with PMCID) |
| PTC-annotated papers | 5,307 unique PMIDs (2,257 with PMCID) |
| PTC annotations | Chemicals: 187,850 (5,580 unique) genes: 230,159 (8,853 unique) mutations: 63,855 (13,610 unique) species: 115,520 (433 unique) strains: 54 (9 unique) |
| Normalized terms | Genes: 5,463 remained chemicals: 805 remained mutations: 5,467 remained |
| Star alleles | 11,201 entries (mistaken as gene entries) |
| Sentences of interest | With 1 pair: 3,574 with multiple pairs: 1987 (distinct sentences) |
PMIDs, PubMed identifiers; PMCIDs, PubMed Central identifiers.
The number of “not unique” Star Alleles, since some of these are present in multiple copies. This number reflects the amount of Gene mentions that were actually Star Alleles.
FIGURE 2Presentation of the performance metrics, as calculated after using 10-fold Cross Validation with the training data, for all four models trained with sentences discussing one pair of Variant-Chemical (1-pair sentences).
FIGURE 3Performance metrics, as calculated after using 10-fold Cross Validation with the training data, for all four models trained with sentences discussing multiple Variant-Chemical pairs (n-pair sentences). The resulting metrics are presented by model and by class, since this is a multiclass classification task, while finally, the by-class metrics for each model separately are weighted with the corresponding class prevalence and summed up to calculate the overall performance metrics.
Results stemming from the comparison of the classification results of the four models trained with 1-pair sentences compared with a gold standard dataset, extracted from PharmGKB.
| Metric | xgboost | svm | Glmnet | Fastrtext |
|---|---|---|---|---|
| Filtered unseen sentences | ||||
| Accuracy | 0.577 | 0.526 | 0.538 | 0.526 |
| Sensitivity/recall | 0.512 | 0.465 | 0.488 | 0.349 |
| Specificity | 0.657 | 0.6 | 0.6 | 0.743 |
| Precision/positive predictive value | 0.647 | 0.588 | 0.6 | 0.625 |
| Negative predictive value | 0.523 | 0.477 | 0.488 | 0.481 |
| Original unseen sentences | ||||
| Accuracy | 0.538 | 0.529 | 0.577 | 0.577 |
| Sensitivity/recall | 0.415 | 0.358 | 0.434 | 0.264 |
| Specificity | 0.666 | 0.706 | 0.725 | 0.902 |
| Precision/positive predictive value | 0.564 | 0.559 | 0.622 | 0.737 |
| Negative predictive value | 0.523 | 0.514 | 0.552 | 0.541 |
TP, TN, FP, and FN were calculated by comparing the resulting classification of the unseen pairs with the pairs present in the Gold Standard and the corresponding metrics were calculated as described in Methods. This table presents the metrics calculated regarding the unseen 1-pair sentences with and without filtering based on the define list of words that was used to create the define the training sentences.