| Literature DB >> 31447668 |
Debleena Guin1,2, Jyoti Rani3,4, Priyanka Singh1,5, Sandeep Grover6, Shivangi Bora1,2, Puneet Talwar7, Muthusamy Karthikeyan8, K Satyamoorthy9, C Adithan10, S Ramachandran4,5, Luciano Saso11, Yasha Hasija2, Ritushree Kukreti1,5.
Abstract
Understanding patients' genomic variations and their effect in protecting or predisposing them to drug response phenotypes is important for providing personalized healthcare. Several studies have manually curated such genotype-phenotype relationships into organized databases from clinical trial data or published literature. However, there are no text mining tools available to extract high-accuracy information from such existing knowledge. In this work, we used a semiautomated text mining approach to retrieve a complete pharmacogenomic (PGx) resource integrating disease-drug-gene-polymorphism relationships to derive a global perspective for ease in therapeutic approaches. We used an R package, pubmed.mineR, to automatically retrieve PGx-related literature. We identified 1,753 disease types, and 666 drugs, associated with 4,132 genes and 33,942 polymorphisms collated from 180,088 publications. With further manual curation, we obtained a total of 2,304 PGx relationships. We evaluated our approach by performance (precision = 0.806) with benchmark datasets like Pharmacogenomic Knowledgebase (PharmGKB) (0.904), Online Mendelian Inheritance in Man (OMIM) (0.600), and The Comparative Toxicogenomics Database (CTD) (0.729). We validated our study by comparing our results with 362 commercially used the US- Food and drug administration (FDA)-approved drug labeling biomarkers. Of the 2,304 PGx relationships identified, 127 belonged to the FDA list of 362 approved pharmacogenomic markers, indicating that our semiautomated text mining approach may reveal significant PGx information with markers for drug response prediction. In addition, it is a scalable and state-of-art approach in curation for PGx clinical utility.Entities:
Keywords: disease–drug–gene–mutation relationship; pharmacogenomic knowledgebase; pharmacogenomic markers; precision medicine; text mining
Year: 2019 PMID: 31447668 PMCID: PMC6692532 DOI: 10.3389/fphar.2019.00839
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.810
Figure 1Overview of the proposed approach. The process of retrieving evidence-based sentences from PubMed abstracts using pubmed.mineR includes: (A) information retrieval, (B) entity recognition, (C) normalization, (D) validation, and (E) data integration and ranking. The final list of relationships of disease–drug–gene–polymorphism is tabulated population-wise.
Performance comparison of pharmacogenomic (PGx) relationships obtained from our proposed pipeline with other benchmark datasets (OMIM, CTD, and PharmGKB).
| Context type | TP | TN | FP | FN | Sensitivity | Specificity | Efficacy | Precision | Recall | Accuracy | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Our pipeline with “PharmGKB” | 1,509 | 208 | 254 | 78 | 82.6 | 86.9 | 88.2 | 0.904 | 0.930 | 0.923 | 89.1 |
| Our pipeline with “OMIM” | 2,225 | – | 79 | – | 78.0 | 77.5 | 81.8 | 0.600 | 0.681 | 0.764 | 59.3 |
| Our pipeline with “CTD” | 1,776 | 153 | 375 | – | 70.7 | 65.5 | 72.2 | 0.729 | 0.803 | 0.801 | 79.7 |
| Our pipeline with (“PharmGKB” AND “OMIM” AND “CTD”) | 1,875 | 102 | 275 | 75 | 82.3 | 84.4 | 93.3 | 0.896 | 0.852 | 0.828 | 94.7 |
PharmGKB corpus compared to that of our pipeline and the articles extracted in these datasets. The formulae used for calculating the accuracy of the proposed pipeline compared to the other datasets (ref. for detailed analysis).
Error analysis evaluation results on different types of error occurrence on the test dataset.
| Sl. No. | Sources of error | True value in data | Observed value in data | Error percentage |
|---|---|---|---|---|
| 1. | Entity detection error | 633,074* | 582,428 | 8.00% |
| 2. | Entity absent in text | 633,074* | 615,650 | 2.75% |
| 3. | Failure to detect entity | 633,074* | 609,413 | 3.73% |
| 4. | Entity normalisation error | |||
| a. | Gene normalization error | 42,607 | 50,336 | 18.14% |
| b. | Disease normalization error | 71,704 | 92,481 | 28.97% |
| c. | Drug normalization error | 11,033 | 14,563 | 31.99% |
PharmGKB has been considered as the gold standard dataset for all the comparisons. *in total PGx corpus extracted from MEDLINE. The error percentage has been calculated according to the formula 1.
Figure 2Distribution of pharmacogenomic (PGx)-specific entities obtained from 2,304 PGx relationships with at least 100 citations. In the increasing order of the most prevalent medications prescribed from left to right and the most studied genes studied for PGx association with these drugs.
Figure 3PGx-specific enriched markers other than that mentioned in PharmGKB. Disease ontology (left), FDA-approved drug (middle), and pharmacogenes (right) known (i.e., statistical association in clinical-genetics studies) to alter drug response or efficacy or lead to adverse drug response ( ).