| Literature DB >> 30753477 |
Yi-Yu Hsu1, Mindy Clyne2, Chih-Hsuan Wei1, Muin J Khoury3, Zhiyong Lu1.
Abstract
Tracking scientific research publications on the evaluation, utility and implementation of genomic applications is critical for the translation of basic research to impact clinical and population health. In this work, we utilize state-of-the-art machine learning approaches to identify translational research in genomics beyond bench to bedside from the biomedical literature. We apply the convolutional neural networks (CNNs) and support vector machines (SVMs) to the bench/bedside article classification on the weekly manual annotation data of the Public Health Genomics Knowledge Base database. Both classifiers employ salient features to determine the probability of curation-eligible publications, which can effectively reduce the workload of manual triage and curation process. We applied the CNNs and SVMs to an independent test set (n = 400), and the models achieved the F-measure of 0.80 and 0.74, respectively. We further tested the CNNs, which perform better results, on the routine annotation pipeline for 2 weeks and significantly reduced the effort and retrieved more appropriate research articles. Our approaches provide direct insight into the automated curation of genomic translational research beyond bench to bedside. The machine learning classifiers are found to be helpful for annotators to enhance the efficiency of manual curation.Entities:
Mesh:
Year: 2019 PMID: 30753477 PMCID: PMC6367517 DOI: 10.1093/database/baz010
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1The CNNs for the TPC task.
The 5-fold cross-validation results on training set
| Method | Precision | Recall |
|
|---|---|---|---|
| CNN | 0.7681 | 0.8785 | 0.8196 |
| SVM | 0.7688 | 0.7354 | 0.7517 |
The performance of the SVM and CNN models on the test set
| Method | Precision | Recall |
|
|---|---|---|---|
| CNN | 0.7614 | 0.8428 | 0.8000 |
| SVM | 0.7615 | 0.7232 | 0.7419 |
Figure 2Comparing CNN with the baseline date-sort method using ROC curves.
Statistics of FP and FN errors
| FPs | FNs | |
|---|---|---|
| Number of errors in the gold standard (mis-curated in the past) | 6 | 1 |
| Number of borderline articles (could be either T1 or T2 and above) | 7 | 5 |
| Number of mis-classified by our CNN method | 29 | 19 |
| Total | 42 | 25 |