| Literature DB >> 31768545 |
Jessica Schuster1,2, Michael Superdock1, Anthony Agudelo2, Paul Stey3, James Padbury1,2,4, Indra Neil Sarkar1,5,6, Alper Uzun1,2,4.
Abstract
To generate a parsimonious gene set for understanding the mechanisms underlying complex diseases, we reasoned it was necessary to combine the curation of public literature, review of experimental databases and interpolation of pathway-associated genes. Using this strategy, we previously built the following two databases for reproductive disorders: The Database for Preterm Birth (dbPTB) and The Database for Preeclampsia (dbPEC). The completeness and accuracy of these databases is essential for supporting our understanding of these complex conditions. Given the exponential increase in biomedical literature, it is becoming increasingly difficult to manually maintain these databases. Using our curated databases as reference data sets, we implemented a machine learning-based approach to optimize article selection for manual curation. We used logistic regression, random forests and neural networks as our machine learning algorithms to classify articles. We examined features derived from abstract text, annotations and metadata that we hypothesized would best classify articles with genetically relevant content associated to the disorder of interest. Combinations of these features were used build the classifiers and the performance of these feature sets were compared to a standard 'Bag-of-Words'. Several combinations of these genetic based feature sets outperformed 'Bag-of-Words' at a threshold such that 95% of the curated gene set obtained from the original manual curation of all articles were extracted from the articles classified by machine learning as 'considered'. The performance was superior in terms of the reduction of required manual curation and two measures of the harmonic mean of precision and recall. The reduction in workload ranged from 0.814 to 0.846 for the dbPTB and 0.301 to 0.371 for the dbPEC. Additionally, a database of metadata and annotations is generated which allows for rapid query of individual features. Our results demonstrate that machine learning algorithms can identify articles with relevant data for databases of genes associated with complex diseases.Entities:
Mesh:
Year: 2019 PMID: 31768545 PMCID: PMC6877776 DOI: 10.1093/database/baz124
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Overview of semi-automated pipeline used to predict articles to consider for manual curation. The pipeline takes a set of PMIDs (here shown as PMID_ID##) as input along with manual classifications (class) of 1 or 0, where 1 signifies that an article was ‘considered’ for curation and 0 signifies that an article was ‘not considered’. With these PMIDs, the computational pipeline queries various public data repositories to retrieve article-specific data. Data are converted to features useful for predictive modeling. Using this feature set, predictive models were trained using logistic regression, random forests and neural networks. These predictive models were used to predict the relevance of unread articles.
Metrics used to assess performance of predictive models. *GTP is the number of unique curator-accepted genes in true positive articles. G(TP + FN) is the number of unique curator-accepted genes in true positive (TP) or false negative (FN) articles
|
|
|
|---|---|
|
| TP/(TP + FN) |
|
| TP/(TP + FP) |
|
| *GTP/G(TP + FN) |
|
| (TN + FN)/(TP + FP + TN + FN) |
|
| 2 (Gene-Recall *Precision) / (Precision + Gene-Recall) |
|
| 2 (Recall *Precision) / (precision + recall) |
Feature set names. A ‘–‘in the Features in Set column denotes that the Features in Set is the same as the set name. The size of the feature set for both dbPEC and dbPTB are listed
|
|
|
|
|
|---|---|---|---|
|
| – | 11 467 | 11 157 |
|
| – | 1462 | 1498 |
|
| – | 56 | 52 |
|
| – | 390 | 375 |
|
| – | 1177 | 1413 |
|
| – | 6 | 6 |
|
| – | 34 494 | 31 926 |
|
| MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count | 14 558 | 14 501 |
|
| Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count | 3091 | 3344 |
|
| Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check | 3085 | 338 |
|
| MeSH + Gene Significance + Semtype Count + Species Check + Gene Count | 14 502 | 14 449 |
|
| MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check | 14 552 | 14 495 |
|
| MeSH + Gene Significance + Semtype Count + Species Check | 14 496 | 14 443 |
AUC of the ROC curve for all features trained and tested on dbPTB. 5-fold cross-validation was used to determine average AUCPR with a 95% confidence interval. P-values are listed for each feature set comparing ROC curves to the bag-of-words ROC curves using pROC. A ‘–‘was used to denote bag-of-words being compared to itself
|
|
|
| ||
|---|---|---|---|---|
|
|
| |||
|
|
| 0.646 ± 0.031 | 0.906 | - |
|
| 0.622 ± 0.063 | 0.908 | 0.918 | |
|
| 0.572 ± 0.078 | 0.919 | 0.572 | |
|
| 0.572 ± 0.066 | 0.920 | 0.505 | |
|
| 0.619 ± 0.064 | 0.919 | 0.526 | |
|
| 0.630 ± 0.048 | 0.901 | 0.825 | |
|
| 0.626 ± 0.045 | 0.907 | 0.974 | |
|
|
| 0.621 ± 0.040 | 0.826 | – |
|
| 0.680 ± 0.077 | 0.922 | 0.002 | |
|
| 0.628 ± 0.070 | 0.872 | 0.137 | |
|
| 0.616 ± 0.069 | 0.862 | 0.251 | |
|
| 0.678 ± 0.093 | 0.865 | 0.221 | |
|
| 0.673 ± 0.077 | 0.857 | 0.331 | |
|
| 0.667 ± 0.083 | 0.870 | 0.155 | |
|
|
| 0.631 ± 0.045 | 0.894 | – |
|
| 0.643 ± 0.022 | 0.893 | 0.979 | |
|
| 0.530 ± 0.059 | 0.872 | 0.488 | |
|
| 0.509 ± 0.075 | 0.884 | 0.755 | |
|
| 0.599 ± 0.031 | 0.869 | 0.393 | |
|
| 0.580 ± 0.048 | 0.891 | 0.892 | |
|
| 0.607 ± 0.098 | 0.892 | 0.923 | |
AUC of the ROC curve and AUC of the precision-recall curve for all features trained and tested on dbPEC. 5-fold cross-validation was used to determine average AUCPR with a 95% confidence interval. P-values are listed for each feature set comparing ROC curves to the bag-of-words ROC curves using pROC. A ‘–‘was used to denote bag-of-words being compared to itself
|
|
|
| ||
|---|---|---|---|---|
|
|
| |||
|
|
| 0.653 ± 0.043 | 0.802 | – |
|
| 0.652 ± 0.028 | 0.805 | 0.875 | |
|
| 0.600 ± 0.027 | 0.746 | 0.007 | |
|
| 0.597 ± 0.027 | 0.743 | 0.005 | |
|
| 0.642 ± 0.027 | 0.805 | 0.854 | |
|
| 0.639 ± 0.028 | 0.801 | 0.991 | |
|
| 0.633 ± 0.066 | 0.802 | 0.985 | |
|
|
| 0.651 ± 0.041 | 0.837 | – |
|
| 0.678 ± 0.023 | 0.805 | 0.064 | |
|
| 0.636 ± 0.033 | 0.777 | 0.001 | |
|
| 0.623 ± 0.043 | 0.761 | 0 | |
|
| 0.673 ± 0.031 | 0.802 | 0.043 | |
|
| 0.664 ± 0.038 | 0.806 | 0.065 | |
|
| 0.641 ± 0.090 | 0.804 | 0.053 | |
|
|
| 0.613 ± 0.027 | 0.763 | – |
|
| 0.619 ± 0.052 | 0.782 | 0.372 | |
|
| 0.531 ± 0.037 | 0.689 | 0.007 | |
|
| 0.554 ± 0.038 | 0.691 | 0.009 | |
|
| 0.595 ± 0.051 | 0.764 | 0.946 | |
|
| 0.598 ± 0.048 | 0.771 | 0.696 | |
|
| 0.610 ± 0.003 | 0.743 | 0.417 | |
The values of the performance metrics for each feature set trained and tested on dbPTB. Performance metrics were recorded for each classifier and values were recorded at a 95% gene Recall threshold
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|
|
|
| 0.805 | 0.956 | 0.379 | 0.516 | 0.543 | 0.716 |
|
| 0.829 | 0.956 | 0.531 | 0.648 | 0.683 | 0.791 | |
|
| 0.756 | 0.956 | 0.544 | 0.633 | 0.693 | 0.814 | |
|
| 0.829 | 0.956 | 0.523 | 0.642 | 0.676 | 0.788 | |
|
| 0.756 | 0.956 | 0.534 | 0.626 | 0.686 | 0.810 | |
|
| 0.805 | 0.956 | 0.541 | 0.647 | 0.691 | 0.801 | |
|
| 0.854 | 0.956 | 0.556 | 0.673 | 0.703 | 0.794 | |
|
|
| 0.829 | 0.965 | 0.262 | 0.398 | 0.412 | 0.575 |
|
| 0.878 | 0.956 | 0.379 | 0.529 | 0.543 | 0.690 | |
|
| 0.707 | 0.956 | 0.460 | 0.558 | 0.621 | 0.794 | |
|
| 0.683 | 0.956 | 0.406 | 0.509 | 0.570 | 0.775 | |
|
| 0.683 | 0.956 | 0.452 | 0.544 | 0.613 | 0.797 | |
|
| 0.659 | 0.956 | 0.435 | 0.524 | 0.598 | 0.797 | |
|
| 0.659 | 0.956 | 0.429 | 0.519 | 0.592 | 0.794 | |
|
|
| 0.829 | 0.956 | 0.374 | 0.515 | 0.537 | 0.703 |
|
| 0.707 | 0.956 | 0.617 | 0.659 | 0.750 | 0.846 | |
|
| 0.805 | 0.956 | 0.429 | 0.559 | 0.592 | 0.748 | |
|
| 0.756 | 0.956 | 0.443 | 0.559 | 0.605 | 0.771 | |
|
| 0.756 | 0.956 | 0.508 | 0.608 | 0.664 | 0.801 | |
|
| 0.756 | 0.956 | 0.544 | 0.633 | 0.693 | 0.814 | |
|
| 0.756 | 0.956 | 0.554 | 0.639 | 0.701 | 0.817 |
The values of the performance metrics for each feature set trained and tested on dbPEC. Performance metrics were recorded for each classifier and values were recorded at a 95% gene Recall threshold
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|
|
|
| 0.950 | 0.951 | 0.425 | 0.588 | 0.588 | 0.247 |
|
| 0.967 | 0.956 | 0.439 | 0.604 | 0.602 | 0.258 | |
|
| 0.939 | 0.951 | 0.421 | 0.582 | 0.584 | 0.249 | |
|
| 0.944 | 0.966 | 0.413 | 0.574 | 0.578 | 0.228 | |
|
| 0.967 | 0.956 | 0.444 | 0.608 | 0.606 | 0.266 | |
|
| 0.961 | 0.951 | 0.458 | 0.620 | 0.618 | 0.292 | |
|
| 0.956 | 0.951 | 0.461 | 0.622 | 0.621 | 0.301 | |
|
|
| 0.939 | 0.956 | 0.448 | 0.607 | 0.610 | 0.294 |
|
| 0.928 | 0.951 | 0.488 | 0.640 | 0.645 | 0.360 | |
|
| 0.922 | 0.956 | 0.445 | 0.600 | 0.607 | 0.301 | |
|
| 0.917 | 0.951 | 0.426 | 0.582 | 0.589 | 0.275 | |
|
| 0.950 | 0.951 | 0.491 | 0.648 | 0.648 | 0.348 | |
|
| 0.922 | 0.961 | 0.494 | 0.643 | 0.653 | 0.371 | |
|
| 0.922 | 0.956 | 0.484 | 0.635 | 0.643 | 0.358 | |
|
|
| 0.967 | 0.956 | 0.407 | 0.572 | 0.570 | 0.199 |
|
| 0.961 | 0.966 | 0.410 | 0.575 | 0.576 | 0.210 | |
|
| 0.922 | 0.951 | 0.400 | 0.558 | 0.563 | 0.223 | |
|
| 0.944 | 0.951 | 0.369 | 0.530 | 0.531 | 0.137 | |
|
| 0.967 | 0.966 | 0.371 | 0.536 | 0.536 | 0.122 | |
|
| 0.961 | 0.956 | 0.448 | 0.611 | 0.610 | 0.277 | |
|
| 0.922 | 0.956 | 0.433 | 0.590 | 0.596 | 0.283 |