| Literature DB >> 30102703 |
Kyubum Lee1, Maria Livia Famiglietti2, Aoife McMahon3, Chih-Hsuan Wei1, Jacqueline Ann Langdon MacArthur3, Sylvain Poux2, Lionel Breuza2, Alan Bridge2, Fiona Cunningham3, Ioannis Xenarios4,5, Zhiyong Lu1.
Abstract
Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.Entities:
Mesh:
Year: 2018 PMID: 30102703 PMCID: PMC6107285 DOI: 10.1371/journal.pcbi.1006390
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Literature triage using our deep learning framework.
Classification performance on UniProtKB, the GWAS Catalog and mycoSet.
(CNN: Convolutional Neural Networks, SVM: Support Vector Machine, LMT: Logistic Model Trees).
| Dataset | Methods | Precision | Recall | F1 |
|---|---|---|---|---|
| Our Method (CNN) | 0.913 | 0.934 | ||
| LinearSVC (SVM) | 0.896 | 0.920 | 0.908 | |
| Our Method (CNN) | 0.973 | 0.991 | ||
| LinearSVC (SVM) | 0.965 | 0.980 | 0.972 | |
| Our Method (CNN) | 0.602 | 0.667 | ||
| LinearSVC (SVM) | 0.566 | 0.627 | 0.595 | |
| mycoSORT (LMT) | 0.552 | 0.6 | 0.575 |
* In this table, the positive vs. negative ratio of mycoSet is 1:9, and that of the other datasets is 1:1.
Lists of the most significant words in the positively classified publications.
(The words that are used as queries in the query-based method of each database are highlighted.).
| UniProtKB/Swiss-Prot | NHGRI-EBI GWAS Catalog | ||
|---|---|---|---|
| mutation(s) | syndrome | wide | variants |
| gene | exon(s) | genome | meta |
| cdna | encoding | association(s) | european |
| human | chromosome | loci (or locus) | identify (-ies, -ied) |
| sequence | two | snp(s) | susceptibility |
| missense | region | gwas | near |
| families | acid | p | ancestry |
| novel | coding | = | chromosome |
| amino | domain | 10 | significance |
| identified | expressed | genetic | 8 |
| family | recessive | study | independent |
| autosomal | affected | replication (replicated) | conducted |
| protein | cloning | associated | cohorts |
Fig 2ROC curves of the classification results on the 2017JanJul group of UniProtKB/Swiss-Prot (Blue) and the GWAS Catalog (Red)–(a) Curves in all the publications, (b) Curves in the publications containing mutations at the abstract level.
Comparison of the results of our method with those of the query-based method in the UniProtKB/Swiss-Prot and GWAS Catalog triage.
Both query-based and CNN-based results were evaluated by the curators, resulting in the total number of curatable publications below.
| UniProtKB/Swiss-Prot | GWAS Catalog | ||
|---|---|---|---|
| 4,680 (3 months, variant-containing publications) | 64,405 (3 weeks, all publications) | ||
| 424 | 27 | ||
| 79 | 304 | ||
| 36 (P: 45.57%, R: 8.49%) | 27 (P: 8.88%, R: 100%) | ||
| 501 | 98 | ||
| 413 (P: 82.43%, R: 97.41) | 26 (P: 26.53%, R: 96.30%) | ||
* As requested by the curators, UniProtKB results are filtered using tmVar as only articles with explicit variant mentions are within the scope of its data curation.
Statistics of the datasets.
| UniProtKB/Swiss-Prot | The NHGRI-EBI GWAS Catalog | mycoSet | mycoSet | |
|---|---|---|---|---|
| Version | Sep. 20, 2017 | Oct. 11, 2017 | - | - |
| Total # of PMIDs | 12,779 | 3,164 | 749 | 6,902 |
| PMIDs with abstracts | 11,978 | 3,143 | 746 | 6,575 |
| N/A | N/A | |||
| N/A | N/A |