| Literature DB >> 22280404 |
Ruihua Fang1, Gary Schindelman, Kimberly Van Auken, Jolene Fernandes, Wen Chen, Xiaodong Wang, Paul Davis, Mary Ann Tuli, Steven J Marygold, Gillian Millburn, Beverley Matthews, Haiyan Zhang, Nick Brown, William M Gelbart, Paul W Sternberg.
Abstract
BACKGROUND: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.Entities:
Mesh:
Year: 2012 PMID: 22280404 PMCID: PMC3305665 DOI: 10.1186/1471-2105-13-16
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Evaluation results of ten WormBase data types using the ten testing sets
| Data types | Recall (testing set) | Precision (testing set) |
|---|---|---|
| RNAi | 0.99 | 0.78 |
| Antibody | 0.94 | 0.81 |
| Phenotype | 0.86 | 0.92 |
| Gene regulation | 0.88 | 0.70 |
| Mutant allele sequence | 0.93 | 0.98 |
| Gene expression | 0.95 | 0.88 |
| Gene product interaction* | NA | NA |
| Overexpression phenotype | 0.91 | 0.81 |
| Gene interaction | 0.85 | 0.79 |
| Gene structure correction | 0.90 | 0.82 |
The SVM analysis was done using training/testing sets specified in Additional File 4, Table S3 and Methods. *Gene product interaction does not have enough labeled papers and no evaluation was done using the testing set.
Evaluation results of Five FlyBase data types with high occurrence using the testing sets
| Data type | Recall | Precision |
|---|---|---|
| New mutant allele | 0.98 | 0.56 |
| Gene expression in wild-type background | 0.96 | 0.92 |
| Gene expresison in perturbed background | 0.95 | 0.92 |
| New transgenic allele | 0.91 | 0.71 |
| Physical interaction between macro-molecules | 0.88 | 0.84 |
The SVM analysis was done using training/testing sets specified in Additional File 5, Table S4 and Methods.
Evaluation results of FlyBase RNAi data type using FlyBase or/and WormBase training papers
| Training dataset | Recall | Precision |
|---|---|---|
| FlyBase RNAi | 0.81 | 1.00 |
| WormBase production RNAi | 0.85 | 0.99 |
| FlyBase+WormBase RNAi | 0.99 | 0.99 |
The SVM analysis was done using training/testing sets specified in Additional File 6, Table S5 and Methods.
Evaluation results of nine FlyBase data types With low occurrence using the testing sets
| Data type | Recall | Filter term (%) |
|---|---|---|
| Initial characterization of a gene | 0.97 ± 0.05 | 18.0 ± 1.3 |
| Use of expression marker | 0.95 ± 0.06 | 22.5 ± 2.3 |
| Transfection of DNA/RNA | 0.94 ± 0.04 | 7.6 ± 1.6 |
| New phenotype (characterization) | 0.93 ± 0.05 | 19.9 ± 2.1 |
| Renaming of a gene | 0.91 ± 0.10 | 10.9 ± 2.6 |
| New cis-regulatory elements | 0.88 ± 0.05 | 8.1 ± 2.2 |
| Gene model modification | 0.88 ± 0.08 | 17.1 ± 3.5 |
| Genome feature sequence mapping | 0.87 ± 0.09 | 10.9 ± 2.6 |
| Merge of gene reports | 0.86 ± 0.06 | 13.7 ± 5.3 |
The SVM analysis was done using training/testing sets specified in Additional File 7, Table S6 and Methods.
Evaluation results of three data types with low occurrence from MGI using the testing sets
| Data type | Recall | Filter term (%) |
|---|---|---|
| Mutant Phenotype allele | 0.98 ± 0.01 | 12.6 ± 1.2 |
| Embryologic expression | 0.94 ± 0.04 | 11.4 ± 1.7 |
| Tumor biology | 0.90 ± 0.08 | 3.4 ± 1.6 |
The SVM analysis was done using training/testing sets specified in Additional File 8, Table S7 and Methods.