| Literature DB >> 25887671 |
Sun Kim1, Zhiyong Lu2, W John Wilbur3.
Abstract
BACKGROUND: Controlled vocabularies such as the Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE.Entities:
Mesh:
Year: 2015 PMID: 25887671 PMCID: PMC4349776 DOI: 10.1186/s12859-015-0487-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Ambiguity of headwords for gene/protein names in SemCat
|
|
|
|
|---|---|---|
| Yes | No | gene, protein, kinase, receptor, transporter, pseudogene, enzyme, peptide, polypeptide, glycoprotein, lipoprotein, symporter, antiporter, collagen, polyprotein, cotransporter, crystallin, lectin, globin, tubulin, oncogene, phosphoprotein, ferredoxin, opsin, antibody, porin, flavoprotein, homeobox, actin, adhesin, isoenzyme, integrin, lysozyme, chaperonin, globulin, ribonucleoprotein, immunoglobulin, isozyme, cadherin, transcript, myosin, apoprotein, cyclin, autoantigen, hemoglobin, spectrin, cytochrome, flagellin, tropomyosin, kinesin, adaptin, keratin, peroxiredoxin, pilin, chemokine, casein, catenin, ferritin, enkephalin, histone, giardin, interferon, albumin, trypsin, glutaredoxin, metallothionein, cyclophilin, proteolipid, mucin, vasopressin, proteoglycan |
| Ambiguous | Low | -ase (i.e. terms ending in “ase”), regulator, antigen, isoform, inhibitor, repressor, hormone, toxin, ras, carrier, suppressor, ligand, translocator, phosphate, thioredoxin, neurotoxin |
| High | Greek letters (e.g. alpha, beta,...), Roman numerals, short strings (e.g. psi, orf, ib,...), precursor, subunit, homolog, chain, factor, component, family, product, channel, activator, system, variant, chaperone, superfamily, molecule, pump, exchanger, element, sequence, resistance, construct, allergen, exporter, transducer, sensor, finger, modulator, effector, antiterminator, fusion, defective, antagonist, locus, wing, acid, receiver, para, cofactor, spot, tail, pigment, class, coma, exon, interactor, coactivator | |
| Rarely used | content, percentage, gain, frame, length, ratio, response, yield, defect, fiber, resistant | |
| No | No | region, domain, complex, form, fragment, binding, weight, transport, member, cell, containing, fluid, related, associated, syndrome, putative, biosynthesis, repeat, activity, segment, preparation, smear, subfamily, dependent, terminus, substrate, determinant, site, level, motif, specific, subtype, mrna, dna, synthesis, fibroblast, cdna, cluster, assembly, membrane, mutant, transmembrane, virus, terminal, group, hybrid, flip, urine, function, number, periplasmic, yield, rich, plasmid, rate, metabolism, fold |
For each term, either the last word or the word before a preposition was considered as a headword. The uniqueness and the ambiguity for being a gene/protein name were judged by an annotator.
Figure 1An example for Linguistic Pattern 1. This pattern evaluates whether a term without a keyword appears in the same abstract. For “infantile autism disease”, “infantile autism” is extracted and checked if it appears in the same abstract (See the red box in the title).
Figure 2An example for Linguistic Pattern 2. This pattern utilizes the pattern, where a term is defined and explained after a “, (appositive)”. “Coflin” and “ArhGAP9” are obtained from the headword, “protein” using this pattern.
List of “is a” relations identified in Yeganova et al. [ 39 ]
| X is a Y | X is a potent Y |
| X are Y | X is the most common Y |
| X and other Y | X are rare Y |
| X as a Y | X is a widely used Y |
| X such as Y | X is an uncommon Y |
| X is an Y | X is an autosomal dominant Y |
| X as an Y | X is a form of Y |
| X is an important Y | X is one of the major Y |
| X a new Y | X is a chronic Y |
| X are the most common Y | X and other forms of Y |
| X is a rare Y | X is a broad spectrum Y |
| X is a novel Y | X is the primary Y |
| X is a major Y | X is a rare autosomal recessive Y |
| X is an essential Y | X is the most common type of Y |
| X was the only Y | X is the second most common Y |
| X was the most common Y | X are the most frequent Y |
| X is a common Y | X is the most widely used Y |
| X is a new Y | X is the most frequent Y |
| X is a complex Y | X is the most common primary Y |
| X is an effective Y | X is one of the major Y |
These patterns are summarized as “X is/are/as DT... Y” in our method, where X is a phrase, DT is a determiner and Y is a headword.
Figure 3An example for Linguistic Pattern 3. This pattern utilizes the pattern, where a term is defined or explained using “is”, “are” or “as”. “TBCE” and “Cholangiocytes” are defined as “a tubulin polymerizing protein” and “the epithelial cells”, respectively.
Dataset used for training SVM classifiers
|
|
|
|
|
|---|---|---|---|
| Gene | 3532163 | 1631676 | GENE_OR_PROTEIN |
| DNA_MOLECULE | |||
| Protein | 3533621 | 1630690 | GENE_OR_PROTEIN |
| PROTEIN_MOLECULE | |||
| Disease | 88653 | 5096888 | DISEASE_OR_SYNDROME |
| INJURY_OR_POISONING | |||
| SIGN_OR_SYMPTOM | |||
| Cell(s) | 14581 | 5178142 | CELL |
For each keyword, terms from relevant SemCat categories were merged and used for the classifiers.
SVM performance using 10-fold cross-validation on the training set for five keywords, “gene”, “protein”, “disease” and “cell(s)”
|
|
|
|
|
|---|---|---|---|
| Gene | 0.9721 | 0.9838 | 0.9779 |
| Protein | 0.9738 | 0.9846 | 0.9792 |
| Disease | 0.8938 | 0.7555 | 0.8188 |
| Cell(s) | 0.9233 | 0.6694 | 0.7761 |
Performance for Linguistic Pattern 1
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Gene | 37678 | 12461 | 100 | 91.0% | 91.0% | 91.0% |
| Protein | 24000 | 8630 | 100 | 91.0% | 91.0% | 91.0% |
| Disease | 438 | 163 | 163 | 93.9% | 94.5% | 93.3% |
| Cell | 50 | 21 | 21 | 95.2% | 95.2% | 95.2% |
| Cells | 565 | 380 | 380 | 97.1% | 97.6% | 97.4% |
Precisions for each annotator are shown for “gene”, “protein”, “disease”, “cell” and “cells”. “Total” means the total number of obtained terms. “New” and “Evaluated” mean the number of terms not in SemCat and the number of evaluated terms by reviewers, respectively.
Performance for Linguistic Pattern 2
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Gene | 1285 | 386 | 100 | 77.0% | 77.0% | 76.0% |
| Protein | 3484 | 1048 | 100 | 93.0% | 93.0% | 93.0% |
| Disease | 274 | 64 | 64 | 98.4% | 98.4% | 96.9% |
| Cell | 77 | 63 | 63 | 98.4% | 98.4% | 98.4% |
| Cells | 56 | 30 | 30 | 96.7% | 96.7% | 96.7% |
Precisions for each annotator are shown for “gene”, “protein”, “disease”, “cell” and “cells”. “Total” means the total number of obtained terms. “New” and “Evaluated” mean the number of terms not in SemCat and the number of evaluated terms by reviewers, respectively.
Performance for Linguistic Pattern 3
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Gene | 5098 | 1230 | 100 | 90.0% | 90.0% | 90.0% |
| Protein | 10439 | 3847 | 100 | 91.0% | 91.0% | 91.0% |
| Disease | 4681 | 2298 | 100 | 99.0% | 99.0% | 99.0% |
| Cell | 147 | 80 | 80 | 95.0% | 95.0% | 95.0% |
| Cells | 112 | 69 | 69 | 98.6% | 98.6% | 98.6% |
Precisions for each annotator are shown for “gene”, “protein”, “disease”, “cell” and “cells”. “Total” means the total number of obtained terms. “New” and “Evaluated” mean the number of terms not in SemCat and the number of evaluated terms by reviewers, respectively.
Estimated recalls for Linguistic Patterns 1, 2 and 3
|
|
|
|
|
|
|---|---|---|---|---|
| Gene | 13.5% | 0.6% | 2.4% | 14.0% |
| Protein | 8.7% | 1.6% | 3.9% | 10.6% |
| Disease | 0.5% | 0.4% | 4.5% | 4.8% |
| Cell | 0.7% | 0.5% | 2.1% | 2.4% |
| Cells | 1.8% | 0.9% | 1.3% | 2.8% |
| Average | 5.0% | 0.8% | 2.8% | 6.9% |
As no true labels are available for PubMed terms, recalls were evaluated based on number of SemCat terms occurring in PubMed that were discovered by the pattern.
Estimated recalls for Linguistic Patterns 1, 2 and 3 without SVM classification
|
|
|
|
|
|
|---|---|---|---|---|
| Gene | 17.4% | 0.8% | 3.1% | 18.2% |
| Protein | 11.6% | 2.3% | 5.4% | 14.2% |
| Disease | 1.4% | 0.6% | 6.1% | 6.8% |
| Cell | 8.0% | 1.0% | 3.5% | 10.7% |
| Cells | 29.7% | 1.7% | 2.7% | 31.6% |
| Average | 13.6% | 1.3% | 4.2% | 16.3% |
As no true labels are available for PubMed terms, recalls were evaluated based on number of SemCat terms occurring in PubMed that were discovered by the pattern.
Performance comparison with or without including general terms
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| Gene | 91.0% | 76.7% | 90.0% | 90.0% | 73.0% | 88.0% |
| Protein | 91.0% | 93.0% | 91.0% | 90.0% | 89.0% | 85.0% |
| Disease | 93.9% | 97.9% | 99.0% | 93.9% | 96.4% | 99.0% |
| Cell | 95.2% | 98.4% | 95.0% | 95.2% | 71.4% | 88.8% |
| Cells | 97.4% | 96.7% | 98.6% | 96.4% | 90.0% | 95.7% |
“General” indicates a term is valid, but the meaning is too general and not useful for enriching SemCat. Scores are the precisions averaged from three reviewers.