| Literature DB >> 24124474 |
Dietrich Rebholz-Schuhmann1, Jee-Hyub Kim, Ying Yan, Abhishek Dixit, Caroline Friteyre, Robert Hoehndorf, Rolf Backofen, Ian Lewin.
Abstract
MOTIVATION: Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical "term space" (the "Lexeome"), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness). RESULT: This study compiles a resource for lexical terms of biomedical interest in a standard format (called "LexEBI"), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions.Entities:
Mesh:
Year: 2013 PMID: 24124474 PMCID: PMC3790750 DOI: 10.1371/journal.pone.0075185
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Sources of baseforms and term variants.
| Baseforms [#] | Variants [#] | Total [#] | Total/Labels | Unique terms [#] | Uniq. Terms/Labels | ||
| Gene/Prot. | GP7 | 516′113 | 4′005′040 |
| 8.76 | 1′726′853 | 3.35 |
| GP6 | 488′577 | 3′389′316 |
| 7.94 | 1′564′436 | 3.20 | |
| Interpro | 20′671 | 0 |
| 1.00 | 20′671 | 1.00 | |
| Enzymes | 4′905 | 8′082 |
| 2.65 | 12′377 | 2.52 | |
| Chemi- cals | Jochem | 278′578 | 1′691′980 |
| 7.07 | 1′527′752 | 5.48 |
| ChEBI | 19′645 | 94′748 |
| 5.82 | 101′307 | 5.16 | |
| ChEBI (all) | 549′838 | 1′187′322 |
| 3.16 | 863′227 | 1.57 | |
| Other | Diseases | 56′010 | 165′581 |
| 3.96 | 186′555 | 3.33 |
| Species | 643′280 | 199′130 |
| 1.31 | 838′135 | 1.30 | |
| UMLS | Pharmact. | 104′201 | 123′840 |
| 2.19 | 227′799 | 2.19 |
| Bioact. | 54′148 | 87′209 |
| 2.61 | 141′121 | 2.61 | |
| Enzymes | 26′065 | 56′332 |
| 3.16 | 82′033 | 3.15 | |
| Lipid, Carb. | 11′518 | 9′770 |
| 1.85 | 21′281 | 1.85 | |
| Vit., Horm. | 6′877 | 10′258 |
| 2.49 | 17′007 | 2.47 | |
| Neoplast. | 4′718 | 6′488 |
| 2.38 | 11′196 | 2.37 |
The table shows the distribution of terms from LexEBI sorted according to the resource that delivered the terms. The biggest portions of the terms contained in LexEBI result from BioThesaurus (GP 6 and GP 7), from Jochem and ChEBI and from the NCBI taxonomy. Interpro and species term show a low degree of term variation.
Distribution of abbreviations.
| All occ. in Medline [#] | Unique acronyms [#] | Occ./acronym | >1 occ. [#] | >1 occ. [%] | |
| GP7 | 1′705′358 | 65′674 | 26.0 | 33′031 | 49.7% |
| Enzymes | 287′219 | 10′001 | 28.7 | 5′184 | 48.2% |
| ChEBI | 7′411′169 | 27′776 | 266.8 | 14′080 | 49.3% |
| Disease | 9′034′479 | 25′377 | 356.0 | 13′610 | 46.4% |
| Species | 218′964 | 10′373 | 21.1 | 4′670 | 55.0% |
|
|
|
|
The abbreviations extracted from Medline have been attributed to a reference terminological resource, e.g. ChEBI, and the frequency of the abbreviation has been determined and added to LexEBI. Half of the abbreviations have single occurrences.
Figure 1Baseform polysemy and nestedness: The diagram shows several comparisons between the different data resources.
The content of the mentioned five resources, i.e. Enzymes, Interpro, Jochem, ChEBI and Species, against the terms contained in GP7 using exact matching and fuzzy matching that considers morphological variation. All comparisons only use the baseforms of the clusters in LexEBI (left part) or the term variants from different resources (right part). The measurements have been performed for the identification of complete terms in the resource and for the nestedness of GP7 terms in the terms of the other resources, i.e. “Identical” versus “Nestedness”, respectively. It can be seen, that terms denoting enzyme entities do not show extensive term variation in GP7 and are nested to only a small extent in other terms of GP7. On the other hand, the terms for chemical entities are nested to a large extent in the terms of GP7 forming the cause of ambiguity and nestedness. Again the terms from Jochem and from ChEBI are part of the term variants from GP7 using exact matching and matching based on morphological variation.
Baseforms and term variants of different types contained in GP6 and GP7.
| Against baseforms only | Against all terms | ||||||||||||||||
| Nested | Matching | Nestedness | Matching | Nestedness | |||||||||||||
| Term | Exact M. | Fuzzy M. | Exact M. | Fuzzy M. | Exact M. | Fuzzy M. | Exact M. | Fuzzy M. | |||||||||
|
| Enzymes | 150′104 |
| 157′099 |
| 178′155 |
| 200′921 |
| 611′645 |
| 711′603 |
| 739′827 |
| 975′362 |
|
| Interpro | 88′613 |
| 134′129 |
| 131′094 |
| 224′739 |
| 92′431 |
| 193′964 |
| 177′819 |
| 477′168 |
| |
| Jochem | 21′461 |
| 28′911 |
| 253′875 |
| 284′856 |
| 25′033 |
| 39′137 |
| 1′148′792 |
| 1′000′229 |
| |
| ChEBI | 10′388 |
| 13′090 |
| 411′622 |
| 431′645 |
| 10′950 |
| 16′708 |
| 1′148′792 |
| 1′530′209 |
| |
| Species | 7′823 |
| 8′820 |
| 36′043 |
| 44′454 |
| 7′871 |
| 9′087 |
| 51′581 |
| 72′848 |
| |
|
| Enzymes | 173′994 |
| 180′829 |
| 202′484 |
| 224′877 |
| 754′229 |
| 872′211 |
| 918′263 |
| 1′214′663 |
|
| Interpro | 93′979 |
| 151′797 |
| 138′979 |
| 250′599 |
| 100′087 |
| 217′898 |
| 197′252 |
| 573′201 |
| |
| Jochem | 23′402 |
| 31′418 |
| 288′062 |
| 319′561 |
| 27′478 |
| 43′214 |
| 1′023′729 |
| 1′273′755 |
| |
| ChEBI | 11′053 |
| 14′124 |
| 447′812 |
| 468′723 |
| 11′724 |
| 18′128 |
| 1′381′545 |
| 1′847′750 |
| |
| Species | 7′884 |
| 9′071 |
| 38′356 |
| 48′419 |
| 7′938 |
| 9′294 |
| 57′469 |
| 83′237 |
| |
The reference data resource (“tagged term”) is either GP6 or GP7 and the alternative data resources (“nested term”) are ChEBI, Enzyme, Interpro and other resources.
The percentage indicates, which portion of the terms has been tagged.
Baseforms and term variants of different types contained chemical term resources.
| Against baseforms only | Against all terms | ||||||||||||||||
| TaggedTerms | NestedTerms | Matching | Nestedness | Matching | Nestedness | ||||||||||||
| Exact M. | Fuzzy M. | Exact M. | Fuzzy M. | Exact M. | Fuzzy M. | Exact M. | Fuzzy M. | ||||||||||
| ChEBI | Jochem | 13′694 |
| 14′683 |
| 15′929 |
| 17′199 |
| 56′155 |
| 60′332 |
| 76′885 |
| 86′470 |
|
| GP6 | 1′019 |
| 1′258 |
| 11′148 |
| 12′442 |
| 1′264 |
| 1′711 |
| 25′017 |
| 31′356 |
| |
| GP7 | 1′119 |
| 1′366 |
| 11′690 |
| 12′843 |
| 1′416 |
| 1′879 |
| 27′461 |
| 33′565 |
| |
| Enzymes | 17 |
| 22 |
| 94 |
| 130 |
| 17 |
| 25 |
| 98 |
| 165 |
| |
| Bioact. | 479 |
| 669 |
| 1′895 |
| 2′569 |
| 667 |
| 1′218 |
| 3′213 |
| 4′960 |
| |
| Enzymes | 27 |
| 35 |
| 163 |
| 435 |
| 31 |
| 45 |
| 260 |
| 1′147 |
| |
| Pharmact. | 2′043 |
| 2′569 |
| 5′648 |
| 7′580 |
| 4′096 |
| 6′096 |
| 12′256 |
| 18′454 |
| |
| Jochem | ChEBI | 15′511 |
| 17′104 |
| 215′072 |
| 256′468 |
| 67′030 |
| 86′499 |
| 668′716 |
| 1′142′849 |
|
| Enzymes | 114 |
| 125 |
| 373 |
| 474 |
| 143 |
| 169 |
| 607 |
| 930 |
| |
| Interpro | GP6 | 2′542 |
| 3′488 |
| 15′946 |
| 19′888 |
| 2′542 |
| 3′488 |
| 15′946 |
| 19′888 |
|
| GP7 | 2′565 |
| 3′709 |
| 15′508 |
| 20′108 |
| 2′565 |
| 3′709 |
| 15′508 |
| 20′108 |
| |
| Enzyme | GP6 | 2′514 |
| 2′604 |
| 4′334 |
| 4′517 |
| 8′400 |
| 8′690 |
| 11′547 |
| 12′280 |
|
| GP7 | 2′561 |
| 2′657 |
| 4′391 |
| 4′576 |
| 8′601 |
| 8′872 |
| 11′674 |
| 12′383 |
| |
| ChEBI | 13 |
| 13 |
| 1′732 |
| 3'270 |
| 13 |
| 13 |
| 3′166 |
| 7′386 |
| |
| Jochem | 91 |
| 102 |
| 3′042 |
| 3′391 |
| 106 |
| 122 |
| 6′392 |
| 7′744 |
| |
| Bioact. | ChEBI | 432 |
| 579 |
| 41′221 |
| 43′864 |
| 553 |
| 1′295 |
| 85′412 |
| 98′536 |
|
| Enzymes | ChEBI | 21 |
| 28 |
| 16′834 |
| 18′985 |
| 21 |
| 42 |
| 35′223 |
| 49′917 |
|
| Pharmact. | ChEBI | 2′342 |
| 2′867 |
| 37′373 |
| 47′049 |
| 3′361 |
| 6′559 |
| 58′178 |
| 86′264 |
|
The table gives an overview on the number of terms from the reference data resource (“tagged term”), e.g. ChEBI, Jochem, that contain the term from the alternative data resource (“nested term”). The percentage indicates the portion of the reference data resource.
Nestedness of terms.
| Nested Term | Chemical | Species | Diso | PGN | ||||||
| Tagged | Total | Unique | Total | Unique | Total | Unique | Total | Unique | Total | Unique |
| Term | 17′311 | 926 | 1′895 | 307 | 727 | 131 | 169 | 98 | ||
| PGN | 17′728 | 15′538 | 16′328 | 774 | 1′109 | 119 | 291 | 55 | ||
| Diso | 1′894 | 1′824 | 983 | 285 | 786 | 195 | 125 | 70 | ||
| Species | 435 | 435 | 435 | 82 | 0 | |||||
| Chemical | 45 | 45 | 1 | 1 | 44 | 31 | ||||
The terms from LexEBI have been cross-compared for the identification of nested terms. The figures in the table have been reduced to the number of those terms that do contain a nested term of a different type. The rows represent the terms that have been hosting other terms (“tagged term”) and the columns indicate the tagging terms that nest inside of the hosting terms (“nested term”). Non-redundant counts (“Unique”) are presented in addition to several mentions of the same term, if it contains different nested terms (“Total”). Please note that table 3 counts a cluster as a single entry even if two clusters share the same baseform whereas this table takes a single term as a single count.
List of nested terms.
| ChEBI | |
|
| 1434/ATP, 1026/threonine, 677/nucleotide, 673/serine, 515/peptide, 472/zinc, 445/inhibitor, 392/phosphate, 380/toxin, 292/glycoprotein, 280/NADH, 265/metal, 250/GTP, 184/UDP, 182/S-adenosyl-L-methionine, 179/amine, 168/leucine, 167/acid, 166/hormone, 157/ADP, 149/quinone, 148/tyrosine, 146/cytochrome P450, 143/amino acid, 138/NAD(P)H |
|
| 157/drug, 83/retinal, 42/alcohol, 23/steroid, 20/hormone, 14/acid, 13/hemoglobin, 13/glucose, 12/iron, 12/growth hormone, 11/pyruvate, 11/lipid, 11/inhibitor, 11/glycogen, 11/cocaine, 10/potassium, 10/group, 10/cholesterol |
|
| |
|
| 469/Beta, 87/Glycine, 86/cis, 45/helix, 39/cancer, 28/glycine, 25/Spea, 24/ammonia, 23/Scolopendra, 20/Squamosa, 18/root, 13/Cancer, 12/iso, 11/anemia, 10/Helix, 8/Paes, 8/mago, 7/transposon Tn4556, 6/prion, 6/Ammonia, 5/Transposon Tn7, 5/codon, 5/Cis |
|
| 224/cancer, 86/anemia, 44/ataxia, 41/glaucoma, 36/bovine, 29/purpura, 23/root, 14/vertigo, 9/trichophyton, 8/salmonella, 8/agnosia, 7/scleroderma, 7/rosacea, 7/Escherichia coli, 6/trichophyton rubrum, 6/microsporum, 5/fossa, 3/trichophyton verrucosum, 3/trichophyton soudanense, 3/patella, 3/nephroma |
|
| |
|
| 81/Sperm, 44/sperm, 18/Mpe, 17/Neuroblastoma, 14/dissociation, 9/anterior, 7/Wiskott-Aldrich syndrome, 7/Anterior, 6/azoospermia, 5/Epstein, 5/Cat eye syndrome, 4/Ten, 4/sma, 4/ns4, 4/Ifi, 4/homocystinuria, 4/ganglion, 4/defect, 3/Nod, 2/Water stress, 2/Tubulointerstitial nephritis, 2/Teratocarcinoma |
|
| 99/Myrmecia, 55/parvovirus, 31/Sheeppox, 23/dgi1, 22/Vaccinia, 21/E11, 15/Yellow fever, 15/Vesicular stomatitis, 13/Erysipelothrix, 13/Avian sarcoma, 10/melas, 9/Hydrometra, 7/Camelpox, 5/Epstein, 5/Caprine arthritis encephalitis, 5/Canine distemper, 5/Budgerigar fledgling disease |
|
| 1/sympathomimetic |
|
| |
|
| 18/insulin, 16/hip, 8/itch, 4/prolactin, 3/angiotensin converting enzyme inhibitor, 3/agglutinin, 2/trypsin, 2/robin, 2/methylmalonyl coA mutase, 2/gastrin, 2/fibrinogen, 2/beta galactosidase, 2/arylsulfatase, 2/androgen receptor, 2/actin, 1/ubiquitin, 1/tyrosinase |
|
| 4/PAP, 3/thioredoxin, 3/ferredoxin, 2/Trp, 2/L-4, 2/IMP, 2/cholinesterase, 2/adrenodoxin, 2/A14, 1/urease, 1/serine proteinase inhibitor, 1/PNP, 1/phospholipase A2 inhibitor, 1/P2Y2, 1/oxytocin, 1/neuraminidase, 1/NAD(P)H, 1/myoglobin, 1/lipoxygenase, 1/lipopeptide |
The table shows the most frequent terms from one type (column labels) that are included in the terms of another type (row labels). Note that disease terms appear as part of a species term, since a disease term with the extension “virus” forms the species term.
Figure 2Graphs of nestedness for chemical entity terms: The figure gives an overview on the graphs based on those terms for chemical entities that are composed of a term of a different type.
An edge exists between two nodes, if the term from one node is nested in the term of the other node. The color encoding is green for PGNs, red for species, yellow for diseases and blue for chemical entities. Only few terms from ChEBI make use of generalised PGNs in contrast to the nestedness of terms for PGNs.
Figure 3Graphs of nestedness for species terms: Terms for living beings (LIVB) contain terms of diseases but no terms of other types.
Figure 4Graphs of nestedness for disease terms: Disease terms are again compositional and make use of species terms, chemical entities and protein named entities.
Only a few disease terms are composed of terms of different types.
Figure 5Graphs of nestedness for PGNs: The diagram gives an overview on the graphs based on those PGNs that are composed of a term of a different type.
The diagram shows that a large portion of protein/gene terms contain nested terms of a chemical entity, but also species terms.
Figure 6Occurrence of terms in LexEBI according to their length: The terms (baseforms and term variants) from the different resources have been matched against the GP7 terms in LexEBI.
The results have been sorted according to the term length (x = 1 to 89) and the frequencies are presented in logarithmic scale (y = 0 to 6.0). After sorting, the results for the terms have been grouped into bins where each bin represents terms of a given length +/−1. For GP7 the overall occurrence is given, for the other resources the numbers indicate how many occurrences of a GP7 term contain a term of the alternative resource, e.g. ChEBI. A large portion of GP7 terms do contain ChEBI terms, and - to a lower rate - a disease or a species term. It is obvious that longer terms are more likely to be composed of terms of a different semantic type. According to the annotation guidelines, species terms should not be part of the PGN.
Use of baseforms in Medline and BNC.
| Medline(2,180,887,571 tokens) | BNC(91,852,411 tokens) | |||||
| Exact M. | Fuzzy M. | Ratio | Exact M. | Fuzzy M. | Ratio | |
| GP 7.0 | 212′114 | 403′452 | 1.90 | 10′706 | 17′300 | 1.62 |
| GP 6.0 | 196′390 | 381′665 | 1.94 | 10′073 | 16′455 | 1.63 |
| InterPro | 2′626 | 15′558 | 5.92 | 119 | 232 | 1.95 |
| Enzymes | 4′856 | 23′072 | 4.75 | 122 | 170 | 1.39 |
| ChEBI | 27′750 | 70′287 | 2.53 | 2′014 | 3′734 | 1.85 |
| Species | 107′797 | 147′106 | 1.36 | 5′374 | 10′181 | 1.89 |
The table gives an overview on the identification of unique terms from the different resources across the two literature repositories: Medline abstracts and the British National Corpus. The statistics counts unique terms that have been identified at least once in the two corpora.
Figure 7Occurrence of terms in Medline, sorted by term length: The terms (baseforms and term variants) from the different resources have been matched against Medline.
The results have been sorted according to the term length and are presented in logarithmic scale (cf. fig. 6). The left diagram counts all occurrences of a GP7 term in Medline. The term lists has been manually curated to remove senseless terms with high frequencies and all occurrences of a term in a single abstract has only been counted once (“unique terms”). A large portion of GP7 terms do contain ChEBI terms, and to a lower rate a disease or a species term. For the right diagram, every GP7 term has only be counted once across all Medline. It becomes clear that longer PGNs contain mentions of chemical entities, and also species and disease terms, which both may have shared polysemous terms (very similar distribution values).