| Literature DB >> 31725864 |
Michael Bada1, Nicole Vasilevsky2, William A Baumgartner1, Melissa Haendel2, Lawrence E Hunter1.
Abstract
Gold-standard annotated corpora have become important resources for the training and testing of natural-language-processing (NLP) systems designed to support biocuration efforts, and ontologies are increasingly used to facilitate curational consistency and semantic integration across disparate resources. Bringing together the respective power of these, the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of full-length, open-access biomedical journal articles with extensive manually created syntactic, formatting and semantic markup, was previously created and released. This initial public release has already been used in multiple projects to drive development of systems focused on a variety of biocuration, search, visualization, and semantic and syntactic NLP tasks. Building on its demonstrated utility, we have expanded the CRAFT Corpus with a large set of manually created semantic annotations relying on Uberon, an ontology representing anatomical entities and life-cycle stages of multicellular organisms across species as well as types of multicellular organisms defined in terms of life-cycle stage and sexual characteristics. This newly created set of annotations, which has been added for v2.1 of the corpus, is by far the largest publicly available collection of gold-standard anatomical markup and is the first large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical categories, as performed in previous corpora. In addition to presenting and discussing this newly available resource, we apply it to provide a performance baseline for the automatic annotation of anatomical concepts in biomedical text using a prominent concept recognition system. The full corpus, released with a CC BY 3.0 license, may be downloaded from http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. Database URL: http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.Entities:
Year: 2017 PMID: 31725864 PMCID: PMC7243923 DOI: 10.1093/database/bax087
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Examples of sentences (along with their PubMed IDs) with nested and nesting UBERON annotations in the CRAFT Corpus. For the latter are shown the specific text spans annotated, class primary labels, and class IDs. Note that discontinuous annotations (i.e., annotations composed of two or more discontinuous text spans) have been created for the second and third sentences.
| Sentence | Nested/Nesting annotations |
|---|---|
| Glaucoma involves retinal ganglion cell death and optic nerve damage that is often associated with elevated intraocular pressure (IOP) [ | ‘retinal’: UBERON:retina (UBERON:0000966) |
| ‘retinal ganglion’: UBERON:‘ganglionic layer of retina’ (UBERON:0001792) | |
| (b) close-up surface view of ventricle of a chimeric heart generated by aggregation of two diploid morulae, one hemizygous for the CK6/ECFP (ECFP+) transgene and the other hemizygous for the YC5/EYFP (EYFP+) transgene. (PMID:12079497) | ‘heart’: UBERON:heart (UBERON:0000948) |
| ‘ventricle of … heart’: UBERON:‘cardiac ventricle’ (UBERON:0002082) | |
| To investigate this pattern in more detail, we hybridized a Tbx15 mRNA probe to a series of transverse sections at E12.5 and observed expression in multiple mesenchymal tissues of the head, trunk, and developing limbs (Figure 4A), much of which is consistent with the skull, cervical vertebrae and limb malformations reported for mice carrying the original droopy ear allele. (PMID:14737183) | ‘head’: UBERON:head (UBERON:0000033) |
| ‘mesenchymal tissues of … head’: UBERON:‘head mesenchyme’ (UBERON:0005253) | |
| ‘trunk’: UBERON:trunk (UBERON:0002100) | |
| ‘mesenchymal tissues of … trunk’: UBERON:‘trunk mesenchyme’ (UBERON:0005256) | |
| ‘limbs’: UBERON:limb (UBERON:0002101) | |
| ‘mesenchymal tissues of … limbs’: UBERON:‘limb mesenchyme’ (UBERON:0009749) | |
| The overall organogenesis of lungs was preserved in Dhcr7-/- pups; four right lung lobes and a single left lobe flanking the heart were easily seen on external examination at birth ( | ‘right lung’: UBERON:‘right lung’ (UBERON:0002167) |
| ‘right lung lobes’: UBERON:‘right lung lobe’ (UBERON:0006518) | |
| Fetal cholesterol can either be synthesized endogenously in fetal tissues or accrued from extra-embryonic tissues such as maternal serum, placenta and yolk sac [ | ‘embryonic’: UBERON:embryo (UBERON:0000922) |
| ‘extra-embryonic tissues’: UBERON:‘extraembryonic tissue’ (UBERON:0005292) |
Examples of sentences (along with their PubMed IDs) with UBERON extension class annotations in the CRAFT Corpus. For the latter are shown the specific text spans annotated and extension class names.
| Sentence | Extension class annotations |
|---|---|
| The C57BL/6J and 129P3/J groups consisted of approximately equal numbers of males and females. (PMID:11532192) | ‘males’: PATO_UBERON_EXT:male_or_bearer_of_maleness |
| ‘females’: PATO_UBERON_EXT:female_or_bearer_of_femaleness | |
| These observations demonstrate that the pigmentary and craniofacial characteristics of deH are caused by loss of function for Tbx15. (PMID:14737183) | ‘craniofacial’: UBERON_EXT:face_or_skull |
| Interestingly, the level of endogenous muscle PPARδ protein in the transgenic mice was much higher than in the control littermates. (PMID:15328533) | ‘muscle’: UBERON_EXT:muscle_structure_or_tissue |
| Here, we use regulatory information from the mouse Gdf5 gene (a bone morphogenetic protein [BMP] family member) to develop new mouse lines that can be used to either activate or inactivate genes specifically in developing joints. (PMID:15492776) | ‘bone’: UBERON_EXT:bone_element_or_tissue |
| The undulations were accompanied by partial dissolution of the underlying basement membrane (Figure 3K and L). (PMID:15630473) | ‘basement membrane’: GO_UBERON_EXT:basement_membrane |
Total annotation counts and average, median and maximum counts of annotation counts per article in the four distributed Uberon-based annotation sets of the public version of the CRAFT Corpus
| Annotation set | Total # annotations | Average # annotations per article | Median # annotations per article | Maximum # annotations per article |
|---|---|---|---|---|
| UBERON_core | 12 187 | 182 | 130 | 575 |
| UBERON_core+extensions | 14 811 | 221 | 166 | 702 |
| UBERON_core+nested | 13 625 | 203 | 137 | 739 |
| UBERON_core+extensions+nested | 16 592 | 248 | 173 | 811 |
Figure 1.Knowtator screenshot of a paragraph of an article in the public set of the CRAFT Corpus, in which each mention of an anatomical concept explicitly represented in the Uberon ontology has been annotated.
Total counts of referenced unique concepts and average, median and maximum counts per article of referenced unique concepts in the four distributed Uberon-based annotation sets of the public version of the CRAFT Corpus
| Annotation set | Total # unique concepts | Average # unique concepts per article | Median # unique concepts per article | Maximum # unique concepts per article |
|---|---|---|---|---|
| UBERON_core | 842 | 31 | 25 | 108 |
| UBERON_core+extensions | 889 | 36 | 30 | 125 |
| UBERON_core+nested | 867 | 33 | 25 | 118 |
| UBERON_core+extensions+nested | 915 | 38 | 31 | 136 |
Figure 2.Interannotator agreement statistics between the primary annotator and the annotation lead in the form of F1-measure versus article batch number.
Counts of annotations and semantic categories analogous to those in the Uberon ontology for other gold-standard corpora in which anatomical entities have been marked up
| Corpus | # Analogous anatomical annotations | # Analogous anatomical categories |
|---|---|---|
| AnEM | 1792 | 8 |
| CellFinder | 913 | 1 |
| GENIA | 1167 | 2 |
| MERLOT | 4449 | 1 |
| MiPACQ | 3652 | 1 |
| MLEE | 1346 | 8 |
Parameter settings, true positive counts (TPs), false positive counts (FPs), false negative counts (FNs), precision scores (P), recall scores (R), and F1-measure scores (F1) for ConceptMapper runs found to produce maximal P, R, and F1 scores on the publicly released UBERON_core set of concept annotations. (Each bolded number indicates the maximal score of the parameter that was optimized for the given row.)
| Concept mapper parameter settings | TPs | FPs | FNs | P | R | F1 |
|---|---|---|---|---|---|---|
| caseMatch:CASE_SENSITIVE | 5208 | 2132 | 6979 | 0.43 | 0.53 | |
| findAllMatches:NO | ||||||
| orderIndependentLookup:OFF | ||||||
| searchStrategy:CONTIGUOUS_MATCH | ||||||
| stemmer:NONE | ||||||
| stopWords:NONE | ||||||
| synonyms:EXACT_ONLY | ||||||
| caseMatch:CASE_IGNORE | 9057 | 39 389 | 3130 | 0.19 | 0.30 | |
| findAllMatches:YES | ||||||
| orderIndependentLookup:ON | ||||||
| searchStrategy:SKIP_ANY_MATCH | ||||||
| stemmer:PORTER | ||||||
| stopWords:NONE | ||||||
| synonyms:ALL | ||||||
| caseMatch:CASE_INSENSITIVE | 8102 | 4325 | 4085 | 0.65 | 0.66 | |
| findAllMatches:NO | ||||||
| orderIndependentLookup:OFF | ||||||
| searchStrategy:CONTIGUOUS_MATCH | ||||||
| stemmer:PORTER | ||||||
| stopWords:NONE | ||||||
| synonyms:EXACT_ONLY |
Parameter settings, true positive counts (TPs), false positive counts (FPs), false negative counts (FNs), precision scores (P), recall scores (R), and F1-measure scores (F1) for ConceptMapper runs found to produce maximal P, R, and F1 scores on the publicly released UBERON_core + extensions set of concept annotations. (Each bolded number indicates the maximal score of the parameter that was optimized for the given row.)
| Concept mapper parameter settings | TPs | FPs | FNs | P | R | F1 |
|---|---|---|---|---|---|---|
| caseMatch:CASE_SENSITIVE | 6960 | 2389 | 7851 | 0.47 | 0.58 | |
| findAllMatches:NO | ||||||
| orderIndependentLookup:OFF | ||||||
| searchStrategy:CONTIGUOUS_MATCH | ||||||
| stemmer:NONE | ||||||
| stopWords:NONE | ||||||
| synonyms:EXACT_ONLY | ||||||
| caseMatch:CASE_IGNORE | 11 454 | 34 708 | 3357 | 0.25 | 0.38 | |
| findAllMatches:YES | ||||||
| orderIndependentLookup:ON | ||||||
| searchStrategy:SKIP_ANY_MATCH or | ||||||
| SKIP_ANY_MATCH_ALLOW_OVERLAP | ||||||
| stemmer:BIOLEMMATIZER | ||||||
| stopWords:NONE | ||||||
| synonyms:ALL | ||||||
| caseMatch:CASE_INSENSITIVE | 10 467 | 4688 | 4344 | 0.69 | 0.71 | |
| findAllMatches:NO | ||||||
| orderIndependentLookup:OFF | ||||||
| searchStrategy:CONTIGUOUS_MATCH | ||||||
| stemmer:PORTER | ||||||
| stopWords:NONE | ||||||
| synonyms:EXACT_ONLY |