| Literature DB >> 22776079 |
Michael Bada1, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A Baumgartner, K Bretonnel Cohen, Karin Verspoor, Judith A Blake, Lawrence E Hunter.
Abstract
BACKGROUND: Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.Entities:
Mesh:
Year: 2012 PMID: 22776079 PMCID: PMC3476437 DOI: 10.1186/1471-2105-13-161
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Counts of annotations
| ChEBI | 8,137 | 121 | 94 | 11 | 486 |
| CL | 5,760 | 86 | 58 | 0 | 435 |
| Entrez Gene | 12,277 | 183 | 155 | 3 | 543 |
| GO BPa | 16,184 | 241 | 194 | 14 | 738 |
| GO CC | 8,354/4,707b | 125/70 | 97/51 | 9/0 | 499/322 |
| GO MF | 4,062 | 61 | 42 | 2 | 403 |
| NCBITaxonc | 7,449 | 111 | 91 | 12 | 378 |
| PRO | 15,594 | 233 | 207 | 4 | 704 |
| SOd | 22,090 | 330 | 328 | 72 | 935 |
| all | 99,907 | 1,491e |
aWe are still in the process of reviewing and editing the GO BP & MF annotations for the official 1.0 version release; therefore, the statistics for these will likely change. We will update annotation statistics on the project Web site as needed.
bWe have calculated statistics for the GO CC project both with and without the annotations of cell (GO:0005623), as these account for over half of the annotations of this project. In addition to skewing these statistics, since this is such a trivial concept that is also being annotated in the CL project, users may wish to exclude these annotations for training and evaluation of systems.
cIn addition to the hundreds of thousands of organism entries, the NCBI Taxonomy also has a small taxonomy of types of biological taxa (e.g., phylum, genus, subgenus). For the NCBI Taxonomy pass, there are also a small number of annotations of the mentions of these taxonomic concepts in the articles; however, we have excluded these in these statistics.
dFor the SO statistics, the independent_continuant annotations (as described in the Methodology) were excluded from the analysis.
eThe averages of the total number of annotations per article and of unique concepts per article were calculated simply by adding up the averages for each terminological annotation pass.
Counts of annotations and of average, median, minimum, and maximum counts of annotations per article for the 67 articles constituting the initial public release of the CRAFT Corpus.
Counts of unique annotated concepts
| ChEBI | 553 | 32 | 28 | 4 | 90 |
| CL | 155 | 7 | 6 | 0 | 22 |
| Entrez Gene | 1,024 | 18 | 17 | 1 | 95 |
| GO BP | 758 | 40 | 40 | 11 | 91 |
| GO CC | 213/212 | 12/11 | 10/9 | 1/0 | 33/32 |
| GO MF | 318 | 13 | 12 | 1 | 37 |
| NCBITaxon | 149 | 11 | 10 | 3 | 49 |
| PRO | 889 | 18 | 19 | 1 | 44 |
| SO | 260 | 41 | 43 | 8 | 89 |
| all | 4,319 | 192 |
Counts of unique mentioned concepts and of average, median, minimum, and maximum counts of unique mentioned concepts per article for the 67 articles constituting the initial public release of the CRAFT Corpus.
Figure 1IAA statistics for ChEBI and GO BP/MF, and GO CC markup. Plot of IAA versus number of training sessions/meetings (approximately weekly) for annotation of the corpus with the ChEBI ontology, GO BP & MF, and CC. IAA has been calculated as F-score, which is the harmonic mean of precision and recall.
Figure 2IAA statistics for CL, NCBITaxon, and SO markup. Plot of IAA versus number of training sessions/meetings (approximately weekly) for annotation of the corpus with the SO, CL, and NCBI Taxonomy. IAA has been calculated as F-score, which is the harmonic mean of precision and recall.
Concept annotation attributes of corpora
| CRAFT Corpus (full/initial release) | ~790,000/~560,000 | 97/67 articles | sources of MGI annotations of mouse genes/gene products | Open Biomedical Ontologies (CL, ChEBI, SO, PRO, GO BP/CC/MF, NCBITaxon), Entrez Gene | ~140,000/~100,000 |
| ABGene | | 4,265 sentences | | n/a | ~8,200 |
| BioInfer | ~34,000/~30,000f | 1,100 sentences | protein-protein interactions | ~100 entity classes, ~100 relationships | ~6,300 named entities, ~2,700 relationshipsg |
| CALBC corpus | ~16,000,000 | 150,000 abstracts | immunology | UniProt, NCBITaxon, UMLSh | ~2,700,000 |
| CLEF Corpus | | variousi | clinical/cancer data | 6 concept types | |
| FetchProt Corpus | | 200 articles | protein tyrosine kinase activity | 10 concept types, UniProt | ~3,800 |
| 4th i2b2/VA Challenge Corpus | | ~750 discharge summaries | clinical data | 3 concept types | ~2,000 |
| GENETAG | ~548,000 | 20,000 sentences | | n/a | ~25,000 genes/proteins, ~19,000 alternative lexical forms |
| GENIA 3.0 | ~440,000 | 2,000 abstracts | human blood-cell transcription factors | 35 entity classes, 34 process classes | ~93,000 entities, ~36,000 events |
| GREC | | 240 abstracts | 433 classes | ~5,000 | |
| ITI TXM PPI/TE Corpora | ~2,000,000/ ~1,900,000 | 217/238 articles | protein-protein interactions/tissue expression | 9/13 concept types, Entrez Gene, RefSeqj, ChEBI, MeSH, NCBITaxonk | ~160,000/~164,000 |
| MedPost | ~156,000 | | | | |
| OntoNotes 2.0 | ~500,000 | 1,000 newswire documents | English & Chinese news | 1000 s of WordNet senses, 50 concept typesl | ~58,000 verbsm |
| PennBioIE Oncology/CYP v1.0 Corpora | ~381,000 (~327,000)/~313,000 (~274,000) | 1,414/1,100 abstracts | medical genetics of oncology/inhibition of cytochrome P450 enzymes | n/a | |
| Yapex Corpus | 200 abstracts | protein-protein interactions | n/a | ~3,700 |
fBioInfer has ~34,000 tokens total, and ~30,000 excluding punctuation.
gBioInfer has ~6,300 named-entity annotations and ~2,700 annotations of what are termed relationships but that might more properly be conceptualized as process or state classes and thus are included here, totaling ~9,000 concept annotations.
hIn the CALBC corpus, NCBI Taxonomy and UMLS concepts were respectively used to mark up species and disease mentions.
1The CLEF Corpus is composed of many types of medical documents: 2 entire patient records (themselves composed of 9 narratives, 1 imaging report, 7 histopathology reports, and associated data) and 50 each of clinical narratives, histopathology reports, and imaging reports.
jThe annotators of the ITI TXM Corpora attempted to assign Entrez Gene IDs to gene annotations and RefSeq IDs to annotations of proteins, mRNAs, and cDNAs (although it is admitted that this assignment was very time-consuming and thus was not performed on the training subset of the PPI Corpus).
kThe annotators of the ITI TXM Corpora used ChEBI, MeSH, and NCBI Taxonomy concepts for drug, tissue, and sequence mentions.
lIn OntoNotes, the 700 most frequent polysemous verbs and 1,100 most frequent polysemous nouns have been annotated with the appropriate senses of WordNet 2.0, so the size of the schema (i.e., the total number of senses of these 1,800 words) likely numbers in the thousands; however, they note that this is different from their ontological annotation, for which only approximately 50 concept types are being used to subsume the annotated word senses.
mIn addition to ~58,000 annotated verbs, OntoNotes has an unstated but presumably large count of annotated nouns.
A summary of counts of words/tokens, of counts and types of component documents, of domains, and of counts of concept annotations for the CRAFT Corpus and related corpora.
Statistics for MGI annotations and articles
| GO (only) | 7,263 | 1,249 | 27 |
| MP (only) | 10,469 | 2,699 | 66 |
| GO & MP | 2,174 | 633 | 5 |
Statistics for MGI annotations and articles for the assembly of the CRAFT Corpus.