| Literature DB >> 22901054 |
Karin Verspoor1, Kevin Bretonnel Cohen, Arrick Lanfranchi, Colin Warner, Helen L Johnson, Christophe Roeder, Jinho D Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A Baumgartner, Michael Bada, Martha Palmer, Lawrence E Hunter.
Abstract
BACKGROUND: We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.Entities:
Mesh:
Year: 2012 PMID: 22901054 PMCID: PMC3483229 DOI: 10.1186/1471-2105-13-207
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Inter-annotator agreement of syntactic annotation of the CRAFT corpus
| | |||||||
|---|---|---|---|---|---|---|---|
| Recall | 91.02 | 92.31 | 89.39 | 95.92 | 94.98 | 93.16 | 67.46 |
| Precision | 90.58 | 90.18 | 90.13 | 94.98 | 94.58 | 94.39 | 33.68 |
Semantic class groupings for CRAFT
| STAR | (any class) |
| GENE | gene |
| POLY | polypeptide |
| | QTL |
| | cDNA |
| | gene |
| GENESTAR | gene or polypeptide |
| | gene or polypeptide or macromolecular complex |
| | gene or transcript or polypeptide |
| | gene or transcript or macromolecular complex |
| | macromolecular complex |
| | polypeptide |
| | polypeptide or macromolecular complex |
| POLYSTAR | promoter |
| | transcript |
| | transcript or polypeptide |
| transcript or polypeptide or macromolecular complex |
Precision/Recall/F1-score results for gene mention detection over CRAFT development set: ABNER with distributed model trained on BioCreative I using different evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.35 | 0.46 | 0.40 | 0.12 | 0.31 | 0.18 | 0.20 | 0.62 | 0.30 |
| overlap | 0.50 | 0.69 | 0.58 | 0.23 | 0.64 | 0.34 | 0.23 | 0.74 | 0.35 |
| shared | 0.49 | 0.65 | 0.56 | 0.22 | 0.57 | 0.32 | 0.23 | 0.73 | 0.35 |
| subspan | 0.50 | 0.69 | 0.58 | 0.23 | 0.64 | 0.34 | 0.23 | 0.74 | 0.35 |
Precision/Recall/F1-score results for gene mention detection over CRAFT initial release: ABNER with distributed model trained on BioCreative I using different evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.32 | 0.36 | 0.34 | 0.11 | 0.29 | 0.16 | 0.16 | 0.41 | 0.23 |
| overlap | 0.48 | 0.55 | 0.51 | 0.19 | 0.52 | 0.28 | 0.21 | 0.57 | 0.31 |
| shared | 0.46 | 0.53 | 0.50 | 0.18 | 0.50 | 0.27 | 0.21 | 0.55 | 0.30 |
| subspan | 0.47 | 0.55 | 0.51 | 0.19 | 0.52 | 0.28 | 0.21 | 0.57 | 0.31 |
Precision/Recall/F1-score results for gene mention over CRAFT development set: ABNER with distributed model trained on NLPBA using indicated evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.38 | 0.44 | 0.41 | 0.15 | 0.58 | 0.24 | 0.15 | 0.33 | 0.21 |
| overlap | 0.47 | 0.55 | 0.51 | 0.17 | 0.69 | 0.28 | 0.21 | 0.46 | 0.29 |
| shared | 0.46 | 0.54 | 0.50 | 0.17 | 0.67 | 0.27 | 0.21 | 0.45 | 0.29 |
| subspan | 0.47 | 0.55 | 0.51 | 0.17 | 0.69 | 0.28 | 0.21 | 0.46 | 0.29 |
Precision/Recall/F1-score results for gene mention over CRAFT initial release set: ABNER with distributed model trained on NLPBA using indicated evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.30 | 0.34 | 0.32 | 0.11 | 0.41 | 0.17 | 0.13 | 0.29 | 0.18 |
| overlap | 0.39 | 0.44 | 0.41 | 0.14 | 0.57 | 0.23 | 0.15 | 0.36 | 0.22 |
| shared | 0.38 | 0.42 | 0.40 | 0.14 | 0.54 | 0.22 | 0.15 | 0.36 | 0.21 |
| subspan | 0.38 | 0.43 | 0.41 | 0.14 | 0.57 | 0.23 | 0.15 | 0.36 | 0.22 |
Precision/Recall/F1-score results for gene mention over CRAFT development set: BANNER with distributed model trained on BioCreative II using indicated evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.38 | 0.61 | 0.47 | 0.16 | 0.49 | 0.25 | 0.20 | 0.78 | 0.32 |
| overlap | 0.49 | 0.80 | 0.61 | 0.25 | 0.77 | 0.38 | 0.22 | 0.85 | 0.35 |
| shared | 0.49 | 0.79 | 0.60 | 0.25 | 0.75 | 0.37 | 0.22 | 0.85 | 0.35 |
| subspan | 0.49 | 0.80 | 0.61 | 0.25 | 0.76 | 0.38 | 0.22 | 0.85 | 0.35 |
Precision/Recall/F1-score results for gene mention over CRAFT initial release: BANNER with distributed model trained on BioCreative II using indicated evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.35 | 0.51 | 0.41 | 0.14 | 0.42 | 0.21 | 0.18 | 0.60 | 0.28 |
| overlap | 0.46 | 0.69 | 0.56 | 0.20 | 0.63 | 0.30 | 0.22 | 0.76 | 0.34 |
| shared | 0.46 | 0.68 | 0.55 | 0.20 | 0.61 | 0.30 | 0.22 | 0.74 | 0.34 |
| subspan | 0.46 | 0.69 | 0.56 | 0.20 | 0.63 | 0.30 | 0.22 | 0.75 | 0.34 |
Precision/Recall/F1-score results for gene mention over CRAFT development set: LingPipe with distributed model trained on Genia using indicated evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.29 | 0.38 | 0.33 | 0.10 | 0.25 | 0.14 | 0.30 | 0.37 | 0.33 |
| shared | 0.35 | 0.47 | 0.40 | 0.14 | 0.34 | 0.20 | 0.36 | 0.45 | 0.40 |
| subspan | 0.36 | 0.48 | 0.41 | 0.14 | 0.36 | 0.20 | 0.37 | 0.47 | 0.41 |
| overlap | 0.36 | 0.48 | 0.41 | 0.14 | 0.36 | 0.20 | 0.37 | 0.47 | 0.41 |
Precision/Recall/F1-score results for gene mention over CRAFT initial release set: LingPipe with distributed model trained on Genia using indicated evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.21 | 0.31 | 0.25 | 0.07 | 0.22 | 0.11 | 0.22 | 0.31 | 0.26 |
| shared | 0.27 | 0.41 | 0.33 | 0.09 | 0.28 | 0.14 | 0.22 | 0.31 | 0.26 |
| subspan | 0.28 | 0.42 | 0.33 | 0.09 | 0.29 | 0.14 | 0.29 | 0.41 | 0.34 |
| overlap | 0.28 | 0.42 | 0.33 | 0.09 | 0.29 | 0.14 | 0.29 | 0.41 | 0.34 |
Precision/Recall/F1-score results for gene mention over CRAFT development set: LingPipe with distributed model trained on GeneTag using indicated evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.26 | 0.69 | 0.38 | 0.12 | 0.60 | 0.20 | 0.12 | 0.61 | 0.20 |
| shared | 0.31 | 0.83 | 0.45 | 0.15 | 0.80 | 0.26 | 0.16 | 0.79 | 0.26 |
| subspan | 0.32 | 0.86 | 0.46 | 0.16 | 0.85 | 0.27 | 0.16 | 0.85 | 0.27 |
| overlap | 0.32 | 0.86 | 0.46 | 0.16 | 0.85 | 0.27 | 0.16 | 0.85 | 0.27 |
Precision/Recall/F1-score results for gene mention over CRAFT initial release set: LingPipe with distributed model trained on GeneTag using indicated evaluation mapping strategies
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| strict | 0.22 | 0.63 | 0.33 | 0.08 | 0.56 | 0.15 | 0.10 | 0.58 | 0.17 |
| shared | 0.30 | 0.85 | 0.44 | 0.12 | 0.84 | 0.22 | 0.14 | 0.84 | 0.24 |
| subspan | 0.30 | 0.86 | 0.45 | 0.13 | 0.86 | 0.22 | 0.14 | 0.86 | 0.24 |
| overlap | 0.30 | 0.86 | 0.45 | 0.13 | 0.87 | 0.22 | 0.14 | 0.87 | 0.25 |
Annotation comparison strategies
| Requiring matches at both the left and right edges of the name span | |
| Allowing any degree of overlap between the system-identified name span and the gold standard name span | |
| Requiring a match only at one of the left or right edge of the name span | |
| Subsumption, where the boundaries of the system-identified name are within the span of the gold standard annotation, or vice versa |
Precision/Recall/F1-score results for gene mention over CRAFT development set: ABNER with model retrained from the CRAFT public release data set
| | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| overlap | 0.72 | 0.40 | 0.51 | 0.86 | 0.33 | 0.48 | 0.78 | 0.40 | 0.53 | 0.56 | 0.04 | 0.07 | 0.64 | 0.06 | 0.11 |
| shared | 0.72 | 0.40 | 0.51 | 0.86 | 0.33 | 0.48 | 0.78 | 0.40 | 0.53 | 0.56 | 0.04 | 0.07 | 0.64 | 0.06 | 0.11 |
| subspan | 0.72 | 0.40 | 0.51 | 0.86 | 0.33 | 0.48 | 0.78 | 0.40 | 0.53 | 0.56 | 0.04 | 0.07 | 0.64 | 0.06 | 0.11 |
| strict | 0.63 | 0.35 | 0.45 | 0.83 | 0.31 | 0.46 | 0.73 | 0.38 | 0.50 | 0.50 | 0.03 | 0.06 | 0.63 | 0.06 | 0.11 |
Precision/Recall/F1-score results for gene mention over CRAFT development set: LingPipe with model retrained from the CRAFT public release data set
| | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| strict | 0.60 | 0.64 | 0.62 | 0.50 | 0.73 | 0.59 | 0.49 | 0.75 | 0.59 | 0.21 | 0.34 | 0.26 | 0.23 | 0.34 | 0.27 |
| subspan | 0.62 | 0.67 | 0.64 | 0.52 | 0.77 | 0.62 | 0.52 | 0.79 | 0.62 | 0.21 | 0.34 | 0.26 | 0.23 | 0.35 | 0.28 |
| shared | 0.62 | 0.67 | 0.64 | 0.52 | 0.77 | 0.62 | 0.52 | 0.79 | 0.62 | 0.21 | 0.34 | 0.26 | 0.23 | 0.35 | 0.28 |
| overlap | 0.62 | 0.67 | 0.64 | 0.52 | 0.77 | 0.62 | 0.52 | 0.79 | 0.62 | 0.21 | 0.34 | 0.26 | 0.23 | 0.35 | 0.28 |
Sentence boundary detection results on the CRAFT public release data (70% set)
| LingPipe | 0.98 | 0.98 | 0.98 |
| OpenNLP | 0.87 | 0.74 | 0.80 |
| UIMA-native | 0.85 | 0.75 | 0.80 |
Tokenization results on the CRAFT public release data (70% set)
| UCompare OpenNLP | 0.95 | 0.86 | 0.90 |
| UIMA-native | 0.96 | 0.93 | 0.95 |
| PennBio | 0.92 | 0.91 | 0.91 |
| Offset Tokenizer | 0.97 | 0.80 | 0.88 |
Part of speech tagging results on the CRAFT public release data (70% set)
| LingPipe (Brown model) | 0.59 (0.90) | 0.58 (0.84) | 0.59 (0.87) |
| LingPipe (MedPost model) | 0.47 (0.88) | 0.46 (0.83) | 0.46 (0.85) |
| LingPipe (Genia model) | 0.79 (0.88) | 0.76 (0.85) | 0.77 (0.87) |
| OpenNLP | 0.82 (0.86) | 0.74 (0.77) | 0.78 (0.81) |
Numbers in parentheses indicate the upper-bound performance potential of the tools, calculated by removing occurrences of tags that did not align to the gold-standard tagset.
Results of constituent parsers using their distributed non-biomedical models on the CRAFT release set; labeled bracket precision (LB-P), recall (LB-R) and F-score (LB-F)
| Berkeley | 58.35 | 61.05 | 59.67 | 24 |
| Bikel | 63.34 | 65.27 | 64.29 | 5 |
| Charniak-Johnson | 56.97 | 49.92 | 53.21 | 166 |
| Enju | 57.76 | 59.87 | 58.80 | 612 |
| Mogura | 47.45 | 55.65 | 51.22 | 105 |
| Stanford 1.6 | 57.70 | 62.31 | 59.92 | 4 |
Results of constituent parsers using their distributed non-biomedical models on the CRAFT development set; labeled bracket precision (LB-P), recall (LB-R) and F-score (LB-F)
| Berkeley | 61.60 | 64.50 | 63.02 | 4 |
| Bikel | 63.97 | 65.82 | 64.89 | 2 |
| Charniak-Johnson | 62.51 | 65.55 | 64.00 | 59 |
| Enju | 71.93 | 43.56 | 54.26 | 8 |
| Mogura | 54.74 | 43.25 | 48.32 | 8 |
| Stanford 1.6 | 60.76 | 64.70 | 62.67 | 3 |
Results of constituent parsers using their distributed biomedical models on the CRAFT development set; labeled bracket precision (LB-P), recall (LB-R) and F-score (LB-F)
| Charniak-Johnson | 56.08 | 61.10 | 58.48 | 0 |
| Charniak-Lease | 55.53 | 59.77 | 57.57 | 2 |
| Mogura | 54.21 | 44.09 | 48.63 | 8 |
| Stanford 1.6.6 | 61.10 | 62.65 | 61.87 | 2 |
Results of constituent parsers using their distributed biomedical models on the CRAFT release set; labeled bracket precision (LB-P), recall (LB-R) and F-score (LB-F)
| Charniak-Johnson | 51.23 | 55.99 | 53.50 | 0 |
| Charniak-Lease | 53.28 | 57.43 | 55.28 | 8 |
| Mogura | 47.55 | 56.27 | 51.54 | 105 |
| Stanford 1.6.6 | 59.49 | 61.81 | 60.63 | 10 |
Results of constituent parsers using retrained CRAFT models for each CRAFT fold and the development set compared to untrained results on the development set; labeled bracket precision (LB-P), recall (LB-R) and F-score (LB-F)
| LB-P | 82.75 | 92.02 | 84.63 | 83.70 | 83.85 | 85.39 | 83.98 | 61.60 |
| LB-R | 82.64 | 90.82 | 84.01 | 83.29 | 82.88 | 84.73 | 83.20 | 64.50 |
| LB-F | 82.70 | 91.41 | 84.32 | 83.49 | 83.36 | 85.06 | 83.59 | 63.02 |
| LB-P | 80.49 | 81.10 | 81.18 | 80.77 | 91.43 | 82.99 | 80.86 | 63.97 |
| LB-R | 79.68 | 79.77 | 80.10 | 80.46 | 91.06 | 82.21 | 80.44 | 65.82 |
| LB-F | 80.08 | 80.43 | 80.64 | 80.62 | 91.24 | 82.60 | 80.65 | 64.89 |
| LB-P | 75.65 | 75.86 | 77.71 | 76.21 | 77.86 | 76.65 | 76.17 | 60.76 |
| LB-R | 76.81 | 76.84 | 78.65 | 77.24 | 77.85 | 77.48 | 75.92 | 64.70 |
| LB-F | 76.23 | 76.34 | 78.18 | 76.72 | 77.86 | 77.07 | 76.04 | 62.67 |
Micro-averaged results for dependency parsers on the CRAFT folds and dev set compared to untrained results on dev set; labeled attachment score (LAS), unlabeled attachment score (UAS), labeled accuracy score (LS)
| LAS | 85.81 | 86.29 | 87.08 | 86.13 | 86.26 | 86.34 | 86.04 | 69.78 |
| UAS | 87.94 | 88.43 | 89.16 | 88.18 | 88.16 | 88.39 | 87.91 | 73.42 |
| LS | 92.19 | 92.74 | 93.12 | 92.78 | 92.80 | 92.75 | 92.75 | 82.01 |
| LAS | 85.65 | 86.37 | 86.89 | 86.08 | 86.29 | 86.28 | 86.70 | |
| UAS | 87.96 | 88.57 | 89.04 | 88.21 | 88.43 | 88.46 | 88.86 | 75.08 |
| LS | 92.09 | 92.95 | 93.24 | 92.91 | 92.92 | 92.86 | 93.37 | 83.26 |
| LAS | 86.46 | 86.99 | 87.94 | 87.12 | 87.23 | 70.43 | ||
| UAS | 88.23 | 88.81 | 89.62 | 88.82 | 88.86 | 88.89 | 89.11 | 73.62 |
| LS | 92.71 | 93.33 | 93.93 | 93.47 | 93.66 | 93.45 | 93.99 | 83.09 |
Macro-averaged results for dependency parsers on the CRAFT folds and dev set compared to untrained results on dev set; labeled attachment score (LAS), unlabeled attachment score (UAS), labeled accuracy score (LS)
| LAS | 88.45 | 88.70 | 89.62 | 89.12 | 88.85 | 88.97 | 88.93 | 72.40 |
| UAS | 90.33 | 90.63 | 91.50 | 90.94 | 90.51 | 90.80 | 90.72 | 75.90 |
| LS | 93.43 | 93.78 | 94.23 | 94.16 | 93.93 | 93.92 | 94.03 | 82.73 |
| LAS | 88.30 | 88.85 | 89.58 | 89.12 | 88.90 | 88.98 | 89.36 | |
| UAS | 90.37 | 90.83 | 91.50 | 91.04 | 90.82 | 90.93 | 91.31 | 79.42 |
| LS | 93.32 | 94.06 | 94.37 | 94.25 | 93.98 | 94.03 | 94.52 | 85.73 |
| LAS | 89.09 | 89.43 | 90.33 | 89.86 | 89.59 | 74.56 | ||
| UAS | 90.66 | 91.09 | 91.81 | 91.42 | 91.08 | 91.23 | 91.63 | 77.78 |
| LS | 93.89 | 94.37 | 94.88 | 94.65 | 94.57 | 94.50 | 94.99 | 85.17 |
Parsing accuracy of constituency parsers, evaluated on their generated dependency correspondences
| (Micro) LAS | 76.97 | 60.21 |
| (Micro) UAS | 88.11 | 70.66 |
| (Micro) LS | 83.13 | 72.68 |
| (Macro) LAS | 80.34 | 65.19 |
| (Macro) UAS | 91.04 | 75.57 |
| (Macro) LS | 84.98 | 75.63 |
| (Micro) LAS | 72.13 | 58.42 |
| (Micro) UAS | 83.22 | 68.83 |
| (Micro) LS | 80.12 | 72.12 |
| (Macro) LAS | 75.87 | 62.10 |
| (Macro) UAS | 86.57 | 71.98 |
| (Macro) LS | 82.40 | 73.85 |
Distribution of data across the folds
| Fold 0 | 3,066 | 11532192 - 15005800 |
| Fold 1 | 3,990 | 15040800 - 15630473 |
| Fold 2 | 3,951 | 15676071 - 16110338 |
| Fold 3 | 3,723 | 16121255 - 16507151 |
| Fold 4 | 4,200 | 16539743 - 17083276 |
| Training | 18,930 | 11532192 - 17083276 |
| Development | 2,780 | 17194222 - 17696610 |