| Literature DB >> 22122862 |
Rafael Carreira1, Sónia Carneiro, Rui Pereira, Miguel Rocha, Isabel Rocha, Eugénio C Ferreira, Anália Lourenço.
Abstract
BACKGROUND: Automated extraction systems have become a time saving necessity in Systems Biology. Considerable human effort is needed to model, analyse and simulate biological networks. Thus, one of the challenges posed to Biomedical Text Mining tools is that of learning to recognise a wide variety of biological concepts with different functional roles to assist in these processes.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22122862 PMCID: PMC3259143 DOI: 10.1186/1471-2105-12-460
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The hierarchical structure of the biological concepts. The hierarchy covers biological concepts that characterize the organizational structure of microbial systems, the physiological conditions that affect them and the laboratory techniques used to identify their underlying properties. Note: The supercategory genetic information carrier is used only for organisational purposes, i.e. the category is never used as an annotation category.
Inter-annotator agreement during training
| Training cycle 1 | Training cycle 2 | Training cycle 3 | |
|---|---|---|---|
| - | 5.74% | 21.74% | |
| - | 55.81% | 65.63% | |
| 69.39% | 63.26% | 83.08% | |
| - | 52.71% | 48.45% | |
| 41.03% | 53.78% | 65.28% | |
| 0% | 38.46% | 20.51% | |
| 45.28% | 65.36% | 71.54% | |
| 0% | 0% | 0% | |
| 27.85% | 40.94% | 42.51% | |
| 23.01% | 48.98% | 48.52% | |
Columns report the F-scores obtained for each semantic category after a cycle of training. Note that the annotation of concepts for categories dna and rna (under the supercategory genetic information carrier, which also includes the category gene) and protein (that aggregates the subcategories enzyme and transcription factor) was considered only after the first training cycle.
Inter-annotator agreement for the 130 full-texts
| Final F-score | |
|---|---|
| 13.22% | |
| 59.69% | |
| 91.78% | |
| 42.15% | |
| 63.33% | |
| 28.13% | |
| 63.90% | |
| 0% | |
| 46.50% | |
| 38.34% | |
F-scores were estimated for each semantic category.
Annotator assignments per category
| Annotator B | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| None | ||||||||||||
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 58 | |||
| 0 | 57 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 23 | ||
| 0 | 0 | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 59 | |||
| 0 | 0 | 4 | 5 | 0 | 2 | 0 | 0 | 0 | 53 | |||
| 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 52 | |||
| 0 | 1 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 27 | |||
| 3 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 75 | |||
| 0 | 0 | 0 | 0 | 48 | 0 | 0 | 0 | 0 | 13 | |||
| 0 | 0 | 0 | 1 | 1 | 0 | 2 | 0 | 1 | 121 | |||
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 210 | |||
| 44 | 52 | 119 | 73 | 117 | 9 | 330 | 1 | 187 | 100 | |||
Cells represent the assignments of both annotators in terms of the number of biological concepts and the corresponding number of annotations (depicted between parentheses). Consensual assignments, i.e. assignments to the same category, are depicted at the diagonal of the table in bold; discrepancies in category assignment are indicated by non-diagonal cells; and the pseudo-category "None" represents all assignments made by only one of the annotators. For instance, the top left hand cell indicates that annotators agreed on 8 biological concepts for the category dna, corresponding to a total of 2316 annotations.
General statistics about the corpus of full-text documents
| Categories | #concepts | # annotations | % concepts | % annotations | Annotation Frequency | Concept Distribution | |
|---|---|---|---|---|---|---|---|
| Genetic Information Carrier | 126 | 3771 | 3.45% | 6.39% | 29.93 | 8.87 | |
| 119 | 3970 | 3.26% | 6.72% | 8.38 | |||
| 8770 | 32.20% | 14.85% | 7.46 | ||||
| Protein | 175 | 2332 | 4.80% | 3.95% | 13.33 | 28.69 | |
| 388 | 4025 | 10.63% | 6.82% | 10.37 | 63.61 | ||
| 47 | 1434 | 1.29% | 2.43% | 30.51 | 7.70 | ||
| 767 | 21.02% | 36.27% | 27.92 | ||||
| 403 | 10166 | 11.04% | 17.22% | 25.23 | |||
| 449 | 3161 | 12.30% | 5.35% | 7.04 | |||
The first statistics depict the number and percentage of biological concepts and associated annotations, and the frequency of annotations per category. Besides individual categories, there are hierarchically structured annotation categories: the categories dna, rna and gene belong to the supercategory genetic information carrier; and the categories protein, enzyme and transcription factor are subcategories of protein. For these categories, the concept distribution of a category is then calculated by dividing the number of biological concepts assigned to the category per the total number of biological concepts assigned to its supercategory.
Legend: The symbol "#" stands for "number of" and the symbol "%" stands for "percentage of". Frequencies are calculated as follows:
General statistics about agreement rates and concept assignments for the two corpora
| Abstracts | Full-texts | |||
|---|---|---|---|---|
| F-scores | Final number of biological concepts | F-scores | Final number of biological concepts | |
| 30.77% | 25 | 13.22% | 126 | |
| 81.82% | 32 | 59.69% | 119 | |
| 87.84% | 73 | 91.78% | 1175 | |
| 45.16% | 35 | 42.15% | 175 | |
| 70.18% | 67 | 63.33% | 388 | |
| 20% | 17 | 28.13% | 47 | |
| 83.09% | 188 | 63.90% | 767 | |
| 0% | -(*) | 0% | -(*) | |
| 46.63% | 145 | 46.50% | 403 | |
| 75.27% | 58 | 38.34% | 449 | |
The F-score columns refer to the F-score values achieved for the 130 documents after training and before post-processing; and the final number of biological concepts is calculated after post-processing.
(*) This biological concept was not included in the final corpora. See the Post-processing sub-section for more details.