| Literature DB >> 34327299 |
Elizabeth T Hobbs1, Stephen M Goralski1, Ashley Mitchell1, Andrew Simpson1, Dorjan Leka1, Emmanuel Kotey1, Matt Sekira1, James B Munro2, Suvarna Nadendla2, Rebecca Jackson2, Aitor Gonzalez-Aguirre3, Martin Krallinger3,4, Michelle Giglio2, Ivan Erill1.
Abstract
Analysis of high-throughput experiments in the life sciences frequently relies upon standardized information about genes, gene products, and other biological entities. To provide this information, expert curators are increasingly relying on text mining tools to identify, extract and harmonize statements from biomedical journal articles that discuss findings of interest. For determining reliability of the statements, curators need the evidence used by the authors to support their assertions. It is important to annotate the evidence directly used by authors to qualify their findings rather than simply annotating mentions of experimental methods without the context of what findings they support. Text mining tools require tuning and adaptation to achieve accurate performance. Many annotated corpora exist to enable developing and tuning text mining tools; however, none currently provides annotations of evidence based on the extensive and widely used Evidence and Conclusion Ontology. We present the ECO-CollecTF corpus, a novel, freely available, biomedical corpus of 84 documents that captures high-quality, evidence-based statements annotated with the Evidence and Conclusion Ontology.Entities:
Keywords: annotation; biocuration; corpus; evidence; literature; text- and data mining
Year: 2021 PMID: 34327299 PMCID: PMC8313968 DOI: 10.3389/frma.2021.674205
Source DB: PubMed Journal: Front Res Metr Anal ISSN: 2504-0537
FIGURE 1Schematic of the evidence statement annotation process. Boxes indicate question-based steps in the annotation process. Arrows show the alternative flow-paths. Only evidence statements that are used to support a specific set of types of assertions are annotated.
FIGURE 2Example gene product annotations. (A) Multiple annotations in one sentence with Biological Process category. (B) Annotation with Molecular Function category. (C) Annotation with Cellular Component category. Text segments in the sentence mapping to ECO terms (green boxes) are highlighted in green. Text segments indicating the category of assertion (red boxes) are highlighted in red. The ECO term, ECO mapping confidence, Category, and Assertion Strength are displayed underneath the annotated text.
FIGURE 3Example annotations with non-gene product annotations. (A) Annotation with Sequence Feature category. (B) Annotation with Phenotype/Traits category. (C) Annotation with Taxonomy/Phylogeny category. Text segments in the sentence mapping to ECO terms (green boxes) are highlighted in green. Text segments indicating the category to annotate (red boxes) are highlighted in red. The ECO term, ECO mapping confidence, Category, and Assertion Strength are displayed underneath the annotated text.
FIGURE 4Example ontology IC calculation. The tree diagram depicts an example ontology with IC values calculated for each node. The RoW (Rest of World) node designates any entities not represented in ECO. An example of IC calculation for node C is shown in the top-right inset.
Example of sentences not appropriate for curation, with reason.
|
|
| "We extracted 50 nucleotides directly upstream from each captured 5′-end, resulting in 1,451 sequences derived from the (delta)hrpL-FLAG sample and 1,472 sequences from the hrpL sample (overlapping sequences within a sample were merged) and used the sequences as input to MEME |
|
|
| “We found that compared to that of wild type, toxR-lacZ expression was reduced in aphB mutants, while expression of aphB from a plasmid in this mutant restored toxR expression ( |
|
|
| “To confirm that |
|
|
| "Moreover, the inability to observe direct EspR-dependent regulation at some major EspR binding sites suggests that EspR has no or little effect on these genes in the conditions tested or that other regulators counter-balance the effect of increased EspR levels." |
|
|
| "As expected, the ompF promoter activity (beta-galactosidase activity) decreased significantly in DeltaompR relative to WT grown at high medium osmolarity (0.5 M sorbitol); however, it showed almost no difference between WT and C-ompR, thereby confirming that the ompR mutation was nonpolar." |
|
|
| "Although the scan matched all annotated and new candidate hrp promoters identified in this study, the model did not match any other region in the genome that showed enrichment in the ChIP-Seq experiment (Evalue cut-off = 0.001, 245 promoter candidates in total)." |
|
|
| Sentence #1: "The stacking energy profiles of |
| Sentence #2: "In contrast, the stacking energy profiles of |
| Sentence #3: "These results suggest that despite the variability of the nucleotide composition of the |
ECO-CollecTF corpus statistics.
| Number of unique documents | 84 |
| Number of annotated documents | 282 |
| Number of annotatable sentences | 19,702 |
| Number of annotations (total) | 2,565 |
| Number of consecutive sentence annotations | 908 |
| Number of sentences annotated (when split) | 2,774 |
| Average number of annotations per document | 9.1 |
| Number of unique ECO terms used | 146 |
FIGURE 5Annotation counts per ECO term.
FIGURE 6KwIC scores for different probabilities of success. p is probability of success, which provides the term distance between the two simulated curators. p = 1.0 is perfect agreement. KwIC averaged for 100 simulated annotated corpora at each p. Bars show standard deviation of the KwIC scores.