| Literature DB >> 21468321 |
Sophia Ananiadou1, Dan Sullivan, William Black, Gina-Anne Levow, Joseph J Gillespie, Chunhong Mao, Sampo Pyysalo, Balakrishna Kolluru, Junichi Tsujii, Bruno Sobral.
Abstract
Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, patricbrc.org). We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21468321 PMCID: PMC3066171 DOI: 10.1371/journal.pone.0014780
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Complexity of Type IV secretion system (T4SS) architecture and nomenclature.
(A) Model of the VirB/VirD P-T4SS encoded on the pTi plasmid of Agrobacterium tumefaciens. LPS = lipopolysaccharide, OM = outer membrane, M = murein layer, IM = inner membrane, C = cytoplasm. (B) Description of the VirB/VirD proteins. (C) Diversity encompassed by the major groups of T4SSs. P, P-T4SS: top = Rickettsia prowazekii (rvh) [31],[45] bottom = Helicobacter pylori (cag pathogenicity island, cag-PAI) [46]. Genes with homology to vir genes are colored accordingly. cag-PAI genes colored gray are not known to form the T4SS scaffold, while genes colored white are involved in T4SS function but have no clear homology to vir genes. F, F-T4SS: top = Escherichia coli (tra/trb of F plasmid), bottom = Neisseria gonorrhoeae (tra/trb of gonococcal genetic island). Capital letters depict tra genes while lower case letters depict trb genes, with remaining genes given their full names. I, I-T4SS: top = tra/trb of the IncI plasmid R64, bottom = Legionella pneumophila (dot/icm) [47]. Capital letters depict icm and tra genes while lower case letters depict dot and trb genes. GI, GI-T4SS: top = Haemophilus influenzae (tfc), bottom = Salmonella enterica Typhi (tfc). NOTE: Genes of F-, I- and GI-T4SSs with homology to vir genes are colored accordingly.
Corpus statistics for T4SS concepts: Bacteria, Cellular Component (Cell. Comp.), Biological Process (Bio. Process.), Molecular Function (Molecular.Fn.).
| Fully Manual Annotation | Acela Annotation (with Manual Seeds) | |
| # Documents | 10 | 27 |
| # Pseudo-sentences | 2437 | 11914 |
| # Tokens | 63465 | 222966 |
Statistics for dictionaries extracted from domain-specific resources for each of the entity classes.
| Bacteria | Cell. Component | Bio. Process | Mol. Function | |
| Full Ontology | ||||
| Head terms | 100255 | 2451 | 17128 | 8655 |
| Total entries | 475612 | 4383 | 50566 | 31882 |
| T4SS Branches | ||||
| Head terms | N/A | 1418 | 2453 | 2880 |
| Total entries | N/A | 2766 | 5881 | 8369 |
All GO-related categories include terms extracted across the full Gene Ontology and for only the T4SS branches.
Entity Recognition across classes contrasting dictionary-based, dictionary-based with corpus enrichment, and machine learning strategies.
| Bacteria | Cellular Comp. | Biological Proc. | Molecular Fun. | |||||||||
| # Entities | 526 | 2237 | 1870 | 203 | ||||||||
| P | R | F | P | R | F | P | R | F | P | R | F | |
| Dictionary | 96 | 97 | 96 | 50 | 11 | 18 | 59 | 35 | 44 | 64 | 62 | 63 |
| Dictionary+Corpus | 96 | 97 | 97 fsd(8) | 49 | 59 | 54 (701) | 66 | 86 | 75 (366) | 69 | 83 | 75 (71) |
| Machine Learning | 93 | 91 | 93 | 74 | 62 | 68 | 87 | 81 | 84 | 92 | 82 | 86 |
Abbreviations are as follows: P = precision, R = recall, and F = F-measure, the harmonic mean of precision and recall. The number of distinct terms added by corpus enrichment is given in parentheses.
Number of terms in each class for Bacteria, Cellular Component, Biological Process, and Molecular Function classes for T4SS, near-miss, and general documents.
| T4SS Documents | ‘Near-miss’ Documents | General | |
| Bacteria | 230 | 259 | 30 |
| Cellular Components | 208 | 92 | 48 |
| Biological Process | 215 | 160 | 58 |
| Molecular Function | 20 | 13 | 4 |
Numbers are scaled by corpus size for each class.
Typological breakdown of entity mention variability in typographical, morphological, syntactic, reduction, and abbreviation classes.
| Examples | |
| Typographical | Nucleotide binding, nucleotide-binding, NUCLEOTIDE-BINDING |
| Morphological | localize, localizes, localized, localization |
| Syntactic | DNA translocation, translocation of DNA, translocates DNA |
| Reduction | secretion process, secretion/ATP-binding activity, ATP-binding |
| Abbreviations | type IV secretory system, T4SS,Type IV secretion system (TFSS) |
Impact of normalization of entity mentions expressed by reduction in number of unique strings, broken down by entity class.
| Original | Normalized | Decrease | |
| Bacteria | 55 | 40 | 27% |
| Cellular Component | 698 | 563 | 19% |
| Biological Process | 323 | 217 | 33% |
| Molecular Function | 60 | 30 | 50% |
Figure 2Comparison of the effect of normalization.
Different classes of entity mention variability (Typographical, Morphological, Syntactic, Reduction, and Abbreviation) across different entity classes (Bacteria, Cellular component, Biological process, and Molecular function). The graph indicates the percentage reduction in unique strings contributed by each class of normalization process.