| Literature DB >> 27955641 |
Jin Mao1, Lisa R Moore2, Carrine E Blank3, Elvis Hsin-Hui Wu1, Marcia Ackerman2, Sonali Ranade1, Hong Cui4.
Abstract
BACKGROUND: The large-scale analysis of phenomic data (i.e., full phenotypic traits of an organism, such as shape, metabolic substrates, and growth conditions) in microbial bioinformatics has been hampered by the lack of tools to rapidly and accurately extract phenotypic data from existing legacy text in the field of microbiology. To quickly obtain knowledge on the distribution and evolution of microbial traits, an information extraction system needed to be developed to extract phenotypic characters from large numbers of taxonomic descriptions so they can be used as input to existing phylogenetic analysis software packages.Entities:
Keywords: Algorithm evaluation; Character matrices; Information extraction; Machine learning; Microbial phenotypes; Natural language processing; Phenotypic data extraction; Prokaryotic taxonomic descriptions; Support vector machine; Text mining
Mesh:
Year: 2016 PMID: 27955641 PMCID: PMC5153691 DOI: 10.1186/s12859-016-1396-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
High-level categories (indicated by stars) and characters implemented in MicroPIE
| Categories*/characters | Example source sentences |
|---|---|
|
| |
| %G + C | DNA G + C content is |
|
| |
| Cell Shape | Cells are |
| Cell Diameter | Cells are |
| Cell Length | In glucose broth, the bacilli are longer, up to |
| Cell Width | In addition, cells have an outer diameter of 1.5–3.0 μm and width of |
| Cell Relationship & Aggregations | A few strains grow |
| Gram Stain Type | The cells are |
| External features | Cells are frequently occurring in chains and |
| Internal features |
|
| Motility | Cells are |
| Pigment Compounds |
|
|
| |
| NaCl Minimum | Growth requires |
| NaCl Optimum | |
| NaCl Maximum | |
| pH Minimum | Growth occurs at temperatures in the range |
| pH Optimum | |
| pH Maximum | |
| Temperature Minimum | |
| Temperature Optimum | |
| Temperature Maximum | |
| Salinity Requirement for Growth |
|
| Aerophilicity |
|
| Magnesium Requirement for Growth |
|
| Vitamins and Cofactors Used For Growth |
|
|
| |
| Antibiotic Sensitivity | Sensitive to (μg per disc) |
| Antibiotic Resistant | The type strain is resistant to |
|
| |
| Colony Shape | On MA, colonies are |
| Colony Margin | Colonies are golden-yellow, circular and convex, with an |
| Colony Texture | On MA, colonies are convex, |
| Colony Color | Colonies are |
|
| |
| Fermentation Products |
|
| Other Metabolic Products |
|
|
| |
| Pathogenic | Pathogenic for |
| Disease Caused | Pathogenic for honeybees in natural and experimental |
| Pathogen Target Organ | Nodosus is infected |
| Haemolytic/Haemadsorption Properties |
|
|
| |
| Organic Compounds Used Or Hydrolyzed | Utilize |
| Organic Compounds Not Used Or Not Hydrolyzed | Arabinose, mannose, N-acetylglucosamine, maltose are used as sole carbon and energy source but not |
| Inorganic Substances Used | Does not require yeast extract for growth, and can use |
| Inorganic Substances Not Used | Does not require |
| Fermentation Substrates Used | Ferments |
| Fermentation Substrates Not Used | And no acid is produced from |
Example source sentences for each character within each category are provided. Bolded text in the source sentences indicates the values that MicroPIE should extract
Fig. 1An example input description to MicroPIE, simplified from [61], used with permission
Fig. 2Part of a hypothetical output matrix. The row in bold corresponds to the description in Fig. 1
Fig. 3System architecture of MicroPIE
Fig. 4A shared extractor based on linguistic rules for the characters Cell Diameter, Cell Width, and Cell Length
Fig. 5An example of a syntactic pattern used in MicroPIE
Examples of how performance evaluation metrics were calculated
| Example # | Character | GSM value | # GSM values | Extracted value | # extracted values | Rigid hit score | Relaxed hit score |
|---|---|---|---|---|---|---|---|
| 1 | %G + C | 55.2 │ mol% | 1 | 55.2 │ mol% | 1 | 1 | 1 |
| 2 | Organic Compounds NOT Used or NOT Hydrolyzed | esculin | 1 | Neither lactate nor pyruvate | 1 | 0 | 0 |
| 3 | Cell Shape | short plump │ rods | 1 | plump │ rods # short | 2 | 0.5 | 1 |
| 4 | Motility | not │ motile by gliding | 1 | not │ motile | 1 | 0.5 | 0.5 |
| 5 | Fermentation Substrates Used | arbutin # salicin # D-raffinose # D-mannose # sucrose # melibiose | 6 | melibiose # sucrose # D-mannose # D-raffinose # salicin # Most strains ferment arbutin | 6 | 5.5 | 6 |
| Total | 10 | 11 | 7.5 | 8.5 |
Rigid and relaxed hit scores measuring the match between extracted values and gold standard matrix (GSM) values, illustrated with examples
Performance of MicroPIE with Character Predictor
| Character | Extraction methods | # of GSM Values | # of MicroPIE Output Values | P | R | F1 | Relaxed_P | Relaxed_R | Relaxed_F1 |
|---|---|---|---|---|---|---|---|---|---|
|
| linguistic rules N | 90 | 96 | 0.91 | 0.97 | 0.94 | 0.91 | 0.97 |
|
| Cell Shape | term matching | 125 | 166 | 0.49 | 0.65 | 0.56 | 0.64 | 0.84 | 0.73 |
|
| linguistic rules N | 14 | 18 | 0.67 | 0.86 | 0.75 | 0.72 | 0.93 |
|
|
| linguistic rules N | 68 | 68 | 0.89 | 0.89 | 0.89 | 0.93 | 0.93 |
|
|
| linguistic rules N | 56 | 58 | 0.91 | 0.95 | 0.93 | 0.93 | 0.96 |
|
|
| term matching | 25 | 27 | 0.72 | 0.78 | 0.75 | 0.82 | 0.88 |
|
|
| term matching | 64 | 62 | 1.00 | 0.97 | 0.98 | 1.00 | 0.97 |
|
| External Features | term matching | 23 | 21 | 0.55 | 0.50 | 0.52 | 0.62 | 0.57 | 0.59 |
|
| term matching | 63 | 56 | 0.78 | 0.69 | 0.73 | 0.91 | 0.81 |
|
|
| term matching | 76 | 77 | 0.71 | 0.72 | 0.71 | 0.84 | 0.86 |
|
|
| term matching | 58 | 51 | 0.90 | 0.79 | 0.84 | 0.97 | 0.85 |
|
|
| linguistic rules N | 44 | 46 | 0.74 | 0.77 | 0.76 | 0.80 | 0.84 |
|
|
| linguistic rules N | 33 | 30 | 0.92 | 0.83 | 0.87 | 1.00 | 0.91 |
|
|
| linguistic rules N | 44 | 46 | 0.75 | 0.78 | 0.77 | 0.83 | 0.86 |
|
|
| linguistic rules N | 24 | 24 | 0.92 | 0.92 | 0.92 | 0.92 | 0.92 |
|
|
| linguistic rules N | 26 | 27 | 0.96 | 1.00 | 0.98 | 0.96 | 1.00 |
|
|
| linguistic rules N | 23 | 24 | 0.92 | 0.96 | 0.94 | 0.92 | 0.96 |
|
| Temperature Minimum | linguistic rules N | 58 | 44 | 0.89 | 0.67 | 0.77 | 0.89 | 0.67 | 0.77 |
| Temperature Optimum | linguistic rules N | 62 | 40 | 1.00 | 0.65 | 0.78 | 1.00 | 0.65 | 0.78 |
| Temperature Maximum | linguistic rules N | 58 | 44 | 0.91 | 0.69 | 0.78 | 0.91 | 0.69 | 0.78 |
| Aerophilicity | term matching | 83 | 89 | 0.63 | 0.68 | 0.65 | 0.69 | 0.74 | 0.72 |
| Magnesium Requirement for Growth | term matching | 4 | 2 | 0.50 | 0.25 | 0.33 | 1.00 | 0.50 | 0.67 |
| Vitamins and Cofactors Used For Growth | term matching | 14 | 26 | 0.39 | 0.71 | 0.50 | 0.39 | 0.71 | 0.50 |
| Salinity Requirement for Growth | linguistic rule + term matching | 42 | 65 | 0.58 | 0.89 | 0.70 | 0.60 | 0.93 | 0.73 |
|
| linguistic rule + term matching | 96 | 84 | 0.91 | 0.80 | 0.85 | 0.93 | 0.81 |
|
|
| linguistic rule + term matching | 64 | 49 | 0.96 | 0.73 | 0.83 | 0.96 | 0.73 |
|
|
| term matching | 102 | 98 | 0.97 | 0.94 | 0.96 | 0.98 | 0.94 |
|
|
| term matching | 43 | 44 | 0.89 | 0.91 | 0.90 | 0.96 | 0.98 |
|
|
| term matching | 69 | 75 | 0.85 | 0.92 | 0.88 | 0.86 | 0.94 |
|
| Colony Color | term matching | 80 | 127 | 0.53 | 0.84 | 0.65 | 0.59 | 0.93 | 0.72 |
| Fermentation Products | linguistic rules + term matching | 127 | 141 | 0.59 | 0.66 | 0.62 | 0.64 | 0.71 | 0.67 |
| Other Metabolic Product | term matching | 13 | 56 | 0.07 | 0.31 | 0.12 | 0.07 | 0.31 | 0.12 |
| Pathogenic | term matching | 3 | 3 | 0.50 | 0.50 | 0.50 | 0.67 | 0.67 | 0.67 |
| Disease Caused | term matching | 7 | 11 | 0.27 | 0.43 | 0.33 | 0.36 | 0.57 | 0.44 |
| Pathogen Target Organ | term matching | 4 | 9 | 0.22 | 0.50 | 0.31 | 0.22 | 0.50 | 0.31 |
| Haemolytic & Haemadsorption Properties | term matching | 10 | 7 | 0.57 | 0.40 | 0.47 | 0.57 | 0.40 | 0.47 |
| Organic Compounds Used Or Hydrolyzed | term matching | 620 | 480 | 0.85 | 0.66 | 0.74 | 0.89 | 0.69 | 0.77 |
| Organic Compounds Not Used Or Not Hydrolyzed | term matching | 733 | 468 | 0.92 | 0.58 | 0.71 | 0.92 | 0.59 | 0.72 |
| Inorganic Substances Used | term matching | 36 | 45 | 0.59 | 0.74 | 0.65 | 0.61 | 0.76 | 0.68 |
| Inorganic Substances Not Used | term matching | 61 | 41 | 0.81 | 0.54 | 0.65 | 0.81 | 0.54 | 0.65 |
| Fermentation Substrates Used | linguistic rules + term matching | 411 | 629 | 0.57 | 0.88 | 0.69 | 0.59 | 0.91 | 0.72 |
|
| linguistic rules + term matching | 442 | 475 | 0.85 | 0.91 | 0.88 | 0.86 | 0.93 |
|
|
|
|
|
|
| |||||
Abbreviations: Superscript N numerical character, S string-based/categorical character. The characters with > = 0.8 in Relaxed_F1 score are shown in bold
Fig. 6The performance comparison between MicroPIE with and without Character Predictor. a Relaxed_P, b Relaxed_R, and c Relaxed_F1 scores
The performance comparison between the students and MicroPIE on 12 characters in 46 taxonomic descriptions
| Character | Student Output | MicroPIE Output | ||||
|---|---|---|---|---|---|---|
| Relaxed_P | Relaxed_R | Relaxed_F1 | Relaxed_P | Relaxed_R | Relaxed_F1 | |
| Motility | 0.35 | 0.25 | 0.29 | 0.89 | 0.86 | 0.87 |
| Pigment Compounds | 0.07 | 0.13 | 0.09 | 1.00 | 0.96 | 0.98 |
| pH Minimum | 0.58 | 0.75 | 0.65 | 0.91 | 0.91 | 0.91 |
| pH Optimum | 0.72 | 0.59 | 0.65 | 0.96 | 1.00 | 0.98 |
| pH Maximum | 0.60 | 0.79 | 0.68 | 0.91 | 0.95 | 0.93 |
| Temperature Minimum | 0.67 | 0.35 | 0.46 | 0.92 | 0.87 | 0.90 |
| Temperature Optimum | 0.63 | 0.21 | 0.31 | 1.00 | 0.88 | 0.94 |
| Temperature Maximum | 0.75 | 0.34 | 0.47 | 0.95 | 0.88 | 0.91 |
| Aerophilicity | 0.52 | 0.55 | 0.53 | 0.71 | 0.77 | 0.74 |
| Antibiotic Sensitivity | 0.57 | 0.37 | 0.45 | 0.94 | 0.87 | 0.90 |
| Antibiotic Resistant | 0.91 | 0.70 | 0.79 | 0.98 | 0.80 | 0.88 |
| Fermentation Products | 0.47 | 0.28 | 0.35 | 0.69 | 0.69 | 0.69 |
Fig. 7Scatter plot showing the relationship between Relaxed_F1 scores and frequency of character value occurrence in GSM for each of the 42 characters. The axis of the number of GSM values is log-transformed