| Literature DB >> 32503990 |
Joshua S Madin1, Daniel A Nielsen2, Maria Brbic3,4, Ross Corkrey5, David Danko6, Kyle Edwards7, Martin K M Engqvist8, Noah Fierer9, Jemma L Geoghegan2, Michael Gillings2, Nikos C Kyrpides10,11, Elena Litchman12, Christopher E Mason6, Lisa Moore13, Søren L Nielsen14, Ian T Paulsen13, Nathan D Price15, T B K Reddy10,11, Matthew A Richards15, Eduardo P C Rocha16, Thomas M Schmidt17, Heba Shaaban6, Maulik Shukla18, Fran Supek19,20, Sasha G Tetu13, Sara Vieira-Silva21, Alice R Wattam22, David A Westfall6, Mark Westoby2.
Abstract
A synthesis of phenotypic and quantitative genomic traits is provided for bacteria and archaea, in the form of a scripted, reproducible workflow that standardizes and merges 26 sources. The resulting unified dataset covers 14 phenotypic traits, 5 quantitative genomic traits, and 4 environmental characteristics for approximately 170,000 strain-level and 15,000 species-aggregated records. It spans all habitats including soils, marine and fresh waters and sediments, host-associated and thermal. Trait data can find use in clarifying major dimensions of ecological strategy variation across species. They can also be used in conjunction with species and abundance sampling to characterize trait mixtures in communities and responses of traits along environmental gradients.Entities:
Mesh:
Year: 2020 PMID: 32503990 PMCID: PMC7275036 DOI: 10.1038/s41597-020-0497-4
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Summary of original datasets.
| Short name | Name | Source | Direct access | Format | |
|---|---|---|---|---|---|
| 1 | Amend-shock | Energetics of overall metabolic reactions of thermophilic and hyperthermophilic Archaea and Bacteria[ | doi.org/10.1111/j.1574-6976.2001.tb00576.x | Yes | |
| 2 | Bacdive-microa | The Bacterial Diversity Metadatabase[ | bacdive.dsmz.de | Yes | txt |
| 3 | Bergeys | Bergey’s Manual of Systematic Bacteriology[ | doi.org/10.1002/97811189 60608 | No | |
| 4 | Campedelli | Genus-wide assessment of antibiotic resistance in Lactobacillus spp.[ | doi.org/10.1128/AEM.01738-18 | Yes | |
| 5 | Corkrey | The Biokinetic Spectrum for Temperature[ | doi.org/10.1371/journal.po ne.0153343.s004 | Yes | txt |
| 6 | Edwards | Nutrient utilization traits of phytoplankton[ | doi.org/10.6084/m9.figshar e.c.3307917 | Yes | txt |
| 7 | Engqvist | Correlating enzyme annotations with a large set of microbial growth temperatures reveals metabolic adaptations to growth at diverse temperatures[ | doi.org/10.5281/zenodo.11 75609 | Yes | tsv |
| 8 | Faprotax | Functional Annotation of Prokaryotic Taxa (FAPROTAX)[ | pages.uoregon.edu/slouca/LoucaLab/archive/FAPROTAX/ | Yes | txt |
| 9 | Fierer | International Journal of Systematic and Evolutionary Microbiology (IJSEM) phenotypic database[ | doi.org/10.6084/m9.figshar e.4272392.v3 | Yes | txt |
| 10 | Genbank | Annotated DNA sequences[ | Yes | txt | |
| 11 | GOLD | Genomes OnLine Database[ | gold.jgi.doe.gov/index | No | txt |
| 12 | Jemma-refseq | Refseq data extraction based on custom text-extraction code[ | No | txt | |
| 13 | KEGG | Kyoto Encyclopedia of Genes and Genomes[ |
| No | html |
| 14 | Kremer | Temperature- and size-scaling of phytoplankton population growth rates: Reconciling the Eppley curve and the metabolic theory of ecology[ | doi.org/10.1002/lno.10523 | Yes | txt |
| 15 | Masonmm | Maximal growth rates of various bacteria under optimal conditions[ |
| No | |
| 16 | MediaDB | Chemically-defined growth conditions[ | mediadb.systemsbiology.ne t | Yes | sql |
| 17 | Metanogen | PhyMet2[ | metanogen.biotech.uni.wro c.pl/ | Yes | txt |
| 18 | Microbe- Directory | Annotation for metagenomic taxonomic analyses[ | microbe.directory/ (github.com/microbe- directory/microbe- directory) | Yes | txt/sql |
| 19 | Nielsensl | Size-dependent growth rates in eukaryotic and prokaryotic algae exemplified by green algae and cyanobacteria: comparisons between unicells and colonial growth forms[ | doi.org/10.1093/plankt/fbi134 | No | txt |
| 20 | Pasteur | Centre de Ressources Biologiques de l’Institut Pasteur - Microorganism biobank catalogue | catalogue- crbip.pasteur.fr/recherche_ catalogue.xhtml | Yes | txt |
| 21 | PATRIC | Pathosystems resource integration center[ |
| Yes | txt |
| 22 | Prochlorococcus | Various marine cyanobacteria doubling times | Lisa Moore (co-author) | No | txt |
| 23 | ProTraits | Phenotypes assigned to microbes using machine learning and text mining[ | protraits.irb.hr (95% precision dataset used) | Yes | txt |
| 24 | Roden-jin | Thermodynamics of Microbial Growth Coupled to Metabolism of Glucose, Ethanol, Short-Chain Organic Acids, and Hydrogen[ | doi.org/10.1128/AEM.02425-10 | Yes | |
| 25 | RRDN | Ribosomal RNA operons[ | rrndb.umms.med.umich.ed u/static/download/ | Yes | tsv |
| 26 | Silva | Growth data for ecological metagenomics[ | doi.org/10.1371/journal.pg en.1000808.s005 | Yes | doc |
Fig. 1A visual representation of the microbe trait data integration workflow for four hypothetical datasets (red, blue, green and orange). Grey bands represent consistent taxonomy and trait detail that applies across the datasets. Each of the four steps—(a) prepare, (b) combine, (c) condense traits and (d) condense to NCBI species—are summarised in the Methods and explained in detail along with scripted steps in R at the GitHub repository.
Summary of microbe traits including information about measurements and statistics about taxon coverage in the accompanying data records.
| Trait name | Description | Measurement type | Units or categoric al terms | Trait category | Number of observations (of possible 169743) | Number of NCBI species (of possible 14884) | Percent of NCBI species (%) | |
|---|---|---|---|---|---|---|---|---|
| 1 | Isolation source | Where the microbe was sourced from | Textual description | (Multiple, hierarchical) | Habitat | 51977 | 9776 | 65.7 |
| 2 | Gram stain | Gram positive or negative | Binary | +, − | Physiological | 44196 | 10141 | 68.1 |
| 3 | Metabolism | Oxygen usage | Categorical | Obligate aerobic, Aerobic, Facultative, Microaerophilic, Anaerobic, Obligate anaerobic | Physiological | 34585 | 9869 | 66.3 |
| 4 | Pathways | List of metabolic pathways undertaken | Categorical | (Multiple, hierarchical) | Physiological | 12076 | 3822 | 25.7 |
| 5 | Carbon substrate | List of carbon substrates that can be utilised | Categorical | (Multiple, hierarchical) | Physiological | 4684 | 4151 | 27.9 |
| 6 | Sporulation | Can produce spores | Binary | Yes, No | Physiological | 19080 | 5916 | 39.7 |
| 7 | Motility | Capacity to move | Categorical | Yes, No, Flagella, Gliding, Axial filament | Physiological | 22763 | 6596 | 44.3 |
| 8 | Salinity range | Coarse environmental preference | Categorical | Low, Moderate, High, Extreme | Environmental | 922 | 536 | 3.6 |
| 9 | Temperature range | Coarse environmental preference | Categorical | Low, Medium, High, Extreme | Environmental | 8799 | 2753 | 18.5 |
| 10 | Cell shape | The typical shape of cells | Categorical | (Multiple) | Morphological | 28326 | 6891 | 46.3 |
| 11 | Cell diameter (lower) | The lower range of cell diameters | Length | µm | Morphological | 4980 | 5726 | 38.5 |
| 12 | Cell diameter (upper) | The upper range of cell diameters | Length | µm | Morphological | 1799 | 3102 | 20.8 |
| 13 | Cell length (lower) | The lower range of cell length, where applicable based on shape | Length | µm | Morphological | 5000 | 5294 | 35.6 |
| 14 | Cell length (upper) | The upper range of cell length, where applicable based on shape | Length | µm | Morphological | 2062 | 3146 | 21.1 |
| 15 | Doubling time | Growth rate estimate based on doubling number of cells | Time | Hours | Physiological | 1134 | 917 | 6.2 |
| 16 | Genome size | Number of base pairs making up the genome | Count | Base pairs | Genomic | 108558 | 9115 | 61.2 |
| 17 | GC content | Percentage of base pairs that are guanine or cytosine | Ratio | Percentage | Genomic | 29382 | 4832 | 32.5 |
| 18 | Coding genes | The number of coding genes | Count | Base pairs | Genomic | 17531 | 2791 | 18.8 |
| 19 | Optimum pH | The preferred pH in which to live | pH | pH | Physiological | 4604 | 3927 | 26.4 |
| 20 | Optimum temperature | The preferred temperature in which to live | Temperature | Degrees C | Physiological | 15193 | 6517 | 43.8 |
| 21 | 16S rRNA genes | The number of 16s rRNAgenes | Count | Base pairs | Genomic | 7246 | 2430 | 16.3 |
| 22 | tRNA genes | The number of tRNA genes | Count | Base pairs | Genomic | 12865 | 2742 | 18.4 |
| 23 | Growth temperature | Temperature for specific growth measurements | Temperature | Degrees C | Contextual (for doubling time) | 13665 | 11265 | 75.7 |
Fig. 2A graphical representation of data coverage and gaps for the 21 core traits mapped onto a phylogeny (black tree). The phylogeny was created by grafting star phylogenies (NCBI species to phylum) onto a recent molecular phylogeny[20] (phylum and above) and was created here purely for illustrative purposes. To avoid clutter, only the six most speciose phyla are delineated at the outer rim (>100 species). Coloured bands represent the presence of traits in the dataset for 14,884 species. In order for the centre outwards, green are habitat traits (isolation source, optimum pH, optimum temperature, growth temperature), blue are organism trait (gram stain, metabolism, metabolic pathways, carbon substrate, sporulation, motility, doubling time, cell shape, any cell diameter), and red are genomic traits (genome size, GC content, coding genes, rRNA16S genes, tRNA genes).
Fig. 3Graphical summaries of each of 23 traits in Online-only Table 2. Barplots are used for categorical traits and frequency histograms for continuous traits. Due to the high number of distinct metabolic pathways (>80) (d) and carbon substrates (>100) (e) included in this data, to simplify presentation each of these were grouped into major categories; pathways were grouped by the primary compound involved or distinct processes where no primary compound exists, and carbon substrates were grouped by chemical classification.
Summary of raw trait data points per source.
| amend-shock | bacdive-microa | campeelli | corkrey | edwards | engqvist | faprotax | fierer | genbank | gold | jemma-refseq | kegg | kremer | masonmm | mediadb | methanogen | microbe-directory | nielsensl | pasteur | patric | prochlorococcus | protraits | roden-jin | rrndb | silva | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gram_stain | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 25,084 | 0 | 0 | 0 | 0 | 0 | 114 | 2,335 | 0 | 0 | 13,979 | 0 | 2,266 | 0 | 0 | 0 |
| metabolism | 0 | 1336 | 182 | 661 | 0 | 0 | 0 | 4,423 | 0 | 10,311 | 0 | 0 | 0 | 0 | 0 | 153 | 0 | 0 | 5,477 | 10,534 | 0 | 579 | 0 | 0 | 0 |
| pathways | 610 | 0 | 0 | 0 | 0 | 0 | 9,515 | 1,427 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 153 | 0 | 0 | 0 | 0 | 0 | 272 | 99 | 0 | 0 |
| carbon_substrates | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4,534 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 150 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| sporulation | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3,322 | 0 | 7,258 | 0 | 0 | 0 | 0 | 0 | 0 | 1,564 | 0 | 0 | 4,174 | 0 | 2,738 | 0 | 0 | 0 |
| motility | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4,356 | 0 | 8,724 | 0 | 0 | 0 | 0 | 0 | 126 | 0 | 0 | 0 | 8,657 | 0 | 552 | 0 | 0 | 0 |
| range_tmp | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7,833 | 0 | 0 | 0 | 0 | 0 |
| range_salinity | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 922 | 0 | 0 | 0 | 0 | 0 |
| cell_shape | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4,478 | 0 | 9,602 | 0 | 0 | 0 | 0 | 0 | 153 | 0 | 0 | 0 | 13,088 | 0 | 632 | 0 | 0 | 0 |
| isolation_source | 0 | 0 | 191 | 0 | 9 | 0 | 0 | 4,672 | 0 | 45,146 | 488 | 278 | 31 | 0 | 0 | 0 | 0 | 0 | 1,104 | 0 | 22 | 0 | 0 | 0 | 0 |
| d1_lo | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3,774 | 0 | 1,014 | 0 | 0 | 0 | 0 | 0 | 147 | 0 | 6 | 0 | 0 | 12 | 0 | 0 | 0 | 0 |
| d1_up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 926 | 0 | 708 | 0 | 0 | 0 | 0 | 0 | 147 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 |
| d2_lo | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3,794 | 0 | 1,028 | 0 | 0 | 0 | 0 | 0 | 148 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
| d2_up | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1,043 | 0 | 859 | 0 | 0 | 0 | 0 | 0 | 148 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| doubling_h | 0 | 0 | 0 | 661 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 42 | 37 | 119 | 0 | 6 | 0 | 0 | 22 | 0 | 0 | 0 | 207 |
| genome_size | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11,344 | 77,307 | 1,727 | 4,664 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12,311 | 0 | 0 | 0 | 0 | 0 |
| gc_content | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11,351 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16,781 | 0 | 0 | 0 | 0 | 0 |
| coding_genes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11,251 | 0 | 1,610 | 4,670 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| optimum_tmp | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4,251 | 0 | 4,539 | 0 | 0 | 0 | 0 | 0 | 152 | 1,559 | 0 | 0 | 3,963 | 0 | 0 | 0 | 0 | 0 |
| optimum_ph | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3,429 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 148 | 994 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| growth_tmp | 0 | 0 | 195 | 661 | 9 | 12,530 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 6 | 31 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 202 |
| rRNA16S_genes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1,609 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5,637 | 0 |
| tRNA_genes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11,237 | 0 | 1,610 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 0 |
| Measurement(s) | Trait • phenotypic trait • quantitative genomic trait |
| Technology Type(s) | digital curation |
| Factor Type(s) | habitat • species |
| Sample Characteristic - Organism | Archaea • Bacteria |
| Sample Characteristic - Location | global |