| Literature DB >> 28066816 |
Aaron Weimann1, Kyra Mooren2, Jeremy Frank3, Phillip B Pope3, Andreas Bremges4, Alice C McHardy1.
Abstract
The number of sequenced genomes is growing exponentially, profoundly shifting the bottleneck from data generation to genome interpretation. Traits are often used to characterize and distinguish bacteria and are likely a driving factor in microbial community composition, yet little is known about the traits of most microbes. We describe Traitar, the microbial trait analyzer, which is a fully automated software package for deriving phenotypes from a genome sequence. Traitar provides phenotype classifiers to predict 67 traits related to the use of various substrates as carbon and energy sources, oxygen requirement, morphology, antibiotic susceptibility, proteolysis, and enzymatic activities. Furthermore, it suggests protein families associated with the presence of particular phenotypes. Our method uses L1-regularized L2-loss support vector machines for phenotype assignments based on phyletic patterns of protein families and their evolutionary histories across a diverse set of microbial species. We demonstrate reliable phenotype assignment for Traitar to bacterial genomes from 572 species of eight phyla, also based on incomplete single-cell genomes and simulated draft genomes. We also showcase its application in metagenomics by verifying and complementing a manual metabolic reconstruction of two novel Clostridiales species based on draft genomes recovered from commercial biogas reactors. Traitar is available at https://github.com/hzi-bifo/traitar. IMPORTANCE Bacteria are ubiquitous in our ecosystem and have a major impact on human health, e.g., by supporting digestion in the human gut. Bacterial communities can also aid in biotechnological processes such as wastewater treatment or decontamination of polluted soils. Diverse bacteria contribute with their unique capabilities to the functioning of such ecosystems, but lab experiments to investigate those capabilities are labor-intensive. Major advances in sequencing techniques open up the opportunity to study bacteria by their genome sequences. For this purpose, we have developed Traitar, software that predicts traits of bacteria on the basis of their genomes. It is applicable to studies with tens or hundreds of bacterial genomes. Traitar may help researchers in microbiology to pinpoint the traits of interest, reducing the amount of wet lab work required.Entities:
Keywords: ancestral trait reconstruction; genotype-phenotype inference; metagenomics; microbial traits; phenotypes; phyletic patterns; single-cell genomics; support vector machines
Year: 2016 PMID: 28066816 PMCID: PMC5192078 DOI: 10.1128/mSystems.00101-16
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
FIG 1 Traitar can be used to phenotype microbial community members on the basis of genomes recovered from single-cell sequencing or (metagenomic) environmental shotgun sequencing data or of microbial isolates. Traitar provides classification models based on protein family annotation for a wide variety of different phenotypes related to the use of various substrates as source of carbon and energy for growth, oxygen requirement, morphology, antibiotic susceptibility, and enzymatic activity.
The 67 traits available in Traitar for phenotyping (we grouped each of these phenotypes into a microbiological or biochemical category)
| Phenotype | Category |
|---|---|
| Alkaline phosphatase | Enzyme |
| Beta-hemolysis | |
| Coagulase production | |
| Lipase | |
| Nitrate-to-nitrite conversion | |
| Nitrite to gas | |
| Pyrrolidonyl-β-naphthylamide | |
| Bile susceptible | Growth |
| Colistin-polymyxin susceptible | |
| DNase | |
| Growth at 42°C | |
| Growth in 6.5% NaCl | |
| Growth in KCN | |
| Growth on MacConkey agar | |
| Growth on ordinary blood agar | |
| Mucate utilization | |
| Arginine dihydrolase | Growth, amino acid |
| Indole | |
| Lysine decarboxylase | |
| Ornithine decarboxylase | |
| Acetate utilization | Growth, carboxylic acid |
| Citrate | |
| Malonate | |
| Tartrate utilization | |
| Gas from glucose | Growth, glucose |
| Glucose fermenter | |
| Glucose oxidizer | |
| Methyl red | |
| Voges-Proskauer | |
| Cellobiose | Growth, sugar |
| Esculin hydrolysis | |
| Glycerol | |
| Lactose | |
| Maltose | |
| Melibiose | |
| ONPG | |
| Raffinose | |
| Salicin | |
| Starch hydrolysis | |
| Sucrose | |
| Trehalose | |
| Urea hydrolysis | |
| Bacillus or coccobacillus | Morphology |
| Coccus | |
| Coccus—clusters or groups predominate | |
| Coccus—pairs or chains predominate | |
| Gram negative | |
| Gram positive | |
| Motile | |
| Spore formation | |
| Yellow pigment | |
| Aerobe | Oxygen |
| Anaerobe | |
| Capnophilic | |
| Facultative | |
| Catalase | Oxygen, enzyme |
| Oxidase | |
| Hydrogen sulfide | Product |
| Casein hydrolysis | Proteolysis |
| Gelatin hydrolysis |
GIDEON phenotypes with at least 10 presence and 10 absence labels.
Phenotypes assigned to microbiological/biochemical categories.
ONPG, o-nitrophenyl-β-d-galactopyranoside.
FIG 2 Work flow of Traitar. Input to the software can be genome sequence samples in nucleotide or amino acid FASTA format. Traitar predicts phenotypes on the basis of precomputed classification models and provides graphic and tabular output. In the case of nucleotide sequence input, the protein families that are important for the phenotype predictions will be further mapped to the predicted protein-coding genes.
Evaluation of the Traitar phypat and phypat+PGL phenotype classifiers and a consensus vote of both classifiers for 234 bacteria described in GIDEON in a 10-fold nested cross-validation by using different evaluation measures
| Data set (no. of bacteria) and classifier | Macroaccuracy | Accuracy | Recall phenotype | |
|---|---|---|---|---|
| Positive | Negative | |||
| GIDEON I (234) | ||||
| Phypat | 82.6 | 88.1 | 86.1 | 91.4 |
| Phypat+PGL | 90.9 | |||
| Consensus | 83.0 | 88.8 | 82.2 | |
| GIDEON II (42) | ||||
| Phypat | 85.3 | 87.5 | 84.9 | 90.2 |
| Phypat+PGL | 89.7 | |||
| Consensus | 85.7 | 87.2 | 80.8 | |
| Phypat | NA | 71.2 | ||
| Phypat+PGL | NA | 72.4 | 74 | 70.8 |
| Consensus | NA | 66.6 | ||
See evaluation metrics in Materials and Methods. Subsequently, we tested another 42 bacteria from GIDEON and 296 bacteria described in Bergey’s Manual of Systematic Bacteriology for an independent performance assessment of the two classifiers. Bold values depict the best performance obtained across the Phypat, Phypat+PGL, and consensus classifiers for each measure.
Only the overall accuracy is reported, as insufficient phenotype labels (fewer than five with negative and positive labels, respectively) were available for several phenotypes, to enable a comparable macroaccuracy calculation to the other data sets (see Table S1 in the supplemental material).
FIG 3 Macroaccuracy for each phenotype of the Traitar phypat and phypat+PGL phenotype classifiers determined in nested cross-validation of 234 bacterial species described in GIDEON (see evaluation metrics in Materials and Methods; Table 1; see Table S1 in the supplemental material).
FIG 4 Classification accuracy for each taxon at different ranks of the NCBI taxonomy. For better visualization of names for the internal nodes, the taxon names are displayed on branches leading to the respective taxon node in the tree. The nested cross-validation accuracy obtained with Traitar for 234 bacterial species described in GIDEON was projected onto the NCBI taxonomy down to the family level. Colored circles at the tree nodes depict the performance of the phypat+PGL classifier (left-hand circles) and the phypat classifier (right-hand circles). The size of the circles reflects the number of species per taxon.
FIG 5 Single-cell phenotyping with Traitar. We used 20 genome assemblies with various degrees of completeness from single cells of the “Candidatus Cloacimonetes” phylum and a joint assembly for phenotyping with Traitar. Shown is a heat map of assembly samples versus phenotypes, which is the standard visualization for phenotype predictions in Traitar. The origin of the phenotype’s prediction (Traitar phypat and/or phypat+PGL classifier) determines the color of the heat map entries. The sample labels have their genome completeness estimates as suffixes. The colors of the dendrogram indicate similar phenotype distributions across samples, as determined by a hierarchical clustering with SciPy (http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html).
FIG 6 Phenotyping of simulated draft genomes and single-cell genomes. In panel a, we used 20 genome assemblies with various degrees of completeness from single cells of the “Candidatus Cloacimonetes” phylum and a joint assembly for phenotyping with the Traitar phypat and phypat+PGL classifiers. Shown is the performance of the phenotype prediction versus the genome completeness of the single cells with respect to the joint assembly. In panel b, we simulated draft genomes on the basis of an independent test set of 42 microbial (pan)genomes. The coding sequences of these genomes were downsampled (10 replications per sampling point), and the resulting simulated draft genomes were used for phenotyping with the Traitar phypat and phypat+PGL classifiers. We plotted various performance estimates (mean center values and standard deviation error bars are shown) against protein content completeness.
The most relevant Pfam families for the classification of three important phenotypes, nitrate-to-nitrite conversion, motility, and l-arabinose
| Accession no. | Phenotype | Pfam description | Remark |
|---|---|---|---|
| Motile | Membrane MotB of proton-channel complex MotA/MotB | Flagellar protein | |
| Motile | Flagellar hook capping proteinN-terminal region | Flagellar protein | |
| Motile | Flagellar FliS protein | Flagellar protein | |
| Motile | Flagellar FliJ protein | Flagellar protein | |
| Motile | Flagellar basal body protein FlaE | Flagellar protein | |
| Motile | Chemoreceptor zinc-binding domain | Chemotaxis related | |
| Motile | Uncharacterized protein family, UPF0114 | ||
| Motile | CHASE2 domain | Chemotaxis related | |
| Motile | P2 response regulator binding domain | Chemotaxis related | |
| Motile | HPP family | ||
| Nitrate-to-nitrite conversion | NapD protein | Involved in Nar formation | |
| Nitrate-to-nitrite conversion | 4Fe-4S dicluster domain | Iron-sulfur cluster center of beta subunit of Nar | |
| Nitrate-to-nitrite conversion | Nitrate reductase cytochrome | Periplasmic Nap subunit | |
| Nitrate-to-nitrite conversion | Nitrate reductase delta subunit | Nap subunit | |
| Nitrate-to-nitrite conversion | Succinate dehydrogenase/fumarate reductase transmembrane subunit | ||
| Nitrate-to-nitrite conversion | Prokaryotic cytochrome | ||
| Nitrate-to-nitrite conversion | TOBE domain | ||
| Nitrate-to-nitrite conversion | High-affinity nickel transport protein | ||
| Nitrate-to-nitrite conversion | Molybdopterin oxidoreductase Fe4S4 domain | Bound to alpha subunit of Nar | |
| Nitrate-to-nitrite conversion | Nitrate reductase gamma subunit | Nar subunit | |
| Catalyzes first reaction in | |||
| Galactose mutarotase-like | |||
| Domain of unknown function (DUF3459) | |||
| Fibronectin type III-like domain | |||
| α- | Acts on | ||
| TraB family | |||
| Bacterial transcriptional regulator | |||
| Ferric iron reductase FhuF-like transporter | |||
| Polysaccharide pyruvyl transferase |
We ranked the Pfam families with positive weights in the Traitar SVM classifiers by the correlation of the Pfam families with the respective phenotype labels across 234 bacteria described in GIDEON. Shown are the 10 highest ranking Pfam families along with their descriptions and a description of their phenotype-related function, where we found one.
FIG 7 Phenotype gain and loss dynamics match protein family dynamics. Shown are the phenotype-protein family gain and loss dynamics for families identified as important by Traitar for the L-arabinose phenotype. Signed colored circles along the tree branches depict protein family gains (+) or losses (−). Taxon nodes are colored according to their inferred (ancestral) phenotype state.
Phenotype predictions for two novel Clostridiales species with genomes reconstructed from a commercial biogas reactor metagenome
Traitar output (yes, no, weak) was cross-referenced with phenotypes manually reconstructed on the basis of Kyoto Encyclopedia of Genes and Genomes orthology annotation (64), which are primarily the fermentation phenotypes of various sugars. We considered all of the phenotype predictions that Traitar inferred with either the phypat or the phypat+PGL classifier. A weak prediction means that only a minority of the classifiers in the Traitar voting committee assigned this sample to the phenotype-positive class (Traitar phenotype). Entries shaded light gray show a difference between the prediction and the reconstruction, whereas dark gray denotes an overlap; bold (no shading) is inconclusive.