| Literature DB >> 20868492 |
Thomas Lingner1, Stefanie Mühlhausen, Toni Gabaldón, Cedric Notredame, Peter Meinicke.
Abstract
BACKGROUND: Establishing the relationship between an organism's genome sequence and its phenotype is a fundamental challenge that remains largely unsolved. Accurately predicting microbial phenotypes solely based on genomic features will allow us to infer relevant phenotypic characteristics when the availability of a genome sequence precedes experimental characterization, a scenario that is favored by the advent of novel high-throughput and single cell sequencing techniques.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20868492 PMCID: PMC2955703 DOI: 10.1186/1471-2105-11-481
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance comparison with pathway-based prediction
| phenotype | best pathway-based | domain-based | ||
|---|---|---|---|---|
| aucROC | sens × spec | aucROC (avg/std) | sens × spec (avg/std) | |
| Gram stain | 0.93 | 0.90 | 0.97/0.01 | 0.96/0.01 |
| Oxygen Requirement | 0.93 | 0.88 | 0.95/0.02 | 0.94/0.02 |
Prediction performance comparison between a pathway-based method [8] and the protein domain profile-based approach. The first column indicates the phenotype category. The second and fourth column represent the phenotype prediction performance in terms of the area under ROC curve (aucROC), the third and fifth column denote the prediction performance in terms of the product of sensitivity (sens) and specificity (spec). For the domain-based method the values denote average (avg) and standard deviation (std) over 100 repetitions of a ten-fold cross-validation with random partitioning. Performance values for the pathway-based methods have been taken from the supplemental material associated with the original work.
Validation performance
| Phenotype | no genus partition | with genus partition | difference |
|---|---|---|---|
| Endospores | 0.946 | 0.821 | -0.125 |
| Gram stain | 0.966 | 0.905 | -0.061 |
| Motility | 0.932 | 0.870 | -0.062 |
| Oxygen Requirement | 0.973 | 0.949 | -0.024 |
| average | 0.954 | 0.886 | -0.068 |
Comparison of phenotype prediction performance for different validation sets according to phylogenetic proximity of organisms (see also section "Methods"). The first column indicates the phenotype category, the second and third column represent the prediction performance in terms of the harmonic mean of sensitivity and specificity for the validation set without and with using the genus partition, respectively. Values in the fourth column correspond to the difference of prediction performance when the genus-partitioned data set is used.
Generalization performance
| Phenotype | sensitivity | specificity | harmonic mean | aucROC | aucPRC |
|---|---|---|---|---|---|
| Endospores | 0.913 | 0.875 | 0.894 | 0.984 | 0.959 |
| Gram stain | 0.993 | 0.907 | 0.948 | 0.986 | 0.968 |
| Motility | 0.942 | 0.874 | 0.906 | 0.927 | 0.944 |
| Oxygen Requirement | 0.992 | 0.963 | 0.977 | 0.993 | 0.987 |
| average | 0.960 | 0.904 | 0.931 | 0.972 | 0.965 |
Generalization performance using an independent test data set (for details see section "Methods"). The first column indicates the phenotype category, the remaining columns represent the prediction accuracy in terms of different performance measures. Here, "aucROC" and "aucPRC" correspond to "area under ROC curve" and "area under PRC curve", respectively.
Figure 1ROC curves for generalization performance. Receiver-operator characteristics (ROC) curves representing the phenotype prediction performance on the independent test set. Each phenotype-specific curve is assigned a unique color according to the legend within the figure. Axes are limited to minimum 50% true positive ratio and maximum 50% false positive ratio. The associated area under ROC curve (aucROC) values are given in table 3.
Discriminative domain families for phenotype category "Endospores"
| rank | weight | # groups | Pfam-ID | Pfam description |
|---|---|---|---|---|
| 1 | 0.008 | 3 | PF03419 | Sporulation factor SpoIIGA |
| 2 | 0.007 | 5 | PF07486 | Cell Wall Hydrolase |
| 3 | 0.007 | 1 | PF06686 | Stage III sporulation protein AC (SpoIIIAC) |
| 4 | 0.007 | 2 | PF00269 | Small, acid-soluble spore proteins, alpha/beta type |
| 5 | 0.007 | 1 | PF07873 | YabP family |
| 6 | 0.007 | 1 | PF09555 | Stage III sporulation protein AD (spore_III_AD) |
| 7 | 0.007 | 3 | PF00407 | Pathogenesis-related protein Bet v I family |
| 8 | 0.007 | 4 | PF00876 | Innexin |
| 9 | 0.007 | 6 | PF04672 | Protein of unknown function (DUF574) |
| 10 | 0.006 | 1 | PF04647 | Accessory gene regulator B |
List of the 10 most (positive) discriminative domain families associated with the RLSC model for the phenotype category "Endospores" (see also section "Methods"). The first column indicates the rank, the second column shows the discriminative model weight. The third column denotes the phylogenetic width of a particular domain family, i.e. the number of taxonomic groups at phylum level in which the family occurs. The fourth and fifth column correspond to the Pfam ID and family description associated with a particular domain family. The table with the 50 most positively and negatively discriminative domain families can be found in additional file 2.
Figure 2Clustering dendrogram for phenotype category "Endospores". Phylogenetic clustering dendrogram of the 50 most discriminative positive domains associated with the phenotype category "Endospores" (see also section "Methods"). Clusters with a maximum linkage branch length of 70% (Matlab® 'dendrogram' default threshold) are assigned unique colors. In the dendrogram a cluster of protein families associated with sporulation can directly be identified (colored green). This cluster also contains several uncharacterized domain families ("DUFs").
Discriminative domain families for phenotype category "Motility"
| rank | weight | # groups | Pfam-ID | Pfam description |
|---|---|---|---|---|
| 1 | 0.008 | 18 | PF00015 | Methyl-accepting chemotaxis protein (MCP) signaling domain |
| 2 | 0.007 | 20 | PF00672 | HAMP domain |
| 3 | 0.007 | 14 | PF00460 | Flagella basal body rod protein |
| 4 | 0.006 | 12 | PF08345 | Flagellar M-ring protein C-terminal |
| 5 | 0.006 | 15 | PF06429 | Domain of unknown function (DUF1078) |
| 6 | 0.006 | 14 | PF00700 | Bacterial flagellin C-terminus |
| 7 | 0.006 | 16 | PF01312 | FlhB HrpN YscU SpaS Family |
| 8 | 0.006 | 11 | PF02120 | Flagellar hook-length control protein |
| 9 | 0.006 | 14 | PF02049 | Flagellar hook-basal body complex protein FliE |
| 10 | 0.006 | 14 | PF00669 | Bacterial flagellin N-terminus |
List of the 10 most (positive) discriminative domain families associated with the RLSC model for the phenotype category "Motility" (see also section "Methods"). The first column indicates the rank, the second column shows the discriminative model weight. The third column denotes the phylogenetic width of a particular domain family. The fourth and fifth column correspond to the Pfam ID and family description associated with a particular domain family. The table with the 50 most positively and negatively discriminative domain families can be found in additional file 2.