| Literature DB >> 17067374 |
Eike Staub1, Sebastian Mackowiak, Martin Vingron.
Abstract
BACKGROUND: Although baker's yeast is a primary model organism for research on eukaryotic ribosome assembly and nucleoli, the list of its proteins that are functionally associated with nucleoli or ribosomes is still incomplete. We trained a naïve Bayesian classifier to predict novel proteins that are associated with yeast nucleoli or ribosomes based on parts lists of nucleoli in model organisms and large-scale protein interaction data sets. Phylogenetic profiling and gene expression analysis were carried out to shed light on evolutionary and regulatory aspects of nucleoli and ribosome assembly.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17067374 PMCID: PMC1794573 DOI: 10.1186/gb-2006-7-10-r98
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Estimation of prediction accuracy. The accuracy of predictions was estimated from 1,000 runs of 10-fold cross-validations using 1,000 alternative training sets (see Materials and methods). The threshold/working point used for the final predictions of new nucleolar proteins is marked in each plot. (a) The sensitivity (SE = TP/(TP + FN)) of our classifier is plotted over different thresholds of classifier scores (log posterior odds ratios) applied to each cross-validation run. The logarithmic posterior odds ratios indicate how likely it is under the naïve Bayesian model that a protein is an NRCA protein (positive scores) versus that it is not an NRCA protein (negative scores). A single point on the line and its error bar stems from calculations of the average sensitivity and its standard deviation obtained from 1,000 cross-validation runs using a distinct classification score threshold. Confidence intervals are ± 2-fold standard deviation intervals around the mean. Note that at the threshold that was finally used for prediction (0.4) we expect to reach a sensitivity of 50.4%. This means that we have probably still missed as many NRCA proteins as we have predicted (62). (b) The specificity (SP = TN/(TN + FP)) of our classifier is plotted over different thresholds of classifier thresholds (log posterior odds ratios) that were applied on results of each of 1,000 cross-validation runs. Confidence intervals are ± 2-fold standard deviation intervals around the mean. Note that at the finally used threshold of 0.4 the specificity reaches 0.986, meaning that we expect only 1.4% of false positives among our predictions. (c) The ROC curve of our classifier is plotted as sensitivity versus (1-specificity). Each individual data point reflects predictions at a single cross-validation run when a single prediction threshold is applied. The central line is based on averaged SE/SP values for each threshold applied. The ROC curve gives an impression of the quality of a classifier. It is a general indicator of classification performance. The bigger the AUC, the better the classifier. We obtained an AUC value of 0.98, which generally indicates a classification of high quality. The ROC curve was also the basis for the selection of our final classifier threshold, as it illustrates the trade-off between sensitivity and specificity. We chose to be very conservative (high specificity) for the sake of missing true NRCA proteins (lower sensitivity).
Classification results and annotation for 62 novel predicted nucleolar/ribosome-associated proteins
| Gene | ORF | Hs | At | Ue | It | Kr | Ga | Ho | log(O) | Description |
| SUA7 | YPR086W | 1 | 0 | 1 | 0 | 1 | 0 | o | 0.665 | TFIIB subunit (transcription initiation factor) factor E |
| HTA1 | YDR225W | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0.612 | Histone H2A |
| HSC82 | YMR186W | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0.697 | Heat shock protein |
| TIF1 | YKR059W | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0.699 | Translation initiation factor 4A |
| PRP4 | YPR178W | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0.703 | U4/U6 snRNP 52 kDa protein |
| KAR2 | YJL034W | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.684 | Component of ER translocon |
| HTA2 | YBL003C | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0.724 | Histone H2A.2 |
| AAC3 | YBR085W | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0.686 | Mitochondrial ADP/ATP carrier - member of the mitochondrial carrier (MCF) family |
| RFC2 | YJR068W | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.686 | DNA replication factor C 41 kDa subunit |
| TEF1 | YPR080W | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.686 | Translation elongation factor eEF1 alpha-A chain cytosolic |
| SMX2 | YFL017W-A | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0.696 | snRNP G protein (the homologue of the human Sm-G) |
| BCP1 | YDR361C | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.686 | Similarity to hypothetical protein |
| LEA1 | YPL213W | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0.704 | U2 A snRNP protein |
| HSP82 | YPL240C | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.686 | Heat shock protein |
| SMD3 | YLR147C | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0.699 | Spliceosomal snRNA-associated Sm core protein required for pre-mRNA splicing |
| TIF2 | YJL138C | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0.686 | Translation initiation factor eIF4A |
| None | YBR025C | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0.610 | Strong similarity to Ylf1p |
| SPT16 | YGL207W | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0.705 | General chromatin factor |
| SUI2 | YJR007W | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0.720 | Translation initiation factor eIF2 alpha chain |
| HSH49 | YOR319W | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0.702 | Essential yeast splicing factor |
| DED1 | YOR204W | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0.716 | ATP-dependent RNA helicase |
| HTB1 | YDR224C | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0.709 | Histone H2B |
| HRR25* | YPL204W | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0.718 | Casein kinase I Ser/Thr/Tyr protein kinase |
| SSA2 | YLL024C | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.686 | Heat shock protein of HSP70 family cytosolic |
| SRP1 | YNL189W | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0.696 | Karyopherin-alpha or importin |
| SUB2 | YDL084W | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.686 | Probably involved in pre-mRNA splicing |
| CKA1 | YIL035C | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0.698 | Casein kinase II catalytic alpha chain |
| PRP43* | YGL120C | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0.695 | Involved in spliceosome disassembly |
| SUI3 | YPL237W | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0.721 | Translation initiation factor eIF2 beta subunit |
| DST1 | YGL043W | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0.692 | TFIIS (transcription elongation factor) |
| PRP8 | YHR165C | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0.721 | U5 snRNP protein pre-mRNA splicing factor |
| PRP9 | YDL030W | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0.667 | Pre-mRNA splicing factor (snRNA-associated protein) |
| SUP45 | YBR143C | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0.704 | Translational release factor |
| ASC1 | YMR116C | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0.698 | 40S small subunit ribosomal protein |
| DBP2* | YNL112W | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0.719 | ATP-dependent RNA helicase of DEAD box family |
| CKB2 | YOR039W | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0.710 | Casein kinase II beta chain |
| YRA1 | YDR381W | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0.720 | RNA annealing protein |
| GCD11 | YER025W | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0.609 | Translation initiation factor eIF2 gamma chain |
| TFG2 | YGR005C | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0.695 | TFIIF subunit (transcription initiation factor) 54 kDa |
| TOP1* | YOL006C | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0.693 | DNA topoisomerase I |
| BRR2 | YER172C | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0.708 | RNA helicase-related protein |
| RVB1 | YDR190C | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0.709 | RUVB-like protein |
| MLP1 | YKR095W | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.686 | Myosin-like protein related to Uso1p |
| HTZ1 | YOL012C | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.685 | Evolutionarily conserved member of the histone H2A F/Z family of histone variants |
| ATP2 | YJR121W | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.685 | F1F0-ATPase complex F1 beta subunit |
| SMD2 | YLR275W | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0.688 | U1 snRNP protein of the Sm class |
| PRP3 | YDR473C | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0.704 | Essential splicing factor |
| EFT1 | YOR133W | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0.682 | Translation elongation factor eEF2 |
| HTB2 | YBL002W | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0.690 | Histone H2B.2 |
| TEF4 | YKL081W | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0.718 | Translation elongation factor eEF1 gamma chain |
| HHF2 | YNL030W | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0.695 | Histone H4 |
| Predictions based solely on protein interactions | ||||||||||
| RPO21 | YDL140C | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.728 | DNA-directed RNA polymerase II 215 kDa subunit |
| DHH1 | YDL160C | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0.714 | Putative RNA helicase of the DEAD box family |
| CFT1 | YDR301W | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.731 | Pre-mRNA 3-end processing factor CF II |
| KAP95 | YLR347C | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0.689 | Karyopherin-beta |
| SPT5 | YML010W | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.732 | Transcription elongation protein |
| TAF14 | YPL129W | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.733 | TFIIF subunit (transcription initiation factor) 30 kDa |
| RPB3 | YIL021W | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.728 | DNA-directed RNA-polymerase II 45 kDa |
| RPO31 | YOR116C | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.729 | DNA-directed RNA polymerase III 160 kDa subunit |
| TIF4631 | YGR162W | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.734 | mRNA cap-binding protein (eIF4F) 150K subunit |
| PRP24 | YMR268C | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.734 | Pre-mRNA splicing factor |
| RET1 | YOR207C | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0.731 | DNA-directed RNA polymerase III 130 kDa subunit |
The data used for classification and the detailed prediction results are listed for all 62 proteins that passed our threshold of O> 0.4. These proteins had not been annotated as associated with nucleolar or ribosomal components before, but were classified as such in our analysis. A literature survey for the predicted proteins revealed that for four proteins a role in the nucleolus and ribosome biogenesis had already been established (see Note added in proof). The lower part of the table lists 11 proteins that were predicted as NRCA proteins solely on the basis of shared participation in complexes or interactions. For these proteins, we do not necessarily predict a nucleolar localization, but direct interaction with nucleolar/ribosomal components at least under one specific cellular condition at an unspecified locus within the cell. *Four proteins for which recent articles have confirmed a role in ribosome biogenesis or the nucleolus. The results are supplemented by a concise annotation for each protein from the Comprehensive Yeast Genome Database (CYGD) [72]. The header line contains abbreviations describing the column content: Gene, gene symbol of yeast gene; ORF, yeast open reading frame ID; Hs, orthology to human nucleolar protein; At, orthology to mouse-ear cress nucleolar protein; It, link to nucleolar protein via Y2H interaction in Ito dataset; Ue, link to nucleolar protein via Y2H interaction in Uetz dataset; Ga, link to nucleolar protein via participation in a complex in Gavin data set; Ho, link to nucleolar protein via participation in a complex in Ho data set; Kr, link to nucleolar protein via participation in a complex in Krogan data set; log(O), average posterior odds ratio from all prediction runs in which the protein was not used for training; Description, concise description of protein function.
Figure 2Phylogenetic profiling of novel nucleolar/ribosome-associated proteins. Phylogenetic profiles of 62 previously unrecovered nucleolar/ribosome-associated proteins of yeast across 84 organisms. The profiles were generated using the best reciprocal hit method with yeast as a reference organism (see Materials and methods). Abbreviations given on the top of the plot represent organism names (first three letters for genus and first three letters of species names; see Materials and methods for a translation of abbreviations into organism names). Further taxonomic annotation is given on the bottom of the plot. Yeast open reading frame identifiers are given on the left side, and gene names and descriptions are given on the right side of the plot. The significance of sequence similarity is visualized by different shades of gray that reflect the logarithmic expectation (E) value from reciprocal BLAST searches (shown at the bottom of the figure). Here, the E values of BLAST searches using target proteome sequences as queries versus the yeast proteome reference database are shown. The genes are ordered according to hierarchical clustering (see Materials and methods).
Summary of effective prediction rules obtained by Bayesian classification
| Hs | At | Ue | It | Kr | Ga | Ho | Prediction: associated with nucleolar or ribosomal component? |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | No |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | No |
| 0 | 0 | 0 | 0 | 0 | 1 | 1 | No |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | No |
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | No |
| 0 | 0 | 0 | 0 | 1 | 1 | 0 | No |
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | No |
| 0 | 0 | 0 | 1 | 0 | 1 | 0 | No |
| 0 | 0 | 0 | 1 | 1 | 0 | 0 | No |
| 0 | 0 | 0 | 1 | 1 | 1 | 0 | No |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 0 | 0 | 1 | 0 | 0 | 0 | 1 | No |
| 0 | 0 | 1 | 0 | 0 | 1 | 0 | No |
| 0 | 0 | 1 | 0 | 1 | 0 | 0 | No |
| 0 | 0 | 1 | 0 | 1 | 1 | 0 | No |
| 0 | 0 | 1 | 1 | 0 | 0 | 0 | No |
| 0 | 0 | 1 | 1 | 0 | 1 | 0 | No |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | No |
| 0 | 1 | 0 | 0 | 0 | 1 | 0 | No |
| 0 | 1 | 0 | 0 | 1 | 0 | 0 | No |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | No |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | No |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | No |
| 1 | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 0 | 0 | 0 | 0 | 1 | 1 | 1 | Yes |
| 0 | 0 | 1 | 0 | 1 | 1 | 1 | Yes |
| 0 | 0 | 1 | 1 | 1 | 1 | 0 | Yes |
| 0 | 1 | 0 | 0 | 0 | 0 | 1 | Yes |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | Yes |
| 0 | 1 | 1 | 0 | 1 | 1 | 1 | Yes |
| 1 | 0 | 0 | 0 | 0 | 0 | 1 | Yes |
| 1 | 0 | 0 | 0 | 1 | 0 | 1 | Yes |
| 1 | 0 | 0 | 0 | 1 | 1 | 0 | Yes |
| 1 | 0 | 0 | 0 | 1 | 1 | 1 | Yes |
| 1 | 0 | 0 | 1 | 1 | 1 | 0 | Yes |
| 1 | 0 | 1 | 0 | 0 | 1 | 0 | Yes |
| 1 | 0 | 1 | 0 | 1 | 0 | 0 | Yes |
| 1 | 0 | 1 | 0 | 1 | 1 | 0 | Yes |
| 1 | 0 | 1 | 1 | 1 | 0 | 0 | Yes |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | Yes |
| 1 | 1 | 0 | 0 | 0 | 0 | 1 | Yes |
| 1 | 1 | 0 | 0 | 0 | 1 | 0 | Yes |
| 1 | 1 | 0 | 0 | 1 | 0 | 0 | Yes |
| 1 | 1 | 0 | 0 | 1 | 1 | 0 | Yes |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | Yes |
| 1 | 1 | 1 | 0 | 0 | 1 | 0 | Yes |
| 1 | 1 | 1 | 0 | 1 | 1 | 0 | Yes |
Our Bayesian classification approach assigns a distinct prediction to each possible binary pattern that could be associated with a protein. With the seven data sources used here, only a limited number of 128 different combinations of binary evidences is possible. Here, all binary patterns that occur in our data set are enumerated. They are supplemented by the prediction result to illustrate which input data generates which prediction. Note that neither a single protein interaction nor a single occurrence in nucleoli of model organisms is sufficient for a positive prediction, and that evidence from three protein interaction experiments is necessary for a positive prediction in the absence of evidence based on orthologs in nucleolar preparations of model organisms. Column headers denote the data source: Hs, orthology to human nucleolar protein; At, orthology to mouse-ear cress nucleolar protein; Ue, link to nucleolar protein via Y2H interaction in Uetz dataset; It, link to nucleolar protein via Y2H interaction in Ito dataset; Kr, link to nucleolar protein via participation in a complex in Krogan data set; Ga, link to nucleolar protein via participation in a complex in Gavin data set; Ho, link to nucleolar protein via participation in a complex in Ho data set.
Figure 3Hierarchical clustering of phylogenetic profiles of nucleolar proteins. Phylogenetic profiles of all 501 nucleolar or ribosome-associated proteins. Organisms vary along the horizontal axis, proteins along the vertical axis. Presence of a gene is indicated by dark blue, absence by light blue. Organisms from the three domains of life are separated by black bars. The dendrogram resulting from protein-wise hierarchical clustering is given on the left. Several evolutionarily meaningful clusters emerged, which are colored in the dendrogram: red, proteins of archaeal origin; yellow, ubiquitous proteins; green, proteins of (eu-)bacterial origin. Note that the eukaryote-only genes constitute the largest group, followed by the archaea/eukaryote group. There is a considerable number of genes with orthologs only in bacteria and eukaryota, but not in archaea.
Figure 4Phylogenetic profiling of the 90S processosome. Phylogenetic profiles of known yeast 90S processosome proteins across 84 organisms. Abbreviations given on the top of the plot represent organism names (first three letters for genus and first three letters of species names; see Materials and methods for a translation of abbreviations into organism names). Further taxonomic annotation is given on the bottom of the plot. Yeast open reading frame identifiers are given on the left side, and gene names and descriptions are given on the right side of the plot. The significance of sequence similarity is visualized by different shades of gray that reflect the logarithmic expectation (E) value from reciprocal BLAST searches (shown at the bottom of the figure). Here, the E values of BLAST searches using target proteome sequences as queries versus the yeast proteome reference database are shown. The genes are ordered according to hierarchical clustering (see Materials and methods). Note that there are only a few proteins with many prokaryotic orthologs when compared to Figure 3.
Figure 5Survey of nucleolar/ribosomal gene expression. Histograms of sets of pairwise Pearson correlation coefficients computed from vectors of gene expression ratios for gene pairs. The distributions of Pearson correlation coefficients (each obtained from the pairwise comparison of expression profiles of two genes) gives an impression of the global similarity of expression patterns in a group of genes. Random data would give a Pearson correlation coefficient distribution centered around 0 (no correlation). The more a distribution deviates towards +1 compared to a 0-centered bell shape, the more similar a group of genes is expressed across the whole expression compendium. Gene pairs were formed within or between the functional/evolutionarily-defined groups of genes that are under investigation here. (a) Correlation within all yeast genes. (b) Correlation within genes that do not encode nucleolar proteins. (c) Correlation within genes for nucleolar proteins. (d) Correlation within genes for ribosomal or ribosome-associated proteins. (e) Correlation within nucleolar genes that stem from archaea. (f) Correlation within nucleolar genes that do not stem from archaea. (g) Correlation within genes that encode 90S processosome components. (h) Correlation between genes for ribosome proteins and 90S processosome proteins. Note that the distributions for the ribosomal protein genes and the 90S processosome strongly deviate from the rather 0-centered distribution of 'all genes-versus-all gene' comparisons. However, the distribution for gene pairs in which one partner is a 90S processosome component and the other partner is a ribosomal component deviate much less from the random shape and, thus, indicate distinct expression programs.
Figure 6Hierarchical clustering of gene expression patterns of ribosomal and processosomal protein genes. The central plot shows color-coded expression ratios as supplied in the ROSETTA expression compendium [47] for genes encoding ribosomal and 90S-processosomal proteins. Genes vary along the horizontal axis, expression experiments vary along the vertical axis. Top: 90S-processosomal genes are marked in black, ribosomal protein genes are marked in white. Bottom: hierarchical clustering yields two large clusters, here marked in cyan and in yellow, that comprise approximately 80% of all ribosomal/processosomal genes (171 of 211). Only genes of these clusters are shown here. Note that only three genes are not clustered according to their membership to either the ribosome or the 90S processosome. The separation of the 90S processosomal and ribosomal protein genes by hierarchical clustering (an unsupervised approach) confirms that the ribosomal and 90S processosomal expression programs are distinct from each other (Figure 5).