| Literature DB >> 18831786 |
Kenneth Bryan1, Pádraig Cunningham.
Abstract
BACKGROUND: Microarrays have the capacity to measure the expressions of thousands of genes in parallel over many experimental samples. The unsupervised classification technique of bicluster analysis has been employed previously to uncover gene expression correlations over subsets of samples with the aim of providing a more accurate model of the natural gene functional classes. This approach also has the potential to aid functional annotation of unclassified open reading frames (ORFs). Until now this aspect of biclustering has been under-explored. In this work we illustrate how bicluster analysis may be extended into a 'semi-supervised' ORF annotation approach referred to as BALBOA.Entities:
Mesh:
Year: 2008 PMID: 18831786 PMCID: PMC2559885 DOI: 10.1186/1471-2164-9-S2-S20
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Change in H-score and Hv-score with increasing bicluster scale. Figure 1 illustrates how the H-score and improved Hv-score change as the scale of the bicluster being measured changes. Biclusters of different scales, but with the same relative row correlation, receive very different H-scores but approximately the same Hv-score. Biclusters were generated as in [28].
Figure 2Illustration of the steps in the BALBOA ORF classification algorithm. This figure shows the various step of the BALBOA ORF prediction algorithm. In step 1 the expression dataset is divided into its annotated genes and unannotated (unclassified) ORFs. In step 2 biclusters are generated in the annotated gene set only. In step 3 selected biclusters (where E ≥ E) are used to classify similarly expressed ORFs in the unclassified set. In step 4 ORFs are combined into weighted frequency list for each functional category. Each ORF label weight is derived from the functional enrichments of the classifying biclusters. In step 5 the top ORFs (where F ≥ F) are selected from this list. In step 6 ORFs consistently classified across independent datasets are returned.
Comparative cross validation of BALBOA with majority & unanimous voting kNN.
| Metabolism (01) | 0.46 | 0.43 | 0.50 | 0.00 | 0.01 | |
| Energy (02) | 0.48 | 0.15 | 0.01 | 0.57 | 0.05 | |
| Cell Cycle (10) | 0.48 | 0.41 | 0.86 | 0.03 | 0.02 | |
| Transcription (11) | 0.47 | 0.36 | 0.13 | 0.00 | 0.03 | |
| Protein Synthesis (12) | 0.51 | 0.52 | 0.97 | 0.24 | 0.08 | |
| Protein Fate (14) | 0.38 | 0.26 | 0.01 | 0.91 | 0.02 | |
| Transp. Elements (38) | 0.29 | 0.57 | 0.25 | 0.09 | 0.43 | |
| Cell Fate (43) | 0.30 | 0.09 | 0.00 | 0.00 | 0.05 | |
| Mean | 0.42 | 0.35 | 0.53 | 0.05 | 0.09 | |
| Metabolism (01) | 0.47 | 0.50 | 0.85 | 0.03 | 0.04 | |
| Energy (02) | 0.59 | 0.16 | 0.03 | 0.76 | 0.08 | |
| Cell Cycle (10) | 0.49 | 0.24 | 0.01 | 0.57 | 0.03 | |
| Transcription (11) | 0.44 | 0.43 | 0.62 | 0.01 | 0.03 | |
| Protein Synthesis (12) | 0.52 | 0.47 | 0.95 | 0.18 | 0.05 | |
| Protein Fate (14) | 0.39 | 0.24 | 0.00 | 0.47 | 0.02 | |
| Prot. Bind. Func. (16) | 0.29 | 0.14 | 0.00 | 0.00 | 0.01 | |
| Cell Transport (20) | 0.36 | 0.20 | 0.02 | 0.50 | 0.03 | |
| Transp. Elements (38) | 0.38 | 0.48 | 0.00 | 0.00 | 0.47 | |
| Biogen. Cell. Comp. (42) | 0.32 | 0.15 | 0.41 | 0.01 | 0.04 | |
| Mean | 0.43 | 0.31 | 0.60 | 0.03 | 0.09 | |
| Metabolism (01) | 0.46 | 0.49 | 0.01 | 0.83 | 0.03 | |
| Energy (02) | 0.52 | 0.20 | 0.05 | 0.83 | 0.09 | |
| Cell Cycle (10) | 0.46 | 0.29 | 0.00 | 0.58 | 0.03 | |
| Transcription (11) | 0.45 | 0.37 | 0.64 | 0.00 | 0.06 | |
| Protein Synthesis (12) | 0.50 | 0.55 | 0.95 | 0.31 | 0.17 | |
| Protein Fate (14) | 0.40 | 0.29 | 0.03 | 0.93 | 0.01 | |
| Cell Transport (20) | 0.36 | 0.20 | 0.25 | 0.00 | 0.01 | |
| Transp. Elements (38) | 0.52 | 0.00 | 0.15 | 0.75 | 0.50 | |
| Biogen. Cell. Comp. (42) | 0.33 | 0.12 | 0.38 | 0.00 | 0.03 | |
| Mean | 0.48 | 0.34 | 0.64 | 0.06 | 0.10 | |
The highest precisons for each MIPS category in each dataset evaluated are shown in bold. BALBOA achvieves the highest mean precision over all MIPS categories in each dataset. P = Pecision, R = Recall.
BALBOA annotation of unclassified ORFs that are consistent over two or more datasets.
| YJR154W | Metabolism: Amino Acid (01.01) | Putative protein, unknown function. GFP-fusion protein localizes to cytoplasm. | Similarity ( |
| YDL072C (YET3) | Energy: Respiration (02.13) | Null mutant has decreased level of secreted invertase (enables respiration of sucrose). | Human BAP31 homolog. |
| YGR149W | Energy: Respiration (02.13) | Putative protein, unknown function | Predicted integral membrane protein. |
| YCR072C (RSA4) | Transcription: rRNA (11.04.01) | Recently verified by MIPS – ribosomal biogenesis. | |
| YDL167C (NRP1) | Transcription: rRNA (11.04.01) | Role in ribosome biogenesis and assembly (RCA). | |
| YMR259C | Transcription: rRNA (11.04.01)/ | Putative protein, unknown function; GFP-fusion protein localizes to the cytoplasm. | |
| YNL022C | Transcription: rRNA (11.04.01) | Putative protein of unknown function. GFP-fusion protein localizes to a single spot in the nucleus. | Similarity ( |
| YDR361C (BCP1) | Transcription (11)/ | Associated with RPL23a & PL23b (Ribosomal sub-units) in Affinity Capture Expts. | |
| YJL122W (ALB 1) | Transcription (11)/ | Shuttling pre-60S factor; involved in the biogenesis of ribosomal large subunit. | |
| YJR003C | Transcription: rRNA (11.04.01)/ | Putative protein. Detected in purified mitochondria. Role in ribosome biogenesis and assembly (RCA). | |
| YLR196W (PWP1) | Ribosomal Proteins (12.01.01)/Transcription (11) | Protein with WD-40 repeats involved in rRNA processing. | |
| YER049W (TPA1) | Ribosomal Proteins (12.01.01)/Cellular Transport (20) | Interacts with Sup45p (eRF1) and Sup35p (eRF3) and Pab1p; role in translation termination efficiency. | |
| YDR282C | Transported Compounds(20.01)/ | Putative protein of unknown function. | Similarity ( |
| YGR266W | Transport Routes (20.09) | Protein of unknown function. Localizes to mitochondrial outer membrane and plasma membrane | Predicted to have single trans-membrane domain. |
| YIL039W | Transport Routes (20.09) | GFP-fusion localizes to the ER. Deletion confers sensitivity to GSAO (angiogenesis inhibitor drug). | |
| YOR175C | Transport Routes (20.09) | Protein of unknown function. Co-purification with Ribosomes. | Member of MBOAT putative membrane bound O-acyltransferases. |
| YPL105C | Transport Routes (20.09)/ | Protein of unknown function. Co-purification with both Ribosomes & mitochondria. | |
| YIL060W | Tansposable Elements (38) | Putative protein of unknown function. Mutant accumulates less glycogen than does wild type. | Similarity ( |
| YJR030C | Tansposable Elements (38) | Putative protein of unknown function. Expression repressed in carbon limited cultures | Similar to YJL181w (cell cyle regulator) & MBP-1 binding site (cell cycle). |
| YIL157C | Biogenesis of Cellular Components: Mitochondrion (42.16) | Detected in Co-purified mitochondria. Null mutant is defective in cytochrome oxidase. | |
| YML030W | Biogenesis of Cellular Components: Mitochondrion (42.16) | Putative protein of unknown function; GFP-fusion protein localizes to mitochondria. |
Unclassified ORFs consistently annotated over two or more datasets, additional labels in italics are from one dataset only. GFP = Green Fluorescent Protein; Rca = Reviewed Computational Analysis.
Figure 3Illustration of the steps in the semi-supervised functional module discovery process. This figure shows the various step of the functional module discovery algorithm. This algorithm is related to the BALBOA algorithm and begins in the same manner with the dataset splitting in step 1 and the bicluster analysis of the annotated genes only in step 2. In step 3 however all biclusters are selected for analysis. In step 4 dataset cross validation is carried out to establish groups of ORFs that are consistently grouped together by biclusters over the three datasets. In step 5 these consistently grouped unclassified ORFs are returned as predicted functional modules, where the function may be inferred by the enrichment and significance of the dominant functional class in the classifying biclusters.
Predicted functional modules of unclassified ORFs.
| YBR271W | Localizes to the cytoplasm | S-adenosylmethionine-dependent methyltransferase. |
| YCR016W | YGL120C (PRP43) RNA helicase/maturation of rRNA, YPR135W (CTF4) Chromatin-associated protein | |
| YDL063C | YPL131W (RPL5-Protein of (60S) ribosomal subunit) | GO ribosome biogenesis & assembly (RCA) |
| YDL167C (NRP1) | GO ribosome biogenesis & assembly (RCA) | |
| YDR361C (BCP1) | Protein component of the large (60S) | GO ribosomal large subunit export from nucleus (IMP); Export of Mss4p lipid kinase |
| YGR187C (HGH1) | YDR188W (CCT6-Chaperonin Containing TCP-1) | GO ribosome biogenesis & assembly (RCA) |
| YIL064W | GO ribosome biogenesis & assembly (RCA). S-adenosylmethionine-dependent methyltransferase | |
| YIL096C | YLR009W (RLP24) 60S ribosomal subunit biogenesis | GO ribosome biogenesis & assembly (RCA) |
| YIL110W | Putative S-adenosylmethionine-dependent methyltransferase | |
| YIL127C | Localizes to the nucleolus | GO ribosome biogenesis & assembly (RCA) |
| YJR003C | Detected in purified mitochondria | GO ribosome biogenesis & assembly (RCA) |
| YLR051C (FCF2) | Essential nucleolar protein, 35S rRNA processing | |
| YLR196W (PWP) | YPL131W (RPL5-Protein of (60S) ribosomal subunit), YDR188W (CCT6-Chaperonin Containing TCP-1) | GO rRNA processing (IMP, ISS) |
| YLR287C | YPR135W (CTF4) Chromatin-associated protein | |
| YOL022C | Null mutant accumulates 20S pre-rRNA | |
| YOR021C (TSR4) | GO ribosome biogenesis & assembly (RCA) | |
| YOR154W (SLP1) | SUN like protein | |
| YOR252W (TMA16) | YLR009W (RLP24) 60S ribosomal subunit biogenesis, YGL120C (PRP43) RNA helicase/maturation of rRNA | |
| YPL183C | Negative regulation of transposition, RNA-mediated (IMP) | |
| YBL112C | Contained within telomere TEL02L, TEL02L-YP | |
| YEL076C | Contained within telomere TEL05L, TEL05L-YP, YEL076C-A | |
| YER189W | Contained within telomere TEL05R, TEL05R-YP | |
| YHR219W | Putative protein of unknown function with similarity to helicases; Contained within telomere TEL08R, TEL08R-YP | |
| YDR493W | Null mutant displays decreased frequency of mitochondrial genome loss | |
| YKL137W | Mutation results in growth defect on a non-fermentable (respiratory) carbon source | |
| YLR204W | Mitochondrial inner membrane protein | |
| YLR218C | Growth defects on a non-fermentable carbon source | |
| YML030W | Localizes to mitochondria | Null mutant is viable & displays decreased frequency of mitochondrial genome loss |
| YMR157C | detected in purified mitochondria | Displays increased frequency of mitochondrial genome loss |
Predicted functional modules of unclassified ORFS. Genetic/Physical interaction and functional evidence supporting predicted functional modules. GO = Gene Ontology; RCA = Reviewed Computational Analysis.
Figure 4Predicted functional module supported by biclusters enriched with Transcription: Ribosomal RNA processing (11.04.01). This figure shows the largest group of unclassified ORFS that were consistently classified together by biclusters significantly enriched for Transcription: Ribosomal RNA processing (11.04.01) in three independent datasets. The 19 unclassified ORFs in this predicted functional module correlate tightly over a subset of 30 sample features in the (a) Eisen, (b) Gasch and (c) Hughes expression datasets.
Figure 5Predicted functional module supported by biclusters enriched with DNA Topology (10.01.02). This figure shows a group of unclassified ORFS that were consistently classified together by biclusters significantly enriched for the DNA Topology (10.01.02) in three independent datasets. The 4 unclassified ORFs in this predicted functional module correlate tightly over a subset of 30 sample features in the (a) Eisen, (b) Gasch and (c) Hughes expression datasets.
Figure 6Predicted functional module supported by biclusters enriched with Mitochondrial (42.16) & Ribosomal Proteins (12.01.01). This figure shows a group of unclassified ORFS that were consistently classified together by biclusters significantly enriched for the Mitochondrial (42.16) & Ribosomal Proteins (12.01.01) in two independent datasets. The 6 unclassified ORFs in this predicted functional module correlate tightly over a subset of 30 sample features in (a) Gasch and (b) Hughes expression datasets.