| Literature DB >> 23547736 |
Nicholas Mitsakakis1, Zak Razak, Michael Escobar, J Timothy Westwood.
Abstract
BACKGROUND: While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross-validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un-annotated genes. A total of approximately 5043 different genes, or about one-third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un-annotated.Entities:
Year: 2013 PMID: 23547736 PMCID: PMC3669044 DOI: 10.1186/1756-0381-6-8
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Figure 1Flowchart Describing the Probability Estimation Procedure using SVM and the Sigmoid Fitting Function. First, SVM is trained using dataset A (SVM training set). Then, classification predictions (in the form of discriminant values) for dataset B (sigmoid training or tuning set) are generated. Those predictions along with the known labels of B are used for the fitting of the sigmoid function. Finally, classification results for dataset C (test set) are mapped to estimated class membership probabilities using the fitted sigmoid.
Selected GO‐BP categories with high precision values
| Detection of external stimulus GO 0009581 | 36 | 1.000 |
| 28 | 1.000 | |
| Rhodopsin mediated phototransduction GO 0009586 | 14 | 1.000 |
| Detection of stimulus GO 0051606 | 38 | 0.964 |
| 25 | 0.958 | |
| 21 | 0.938 | |
| Phototransduction GO 0007602 | 25 | 0.938 |
| 20 | 0.929 | |
| 68 | 0.919 | |
| ATP synthesis coupled electron transport sensu Eukaryota GO 0042775 | 22 | 0.917 |
| 25 | 0.909 | |
| ATP synthesis coupled electron transport GO 0042773 | 22 | 0.906 |
| 21 | 0.900 | |
| 16 | 0.900 | |
| Cell substrate adhesion GO 0031589 | 25 | 0.875 |
| 16 | 0.854 | |
| 20 | 0.850 | |
| Aerobic respiration GO 0009060 | 31 | 0.845 |
| Response to light stimulus GO 0009416 | 33 | 0.839 |
| 27 | 0.833 | |
| 12 | 0.833 | |
| 19 | 0.833 | |
| 20 | 0.833 | |
| 33 | 0.817 | |
| Glucosamine metabolism GO 0006041 | 27 | 0.813 |
| N acetylglucosamine metabolism GO 0006044 | 27 | 0.813 |
| SRP dependent cotranslational protein targeting to membrane GO 0006614 | 12 | 0.813 |
| 14 | 0.800 | |
| 20 | 0.792 | |
| DNA amplification GO 0006277 | 19 | 0.788 |
| 18 | 0.783 | |
| 53 | 0.776 | |
| response to radiation GO 0009314 | 37 | 0.774 |
| 18 | 0.771 | |
| 13 | 0.771 | |
| 14 | 0.767 | |
| Protein targeting to membrane GO 0006612 | 12 | 0.750 |
| 12 | 0.750 | |
| 12 | 0.750 |
Each one of these 39 categories has a fold‐average precision‐at‐40 (i.e. that corresponds to recall value = 0.4) equal or larger than 0.75. The number of genes in the dataset annotated with a category is also shown. Bold fonts are used for the 24 GO‐BP categories that show minimal redundancy with genes found in other categories.
Figure 2Precision‐recall Plots for Two of the GO‐ Biological Process Categories. Plots show the fold‐average precision that corresponds to a recall value for (A) DNA‐dependent DNA replication, and (B) oxidative phosphorylation GO categories. Vertical averaging method was used.
Figure 3Developmental Transcription Profiles for Two of the GO‐ Biological Process Categories. The gene expression profiles of two of the GO‐BP categories: (A) DNA‐dependent DNA replication, and (B) oxidative phosphorylation are shown. The top portion of each of the category figures contains the expression profiles throughout development for each gene in the category in the same order as they appear in either Table 2 or 3 respectively. The bottom portion of each category represents an equal number of genes randomly selected from the expression profiles in the entire data set. Red denotes genes having up‐regulated transcription at a given time point and green down‐regulated genes. The scale at the top of the figure indicates the degree of up‐ (in red) and down‐ (in green) regulation (in fold change).
Gene Lists and Fluorescent In Situ Hybridization (FISH) analysis for DNA‐dependent DNA replication GO‐BP category
| CG5924 | LD38710 | | 0.923 | 0.705 | CG1109 | LD26389 | | 0.798 | 0.592 |
| CG1109 | LD27350 | | 0.923 | 0.736 | CG8290 | LD37351 | + | 0.798 | 0.575 |
| CG7663 | LD46979 | + | 0.923 | 0.724 | CG2910 | GH11110 | | 0.798 | 0.598 |
| CG7384 | LD46023 | | 0.923 | 0.76 | CG10364 | LD32040 | - | 0.798 | 0.58 |
| CG14464 | LD29015 | | 0.923 | 0.707 | CG5877 | LD29352 | | 0.79 | 0.572 |
| CG16892 | LD26813 | + | 0.923 | 0.728 | CG10625 | LD39545 | - | 0.79 | 0.568 |
| CG1578 | LD28359 | + | 0.923 | 0.755 | CG17509 | GH12788 | | 0.787 | 0.568 |
| CG16892 | LD42637 | | 0.923 | 0.739 | CG11409 | LD40802 | | 0.787 | 0.607 |
| CG11122 | LD29040 | | 0.86 | 0.644 | CG30007 | LD29335 | | 0.787 | 0.58 |
| CG9300 | LD21924 | | 0.86 | 0.666 | CG17681 | LD30009 | | 0.787 | 0.573 |
| CG3287 | SD03445 | | 0.86 | 0.643 | CG18622 | LD26416 | | 0.787 | 0.539 |
| CG1960 | GH21591 | | 0.86 | 0.66 | CG31152 | LD29477 | + | 0.787 | 0.545 |
| CG1024 | LD28076 | | 0.86 | 0.652 | CG11990 | LD47989 | + | 0.787 | 0.549 |
| CG13096 | SD03546 | | 0.86 | 0.664 | CG6724 | LD40657 | | 0.783 | 0.589 |
| CG11596 | LD45925 | | 0.86 | 0.682 | CG32069 | LD47413 | | 0.783 | 0.565 |
| CG4857 | LD29423 | + | 0.86 | 0.638 | CG2962 | LD27487 | | 0.783 | 0.578 |
| CG4949 | LD46305 | + | 0.86 | 0.669 | CG6049 | LD27763 | | 0.776 | 0.558 |
| CG11943 | SD04935 | | 0.86 | 0.703 | CG2260 | LD30339 | | 0.772 | 0.563 |
| CG2469 | LD30285 | + | 0.86 | 0.677 | CG3735 | LD35854 | | 0.771 | 0.553 |
| CG11596 | LD42227 | + | 0.839 | 0.66 | CG7110 | LD39933 | - | 0.771 | 0.577 |
| CG12785 | LD27528 | | 0.839 | 0.616 | CG12202 | LD30511 | + | 0.771 | 0.584 |
| CG11329 | LD26217 | | 0.839 | 0.619 | CG9591 | LD26057 | + | 0.771 | 0.554 |
| CG6066 | LD27582 | | 0.839 | 0.621 | CG12340 | LD26050 | | 0.771 | 0.552 |
| CG17050 | LD35611 | | 0.839 | 0.647 | CG30020 | LD40262 | | 0.771 | 0.549 |
| CG18004 | LD27741 | | 0.839 | 0.662 | CG12050 | LD30416 | + | 0.771 | 0.561 |
| CG1647 | LD30287 | | 0.839 | 0.591 | CG6151 | LD28933 | | 0.766 | 0.567 |
| CG31697 | SD02518 | | 0.839 | 0.618 | CG14657 | LD28447 | | 0.766 | 0.556 |
| CG15736 | LD33780 | + | 0.83 | 0.619 | CG4203 | LD29184 | | 0.761 | 0.537 |
| CG2691 | LD46946 | | 0.829 | 0.61 | CG4281 | SD03946 | | 0.76 | 0.548 |
| CG7728 | LD39680 | + | 0.812 | 0.592 | CG14005 | LD30293 | | 0.76 | 0.566 |
| CG31163 | SD09611 | + | 0.812 | 0.598 | CG9028 | LD27194 | + | 0.76 | 0.555 |
| CG3680 | LD27862 | | 0.81 | 0.63 | CG7824 | LD26655 | | 0.76 | 0.542 |
| CG3362 | LD28544 | | 0.81 | 0.595 | CG7407 | LD29166 | - | 0.758 | 0.548 |
| NA ∗ | LD42550 | | 0.81 | 0.612 | CG3338 | LD27356 | + | 0.756 | 0.524 |
| CG11906 | LD27134 | 0.798 | 0.566 |
∗CG number not known
DNA‐dependent DNA replication GO‐BP category (along with oxidative phosphorylation) was selected for further biological validation using fluorescent in situ hybridization (FISH) analysis of early D. melanogaster embryos. In this table, genes that are predicted to belong to this category are shown along with information regarding their presence in the FISH database. Each of the genes (i.e. CG numbers) in the list was searched in a D. melanogaster FISH database to visually examine the spatial and temporal mRNA expression for that gene during early embryogenesis. Genes that are marked with (+) or (-) sign had images present in the database. Those that had patterns that largely matched that of known genes in the category are marked with a (+). If either their temporal or their spatial pattern did not match the known gene pattern, they are marked with (-). For each gene the “gene‐precision” score and the average probability estimate output from SVM are reported.
Gene Lists and Fluorescent In Situ Hybridization (FISH) analysis for oxidative phosphorylation GO category
| CG12934 | LP05346 | | 1 | 0.814 | CG5608 | LD32461 | | 0.898 | 0.537 |
| CG1715 | LD33960 | + | 1 | 0.81 | CG8401 | GH01937 | | 0.897 | 0.548 |
| CG33316 | SD08735 | | 0.975 | 0.656 | CG6094 | GH26345 | | 0.897 | 0.553 |
| CG5523 | GH14535 | | 0.975 | 0.754 | CG9813 | GH04365 | | 0.892 | 0.522 |
| CG10675 | GH14673 | | 0.975 | 0.79 | CG13220 | GH06079 | + | 0.885 | 0.538 |
| CG9921 | GH07174 | | 0.975 | 0.714 | CG9056 | GH11503 | | 0.885 | 0.559 |
| CG8486 | GH04578 | | 0.975 | 0.594 | CG4577 | GH23863 | | 0.875 | 0.535 |
| CG8086 | GH25625 | | 0.975 | 0.752 | CG5325 | GM14611 | | 0.871 | 0.539 |
| CG10075 | GH25609 | + | 0.975 | 0.679 | CG7710 | LP03578 | | 0.871 | 0.544 |
| CG30116 | GH04922 | | 0.975 | 0.677 | CG14125 | GH07601 | | 0.868 | 0.502 |
| CG5532 | GH01442 | + | 0.975 | 0.662 | CG1859 | GH26443 | | 0.868 | 0.52 |
| CG12239 | GH14380 | | 0.975 | 0.66 | CG5325 | GH03076 | | 0.829 | 0.505 |
| CG8740 | GH05582 | + | 0.975 | 0.658 | CG1135 | GH01794 | | 0.829 | 0.489 |
| CG18616 | GH04932 | + | 0.975 | 0.737 | CG4757 | SD01814 | | 0.826 | 0.496 |
| CG3420 | GH11502 | + | 0.975 | 0.682 | CG5989 | GH26459 | + | 0.826 | 0.481 |
| CG15669 | GH02495 | | 0.975 | 0.629 | CG3153 | GH04701 | | 0.812 | 0.468 |
| CG11203 | GH26638 | | 0.975 | 0.691 | CG1927 | GH11112 | | 0.812 | 0.462 |
| CG6044 | GH12587 | | 0.975 | 0.61 | CG7217 | LD45324 | | 0.812 | 0.457 |
| CG5903 | GH13386 | - | 0.975 | 0.768 | CG6123 | GH13094 | | 0.812 | 0.47 |
| CG14823 | GH02020 | | 0.975 | 0.656 | CG2269 | GH06015 | - | 0.809 | 0.48 |
| CG13367 | GH14959 | | 0.975 | 0.7 | CG4589 | LP05955 | | 0.807 | 0.489 |
| CG6424 | GH08256 | | 0.975 | 0.743 | CG14438 | GH25521 | | 0.8 | 0.47 |
| CG7083 | GH27162 | | 0.947 | 0.68 | CG8206 | GH04557 | | 0.794 | 0.437 |
| CG3631 | LD29155 | + | 0.925 | 0.655 | CG9336 | GH22472 | | 0.794 | 0.453 |
| CG4281 | GH10944 | + | 0.925 | 0.604 | CG7570 | GH27163 | | 0.794 | 0.459 |
| CG12706 | GH14695 | | 0.925 | 0.648 | CG10973 | LD28549 | | 0.792 | 0.452 |
| CG10249 | GH11802 | | 0.925 | 0.56 | CG7506 | GH02466 | | 0.788 | 0.467 |
| CG14292 | GH14813 | | 0.912 | 0.592 | CG6455 | GH04666 | | 0.786 | 0.463 |
| CG4972 | GH14975 | | 0.912 | 0.592 | CG17828 | GH04984 | + | 0.778 | 0.432 |
| CG32795 | HL08104 | - | 0.912 | 0.532 | CG15765 | GH28601 | | 0.778 | 0.447 |
| CG4975 | GH18454 | | 0.912 | 0.638 | CG10585 | GH23839 | + | 0.775 | 0.432 |
| CG6550 | GH28477 | + | 0.912 | 0.609 | CG10039 | GH11404 | | 0.775 | 0.436 |
| CG3971 | GH11554 | + | 0.912 | 0.549 | CG14817 | GH01621 | | 0.771 | 0.433 |
| CG6659 | LD45943 | + | 0.912 | 0.578 | CG17666 | GH08313 | | 0.762 | 0.42 |
| CG15386 | GH19557 | | 0.906 | 0.516 | CG11737 | GH22337 | + | 0.761 | 0.439 |
| CG1231 | GH01151 | | 0.906 | 0.582 | CG7358 | GH14795 | | 0.759 | 0.436 |
| CG15067 | GH14961 | | 0.906 | 0.545 | CG5773 | GH07612 | | 0.758 | 0.432 |
| CG6008 | GH05862 | 0.898 | 0.557 |
This table contains the genes that were predicted to belong to oxidative phospholylation GO‐BP category along with information regarding the presence in the FISH database. Notation is similar to Table 2.
Figure 4The Proportion of genes that are Annotated, Un‐annotated, and Predicted by the SVM to Belong to a GO‐BP Category. Genes from the developmental time course data set that had no GO‐BP annotation were broken down into genes that were predicted with high confidence to belong to a GO‐BP category (predicted), and those that had low prediction values (not predicted).
Figure 5Fluorescent In Situ Hybridization (FISH) Images of Annotated and Un‐annotated Gene mRNAs. One annotated gene and one un‐annotated gene from each of the two GO‐BP categories shown in Tables 2 and 3 were chosen and FISH images for those genes was retrieved from the FlyFISH database. Four different stage categories of early embryogenesis are shown. Green fluorescence represents the mRNA localization pattern for that transcript/gene and red fluorescence is showing the position of nuclei within the organism at that stage. CG1584 = Orc6, Origin recognition complex subunit 6. Its molecular function is described as DNA binding and it is involved in the biological processes: DNA replication initiation; DNA‐dependent DNA replication; chromatin silencing. CG1970‐ NADH dehydrogenase (ubiquinone) activity. It is involved in the biological process mitochondrial electron transport.