| Literature DB >> 19811686 |
Brandon M Malone1, Andy D Perkins, Susan M Bridges.
Abstract
BACKGROUND: This paper presents a framework for integrating disparate data sets to predict gene function. The algorithm constructs a graph, called an integrated similarity graph, by computing similarities based upon both gene expression and textual phenotype data. This integrated graph is then used to make predictions about whether individual genes should be assigned a particular annotation from the Gene Ontology.Entities:
Mesh:
Year: 2009 PMID: 19811686 PMCID: PMC3226192 DOI: 10.1186/1471-2105-10-S11-S20
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Pseudocode for annotation prediction algorithm. The algorithm predicts whether gene g should be annotated with annotation a. The algorithm consists of four key steps. First, threshold_low, which is the lowest similarity of any gene known to have a to all other genes known to have a, is calculated. Next, threshold_high, which is the highest similarity of any gene known to not have a to all genes known to have a, is calculated. Then, total_similarity of g to all genes known to have a is calculated. Finally, the prediction is made. If total_similarity exceeds threshold_high, then g is always predicted to have annotation a. If total_similarity is less than threshold_low, then g is never predicted to have annotation a. If total_similarity falls between threshold_low and threshold_high, then it is linearly interpolated between the two thresholds to produces a number between 0 and 1. Specifically, the formula for the linear interpolation is . An predefined cutoff, such as 0.5, is then used to predict whether or not to assign the annotation to gene g. Thus, if cutoff = 0.5 and interpolated_sim = 0.6 for gene g and annotation a, then gene g would be predicted to have annotation a.
Figure 2Computing key similarities for predicting annotations. Three key similarity calculations are required to determine if annotation a should be assigned to gene g. Step 1: The lower threshold is the minimum of the total similarity of each gene with annotation a to all other genes with annotation a. Step 2: The upper threshold is the maximum of the total similarity of each gene without annotation a to every gene with annotation a. Step 3: The total similarity of gene g to all genes with annotation a is computed and the decision for assigning annotation a to gene g is made as illustrated in Figure 3.
Figure 3Predicting transfer to annotations to genes. When considering whether an annotation a should be assigned to gene g, the total similarity of each gene with annotation a to gene g is computed. If the total similarity is greater than the upper threshold, the annotation is assigned. If the total similarity is less than the lower threshold, the annotation is never assigned. For genes with a total similarity greater than the lower threshold but less than the upper threshold, linear interpolation is used to determine where the similarity falls relative to the two thresholds. If the interpolated similarity is above a predefined cutoff, the annotation is assigned. The pink area indicates similarity values for which the annotation will be transferred.
GEO Accessions
| Sample | Sample Title |
|---|---|
| GSM112158 | Yeast cell cycle-time point 0 min 2001-10-30_O.rfm Yeast W303 cells |
| GSM112159 | Yeast cell cycle-time point 5 min 2001-11-09_0005.rfm Yeast W303 cells |
| GSM112160 | Yeast cell cycle-time point 10 min 2001-11-09_0010.rfm Yeast W303 cells |
| GSM112161 | Yeast cell cycle-time point 15 min 2001-11-09_0015.rfm Yeast W303 cells |
| GSM112162 | Yeast cell cycle-time point 20 min 2001-11-09_0020.rfm Yeast W303 cells |
| GSM112163 | Yeast cell cycle-time point 25 min 2001-11-09_0025.rfm Yeast W303 cells |
| GSM112164 | Yeast cell cycle-time point 30 min 2001-11-09_0030.rfm Yeast W303 cells |
| GSM112165 | Yeast cell cycle-time point 35 min 2001-11-09_0035.rfm Yeast W303 cells |
| GSM112166 | Yeast cell cycle-time point 40 min 2001-11-09_0040.rfm Yeast W303 cells |
| GSM112167 | Yeast cell cycle-time point 45 min 2001-11-09_0045.rfm Yeast W303 cells |
| GSM112168 | Yeast cell cycle-time point 50 min 2001-11-09_0050.rfm Yeast W303 cells |
| GSM112169 | Yeast cell cycle-time point 55 min 2001-11-09_0055.rfm Yeast W303 cells |
| GSM112170 | Yeast cell cycle-time point 60 min 2001-11-09_0060.rfm Yeast W303 cells |
| GSM112171 | Yeast cell cycle-time point 65 min 2001-11-21_0065.rfm Yeast W303 cells |
| GSM112172 | Yeast cell cycle-time point 70 min 2001-11-21_0070.rfm Yeast W303 cells |
| GSM112173 | Yeast cell cycle-time point 75 min 2001-11-28_0075.rfm Yeast W303 cells |
| GSM112174 | Yeast cell cycle-time point 80 min 2001-11-28_0080.rfm Yeast W303 cells |
| GSM112175 | Yeast cell cycle-time point 85 min 2001-11-29_0085.rfm Yeast W303 cells |
| GSM112176 | Yeast cell cycle-time point 90 min 2001-11-29_0090.rfm Yeast W303 cells |
| GSM112177 | Yeast cell cycle-time point 95 min 2001-11-29_0095.rfm Yeast W303 cells |
| GSM112178 | Yeast cell cycle-time point 100 min 2001-11-29_0100.rfm Yeast W303 cells |
| GSM112179 | Yeast cell cycle-time point 105 min 2001-12-06_0105.rfm Yeast W303 cells |
| GSM112180 | Yeast cell cycle-time point 110 min 2001-11-29_0110.rfm Yeast W303 cells |
| GSM112181 | Yeast cell cycle-time point 115 min 2001-11-29_0115.rfm Yeast W303 cells |
| GSM112182 | Yeast cell cycle-time point 120 min 2001-11-29_0120.rfm Yeast W303 cells |
| GSM81064 | Yeast cell cycle-time point 0 min 2001-05-03_0000.rfm |
| GSM81065 | Yeast cell cycle-time point 10 min 2001-05-03_0010.rfm |
| GSM81066 | Yeast cell cycle-time point 20 min 2001-05-03_0020.rfm |
| GSM81067 | Yeast cell cycle-time point 30 min 2001-05-03_0030.rfm |
| GSM81068 | Yeast cell cycle-time point 40 min 2001-04-11_0040.rfm |
| GSM81069 | Yeast cell cycle-time point 50 min 2001-04-11_0050.rfm |
| GSM81070 | Yeast cell cycle-time point 60 min 2001-04-11_0060.rfm |
| GSM81071 | Yeast cell cycle-time point 70 min 2001-04-11_0070.rfm |
| GSM81072 | Yeast cell cycle-time point 80 min 2001-04-11_0080.rfm |
| GSM81073 | Yeast cell cycle-time point 90 min 2001-04-11_0090.rfm |
| GSM81074 | Yeast cell cycle-time point 100 min 2001-04-11_0100.rfm |
| GSM81075 | Yeast cell cycle-time point 110 min 2001-04-11_0110.rfm |
| GSM81076 | Yeast cell cycle-time point 120 min 2001-04-11_0120.rfm |
These data were downloaded from GEO on November 5, 2008. The data sets were generated by a variety of researchers in many different laboratories.
Figure 4Total Number of Positive Predictions. As the cutoff used in the prediction algorithm is increased, all of the datasets make fewer positive predictions. That is, they predict that fewer genes should be annotated with a particular GO term. However, the number of predictions based on only the phenotype data is consistently far less than the number based on the expression data or the combined data set. This suggests that, in general, the phenotype data will not be as much aid in making novel predictions as the other data sets. There is no difference in the number of predictions assigned by either the generalized or exact approaches since those only differentiate between which predictions are considered correct.
Figure 5Total number of correct positive predictions. As the cutoff used in the prediction algorithm is increased, all of the datasets make fewer correct positive predictions. The integrated data set makes nearly as many correct predictions as the gene expression data set, and they both make many more predictions than the phenotype data set. This confirms that the phenotype data set is not as capable as the other data sets in predicting new annotations. The generalized predictions always result in more true positives. Figure 5a shows the results for the generalized predictions. Figure 5b shows the results for the exact predictions.
Figure 6Precision. As the cutoff used in the prediction algorithm is increased, the precision of all of the data sets increases. Precision is defined as (tp)/(tp + fp) [16]. Combined with Figures 4 and 5, this indicates that, while fewer false positives are predicted as cutoff is increased, fewer true positives are also predicted. This is especially true in the case of the phenotype data set, which resulted in far fewer predictions than the other data sets. The integrated data set does outperform the other data sets. The generalized predictions result in a better precision than the exact predictions. Figure 6a shows the results for the generalized predictions. Figure 6b shows the results for the exact predictions.
Figure 7Recall. In contrast to recall, as the cutoff is increased, the recall decreases. Recall is defined as (tp)/(tp + fn) [16]. Since false negatives indicate negative predictions of known positive annotations, it is not surprising that the values would decrease as the cutoff is increased since that results in fewer predictions. The gene expression data set has the highest recall, but the integrated data set is only slightly lower. The generalized predictions have a better recall than the exact predictions. Figure 7a shows the results for the generalized predictions. Figure 7b shows the results for the exact predictions.
Figure 8F-measure. The F-measure tends to favor cutoffs which are neither very high nor very low. F-measure is calculated as (2*precision*recall)/(precision + recall) [16]. Thus, the best F-measures strike a balance between precision and recall. The integrated data set using the max to combine the similarities results in the highest F-measure with a cutoff of 0.6. Figure 8a shows the results for the generalized predictions. Figure 8b shows the results for the exact predictions.
Figure 9F-measures at different depths within the GO. Because of the more specific information available in the phenotype data set, it results in more accurate predictions at deeper levels in the GO. This figure shows that for predictions at levels between 7 and 12 in the GO, the phenotype data almost always has a higher F-measure. However, by combining the phenotype data with the expression data, the many more predictions (see Figures 4 and 5) made by the integrated data set do fare better than those of the gene expression data set alone. These were only evaluated using the generalized predictions. A cutoff of 0.6 was used in all cases.