| Literature DB >> 21388526 |
Jean F Fontaine1, Bernhard Suter, Miguel A Andrade-Navarro.
Abstract
BACKGROUND: High-throughput biological experiments can produce a large amount of data showing little overlap with current knowledge. This may be a problem when evaluating alternative scoring mechanisms for such data according to a gold standard dataset because standard statistical tests may not be appropriate.Entities:
Year: 2011 PMID: 21388526 PMCID: PMC3060832 DOI: 10.1186/1756-0500-4-57
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1Flow chart of the QiSampler algorithm. The data to be processed (all known and novel items and corresponding scoring schemes), and the values of S and N are set from the inputs (see algorithm section for details).
Figure 2Scores comparison. These graphs produced by QiSampler show the average performance of two scores (scaled to [0,1]), used to select PPIs from the same experimental dataset [10]. Performance was averaged over 1000 repetitions with a sampling rate of 25%. Dashed lines represent randomized data. Based on the Precision-recall and ROC graphs, the normalized DN score performs better than the z-score and a cut-off close to 0.3 would produce optimal values of recall and precision.
Average running times on the full dataset
| Sampling rate | 0.25 | 0.75 | 1 |
|---|---|---|---|
| Running time for 10 repetitions | 00:00:04 | 00:00:07 | 00:00:11 |
| Running time for 100 repetitions | 00:00:54 | 00:03:30 | 00:04:46 |
| Running time for 1000 repetitions | 00:40:15 | 03:09:14 | 05:09:52 |
The full dataset contained 26,803 protein pairs including 105 known in the literature. Times were averaged over two runs and were recorded on an AMD Opteron (64 bits, 2.3 GHz) processor-based computer.
Figure 3Functional enrichment in differentially expressed genes. These graphs produced by QiSampler show the average performance of the z-score comparing gene expression values between 30 OTA and 24 WT samples to select gene probes related to oxidoreductase activity in a thyroid microarray gene expression dataset [12]. The performance was averaged over 1000 repetitions with a sampling rate of 25% representing 34/137 known items and 34/3684 novel items. Dashed lines represent randomized data. The precision increases with the z-score cutoff indicating functional enrichment in the upregulated genes.