| Literature DB >> 22244038 |
Shannon M Bell1, Lyle D Burgoon, Robert L Last.
Abstract
BACKGROUND: High throughput methodologies such as microarrays, mass spectrometry and plate-based small molecule screens are increasingly used to facilitate discoveries from gene function to drug candidate identification. These large-scale experiments are typically carried out over the course of months and years, often without the controls needed to compare directly across the dataset. Few methods are available to facilitate comparisons of high throughput metabolic data generated in batches where explicit in-group controls for normalization are lacking.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22244038 PMCID: PMC3278354 DOI: 10.1186/1471-2105-13-10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flowchart of MIPHENO. "Input Data" (1) contains data with identifiable parameters for grouping/processing the data. The data pass through a quality control (QC) removal step (2), where groups not meeting the cut offs are identified and removed on an attribute-by-attribute basis. Data are normalized (3) using a scaling factor based on the data distribution. Putative hits are identified (4) using a CDF built from the data or user defined NULL distribution and an empirical p-value is assigned to each observation. Thresholds can be established based on follow-up capacity and prior knowledge (e.g. ability to detect known 'gold standard' mutant samples).
Figure 2Synthetic Populations used in Testing. Synthetic data were generated to measure the performance of the three different methods in a case where 'ground truth' is known. Samples were randomly drawn from a low abundance population (Low, blue line), high abundance population (High, red line) or a WT population (WT, black line) as shown in the upper panels (A, C). Two population structures were sampled, one with a low probability of WT, P(WT = 0.4), and the other with a high probability of WT, P(WT) = 0.93, shown in the lower panels (B, C). To test the effect of population shape, equal relative standard deviation (RSD = 15%, A and B) or equal standard deviation (SD = 5, C and D) were independently tested.
Figure 3Performance of Methods on Synthetic Data: AUC. The AUC was used to evaluate classification performance of MIPHENO, the use of raw data followed by a CDF classifier (RAW), and a group-based metric (Z) on synthetic data described in Figure 2. MIPHENO (pink, first in set) outperforms both RAW (green, middle) and Z (blue, left in set) across the different population parameters.
Figure 4Performance of Methods on Synthetic Data: Accuracy. Accuracy of classification was used to compare the performance of MIPHENO, the use of raw data followed by a CDF classifier (RAW), and a group-based metric (Z) on synthetic data from populations described in Figure 2. The percent accuracy is plotted along the y-axis while the false discovery rate (FDR) cut off is along the x-axis. Each population distribution tested is shown in a separate panel. Note that MIPHENO (pink) achieved higher classification than Z (blue) (p < 2.2e-15, Wilcoxon sign rank) and both methods outperformed Raw (green) independent of the population parameters tested.
Figure 5Performance of Methods on Synthetic Data: False Non-Discovery Rate. The false non-discovery rate (or percent positive hits missed) was used to compare the performance of MIPHENO, the use of raw data followed by a CDF classifier (RAW), and a group-based metric (Z) on synthetic data from populations described in Figure 2. The FNDR is plotted along the y-axis with the different false discovery rate (FDR) cut offs along the x-axis. Each population distribution is shown in a different panel. Note that across all populations tested, MIPHENO has a lower FNDR than the other two method, suggesting that fewer putative hits will missed with MIPHENO compared to using the Z-score (blue) or raw data (green).
Figure 6Flowchart of Performance Measures for Chloroplast 2010 Data. Metabolite data from wild-type Col-0 ecotype samples were taken from the Chloroplast 2010 dataset. MIPHENO empirical p-values and z-scores were calculated separately for metabolite values reported as mol % and nmol/g fresh weight (nmol/gFW) and results filtered according to criteria. Publicly available annotation (Aracyc and GO, Additional file 1) for annotated genes provided a basis of comparison between the two metrics.
Lines identified by MIPHENO and Z methods
| Locus | Description | Tissue | MIPHENO | Zscore | Zscore |
|---|---|---|---|---|---|
| ADT6: Plastid-localized | Seed | ||||
| arogenate dehydratase | Leaf | ||||
| ATATP-PRT2: ATP | Seed | ||||
| phosphoribosyl transferase | Leaf | ||||
| ADT1: Plastid-localized | Seed | ||||
| arogenate dehydratase | Leaf | ||||
| GAD2: glutamate | Seed | ||||
| decarboxylase | Leaf | ||||
| P5CS1: delta1-pyrroline- | Seed | ||||
| 5-carboxylate synthase | Leaf | ||||
| FAD7: Responsible for the synthesis of 16:3 and | Seed | ||||
| 18:3 fatty acids | Leaf | ||||
| IVD: Isovaleryl-CoA | Seed | ||||
| Dehydrogenase | Leaf | ||||
| AK-HSDK II: Bifunctional aspartate | Seed | ||||
| kinase, homoserine dehydrogenase. | Leaf | ||||
| FAD4: Palmitate | Seed | ||||
| desaturase | Leaf | ||||
| LKR/SDH: Splice variant of a bifunctional | Seed | ||||
| enzyme for lysine catabolism | Leaf | ||||
| ASA1: Alpha subunit of | Seed | ||||
| anthranilate synthase | Leaf | ||||
| GLT1: NADH-dependent glutamate | Seed | ||||
| synthase | Leaf | ||||
*Aracyc information not updated, manually added
Results of the analysis presented in Figure 6.