| Literature DB >> 23935471 |
Jaeyun Sung1, Pan-Jun Kim, Shuyi Ma, Cory C Funk, Andrew T Magis, Yuliang Wang, Leroy Hood, Donald Geman, Nathan D Price.
Abstract
We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein--Identification of Structured Signatures and Classifiers (ISSAC)--that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23935471 PMCID: PMC3723500 DOI: 10.1371/journal.pcbi.1003148
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Description of all GEO microarray datasets used in this study.
| Phenotype name | GEO accession # | First author (publication year) | Ref. | Sample size | Affymetrix array |
| Ependymoma | GSE16155 | Donson (2009) | 17 | 19 | U133 plus2.0 |
| GSE21687 | Johnson (2010) | 18 | 83 | U133 plus2.0 | |
| Glioblastoma Multiforme | GSE 4412 | Freije (2004) | 19 | 59 | U133A |
| GSE 4271 | Phillips (2006) | 20 | 76 | U133A | |
| GSE 8692 | Liu (2007) | 21 | 6 | U133A | |
| GSE 9171 | Wiedemeyer (2008) | 22 | 13 | U133 plus2.0 | |
| GSE 4290 | Sun (2006) | 23 | 77 | U133 plus2.0 | |
| Medulloblastoma | GSE 10327 | Kool (2008) | 24 | 61 | U133 plus2.0 |
| GSE 12992 | Fattet (2009) | 25 | 40 | U133 plus2.0 | |
| Meningioma | GSE 4780 | Scheck (2006) | - | 62 | U133A/U133 plus2.0 |
| GSE 9438 | Claus (2008) | 26 | 31 | U133 plus2.0 | |
| GSE 16581 | Lee (2010) | 27 | 68 | U133 plus2.0 | |
| Oligodendroglioma | GSE 4412 | Freije (2004) | 19 | 11 | U133A |
| GSE 4290 | Sun (2006) | 23 | 50 | U133 plus2.0 | |
| Pilocytic Astrocytoma | GSE 12907 | Wong (2005) | 28 | 21 | U133A |
| GSE 5675 | Sharma (2007) | 29 | 41 | U133 plus2.0 | |
| Normal Brain | GSE 3526 | Roth (2006) | 30 | 146 | U133 plus2.0 |
| GSE 7307 | Roth (2007) | - | 57 | U133 plus2.0 |
Studies that have not been published are denoted as ‘-’.
The node marker-panel is a collection of gene-pair classifiers from the nodes of the diagnostic hierarchy.
| Node # | Node classes | Gene | Gene |
|
| 2 | EPN GBM MDL MNG OLG PA |
|
| 1 |
| 3 | normal |
|
| 1 |
| 4 | EPN GBM MDL OLG PA |
|
| 1 |
|
|
| |||
| 5 | MNG |
|
| 1 |
| 6 | EPN GBM OLG PA |
|
| 2 |
|
|
| |||
|
|
| |||
|
|
| |||
|
|
| |||
| 7 | MDL |
|
| 4 |
|
|
| |||
|
|
| |||
|
|
| |||
|
|
| |||
| 8 | EPN |
|
| 2 |
|
|
| |||
|
|
| |||
|
|
| |||
| 9 | GBM OLG PA |
|
| 1 |
| 10 | GBM OLG |
|
| 1 |
|
|
| |||
| 11 | PA |
|
| 3 |
|
|
| |||
|
|
| |||
|
|
| |||
|
|
| |||
|
|
| |||
|
|
| |||
| 12 | GBM |
|
| 1 |
|
|
| |||
|
|
| |||
| 13 | OLG |
|
| 1 |
Node # corresponds to numerical labels in the diagnostic hierarchy shown in Figure 1.
Disease abbreviation (name): EPN (Ependymoma), GBM (Glioblastoma Multiforme), MDL (Medulloblastoma), MNG (Meningioma), OLG (Oligodendroglioma), PA (Pilocytic astrocytoma), and normal (Normal brain).
Gene i and gene j are the genes expressed higher and lower, respectively, within each gene-pair classification decision rule. Specifically, the statement of “Gene i is expressed higher than Gene j” being true contributes to the expression profile being classified as the phenotype(s) of the node. Gene names, chromosome loci, and Affymetrix microarray platform probe IDs of the classifier genes can be found in Table S1.
The minimum number of gene-pair classifiers whose decision rule outcomes for an expression profile are required to be ‘true ( = 1)’ for the profile to be classified as the phenotype(s) of the node.
Genes that share same symbol/name, but correspond to different Affymetrix probe IDs.
Figure 1Gene-pair sets of the node marker-panel are shown at their corresponding twelve nodes in the brain cancer diagnostic hierarchy.
Gene i (left) and Gene j (right) are the genes expressed higher and lower within each gene-pair, respectively. A transcriptome test sample is classified as the phenotype(s) of the node if the number of corresponding gene-pairs with a ‘true’ outcome for the statement “Gene i is expressed higher than Gene j” is greater than or equal to a threshold k defined for that node.
The decision-tree marker-panel shows phenotype-specific signatures in the form of binary patterns.
| Gene symbols | Disease binary signatures | |||||||
| Gene | Gene | EPN | GBM | MDL | MNG | OLG | PA | normal |
|
|
| 1 | 1 | 1 | 1 | 1 | 1 | 0 |
|
|
| 1 | 1 | 1 | 0 | 1 | 1 | - |
|
|
| 1 | 1 | 0 | - | 1 | 1 | - |
|
|
| 1 | 0 | - | - | 0 | 0 | - |
|
|
| - | 1 | - | - | 1 | 0 | - |
|
|
| - | 1 | - | - | 0 | - | - |
Affymetrix microarray platform probe IDs of the classifier genes are shown in Table S2.
For each gene-pair comparison (i.e., Is Gene i>Gene j ?), 1 and 0 delineates ‘true’ and ‘false’, respectively, and ‘–’ denotes that the outcome is not used for classification.
Figure 2Gene-pairs of the decision-tree marker-panel are shown at their corresponding edges in the brain cancer diagnostic hierarchy.
Gene i and Gene j are the genes expressed higher and lower within the gene-pair, respectively. For a given test sample, the direction of its classification down the diagnostic hierarchy is based on the gene-pair classifiers' true/false outcomes (left/right, respectively) for the statement “Gene i is expressed higher than Gene j”.
Figure 3Comprehensive classification of human brain cancer and normal brain transcriptomes using molecular signatures from ISSAC.
A The coarse-to-fine classification process is represented by a hierarchically structured groupings of phenotypes. There is a node classifier for each set of phenotypes in the hierarchy, which is designed to respond positively if the sample belongs to this set of diseases and negatively otherwise. Our diagnostic hierarchy has thirteen nodes in total, and seven terminal nodes (i.e., leaves). The node classifiers are executed sequentially and adaptively on a given expression profile; a classifier test for a particular node is performed if and only if all of its ancestor tests were performed and deemed positive. The node classifiers are used to screen for phenotype-specific signatures. B Leaves that have positive classifier outcomes correspond to the candidate phenotypes of a given expression profile. If there is no candidate phenotype, the expression profile is labeled as ‘Unclassified’. If only one candidate phenotype is identified, the profile is labeled as that phenotype of the respective leaf. If the profile is considered to consist of multiple phenotype signatures, the ambiguity is resolved using the decision-tree classifiers based on the same diagnostic hierarchy. Here, the decision-tree classifiers are executed starting from the root of the tree, directing the profile to one of the two child nodes sequentially until it completes a full path towards a leaf. The phenotype label of the final destination corresponds to the unique diagnosis.
Figure 4Molecular signatures from multi-study, integrated datasets have higher average phenotype prediction accuracy and lower performance variance than those from individual datasets.
A Hold-one-lab-in validation results for each of the five glioblastoma (GBM) datasets. Gray line indicates average accuracy on the four validation sets. B Leave-one-lab-out validation results for each of the five GBM datasets. Blue and red line indicates average accuracy of GBM signatures from leave-one-lab-out (L1LO) validation and hold-one-lab-in (H1LI) validation, respectively. C H1L1 validation to test GBM signatures from GSE4412, GSE4271, and GSE4290, while keeping the number of samples in the GBM training set the same. 50 samples were randomly selected from each GBM dataset for signature learning. H1LI validation was executed ten times for each of the three GBM datasets. Error bars indicate standard deviations. D L1LO validation to test GBM signatures on GSE4412, GSE4271, and GSE4290 validation sets, while 50 total samples were randomly selected from the other four GBM datasets for signature learning. L1LO validation was executed ten times. E Data from C and D are used to show GBM signatures' accuracies on GSE4412, GSE4271, and GSE4290 validation sets when the GBM training data were from individual or combined GBM datasets. Ψ, Φ, and Ω indicate statistical significance relative to GSE4271, GSE4290, and GSE4412, respectively (Tukey's post-hoc test, p<0.01). F Signal-to-noise ratios (SNRs) from data in E. SNR was calculated as the ratio of average accuracy to standard deviation.
Classification performance of brain cancer marker-panel in ten-fold cross-validation.
| Actual phenotype | Predicted phenotype (%) | Total | |||||||
| EPN | GBM | MDL | MNG | OLG | PA | normal | UC | ||
| EPN |
| 2.8 | 0.3 | 1.7 | 1.3 | 0.6 | 0.2 | 1.0 | 102 |
| GBM | 0.7 |
| 0.2 | 0.5 | 11.9 | 0.1 | 0.3 | 1.3 | 231 |
| MDL | 2.2 | 2.3 |
| 0.8 | 2.7 | 0.2 | 0.0 | 0.8 | 101 |
| MNG | 0.1 | 1.8 | 0.0 |
| 0.1 | 0.2 | 0.0 | 0.2 | 161 |
| OLG | 0.5 | 20.7 | 0.2 | 0.0 |
| 2.1 | 0.0 | 2.0 | 61 |
| PA | 1.3 | 2.3 | 0.0 | 0.0 | 1.3 |
| 0.0 | 0.8 | 62 |
| normal | 0.0 | 0.5 | 0.0 | 0.1 | 0.7 | 0.0 |
| 0.1 | 203 |
Accuracies reflect average performance in ten-fold cross-validation conducted ten times. The main diagonal gives the average classification accuracy of each class (bold), and the off-diagonal elements show the erroneous predictions.
UC (Unclassified samples). When using the node classifiers, expression profiles that did not exert a signature of any phenotype (i.e., did not percolate down to at least one positive terminal node) were rejected from classification. In this case, the Unclassified sample is treated as a misclassification.