Literature DB >> 23935471

Multi-study integration of brain cancer transcriptomes reveals organ-level molecular signatures.

Jaeyun Sung¹, Pan-Jun Kim, Shuyi Ma, Cory C Funk, Andrew T Magis, Yuliang Wang, Leroy Hood, Donald Geman, Nathan D Price.

Abstract

We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein--Identification of Structured Signatures and Classifiers (ISSAC)--that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Biomarkers, Tumor

Year: 2013 PMID： 23935471 PMCID： PMC3723500 DOI： 10.1371/journal.pcbi.1003148

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

One important goal in systems medicine is to develop molecular diagnostics that can accurately and comprehensively report health and disease states of an organ system [1], [2]. The discovery of organ-level molecular signatures [3] from global biomolecule expression measurements would mark a significant advance toward this goal. In this regard, genome-wide transcriptomic data are readily available, making this a promising source for molecular signatures as well as a good means to study the robustness of signatures across different studies. During the past decade, transcriptomics analyses on clinical patient samples have been widely used to uncover cancer-associated genes [4] and to discover biomarkers for diagnosis, prognosis prediction, or optimal therapy selection [5]–[7]. Recently, RNAs measured in blood have also been used as serum-based molecular fingerprints of neurological disease [8]. While many molecular signature studies have focused on identifying differences between case (e.g., cancer) and control (e.g., normal), a more clinically relevant and challenging task is the multi-category classification problem. This task pertains especially to identifying signatures for molecular screening and monitoring purposes. Such signatures need to detect and stratify various pathological conditions simultaneously; they must therefore be highly specific for a particular disease as well as tissue of origin. The successful identification of more reliable and efficient molecular signatures will also be critical for the blood-based, organ-specific diagnostics envisioned for the future [9]. Data-driven, hierarchical approaches to multi-category classification have been investigated extensively in machine learning [10], [11]. The basic idea of these methods is first to construct a classification framework in the form of a hierarchy, so that multi-category classifications can be reformulated into a series of binary decision sets (i.e., discriminating one class or group of classes from a second class or group of classes). The next step is to identify binary classifiers for all decision points (i.e., nodes and/or edges) of the hierarchy. This principle can be applied directly towards molecular disease classification, wherein all diseases can be organized into a global hierarchy of disease sets, where the diseases in each set share common expression patterns. The sets of binary classifiers can further be aggregated into a classifier marker-panel, which can direct diagnosis of an unlabeled patient sample down the hierarchical structure towards a particular label. Therefore, the cumulative expression patterns constitute “hierarchically-structured” molecular signatures. A significant drawback to the use of molecular signatures derived from high-throughput—particularly transcriptomic—data is limited reproducibility and performance accuracy, which is often observed across independent studies of what are considered the same disease phenotype. This drawback holds true for both binary and multi-category classification problems. The lack of robustness, even for promising results, can be attributed to molecular heterogeneity within tumors or other diseased tissue-samples [12], [13], complex disease subtypes, various patient demographics, and/or other biologically relevant factors. Another major issue is batch effects, which arise from differences or inconsistencies in experimental protocols, data quality, data-processing techniques, and laboratory conditions and personnel [14]. A promising method to address some of these limitations in robustness is to accumulate and combine data from many independent studies into large meta-analyses [15], [16]. This integrated strategy naturally expands sample sizes across diverse sources and conditions and can therefore provide more reliable disease signatures as phenotype-associated signals become stronger relative to noise from batch effects and other sources of variance. In this study, we developed a computational approach called Identification of Structured Signatures And Classifiers (ISSAC) to identify molecular signatures that simultaneously distinguish major cancers of the human brain. From an integrated dataset of publicly available gene expression data, ISSAC provides a global diagnostic hierarchy and corresponding structured brain cancer signatures composed of sets of gene-pair classifiers. The signal in the transcriptomics data was sufficient to develop accurate, comprehensive signatures, as long as the training set was sampled from the same population as the validation set (i.e., cross validation). In contrast, training on one dataset and testing against an independent set (i.e., an independent study measured from another lab) generally failed to reach the same performance due to biological and technical sources of dataset variation. To address this issue, we found that integration of datasets from multiple studies enhanced the disease signal sufficiently to mitigate batch effects and greatly improve independent validation results for brain cancers.

Results/Discussion

We compiled a multi-study, integrated dataset of brain cancer and normal transcriptomes [17]–[30] (Table 1, Table S1, and Table S2), on which we used our ISSAC algorithm (described below) to assemble classifiers into a node (Table 2, Table S3, and Figure 1) and a decision-tree (Table 3, Table S4, and Figure 2) marker panel. Importantly, while developing our algorithm to identify molecular signatures of brain cancer, we explored the effects of integrating data from multiple studies on classification performance, confirming that our integrated approach does indeed lead to more robust phenotype signatures.

Table 1

Description of all GEO microarray datasets used in this study.

Phenotype name	GEO accession #	First author (publication year)	Ref.	Sample size	Affymetrix array
Ependymoma	GSE16155	Donson (2009)	17	19	U133 plus2.0
	GSE21687	Johnson (2010)	18	83	U133 plus2.0
Glioblastoma Multiforme	GSE 4412	Freije (2004)	19	59	U133A
	GSE 4271	Phillips (2006)	20	76	U133A
	GSE 8692	Liu (2007)	21	6	U133A
	GSE 9171	Wiedemeyer (2008)	22	13	U133 plus2.0
	GSE 4290	Sun (2006)	23	77	U133 plus2.0
Medulloblastoma	GSE 10327	Kool (2008)	24	61	U133 plus2.0
	GSE 12992	Fattet (2009)	25	40	U133 plus2.0
Meningioma	GSE 4780	Scheck (2006)	-	62	U133A/U133 plus2.0
	GSE 9438	Claus (2008)	26	31	U133 plus2.0
	GSE 16581	Lee (2010)	27	68	U133 plus2.0
Oligodendroglioma	GSE 4412	Freije (2004)	19	11	U133A
	GSE 4290	Sun (2006)	23	50	U133 plus2.0
Pilocytic Astrocytoma	GSE 12907	Wong (2005)	28	21	U133A
	GSE 5675	Sharma (2007)	29	41	U133 plus2.0
Normal Brain	GSE 3526	Roth (2006)	30	146	U133 plus2.0
	GSE 7307	Roth (2007)	-	57	U133 plus2.0

Studies that have not been published are denoted as ‘-’.

Table 2

The node marker-panel is a collection of gene-pair classifiers from the nodes of the diagnostic hierarchy.

Node #a	Node classesb	Gene i c	Gene j c	k d
2	EPN GBM MDL MNG OLG PA	PRPF40A	PURA	1
3	normal	PURA	PRPF40A	1
4	EPN GBM MDL OLG PA	NRCAM	ISLR	1
		IDH2	GMDS
5	MNG	ISLR	NRCAM	1
6	EPN GBM OLG PA	SALL1	PAFAH1B3	2
		SRI	NBEA
		DDR1 e	TIA1
		DDR1 e	MAB21L1
		ITPKB	PDS5B
7	MDL	PAFAH1B3	SALL1	4
		NBEA	SRI
		TIA1	DDR1 e
		MAB21L1	DDR1 e
		PDS5B	ITPKB
8	EPN	NUP62CL	ZNF280A	2
		GALNS	WAS
		CELSR1	OR10H3
		TLE4	OLIG2
9	GBM OLG PA	ZNF280A	NUP62CL	1
10	GBM OLG	DDX27	KCNMA1	1
		COX7A2	GNPTAB
11	PA	KCNMA1	DDX27	3
		GNPTAB	NDUFS2
		APOD	PPIA
		CD59	SNRPB2
		SEMA3E	ADAMTS3
		CD59	HINT1
		BAMBI	CIAPIN1
12	GBM	FLNA	TNKS2	1
		ITGB3BP	RB1CC1
		DDX27	TRIM8
13	OLG	LARP5	ANXA1	1

Node # corresponds to numerical labels in the diagnostic hierarchy shown in Figure 1.

Disease abbreviation (name): EPN (Ependymoma), GBM (Glioblastoma Multiforme), MDL (Medulloblastoma), MNG (Meningioma), OLG (Oligodendroglioma), PA (Pilocytic astrocytoma), and normal (Normal brain).

Gene i and gene j are the genes expressed higher and lower, respectively, within each gene-pair classification decision rule. Specifically, the statement of “Gene i is expressed higher than Gene j” being true contributes to the expression profile being classified as the phenotype(s) of the node. Gene names, chromosome loci, and Affymetrix microarray platform probe IDs of the classifier genes can be found in Table S1.

The minimum number of gene-pair classifiers whose decision rule outcomes for an expression profile are required to be ‘true ( = 1)’ for the profile to be classified as the phenotype(s) of the node.

Genes that share same symbol/name, but correspond to different Affymetrix probe IDs.

Figure 1

Gene-pair sets of the node marker-panel are shown at their corresponding twelve nodes in the brain cancer diagnostic hierarchy.

Gene i (left) and Gene j (right) are the genes expressed higher and lower within each gene-pair, respectively. A transcriptome test sample is classified as the phenotype(s) of the node if the number of corresponding gene-pairs with a ‘true’ outcome for the statement “Gene i is expressed higher than Gene j” is greater than or equal to a threshold k defined for that node.

Table 3

The decision-tree marker-panel shows phenotype-specific signatures in the form of binary patterns.

Gene symbolsa		Disease binary signaturesb
Gene i	Gene j	EPN	GBM	MDL	MNG	OLG	PA	normal
PRPF40A	PURA	1	1	1	1	1	1	0
NRCAM	ISLR	1	1	1	0	1	1	-
SRI	NBEA	1	1	0	-	1	1	-
NUP62CL	OR10H3	1	0	-	-	0	0	-
DDX27	KCNMA1	-	1	-	-	1	0	-
FLNA	TNKS2	-	1	-	-	0	-	-

Affymetrix microarray platform probe IDs of the classifier genes are shown in Table S2.

For each gene-pair comparison (i.e., Is Gene i>Gene j ?), 1 and 0 delineates ‘true’ and ‘false’, respectively, and ‘–’ denotes that the outcome is not used for classification.

Figure 2

Gene-pairs of the decision-tree marker-panel are shown at their corresponding edges in the brain cancer diagnostic hierarchy.

Gene-pair sets of the node marker-panel are shown at their corresponding twelve nodes in the brain cancer diagnostic hierarchy.

Gene-pairs of the decision-tree marker-panel are shown at their corresponding edges in the brain cancer diagnostic hierarchy.

Gene i and Gene j are the genes expressed higher and lower within the gene-pair, respectively. For a given test sample, the direction of its classification down the diagnostic hierarchy is based on the gene-pair classifiers' true/false outcomes (left/right, respectively) for the statement “Gene i is expressed higher than Gene j”. Studies that have not been published are denoted as ‘-’. Node # corresponds to numerical labels in the diagnostic hierarchy shown in Figure 1. Disease abbreviation (name): EPN (Ependymoma), GBM (Glioblastoma Multiforme), MDL (Medulloblastoma), MNG (Meningioma), OLG (Oligodendroglioma), PA (Pilocytic astrocytoma), and normal (Normal brain). Gene i and gene j are the genes expressed higher and lower, respectively, within each gene-pair classification decision rule. Specifically, the statement of “Gene i is expressed higher than Gene j” being true contributes to the expression profile being classified as the phenotype(s) of the node. Gene names, chromosome loci, and Affymetrix microarray platform probe IDs of the classifier genes can be found in Table S1. The minimum number of gene-pair classifiers whose decision rule outcomes for an expression profile are required to be ‘true ( = 1)’ for the profile to be classified as the phenotype(s) of the node. Genes that share same symbol/name, but correspond to different Affymetrix probe IDs. Affymetrix microarray platform probe IDs of the classifier genes are shown in Table S2. For each gene-pair comparison (i.e., Is Gene i>Gene j ?), 1 and 0 delineates ‘true’ and ‘false’, respectively, and ‘–’ denotes that the outcome is not used for classification. Our marker panel consists of 39 total gene pairs and 44 unique genes (46 unique Affymetrix microarray probe IDs). Details on how the gene-pair sets were chosen as classifiers, and how they are used for phenotype prediction, can be found in the Materials and Methods section and Text S1. In addition, we discuss how the genes and gene pairs in our marker panel were found to have previously confirmed associations with brain cancer. Overall, we generated a marker panel with reasonably high multi-class brain cancer classification accuracy and straightforward biological interpretation.

An overview of Identification of Structured Signatures and Classifiers (ISSAC)

Here, we summarize the overall method of ISSAC into three main steps (Figure S1); a detailed algorithm and step-by-step guide are presented in the Materials and Methods section and Text S1, respectively. First, ISSAC constructs the framework for brain cancer diagnosis (Figure 3A and Figure S2)—a tree-structured hierarchy of all brain phenotypes including ependymoma (EPN), glioblastoma multiforme (GBM), medulloblastoma (MDL), meningioma (MNG), oligodendroglioma (OLG), pilocytic astrocytoma (PA), and normal brain, built using an agglomerative hierarchical clustering algorithm on gene expression training data. The construction of the hierarchy relies on iteratively identifying pairs of phenotype groups based on shared features in gene expression. As shown in Figure 3A, the cumulative set of different phenotypes is partitioned into smaller and more homogeneous subsets, thereby decomposing the multi-class diagnosis problem into more tractable sub-problems of class prediction.

Figure 3

Comprehensive classification of human brain cancer and normal brain transcriptomes using molecular signatures from ISSAC.

Comprehensive classification of human brain cancer and normal brain transcriptomes using molecular signatures from ISSAC.

A The coarse-to-fine classification process is represented by a hierarchically structured groupings of phenotypes. There is a node classifier for each set of phenotypes in the hierarchy, which is designed to respond positively if the sample belongs to this set of diseases and negatively otherwise. Our diagnostic hierarchy has thirteen nodes in total, and seven terminal nodes (i.e., leaves). The node classifiers are executed sequentially and adaptively on a given expression profile; a classifier test for a particular node is performed if and only if all of its ancestor tests were performed and deemed positive. The node classifiers are used to screen for phenotype-specific signatures. B Leaves that have positive classifier outcomes correspond to the candidate phenotypes of a given expression profile. If there is no candidate phenotype, the expression profile is labeled as ‘Unclassified’. If only one candidate phenotype is identified, the profile is labeled as that phenotype of the respective leaf. If the profile is considered to consist of multiple phenotype signatures, the ambiguity is resolved using the decision-tree classifiers based on the same diagnostic hierarchy. Here, the decision-tree classifiers are executed starting from the root of the tree, directing the profile to one of the two child nodes sequentially until it completes a full path towards a leaf. The phenotype label of the final destination corresponds to the unique diagnosis. Second, ISSAC identifies gene-pair classifiers corresponding to the nodes and edges of the diagnostic hierarchy (Figures 1 and 2 and Tables 2 and 3). Both types of classifiers are binary, i.e., attempt to distinguish between two sets of phenotypes. The objective of a node classifier is to distinguish the set of phenotypes associated with the node from all other phenotypes. For example, the classifiers of node 6 in Figure 1 and Table 2 can predict the class label of a particular transcriptome sample as either glioma (EPN, GBM, OLG, and PA) or non-glioma (MNG, MDL, and normal). In the case of an edge-based, decision-tree classifier, the objective is to distinguish the two sets of phenotypes associated with the two child nodes, analogous to rules of an ordinary decision tree. In the case of the two genes PRPF40A and PURA in Figure 2 and Table 3, this classifier determines the label of a sample as either brain cancer or normal phenotype. All classifiers are based on comparing the relative expression values (i.e., ranks) between two genes or several pairs of genes within a gene expression profile (Figures 1 and 2 and Tables 2 and 3). The chosen pairs are those that best differentiate between the phenotype sets and are based entirely on the reversal of relative expression (Materials and Methods), as previously reported [31]. Briefly, the decision rule by Geman et al. is based on two genes (e.g., gene i and gene j) for distinguishing between two phenotypes (e.g., class A and class B): If the expression of gene i is greater than that of gene j for a given profile, then the phenotype is classified as class A; otherwise, class B. It has been shown that using such simple decision rules with only a small number of gene pairs can lead to highly accurate supervised classification of human cancers [32], [33]. We describe the advantages of using relative expression reversals in Text S2. In addition, we provide a summary of the expression differences between classifier genes i and j in Table S5. Overall, the collection of node classifiers represent a series of coarse-grained to fine-grained explanations of the hierarchical groupings and are used in diagnosis to screen for phenotype-specific expression patterns (described below). Thus, the hierarchy of binary predictors guides classification of an expression profile in a dynamic coarse-to-fine fashion: a classifier is executed if and only if all of its ancestor classifiers have been executed and have returned a positive response—i.e., predicted the phenotypes in each node. The cumulative outcome of the node classifiers for a given expression profile is the set of its candidate phenotypes, corresponding to all the leaves of the hierarchy that were reached and tested positively. This property means that it is possible to traverse multiple paths to multiple leaf nodes, and thus multiple diagnoses may be made in this step (though in practice it is usually just one). For tie-breaking purposes, the decision-tree classifiers at the edges of the diagnostic hierarchy are used to reach a unique diagnosis. Finally, ISSAC uses the gene-pair classifiers for class prediction (Figure 3B). Given a transcriptome sample, ISSAC executes the node classifiers in a hierarchical, top-down fashion within the disease diagnostic hierarchy to identify the phenotype(s) whose class-specific signature(s) is present. As shown in Figure 3B, transcriptome samples 4–7 all have expression signatures of at least one class, i.e., a sample is classified (positive) as at least one terminal node (leaf) phenotype. In contrast, samples 1–3 do not have any class-specific signatures, i.e., samples are not positive for any leaf, and are labeled as “Unclassified”. In case of multiple class candidates, i.e., node classifiers for multiple leaves are positive as in samples 6 and 7, the ambiguity is resolved by aggregating all the decision-tree classifiers into a classification decision-tree, thereby leading any expression signature down one unique path toward a single phenotype. Once the hierarchy and classifiers were determined, ISSAC distinguished brain cancer phenotypes with an accuracy of 90% in ten-fold cross-validation (discussed below). When the individual transcriptomic samples used in the training set were re-examined, ISSAC correctly observed all samples with an apparent (resubstitution) accuracy of 94%. This gives a sense for the relatively small degree of over-fitting compared to the cross-validation accuracy estimate.

Integrating disparate datasets identifies more robust molecular signatures across independent studies

To estimate the robustness of signature accuracy, it is best to test molecular signatures against datasets (i.e., patient samples) that are truly independent of the training set (e.g., drawn from a different patient population, clinical laboratory, etc.). To study the effects of training across multiple studies, we used glioblastoma (GBM), where we had the highest number of transcriptomic datasets for the phenotype. We trained ISSAC on each of the five transcriptomic datasets (i.e., GSE #) of GBM, coupled in each case to all the data from the other brain phenotypes. The full multi-class signatures were completely relearned (every step) with the only difference in each case being which single GBM dataset was included in the training stage. We then assessed the accuracy of correctly classifying GBM transcriptomes measured in the four held-out datasets from all other possible phenotypes. We term this evaluation method as “hold-one-lab-in validation”. The overall hold-one-lab-in validation performance, or the average of all classification accuracies in Figure 3a, was 38%. This shows that, in general, individual datasets do not consistently yield robust molecular signatures. For example, GBM signatures from GSE8692 (6 samples, ref. 21) and GSE9171 (13 samples, ref. 22) led to average accuracies of 22% and 0% for classifying independent GBM samples from other studies, respectively. These significantly low performance results are not surprising for these sets given the very small sample numbers. To an extent, relatively larger datasets could indeed yield disease signatures of higher average accuracy. However, sample size was not the sole determining factor of signature performance. For example, training on GSE4412 (59 samples, ref. 19) gave an average accuracy of 23% (Figure 4a) on the remaining GBM samples from the other studies. As a notable exception, training on GSE4271 (76 samples, ref. 20) alone resulted in the best overall average accuracy (87%) in correctly classifying samples from the four held-out GBM datasets, with individual validation set accuracies ranging from 78% to 100% (Table S6). However, when GSE4290 (77 samples, ref. 23) was used as the training set, there was over a 30% lower average GBM classification accuracy (56%) despite the nearly identical sample size with GSE4271.

Figure 4

Molecular signatures from multi-study, integrated datasets have higher average phenotype prediction accuracy and lower performance variance than those from individual datasets.

Molecular signatures from multi-study, integrated datasets have higher average phenotype prediction accuracy and lower performance variance than those from individual datasets.

A Hold-one-lab-in validation results for each of the five glioblastoma (GBM) datasets. Gray line indicates average accuracy on the four validation sets. B Leave-one-lab-out validation results for each of the five GBM datasets. Blue and red line indicates average accuracy of GBM signatures from leave-one-lab-out (L1LO) validation and hold-one-lab-in (H1LI) validation, respectively. C H1L1 validation to test GBM signatures from GSE4412, GSE4271, and GSE4290, while keeping the number of samples in the GBM training set the same. 50 samples were randomly selected from each GBM dataset for signature learning. H1LI validation was executed ten times for each of the three GBM datasets. Error bars indicate standard deviations. D L1LO validation to test GBM signatures on GSE4412, GSE4271, and GSE4290 validation sets, while 50 total samples were randomly selected from the other four GBM datasets for signature learning. L1LO validation was executed ten times. E Data from C and D are used to show GBM signatures' accuracies on GSE4412, GSE4271, and GSE4290 validation sets when the GBM training data were from individual or combined GBM datasets. Ψ, Φ, and Ω indicate statistical significance relative to GSE4271, GSE4290, and GSE4412, respectively (Tukey's post-hoc test, p<0.01). F Signal-to-noise ratios (SNRs) from data in E. SNR was calculated as the ratio of average accuracy to standard deviation. We found considerable discrepancy between the minimum and maximum validation set accuracies for training sets GSE4412 (0% and 83%, respectively) and GSE4290 (17% and 92%) (Table S6). This indicates that batch effects, as well as potential biological discrepancies between populations studied at different sites, can lead to remarkable variation among transcriptomic datasets of supposedly the same phenotype. This “dataset variation” is widespread in large-scale expression studies, causing inconsistencies in molecular signature identification and performance reproducibility [34]. Large variation within and across transcriptomic datasets of GBM is perhaps not surprising, given that GBM is known to have various molecular subtypes [35]. Therefore, as mentioned above, molecular signatures from any single dataset need to be approached with caution in terms of their generalization. We next analyzed how the multi-study integration approach affects performance robustness. One of each of the five datasets of GBM was sequentially withheld as the validation set, while all remaining gene expression data (including those from all other phenotypes) were used for training. The GBM signature was then evaluated on the held-out validation set. We term this strategy as “leave-one-lab-out validation”. Classification accuracies using this approach ranged from 63% (GBM training set: 155 samples across four datasets; validation set: GSE4271, 76 samples) to 100% (GBM training set: 225 samples across four datasets; validation set: GSE8692, 6 samples) (Figure 4b). The average accuracy of the five leave-one-lab-out validations was 83%, which was considerably higher than that obtained from training on individual GBM datasets (38%). We conjecture that this result is due to the underlying variation in the training sets better representing the true variation in the population, both by achieving a greater sample size, as well as by having the samples come from a broader range of situations. To evaluate how multi-study dataset integration alone affects performance robustness independent of sample size, we performed hold-one-lab-in and leave-one-lab-out validations for the studies with the largest number of samples, GSE4412, GSE4271, and GSE4290 (59, 76, and 77 samples, respectively) while training on the same number of samples for GBM. More specifically, the same steps in the analyses of Figure 4a and Figure 4b were used, while GBM signatures were learned from a training set of exactly 50 samples chosen randomly from either an individual dataset or across four combined datasets (with the fifth data set left out for validation). This process was conducted ten times for each GBM training set. The average performances of hold-one-lab-in and leave-one-lab-out validations were 47% and 70%, respectively. Overall, the results were consistent with our two aforementioned conclusions: 1) when a molecular signature is learned from an individual dataset, its ability to accurately and precisely represent phenotype features across a broad population highly varies depending on the particular dataset used for training (Figure 4c and Table S7); and 2) combining datasets considerably increased average accuracy (Figure 4d and Table S7). Thus, dataset integration across multiple studies, even without change in sample size, can lead to significant improvements in predictive performance. Lastly, we used the results in Figure 4c and Figure 4d to compare performances of different GBM signatures on the same validation set (Figure 4e). In all cases, signatures from combined datasets had, on average, higher classification accuracy than those from any of the individual datasets—even though the same number of samples was used in the training sets and were tested on a validation set independent of the training set. These results were then used to evaluate the precision of a GBM signature's classification accuracy by calculating its “signal-to-noise ratio (SNR)”. SNR in the accuracy estimate was calculated herein as the ratio of average classification accuracy to standard deviation in the accuracy estimate across studies. We found that, for all validation set cases, GBM signatures developed on the basis of multiple datasets had SNRs greater by at least two fold than those from individual data sets. This clearly shows that learning on integrated datasets leads to molecular signatures that have higher and more consistent (i.e. less variable) predictive performance (Figure 4f), and motivated our choice in developing the brain cancer ISSAC signature to only use cases where we had at least 2 independent studies to learn across. Overall, we have shown that when a broader range of conditions within a particular phenotype is presented during the classifier-learning stage, ISSAC can better distinguish the true disease signal from noise prior to independent validation. However, single and/or smaller training sets that were used to define the classifiers might not be representative of, or generalizable to, larger populations – leading to poor validation results. Therefore, the utilization of all currently available datasets from various sources and conditions may be a promising approach to finding novel diagnostic markers, and eventually bringing the successful adaptation of genomic biomarkers into clinical practice. Also, prospective design of studies is generally best when they utilize multiple sites to avoid over-fitting to particular contexts. It is worth mentioning that in some cases, molecular signatures from a single source can have (or at least appear to have) superior performance, as demonstrated by the molecular signatures from GSE4271. Specifically, training on a single GSE4271 data set provided higher accuracy (87%, Figure 4a) than learning on any of the four sets combined (average 83%, Figure 4b). Indeed, when such surprisingly robust single datasets are identified, they potentiate significant new insight into the underlying heterogeneities present in a patient population of a disease phenotype. Such data sets can be utilized for follow-up studies, and hence serve as a valuable resource to the scientific and medical communities. It is, however, difficult in practice to predict in advance data set robustness, which must be ensured through careful sample collection and data set preprocessing techniques. To help ensure the production of reliable omics-based data sets, we recommend the following: 1) Good experimental design, such as clearly defining clinical phenotypes of interest; 2) When collecting new experimental data, sufficient sample size must be obtained; 3) All aspects of the experimental and analytical procedures must be carefully controlled to avoid batch effects; and 4) No confounding from factors unrelated to phenotype(s) of interest must occur.

Brain cancer marker-panel achieves high classification accuracy in cross-validation

As shown by our leave-one-lab-out validations, learning signatures across multiple datasets significantly improved average classification accuracy with concomitant reduction in performance variance. In this regard, the brain cancer marker-panel obtained using all currently available microarray data simultaneously (Tables 2 and 3) should represent more robust phenotype signatures. The classification performance of this comprehensive brain cancer marker-panel was evaluated by ten-fold cross-validation (Figure S3). Our marker-panel achieved a 90% average of phenotype-specific classification accuracies (Table 4), showing strong promise against a multi-category, multi-dataset background at the gene expression level. In addition, we observed higher classification accuracy (93%) among the expression profiles for which a unique diagnosis was obtained without subsequent disambiguation from the decision-tree (Table S8). Furthermore, the glioblastoma (GBM) classification accuracy previously seen in our leave-one-lab-out analysis (83%) is comparable to that seen in cross-validation (85%). Indeed, that these two accuracies are so close suggests that, for GBM, the effects of variability among the datasets from different institutions and time-points have been mostly overcome by integration across multiple training studies.

Table 4

Classification performance of brain cancer marker-panel in ten-fold cross-validation.

Actual phenotype	Predicted phenotype (%)a								Total
	EPN	GBM	MDL	MNG	OLG	PA	normal	UCb
EPN	92.2	2.8	0.3	1.7	1.3	0.6	0.2	1.0	102
GBM	0.7	84.8	0.2	0.5	11.9	0.1	0.3	1.3	231
MDL	2.2	2.3	91.1	0.8	2.7	0.2	0.0	0.8	101
MNG	0.1	1.8	0.0	97.5	0.1	0.2	0.0	0.2	161
OLG	0.5	20.7	0.2	0.0	74.6	2.1	0.0	2.0	61
PA	1.3	2.3	0.0	0.0	1.3	94.4	0.0	0.8	62
normal	0.0	0.5	0.0	0.1	0.7	0.0	98.5	0.1	203

UC (Unclassified samples). When using the node classifiers, expression profiles that did not exert a signature of any phenotype (i.e., did not percolate down to at least one positive terminal node) were rejected from classification. In this case, the Unclassified sample is treated as a misclassification.

Accuracies reflect average performance in ten-fold cross-validation conducted ten times. The main diagonal gives the average classification accuracy of each class (bold), and the off-diagonal elements show the erroneous predictions. UC (Unclassified samples). When using the node classifiers, expression profiles that did not exert a signature of any phenotype (i.e., did not percolate down to at least one positive terminal node) were rejected from classification. In this case, the Unclassified sample is treated as a misclassification. Four other brain cancers (ependymoma, medulloblastoma, meningioma, and pilocytic astrocytoma) have estimated accuracies of at least 91%, suggesting clear differences between them and the other phenotypes at the transcriptomic level. The anatomical region specificity of these four cancers may have contributed toward their highly accurate separation, as there are regional areas of unique gene expression patterns. Roth et al. analyzed gene expression of 20 anatomically distinct regions of the central nervous system [30] and clustered all anatomical sites into distinct groups, providing evidence of region-specific expression patterns. However, results from another study analyzing gene expression data from distinct brain regions suggested that clustering disparities might also be due to activity of distinct brain cell types, rather than solely on region [36], [37]. Furthermore, if region specificity played a dominant role in classification, we would expect to see a high number of misdiagnoses to occur between the normal brain, which was derived from 25 different locations (Text S3), and the six cancers. Such a trend was not observed in Table 4. Therefore, our predictive results suggest a stronger contribution from underlying cell-type specific and disease-intrinsic elements than from region effects alone. Compared to the cross-validation accuracies of other phenotypes, lower performance was observed for GBM and oligodendroglioma (OLG) (85% and 75%, respectively). This could have been mainly a consequence of the limited ability of the marker-panel to correctly differentiate these two cancers from each other. Indeed, the distinction of these two phenotypes from transcriptomics seems to be rather difficult in general, and our accuracies here are comparable to those reported previously in two-phenotype comparison studies [38], [39]. Furthermore, our signatures did show an excellent degree of sensitivity (96%) and specificity (97%) for distinguishing these two well-progressed gliomas as a set from all other brain phenotypes. There exist genetic tests and methods that differentiate GBM and OLG well, such as the combined loss of chromosome arms 1p and 19q [40], and over-expression of the transcription factor protein Olig2 [41], but our goal in this particular study was to evaluate molecular discriminatory power as represented in transcriptomes across multiple brain cancers.

Marker-panel genes' association to cancer biology

Several genes in our marker panel are strongly associated with brain cancers, suggesting putative relationships to the underlying pathophysiology of their corresponding phenotypes. One such gene is NRCAM (nodes 4 and 5 of Figure 1 and Table 2), which was reported as a marker for high-risk neuroblastoma [42] and poor prognostic ependymoma [43]. NRCAM was also found to be over-expressed in cell lines derived from pilocytic astrocytomas and glioblastoma multiforme tumors [44]. The receptor tyrosine kinase DDR1, a predicted marker gene for PA when expressed higher than TIA1 and MAB21L1 (nodes 6 and 7), was found to be over-expressed in high-grade gliomas and to promote tumor cell invasion [45]. FLNA was detected in the serum of high-grade astrocytoma (grade 3 and GBM) patients [46], and ANXA1, a gene that encodes an anti-inflammatory phospholipid binding protein, was implicated in astrocytoma progression [47]. These reports are consistent with our identification of FLNA and ANXA1 as two classifier genes expressed higher in GBM than in oligodendroglioma (nodes 12 and 13). The basic helix-loop-helix (bHLH) transcription factor OLIG2 is innately expressed in oligodendrocytes and was recently characterized as a key antagonist of p53 function in neural stem cells and malignant gliomas [48]. In accordance with lower expression of OLIG2 as an EPN classifier in this study (node 8), OLIG2 expression was used as a negative marker to differentiate EPN from other gliomas [49]. SEMA3E, one of several classifier genes for PA (node 11), has been reported to drive invasiveness of melanoma cells in mice [50]. And finally, mutation to IDH2 (node 4) in GBM is well known, with occurrence reported in 80% of secondary glioblastomas [51], [52]. That the genes in our marker panel have previously confirmed ties to brain cancers raises the question of what is the underlying molecular framework surrounding the generation of gene-pair classifiers, which would be an interesting direction for future studies. Among the gene pairs in our marker panel, we focus on two pairs (below) in which the genes' common functional roles or relevance to cancer suggest putative relationships to corresponding pathology. Our discussions below point to potential biological relationships underlying the observed gene expression reversals, representing hypotheses that require further experimental validation. One of the classifier gene pairs involved in the differentiation between meningioma and the remaining five brain cancers (EPN, GBM, MDL, OLG, PA) are two metabolic enzymes, IDH2 and GMDS (node 4). IDH2 converts isocitrate to α-ketoglutarate within the TCA cycle. This reaction produces NADPH, which not only is an essential cofactor for many metabolic reactions, but also helps to protect the cell against oxidative damage [53]. Moreover, GMDS aids the biosynthesis of GDP-fucose from GDP-mannose in mannose metabolism, in which NADPH is produced [54]. That the enzymatic activities of both IDH2 and GMDS participate in the conversion between NADP+ and NADPH is interesting, considering the well-known alteration to cellular metabolism and deregulated redox balance in cancer [55]. Possible MNG-specific mutations in IDH2 and/or GMDS, or changes in the regulatory network that controls the expression of these two genes, may affect cellular redox balance and functions of other metabolic enzymes. The TLE4 and OLIG2 gene pair is used to differentiate EPN from GBM, OLG, and PA (node 8). TLE4, a human homolog of the Drosophila Groucho protein, represses the Wnt and FGF developmental signaling pathways [56]–[58] by recruiting deacetylases to histones H3 and H4 [59]. FGF receptor signaling was reported to control neuronal and glial cell development by regulating OLIG2 expression in zebrafish [58]. This connection between these two genes in regards to brain cell development could be reflective of the extent of cell-type differentiation (a hallmark of cancer), or lack thereof, unique to EPN compared with the other gliomas. To develop further hypotheses of the functional relationships between the classifiers and pathophysiological traits, we looked for statistical enrichment of biological properties (e.g. biological processes, chromosome numbers) on an exhaustive list of gene pairs discriminating GBM and OLG (Text S4). Our statistical enrichment of biological processes of gene-set i and gene-set j (the union of genes in each gene-pair classifier that are expressed relatively higher and lower in GBM, respectively) showed that the genes reflect disease properties. Specifically, the genes that are in gene-set i, or those expressed higher in GBM compared to OLG, were the most enriched in the biological process of ‘Immunity and Defense’ (Figure S4); this is in concordance with clinical observations showing high degree of inflammation inside malignant tumors (such as GBM), as well as the subsequent high number of immune cells. Our additional reports on the statistical enrichment of certain chromosome numbers link our classifiers to known genomic aberrations of their respective brain cancers, providing further insight as to why certain genes might have been selected as classifiers.

Looking ahead: Molecular signatures based on putative blood borne biomolecules offer a glimpse into possible molecular diagnostics

The work reported herein has focused on identifying a structured molecular signature that can separate major brain cancers simultaneously, as well as on evaluating issues related to reproducibility in molecular signatures. However, our long-term motivation for wanting molecular signatures of an organ system is ultimately to find corresponding signatures in the blood, where they can be assayed non-invasively. Blood bathes virtually all organs, which secrete proteins and nucleic acids. Subsets of these secreted biomolecules can potentially constitute disease signatures for molecular diagnostics, as measurement technologies mature. Moreover, the blood is easily accessible in contrast to biopsies of diseased organs for obtaining transcript or protein profiles. In this regard, the brain represents an organ system where a critical need exists to develop non-invasive techniques to monitor its health state through secreted proteins. Previously, organ-specific proteins have been detected in blood; when these proteins changed in concentration or chemical structure, the tissue origin of this change was identified [60]. For blood-based, organ-specific diagnostics, molecular signatures need to detect and stratify various possible cancers and other pathological conditions simultaneously. In the context of this current study, an intriguing question is if training ISSAC on shed or secreted blood borne biomolecule measurements identifies molecular signatures that allow us to distinguish health from disease; and if diseased, which one and how far has it progressed? Thus, the approach laid out herein for transcriptomics is a foundation for identifying similar signatures from blood proteins as these measurements become more abundant. As proof of concept and to provide candidates for targeted proteomics analysis, we performed the above transcriptomic analysis of finding brain cancer signatures using only the genes that are annotated to encode extracellular proteins (Materials and Methods). We trained ISSAC on a total of 767 genes that matched this criterion, which led to a new brain cancer marker-panel composed of 41 gene-pair classifiers from 71 unique features (Figure S5). When looking at the case of GBM gene-pair classifiers, i.e. 59 node-based genes involved in the detection of either GBM or phenotype groups that include GBM, 11 were previously identified as potential GBM-specific serum markers (detected either from GBM cell-line secretome experiments or in human plasma): APOD [61], CALU [62], CD163 [63], [64], CHI3L1 [65]–[67], CSF1 [68], [69], EGFR [68], [70], [71], IGFBP2 [62], [72]–[75], NID1 [76], PDGFC [77], [78], PSG9 [72], and PTN [79]. We provide the functional roles of these genes in Table S9. None of these previous studies performed relative abundance comparisons or measured expression ratios, so we are unable to answer at this time whether the particular relative expression reversal patterns would be observed in serum. We were not able to find any direct available evidence associating the remaining GBM classifier genes to potential serum-based markers. Nevertheless, we were encouraged that ISSAC was able to verify some previously identified potential GBM markers, which provides support for its use towards a blood-based test since there is currently no clinically approved GBM-specific, serum-based biomarker. Our marker-panel, composed entirely of genes encoding extracellular products, obtained an average classification accuracy of 87% in 10-fold cross-validation (Table S10), which compares favorably to the average accuracy we previously achieved using all the genes in the microarray (90%). This suggests that strong signal may possibly persist for phenotype distinction even when using only secreted biomolecules from diseased organs. If indeed there are enough biomolecules secreted into the blood at concentrations that can be accurately and consistently detected by e.g., targeted mass spectrometry, then there is the very exciting possibility that organ-specific pathologies, such as those described above, can be detected from the blood. This would truly make blood a powerful window into health and disease.

Materials and Methods

Multi-study dataset of human brain cancer transcriptomes

All transcriptomic data used in our analysis are publicly available at the NCBI Gene Expression Omnibus (GEO). We integrated 921 microarray samples of six brain cancers (ependymoma, glioblastoma multiforme, medulloblastoma, meningioma, oligodendroglioma, pilocytic astrocytoma) and normal brain across 16 independent studies into a transcriptome multi-study dataset. Importantly, we obtained the raw data (.CEL files) from each of these studies and preprocessed them uniformly using identical techniques to greatly reduce extraneous sources of technical artifacts (discussed below). All data manipulation and numerical calculations were performed using MATLAB (MathWorks). To ensure data quality and to help control for systemic bias and batch effects), we used the following strict criteria and reasoning for brain phenotype selection: 1) Expression profiles must have been conducted on either the Affymetrix Human Genome U133A or U133 Plus 2.0 microarray platform. This allowed maximum microarray sample collection without considerable reduction in number of overlapping classifier features (i.e., microarray probe-sets). 2) Transcriptomic datasets (i.e., GSE #) for each phenotype must have been collected from at least two independent sources to help mitigate batch effects. 3) All datasets must have consisted of no fewer than 5 microarray samples. 4) All datasets must have originated from primary brain tumor or tissue biopsies. Expression profiles from cell-lines or laser micro-dissections were not used in our study to better ensure sample consistency. 5) Raw microarray intensity data (.CEL files) must have been available on GEO for consensus preprocessing (described below). 6) Sample preparation protocols must have been fully disclosed on GEO. 7) All microarray samples in a dataset of a given phenotype were used in order to take into consideration all sources of heterogeneity. That is, no samples were excluded because their gene expression profiles were abnormal for their associated phenotypes. We are aware that this may allow mislabeled samples, e.g. samples that were originally misclassified by the histopathologist upon class labeling (Text S5), to be used in the classifier-learning stage, and thereby limit the biological “purity” of a phenotype in the training set. This can pose a serious challenge in interpreting misclassified samples that actually seem to be a much better match (or even perfect match) to a different phenotype, leading to questions of whether a misclassification is due to ISSAC's limitation in distinguishing phenotypes, or whether a re-evaluation of the original tumor biopsy is required. Despite these concerns, we concluded this to be the most stringent test. After an exhaustive search on GEO, we identified 921 microarray samples from 16 studies that met the above criteria (as of January 2011). Information on all datasets (e.g., publication sources, Affymetrix platforms, GEO dataset IDs, and microarray sample IDs), studies, and GEO microarray sample IDs used in our study is available in Table 1, Table S1, and Table S2, respectively. Raw microarray intensity data (.CEL files) were obtained online from GEO and preprocessed uniformly. More specifically, common probe-sets were found across all transcriptome samples, and consensus preprocessing was performed on all the raw microarray image data to build a consensus dataset. This step removes one major non-biological source of variance between different studies. These preprocessed samples were used to build a multi-study integrated dataset of human brain cancer and normal brain transcriptomes. Finally, stringent probe-set filtering was used to remove spurious classifier features. Our consensus preprocessing and probe-set filtering methods are explained in further detail below. Our integrated and uniformly pre-processed dataset is available on our group's webpage (http://price.systemsbiology.net/downloads) as a community resource for those who wish to conduct their own analyses.

Consensus preprocessing using GCRMA

All gene expression data used in our study were measurements conducted on either the Affymetrix Human Genome U133A or U133Plus2.0 oligonucleotide microarrays. The expression level of a target gene on these two platforms is measured by first quantifying the total intensity of fluorescently labeled RNA fragments (from patient specimens) that bind to a probe set, or the set of complementary 25-mer oligonucleotide probe sequences. The intensities of all probe sets (raw measurements in the form of .CEL files) are then adjusted for background variability and normalized across all samples to obtain the target genes' final expression values. Raw .CEL data files were downloaded directly from GEO. Probe set information used in this study were based on the latest Affymetrix annotations. Raw intensity measurements of all microarray samples considered in our study were preprocessed collectively (consensus preprocessing) using the MATLAB implementation of the microarray preprocessing GCRMA [80]. Only the probe sets that map to known genes and exist on both Affymetrix platforms (same oligonucleotide sequences) were considered for preprocessing. The use of individual Affymetrix probe sets as classifiers (and not the mean or median of their expression values as demonstrated in other microarray-based studies) imposes limitations in the classifiers' multi-platform compliance, as discussed in Text S6 and Text S7.

Probe set filtering using MAS5 calls

Probe sets of Affymetrix microarrays have “perfect match” probes that are exactly complementary to the target gene's mRNA sequence. They also have “mismatch” probes that contain a mismatched nucleotide halfway along the probe sequence, and are used to estimate the degree of non-specific binding. To ensure that a probe set is reliably detected, the measurements of the “perfect match” probes must be significantly greater than those of the “mismatch” probes. This is usually assessed based on statistical measures. The MAS5 preprocessing software makes expression quality calls based on the nonparametric Wilcoxon signed-rank test. The “absent” call is made when the p-value is greater than 0.06, representing no significant difference between the measurements of the “perfect match” and those of the ‘mismatch’ probes [81]. We eliminated probes that were determined to be “absent” in all samples of the consensus dataset. After this probe set filtering step, 19,656 probe sets (corresponding to target genes) within each microarray sample were kept for further analysis. All GCRMA preprocessing and MAS5 probe set filtering procedures were conducted separately for training and test set samples, i.e., inside each cross-validation or hold-out loop, in order to avoid possible cross-talk between the two datasets. Genes that were excluded based on the MAS5 “absent” calls on the training data were also removed from the corresponding test data.

Description of ISSAC algorithm

A tree-structured framework for brain cancer diagnostics

Let denote the set of class labels, in our case the seven brain phenotypes: six cancers and normal. Given an expression profile x, the objective is to determine its true phenotype Y . The main assumption is that there are natural groupings L among the phenotypes. Thus, testing for these groupings can more efficiently utilize the available training data, leading to more accurate classification than testing for each phenotype individually. Based on these attributes, the natural structure to represent is then a diagnostic hierarchy in the form of a binary hierarchical decision tree T. Each node t is associated with a set of phenotypes L . The root of T contains all the phenotypes and each leaf (terminal node) of T delineates a single phenotype. Overall, this representation is nested, in the sense that the set of phenotypes at every non-terminal node is the disjoint union of the phenotypes of the two child nodes. This tree is built from the training data by agglomerative hierarchical clustering derived from features of the profiles, as discussed below.

Node classifiers are assembled according to the diagnostic hierarchy

There is a binary classifier f for every node t except for the root. The classifier f is a function of the expression profile x. Put simply, f is a collective “test” for phenotypes in L versus all other phenotypes. More formally, the classifier returns two possible outcomes: f (x) = 1 (i.e., positive) signals that we accept the hypothesis that Y L, and f (x) = 0 (i.e., negative) signals that we reject this hypothesis and conclude that Y L. In particular, f is not a test for L versus the phenotypes in the sibling of t, as would be the case with a standard decision tree. Rather, f looks for traits within a given profile x which characterize all phenotypes in L simultaneously, such that a positive result signifies that the classifier assumes the true class of x belongs to L. Classifier learning begins at the two child nodes of the root, and the classifiers are learned from two types of training data. The positive training data for learning the classifier f for node t are all the expression profiles of the phenotypes in L. The negative training data are all the profiles of the phenotypes that are not in L. Being binary, each classifier has two performance metrics: the sensitivity of f is the probability that f (x) = 1 given x is from the positive training data, and the specificity of f is the probability that f (x) = 0 given x is from the negative training data. Due to the coarse-to-fine, hierarchical manner in which the classifiers are processed, we required the sensitivity of f to be as close to unity as possible. This can be accomplished at the expense of specificity by adjusting a threshold, as discussed below. The reason for imposing a high sensitivity on each classifier is that if a test profile is rejected from belonging to L by the classifier when in fact it does belong to L, it cannot be recovered. However, the reduced specificity is only local to each node, and the overall specificity increases with testing at subsequent nodes.

A coarse-to-fine screening yields candidate phenotypes

The strategy for processing any given profile x with the diagnostic hierarchy is breadth-first, coarse-to-fine. Starting from the two child nodes of the root, classifiers are executed sequentially and adaptively, with f performed if and only if all its ancestor tests are performed and are positive. More specifically, f is performed if and only if f = 1 for every node s T between t and the root. As soon as f = 0 for a non-terminal node t, none of the descendant classifiers in the sub-tree rooted at t are performed. This is because a negative response of f means that the phenotype is unlikely to belong to L and the set of phenotypes associated with descendant of t, which are necessarily subsets of L. This facilitates pruning whole subsets of phenotypes at once. The complete coarse-to-fine screening process for x results in a set of detected phenotypes. We denote this set by L(x) . These are the phenotypes corresponding to a complete chain of positive results for all f from root to leaf. Equivalently, L(x) is the set of phenotypes that are not ruled out by any test performed. During the diagnostic process, a profile may traverse only one path all the way to the terminal node. In this case, L(x) consists of a single phenotype d, and the diagnostic process terminates with Y = d as the predicted phenotype. However, a profile may also traverse multiple branches to the terminal nodes of T, in which case L(x) consists of multiple candidate phenotypes (see the discussion on resolving ambiguities below). Moreover, a profile may reach no terminal nodes, in which case L(x) is empty. When no terminal node is reached, the profile is determined to be outside of L, and labeled as ‘Unclassified’.

Resolving ambiguities using a decision-tree approach

When L(x) consists of multiple phenotypes, it becomes necessary to refine the diagnosis. The ambiguity is resolved by another tree-structured process – an ordinary decision tree based on the same diagnostic hierarchy. For every pair of sibling nodes, e.g., and , we learn a classifier g which tests Y L versus Y L, just as in an ordinary decision-tree (the process of classifier identification is elaborated below). Starting from the root of the tree, execution of the decision-tree classifiers directs a profile to one of two sibling nodes sequentially until it reaches a terminal node. Unlike the process of traversing the hierarchy of node classifiers, a sample that enters the decision tree is directed to one and only one leaf node, and hence uniquely labeled.

Classifier design and learning

Every node classifier is based on expression level comparisons between two genes. Let G be the set of all genes for which we have microarray expression data, and denote the set of all distinct pairs of genes by . For each gene-pair , consider the Boolean feature (x) of an expression profile x = . (x) assumes the value 1 if gene is expressed higher than gene (i.e., ) in x, and the value 0 otherwise (i.e., ). These are the features that have been used in previous work on relative expression reversals [31]. Each node classifier f is constructed from a small set of gene-pairs P , the binary outcomes of for all P, and a constant threshold k. More specifically, f (x) = 1 if ≥k, and f (x) = 0 otherwise. The threshold k takes values between 1 and |P|. There is a classifier of this nature for every node t T, except for the root. The set of gene-pairs P = P and threshold k = k depend on the node t. Hence, for each t, the classifier f = 1 if at least k of the gene-pair comparisons in P are positive ; otherwise, f = 0. The comparisons are chosen such that, for each pair in P, we expect to see gene expressed more than gene under the assumption that the phenotype of x belongs to L, whereas if the phenotype of x does not belong to L, we expect to see the reverse. For every node t, every pair of all gene-pair combinations is “scored” by the difference between the probability of the event that = 1 given Y L and the probability given Y L. These probabilities are estimated from the training data, and the subset of pairs with the highest scores are chosen. Since each positive (resp., negative) comparison is viewed as evidence for Y L (resp., Y L), we can then favor sensitivity over specificity by varying the threshold k. That is, by choosing a relatively small value for k relative to the number of comparisons in P, we can make it highly likely that the classifier responds positively when in fact the sample belongs to the set L. We show the sets of gene-pairs P for each of the twelve nodes in our diagnostic hierarchy in Table 2 and an illustrative example in Figure 1. Finally, the decision tree classifiers g are all based on comparisons of single gene-pairs at all edges of the diagnostic hierarchy. While multiple gene pairs were used at each decision point in the node-based tree, only a single gene pair was used at each decision point in the decision-tree. This is due to the difference in the motivation of building the two trees; the node-based tree was constructed to maximize sensitivity and minimize false-positives with as many pairs as necessary, while the decision-tree was designed to resolve multiple diagnoses (i.e. ties) which could be done with only one pair. We show the pair of genes for each of the six decision-tree classifiers in Table 3 and Figure 2. MATLAB implementations of the ISSAC algorithm and a step-by-step tutorial are available to download at http://price.systemsbiology.net/downloads.

Selecting genes that encode extracellular products

Using Gene Ontology (GO) annotations, we have identified a list of 767 genes (mapped on 1,085 total probes) in every transcriptome sample that encode for possible blood-borne proteins. Specifically, we selected only the genes whose products are annotated to be in either the ‘Extracellular Space’ or the ‘Extracellular Region’ cellular locations. We use this gene set as a starting point for targeted blood diagnostics. All computational steps and analyses in regards to molecular signature discovery are identical to those discussed above. The overall method of ISSAC can be summarized into three main steps. A ISSAC constructs the framework for brain cancer diagnosis – a tree-structured hierarchy of all brain cancer phenotypes built using an agglomerative hierarchical clustering algorithm on gene expression training data. B Training on gene-expression data from all brain phenotypes, ISSAC identifies disjoint, gene-pair classifiers at all nodes (excluding the root) and edges of the diagnostic hierarchy, and accumulates them into their respective marker panels. The chosen pairs are the ones that best differentiate between the phenotype sets, and are based entirely on the reversal of relative expression. C ISSAC uses the gene-pair classifiers for class prediction. Briefly, given a gene expression profile, ISSAC executes the node classifiers in a hierarchical, top-down fashion within the disease diagnostic hierarchy to identify the phenotype(s) whose class-specific signature(s) is present. In case of multiple class candidates (i.e. node classifiers for multiple leaves are positive), the ambiguity is resolved by aggregating all the decision-tree classifiers into a classification decision-tree, thereby leading any expression signature down one unique path toward a single phenotype. (TIFF) Click here for additional data file. Brain phenotypes are grouped into a global diagnostic hierarchy, which allows an intuitive representation of the classification process. The diagnostic hierarchy is built using a data-driven, iterative approach, and is free of manual, ad-hoc construction. In each iteration, two classes, or two groups of classes, with the lowest TSP score (Materials and Methods and Text S1) among all pair-wise comparisons, come together to form a node. This approach optimizes overall classification by placing the more challenging decisions further away from the base of the tree (i.e. root), thereby ensuring only the minimum misclassifications percolate down the tree. The final form of the brain phenotype diagnostic hierarchy represents a hierarchical structure of nested partitions, where the multi-class problem is decomposed into smaller and smaller groups using a sequence of diagnostic decision rules. (TIFF) Click here for additional data file. Performance evaluation using ten-fold cross-validation. A Ten-fold cross-validation is conducted ten times to obtain the average accuracy. In every iteration of cross-validation, the order of samples within a particular class are randomly permuted before training/test set allocations. B Our marker panel achieved a 90.4% average of phenotype-specific classification accuracies, showing strong promise against a multi-category, multi-dataset background at the gene expression level. (TIFF) Click here for additional data file. Statistical enrichment analysis on PANTHER database biological processes and chromosome numbers of the top 500, 1,000, and 1,500 gene-pair classifiers for GBM vs. OLG. A ‘Immunity and Defense’ was the most enriched biological process for ‘gene-set i’, reflecting the chronic inflammatory conditions inside the GBM tumor. B ‘Neuronal Activities’ was the most enriched biological process for ‘gene-set j’, reflecting decrease in neuronal behavior and possibly other brain cell activity inside the GBM tumor. The genes in ‘gene-set i’ and ‘gene-set j’ were the most enriched in C Chromosome 1 and D Chromosome 10, respectively, reflecting the major chromosome aberrations of the two brain cancers. Ψ delineates the most enriched category. aBiological process abbreviation (name): AAM (Amino acid metabolism), APT (Apoptosis), CM (Carbohydrate metabolism), CA (Cell adhesion), CC (Cell cycle), CPD (Cell proliferation and differentiation), CSM (Cell structure and motility), DP (Developmental processes), HMS (Homeostatis), IMD (Immunity and defense), IPRT (Intracellular protein transport), MCN (Muscle contraction), NA (Neuronal activities), NAM (Nucleic acid metabolism), PMM (Protein metabolism and modification), SM (Sulfur metabolism), ST (Signal transduction), and TR (Transport). (TIFF) Click here for additional data file. Gene-pair classifiers based on only the genes that encode extracellular products. Gene pairs are shown at their corresponding nodes in the brain disease diagnostic hierarchy. The corresponding node-based marker panel consists of 41 classifier pairs and 71 unique classifier features. (TIFF) Click here for additional data file. Phenotype specimen descriptions and main results for all GEO accessions used in this study. (PDF) Click here for additional data file. GEO microarray sample IDs used in this study. (PDF) Click here for additional data file. Node marker panel for brain cancer and normal transcriptome classification. Node #: Corresponds to numerical labels shown in the brain phenotype diagnostic hierarchy (Figure 1). Brain phenotype abbreviation (name): EPN (Ependymoma), GBM (Glioblastoma multiforme), MDL (Medulloblastoma), MNG (Meningioma), normal (Normal brain), OLG (Oligodendroglioma), and PA (Pilocytic astrocytoma). Gene i/Gene j: the gene expressed higher and lower in the gene-pair, respectively, within each corresponding phenotype. Gene name/Chromosome locus: according to Entrez Gene. Affymetrix Probe ID: For both Affymetrix Human Genome U133A and U133Plus2.0 Arrays. k: The minimum number of gene-pair classifiers whose decision rule outcomes for a test sample are required to be ‘true ( = 1)’ for the sample to be classified as the phenotype(s) of the corresponding node. (PDF) Click here for additional data file. Decision-tree marker panel for brain cancer and normal transcriptome classification. For each classifier decision rule (i.e. Is Gene i>Gene j ?), 1 and 0 delineates ‘true’ and ‘false’, respectively, and ‘–’ denotes that the outcome is not used for classification. The vertical binary pattern under each class label corresponds to a phenotype-specific molecular signature. (PDF) Click here for additional data file. Summary of expression differences between genes-pair classifiers. Node #: Corresponds to numerical labels shown in the brain phenotype diagnostic hierarchy (Figure 1). Brain phenotype abbreviation (name): EPN (Ependymoma), GBM (Glioblastoma multiforme), MDL (Medulloblastoma), MNG (Meningioma), normal (Normal brain), OLG (Oligodendroglioma), and PA (Pilocytic astrocytoma). Sample number: Number of total samples in classes of respective Node #. Gene i/Gene j: the gene expressed higher and lower in the gene-pair, respectively, within each corresponding phenotype. Gene name/Chromosome locus: according to Entrez Gene. Affymetrix Probe ID: For both Affymetrix Human Genome U133A and U133Plus2.0 Arrays. k: The minimum number of gene-pair classifiers whose decision rule outcomes for a test sample are required to be ‘true ( = 1)’ for the sample to be classified as the phenotype(s) of the corresponding node. Ranked expression differences of each gene pair (i.e. Rank_gene_i – Rank_gene_j) were calculated for each sample, and Mean, St. dev., Max., Min., and Median were found across all samples within classes of respective Node #. (PDF) Click here for additional data file. Hold-one-lab-in validation accuracies of glioblastoma signatures. (PDF) Click here for additional data file. Hold-one-lab-in (H1LI) and leave-one-lab-out (L1LO) validation accuracies of glioblastoma signatures when training data were constrained to 50 total samples. H1LI and L1LO validations were performed ten times for each category of training data. In each validation trial, 50 samples were randomly selected from the single microarray dataset (for H1L1) or from the multi-study, combined dataset (for L1LO). (PDF) Click here for additional data file. Ten-fold cross-validation accuracies when only the node marker panel was required to reach unique diagnoses. Sample size: Average proportion of total samples that reached unique diagnoses via node marker panel. Accuracy: Reflects average performance in ten-fold cross-validation conducted ten times. (PDF) Click here for additional data file. Functional roles of 11 previously identified GBM serum markers that are present in our extracellular-product encoding marker-panel. a http://www.ncbi.nlm.nih.gov/gene. (PDF) Click here for additional data file. Ten-fold cross-validation accuracies of gene-pair classifiers composed of genes that encode extracellular products. aAccuracies reflect average performance in ten-fold cross-validation conducted ten times. The main diagonal gives the average classification accuracy of each class (bold), and the off-diagonal elements show the erroneous predictions. bUC (Unclassified samples). When using the node classifiers, expression profiles that did not exert a signature of any phenotype (i.e., did not percolate down to at least one positive terminal node) were rejected from classification. In this case, the Unclassified sample is treated as a misclassification. (PDF) Click here for additional data file. Step-by-step description of how ISSAC works. (PDF) Click here for additional data file. Advantages of using relative expression reversals to build classifiers. (PDF) Click here for additional data file. Twenty-five anatomical regions of the human brain from which normal transcriptome samples were obtained. (PDF) Click here for additional data file. Global statistical enrichment analysis of gene-pair classifiers. (PDF) Click here for additional data file. Whether ISSAC can play a role in identifying misdiagnoses. (PDF) Click here for additional data file. Reasoning for selecting only Affymetrix microarray platforms and for not using probe-specific offsets. (PDF) Click here for additional data file. Candidates of brain cancer molecular signatures. (PDF) Click here for additional data file.

79 in total

1. Gene expression profiling of gliomas strongly predicts survival.

Authors: William A Freije; F Edmundo Castro-Vargas; Zixing Fang; Steve Horvath; Timothy Cloughesy; Linda M Liau; Paul S Mischel; Stanley F Nelson
Journal: Cancer Res Date: 2004-09-15 Impact factor: 12.701

2. Systems biology and new technologies enable predictive and preventative medicine.

Authors: Leroy Hood; James R Heath; Michael E Phelps; Biaoyang Lin
Journal: Science Date: 2004-10-22 Impact factor: 47.728

Review 3. Microarray data analysis: from disarray to consolidation and consensus.

Authors: David B Allison; Xiangqin Cui; Grier P Page; Mahyar Sabripour
Journal: Nat Rev Genet Date: 2006-01 Impact factor: 53.242

Review 4. Tackling the widespread and critical impact of batch effects in high-throughput data.

Authors: Jeffrey T Leek; Robert B Scharpf; Héctor Corrada Bravo; David Simcha; Benjamin Langmead; W Evan Johnson; Donald Geman; Keith Baggerly; Rafael A Irizarry
Journal: Nat Rev Genet Date: 2010-09-14 Impact factor: 53.242

5. Simple decision rules for classifying human cancers from gene expression profiles.

Authors: Aik Choon Tan; Daniel Q Naiman; Lei Xu; Raimond L Winslow; Donald Geman
Journal: Bioinformatics Date: 2005-08-16 Impact factor: 6.937

6. Deregulated Wnt/beta-catenin program in high-risk neuroblastomas without MYCN amplification.

Authors: X Liu; P Mazanek; V Dam; Q Wang; H Zhao; R Guo; J Jagannathan; A Cnaan; J M Maris; M D Hogarty
Journal: Oncogene Date: 2007-08-27 Impact factor: 9.867

Review 7. Roles of insulin-like growth factor binding protein-2 (IGFBP-2) in glioblastoma.

Authors: Tsuyoshi Fukushima; Hiroaki Kataoka
Journal: Anticancer Res Date: 2007 Nov-Dec Impact factor: 2.480

8. Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas.

Authors: Nathan D Price; Jonathan Trent; Adel K El-Naggar; David Cogdell; Ellen Taylor; Kelly K Hunt; Raphael E Pollock; Leroy Hood; Ilya Shmulevich; Wei Zhang
Journal: Proc Natl Acad Sci U S A Date: 2007-02-21 Impact factor: 11.205

9. Proteins with altered levels in plasma from glioblastoma patients as revealed by iTRAQ-based quantitative proteomic analysis.

Authors: Poonam Gautam; Sudha C Nair; Manoj Kumar Gupta; Rakesh Sharma; Ravindra Varma Polisetty; Megha S Uppin; Challa Sundaram; Aneel K Puligopu; Praveen Ankathi; Aniruddh K Purohit; Giriraj R Chandak; H C Harsha; Ravi Sirdeshmukh
Journal: PLoS One Date: 2012-09-28 Impact factor: 3.240

10. Olig2-induced neural stem cell differentiation involves downregulation of Wnt signaling and induction of Dickkopf-1 expression.

Authors: Sung-Min Ahn; Kyunghee Byun; Deokhoon Kim; Kiyoung Lee; Jong Shin Yoo; Seung U Kim; Eek-Hoon Jho; Richard J Simpson; Bonghee Lee
Journal: PLoS One Date: 2008-12-18 Impact factor: 3.240

8 in total

1. A Cell-Surface Membrane Protein Signature for Glioblastoma.

Authors: Dhimankrishna Ghosh; Cory C Funk; Juan Caballero; Nameeta Shah; Katherine Rouleau; John C Earls; Liliana Soroceanu; Greg Foltz; Charles S Cobbs; Nathan D Price; Leroy Hood
Journal: Cell Syst Date: 2017-03-29 Impact factor: 10.304

Review 2. An argument for mechanism-based statistical inference in cancer.

Authors: Donald Geman; Michael Ochs; Nathan D Price; Cristian Tomasetti; Laurent Younes
Journal: Hum Genet Date: 2014-11-09 Impact factor: 4.132

3. Measuring the effect of inter-study variability on estimating prediction error.

Authors: Shuyi Ma; Jaeyun Sung; Andrew T Magis; Yuliang Wang; Donald Geman; Nathan D Price
Journal: PLoS One Date: 2014-10-17 Impact factor: 3.240

4. Global Meta-analysis of Airborne Bacterial Communities and Associations with Anthropogenic Activities.

Authors: Xiaoqing Jiang; Chunhui Wang; Jinyuan Guo; Jiaheng Hou; Xiao Guo; Haoyu Zhang; Jie Tan; Mo Li; Xin Li; Huaiqiu Zhu
Journal: Environ Sci Technol Date: 2022-07-04 Impact factor: 11.357

5. Transcriptional analysis of aggressiveness and heterogeneity across grades of astrocytomas.

Authors: Chunjing Wang; Cory C Funk; James A Eddy; Nathan D Price
Journal: PLoS One Date: 2013-10-11 Impact factor: 3.240

6. Selective analysis of cancer-cell intrinsic transcriptional traits defines novel clinically relevant subtypes of colorectal cancer.

Authors: Claudio Isella; Francesco Brundu; Sara E Bellomo; Francesco Galimi; Eugenia Zanella; Roberta Porporato; Consalvo Petti; Alessandro Fiori; Francesca Orzan; Rebecca Senetta; Carla Boccaccio; Elisa Ficarra; Luigi Marchionni; Livio Trusolino; Enzo Medico; Andrea Bertotti
Journal: Nat Commun Date: 2017-05-31 Impact factor: 14.919

7. A predictive index for health status using species-level gut microbiome profiling.

Authors: Vinod K Gupta; Minsuk Kim; Utpal Bakshi; Kevin Y Cunningham; John M Davis; Konstantinos N Lazaridis; Heidi Nelson; Nicholas Chia; Jaeyun Sung
Journal: Nat Commun Date: 2020-09-15 Impact factor: 14.919

8. CancerLivER: a database of liver cancer gene expression resources and biomarkers.

Authors: Harpreet Kaur; Sherry Bhalla; Dilraj Kaur; Gajendra Ps Raghava
Journal: Database (Oxford) Date: 2020-01-01 Impact factor: 3.451

8 in total