| Literature DB >> 26202217 |
Henry A Ogoe1, Shyam Visweswaran2,3, Xinghua Lu4, Vanathi Gopalakrishnan5,6,7.
Abstract
BACKGROUND: Most 'transcriptomic' data from microarrays are generated from small sample sizes compared to the large number of measured biomarkers, making it very difficult to build accurate and generalizable disease state classification models. Integrating information from different, but related, 'transcriptomic' data may help build better classification models. However, most proposed methods for integrative analysis of 'transcriptomic' data cannot incorporate domain knowledge, which can improve model performance. To this end, we have developed a methodology that leverages transfer rule learning and functional modules, which we call TRL-FM, to capture and abstract domain knowledge in the form of classification rules to facilitate integrative modeling of multiple gene expression data. TRL-FM is an extension of the transfer rule learner (TRL) that we developed previously. The goal of this study was to test our hypothesis that "an integrative model obtained via the TRL-FM approach outperforms traditional models based on single gene expression data sources".Entities:
Mesh:
Substances:
Year: 2015 PMID: 26202217 PMCID: PMC4512094 DOI: 10.1186/s12859-015-0643-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The TRL-FM framework. The framework for knowledge transfer using functional mapping and classification rules works as follows. First, use a feature selector to select relevant variables from the source and target datasets. Second, combine the selected variables into a single list and partition them into functional modules (FMs). Third, using the discovered functional modules in addition to rules induced from the source data, build a prior hypothesis of classification rules. Finally, using the prior hypothesis as a seed, learn a new classification rule model on the target data
Fig. 2A protocol for identifying functional modules using spectral clustering and the Gene Ontology. Given an input set of genes, first map each gene to the corresponding GO term(s) that annotate(s) it according to the GO annotation database [23]. For example, if G denotes the set of input genes then we map each gene g (where g ∈ G), to the GO term go (where go ∈ GO) that annotates it. Here, GO refers to a set of biological process terms in the GO. For example, the mapping M(g 1) = {go 1, go 3} means that terms go 1 and go 3 annotate gene g 1. Second, form a set union of all GO terms that annotate at least one member of the input gene set. Third, using semantic similarity [24] as a distance measure, construct a similarity matrix among the GO terms. Fourth, with the similarity matrix as input, applied the spectral clustering algorithm [25] to group the GO terms into functionally similar clusters. Fifth, apply the Silhouette value technique [26] to estimate appropriate cluster size as well as cluster validity. Finally, map each gene g (i.e., keys of map M) to cluster C if there exist at least one term in C that annotates g
Fig. 3An algorithm for implementing the TRL-FM framework. This algorithm, a vast modification of the TRL algorithm [8], incorporates a subroutine (see Fig. 4) for mapping functionally related variables between the source and target data. The statements in red font are additions to the TRL algorithm
Fig. 4An algorithm for generating prior rules for seeding learning of a rule model. This algorithm, a subroutine within the TRL-FM framework, leverages information from domain knowledge, through functional modules, to instantiate prior rules to seed learning on the target data
Experimental data sources. Sources of data for experiments and their descriptions
| Disease | Author | Year | Platform | Sample Size (Cases/Controls) | Source |
|---|---|---|---|---|---|
| Prostate Cancer | Singh | 2002 | HG-U95Av2 | 102 (52/50) | www.broad.mit.edu |
| Lapointe | 2004 | cDNA | 103 (62/41) | GSE3933 | |
| Wallace | 2008 | HGU133A2 | 89 (69/20) | GSE6956 | |
| Nanni | 2006 | HG-U133A | 30(23/7) | GSE3868 | |
| Varambally | 2005 | HG-U133 Plus 2 | 13(7/6) | GSE3325 | |
| Welsh | 2001 | HG-U95A | 34(25/9) | public.gnf.org/cancer | |
| Yu | 2004 | HG-U95Av2 | 83(65/18) | GSE6919 | |
| Brain Cancer | Freije | 2004 | HG-U133A,B | 85 (59/26) | GSE4412 |
| Phillips | 2006 | HG-U133A,B | 100 (76/24) | GSE4271 | |
| Sun | 2006 | HG-U133 Plus 2 | 100 (81/19) | GSE4290 | |
| Petalidis | 2008 | HG-U133A | 58 (39/19) | GSE1993 | |
| Gravendeel | 2009 | HG-U133 Plus 2 | 175(159/16) | GSE16011 | |
| Paugh | 2010 | HG-U133 Plus 2 | 42(33/9) | GSE19578 | |
| Yamanaka | 2006 | Agilent | 29(22/7) | GSE4381 | |
| Lung Disease Studies (IPF) | Pardo | 2005 | Codelink | 24(13/11) | GSE2052 |
| Yang | 2007 | Agilent 43 K | 29(20/9) | GSE5774 | |
| Konishi | 2009 | Agilent 4x44K | 38(23/15) | GSE10667 | |
| KangA | 2011 | Agilent 4x44K | 63(52/11) | Dr. Kaminski | |
| KangB | 2011 | Agilent 8x60K | 96(75/21) | Dr. Kaminski | |
| Larsson | 2008 | HG-U133 Plus 2 | 12(6/6) | GSE11196 | |
| Emblom | 2010 | cDNA | 58(38/20) | GSE17978 |
FMs for target, Petalidis. Functional modules to facilitate functional mapping to target (Petalidis) variables from sources (Freije, Gravendeel, Paugh, Phillips, Sun, Yamanaka) variables
| Clusters | GO Functional Theme | Markers |
|---|---|---|
| FM1 | DNA repair |
|
| FM2 | Apoptotic processes |
|
| FM3 | Regulation of protein phosphorylation |
|
| FM4 | Cell differentiation |
|
| FM5 | Transport |
|
| FM6 | Signal transduction |
|
| FM7 | Cell proliferation |
|
| FM8 | Response to glucose stimuli |
|
| FM9 | Toll-like receptor signaling |
|
| FM10 | Transcription |
|
| FM11 | Response to stress |
|
FMs for target, KangA. Functional modules to facilitate functional mapping to target (KangA) variables from sources (Emblom, KangB, Konishi, Larsson, Pardo, Yang) variables
| Clusters | GO Functional Theme | Markers |
|---|---|---|
| FM1 | Regulation of kinase activity |
|
| FM2 | Notch signaling |
|
| FM3 | Cell junction assembly |
|
| FM4 | Cell adhesion |
|
| FM5 | T cell receptor signaling pathway |
|
| FM6 | Brain development |
|
| FM7 | Protein homooligomerization |
|
| FM8 | Pyrimidine nucleobase catabolic process |
|
| FM9 | Transcription |
|
| FM10 | Transsulfuration |
|
| FM11 | Muscle cell differentiation |
|
| FM12 | Sex determination |
|
| FM13 | Superoxide metabolic process |
|
| FM14 | Cellular protein localization |
|
FMs for target, Lapointe. Functional modules to facilitate functional mapping to target (Lapointe) variables from sources (Nanni, Singh, Varambally, Wallace, Welsh, Yu) variables
| Clusters | GO Functional Theme | Markers |
|---|---|---|
| FM1 | Cardiac and urinary organ morphogenesis |
|
| FM2 | Lipid metabolism |
|
| FM3 | Regulation of chemokine production |
|
| FM4 | Histone acetylation and methylation |
|
| FM5 | Signal transduction |
|
| FM6 | Chemotaxis |
|
| FM7 | Transcription |
|
| FM8 | Regulation of transcription |
|
| FM9 | Translation |
|
| FM10 | Cellular response to cytokines |
|
Comparison of TRL-FM with baseline RL. AUCs when RL (baseline) and TRL-FM are applied to build a classification rule model on three datasets, Petalidis (brain), KangA (IPF), and Lapointe (prostate). For TRL-FM, the FMs are the medium through which knowledge transfer occurs. “Union” is an ensemble of all FMs. The mean and the standard error of the mean (SEM) for the AUC of a dataset was obtained by 10-fold cross-validation
| Dataset | Petalidis | KangA | Lapointe |
|---|---|---|---|
| AUC (SEM) | AUC (SEM) | AUC (SEM) | |
| Baseline | 0.83 (0.06) | 0.86 (0.07) | 0.93 (0.03) |
| FM1 |
|
|
|
| FM2 |
|
|
|
| FM3 |
|
|
|
| FM4 |
|
|
|
| FM5 |
|
|
|
| FM6 |
|
|
|
| FM7 |
|
|
|
| FM8 |
|
|
|
| FM9 |
| 0.86 (0.07) |
|
| FM10 |
| 0.86 (0.07) |
|
| FM11 |
|
| |
| FM12 | 0.86 (0.07) | ||
| FM13 |
| ||
| FM14 |
| ||
| Union |
|
|
|
For each dataset, positive transfer is shown in bold font, while underlined AUCs denote negative transfer
Comparison of TRL with baseline RL. AUCs when RL (baseline) and TRL are applied to build a classification rule model on three datasets, Petalidis (brain), KangA (IPF), and Lapointe (prostate). SRC means the source dataset (e.g., for target Petalidis, SRC1 is Freije, see Additional File 3). The mean and the standard error of the mean (SEM) for the AUC of a dataset was obtained by 10-fold cross-validation
| Dataset | Petalidis | KangA | Lapointe |
|---|---|---|---|
| AUC (SEM) | AUC (SEM) | AUC (SEM) | |
| Baseline | 0.83 (0.06) | 0.86 (0.07) | 0.93 (0.03) |
| SRC1 |
| 0.86 (0.07) | 0.93 (0.03) |
| SRC2 |
| 0.86 (0.07) |
|
| SRC3 |
|
|
|
| SRC4 |
| 0.86 (0.07) | 0.93 (0.03) |
| SRC5 |
| 0.86 (0.07) |
|
| SRC6 |
| 0.86 (0.07) |
|
For each dataset, positive transfer is shown in bold font, while underlined AUCs denote negative transfer
Comparison of classification performance of all classifiers on all datasets. Comparison of classification performance (AUC) among selected machine learning methods namely, Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), Random Forest (RF), C4.5, Naïve Bayes (NB), Penalized Logistic Regression (PLR), as well as RL (baseline), TRL, and TRL-FM on all datasets. Note that for TRL, the AUC for the highest performing source is shown, while for TRL-FM, the medium of knowledge transfer is the union of all FMs. In addition, the average (AVG) AUC performances, including average standard error of the mean, for each classifier across the entire datasets are provided (see Additional File 5 for detailed results)
| Dataset | SVM | LDA | RF | C4.5 | NB | PLR | RL | TRL | TRL-FM |
|---|---|---|---|---|---|---|---|---|---|
| Emblom | 1.00 | 1.00 | 1.00 | 0.98 | 0.96 | 0.98 | 0.97 | 0.97 | 0.94 |
| Freije | 0.74 | 0.72 | 0.72 | 0.73 | 0.82 | 0.76 | 0.76 | 0.78 | 0.80 |
| Gravendeel | 0.52 | 0.59 | 0.59 | 0.53 | 0.63 | 0.56 | 0.49 | 0.49 | 0.59 |
| KangA | 0.93 | 0.86 | 0.86 | 0.79 | 0.94 | 0.90 | 0.86 | 0.93 | 0.97 |
| KangB | 0.91 | 0.87 | 0.87 | 0.87 | 0.91 | 0.95 | 0.83 | 0.91 | 0.95 |
| Konishi | 0.90 | 0.68 | 0.68 | 0.74 | 0.90 | 0.90 | 0.78 | 0.83 | 0.95 |
| Lapointe | 0.96 | 0.91 | 0.91 | 0.94 | 0.97 | 0.96 | 0.93 | 0.93 | 0.97 |
| Larsson | 0.33 | 0.67 | 0.67 | 0.58 | 0.67 | 0.67 | 0.75 | 0.75 | 1.00 |
| Nanni | 0.70 | 0.61 | 0.61 | 0.44 | 0.57 | 0.65 | 0.54 | 0.54 | 0.64 |
| Pardo | 0.83 | 0.85 | 0.85 | 0.63 | 0.80 | 0.88 | 0.85 | 0.90 | 0.95 |
| Paugh | 0.48 | 0.45 | 0.45 | 0.43 | 0.50 | 0.45 | 0.51 | 0.52 | 0.54 |
| Petalidis | 0.75 | 0.71 | 0.71 | 0.69 | 0.80 | 0.80 | 0.83 | 0.88 | 0.91 |
| Phillips | 0.73 | 0.70 | 0.70 | 0.66 | 0.75 | 0.80 | 0.66 | 0.73 | 0.78 |
| Singh | 0.89 | 0.90 | 0.90 | 0.89 | 0.88 | 0.91 | 0.89 | 0.89 | 0.93 |
| Varambally | 1.00 | 0.92 | 0.92 | 0.67 | 1.00 | 1.00 | 0.83 | 1.00 | 1.00 |
| Wallace | 0.82 | 0.85 | 0.85 | 0.76 | 0.81 | 0.87 | 0.76 | 0.81 | 0.84 |
| Welsh | 0.94 | 0.66 | 0.66 | 0.79 | 0.93 | 0.94 | 0.92 | 0.95 | 0.93 |
| Yamanaka | 0.57 | 0.57 | 0.57 | 0.56 | 0.71 | 0.56 | 0.50 | 0.50 | 0.79 |
| Yang | 0.69 | 0.51 | 0.51 | 0.89 | 0.57 | 0.73 | 0.94 | 0.94 | 0.89 |
| Yu | 0.94 | 0.93 | 0.93 | 0.80 | 0.97 | 0.94 | 0.88 | 0.90 | 0.93 |
| AVG AUC | 0.77 | 0.74 | 0.74 | 0.71 | 0.80 | 0.81 | 0.77 | 0.80 | 0.86 |
| AVG SEM | 0.06 | 0.07 | 0.07 | 0.07 | 0.06 | 0.05 | 0.07 | 0.06 | 0.04 |
Pairwise significance test for classification performance among all methods. A Mann–Whitney paired-sample signed rank test with significance level α = 5 %. P-values were adjusted with the Benjamini Hochberg method [45]
| Method | SVM | LDA | RF | C4.5 | NB | PLR | RL | TRL |
|---|---|---|---|---|---|---|---|---|
| LDA | 0.1230 | |||||||
| RF | 0.1230 | |||||||
| C4.5 |
| 0.0943 | 0.0943 | |||||
| NB | 0.3453 |
|
|
| ||||
| PLR | 0.0737 |
|
|
| 0.6280 | |||
| RL | 0.3473 | 0.6825 | 0.6825 |
| 0.1151 | 0.0700 | ||
| TRL | 0.6924 | 0.0648 | 0.0648 |
| 0.8666 | 0.6825 |
| |
| TRL-FM |
|
|
|
|
|
|
|
|
Significant p-values are displayed in bold font
Comparison of classification performance of all non-transfer rule learning classifiers on post meta-analysis datasets. Using the AW [34] meta-analysis method only biomarkers with statistically significant effect size within a particular disease type are used for a class prediction task (see Additional File 6 for further details)
| Dataset | SVM | LDA | RF | C4.5 | NB | PLR | RL |
|---|---|---|---|---|---|---|---|
| Emblom | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.96 |
| Freije | 0.77 | 0.74 | 0.74 | 0.71 | 0.72 | 0.79 | 0.73 |
| Gravendeel | 0.50 | 0.73 | 0.73 | 0.59 | 0.69 | 0.67 | 0.49 |
| KangA | 0.82 | 0.72 | 0.72 | 0.86 | 0.93 | 0.96 | 0.86 |
| KangB | 0.94 | 0.89 | 0.89 | 0.85 | 0.94 | 0.93 | 0.83 |
| Konishi | 0.88 | 0.58 | 0.58 | 0.77 | 0.88 | 0.87 | 0.80 |
| Lapointe | 0.95 | 0.92 | 0.92 | 0.91 | 0.95 | 0.95 | 0.91 |
| Larsson | 0.33 | 0.33 | 0.33 | 0.67 | 0.42 | 0.67 | 0.75 |
| Nanni | 0.56 | 0.68 | 0.68 | 0.55 | 0.72 | 0.66 | 0.75 |
| Pardo | 0.88 | 0.88 | 0.88 | 0.78 | 0.83 | 0.88 | 0.80 |
| Paugh | 0.57 | 0.45 | 0.45 | 0.65 | 0.51 | 0.66 | 0.52 |
| Petalidis | 0.86 | 0.66 | 0.66 | 0.75 | 0.82 | 0.79 | 0.84 |
| Phillips | 0.75 | 0.83 | 0.83 | 0.68 | 0.80 | 0.81 | 0.64 |
| Singh | 0.92 | 0.86 | 0.86 | 0.84 | 0.87 | 0.90 | 0.92 |
| Sun | 0.73 | 0.66 | 0.66 | 0.66 | 0.73 | 0.73 | 0.69 |
| Varambally | 0.75 | 0.92 | 0.92 | 0.79 | 0.92 | 1.00 | 0.83 |
| Wallace | 0.81 | 0.82 | 0.82 | 0.77 | 0.76 | 0.81 | 0.70 |
| Welsh | 0.98 | 0.80 | 0.80 | 0.85 | 0.98 | 0.91 | 0.92 |
| Yamanaka | 0.63 | 0.42 | 0.42 | 0.79 | 0.71 | 0.70 | 0.61 |
| Yang | 0.74 | 0.51 | 0.51 | 0.90 | 0.71 | 0.80 | 0.94 |
| Yu | 0.91 | 0.92 | 0.92 | 0.87 | 0.92 | 0.95 | 0.90 |
| AVG AUC | 0.78 | 0.73 | 0.73 | 0.77 | 0.80 | 0.83 | 0.78 |
| AVG SEM | 0.06 | 0.07 | 0.07 | 0.07 | 0.06 | 0.06 | 0.07 |
Comparing average performance per disease type to merged datasets per disease type. This table shows the average classification performance per disease type as compared to merged datasets per disease type. In the dataset column, Avg denotes average, MM denotes merged by meta-analysis, and M means merged by cross-platform data merging
| Dataset | SVM | LDA | RF | C4.5 | NB | PLR | RL | TRL | TRL-FM |
|---|---|---|---|---|---|---|---|---|---|
| Average performance per disease type | |||||||||
| Avg_brain | 0.67 | 0.66 | 0.66 | 0.64 | 0.73 | 0.69 | 0.66 | 0.68 | 0.76 |
| Avg_ipf | 0.80 | 0.78 | 0.78 | 0.78 | 0.82 | 0.86 | 0.85 | 0.89 | 0.95 |
| Avg_prostate | 0.89 | 0.83 | 0.83 | 0.76 | 0.88 | 0.90 | 0.82 | 0.86 | 0.89 |
| Merged per disease type by meta-analysis | |||||||||
| MM_brain | 0.67 | 0.70 | 0.70 | 0.69 | 0.70 | 0.69 | 0.67 | * | * |
| MM_ipf | 0.88 | 0.88 | 0.88 | 0.85 | 0.74 | 0.88 | 0.81 | * | * |
| MM_prostate | 0.89 | 0.84 | 0.84 | 0.81 | 0.70 | 0.85 | 0.76 | * | * |
| Merged per disease type by batch effect removal | |||||||||
| M_Brain | 0.50 | 0.51 | 0.51 | 0.48 | 0.53 | 0.51 | 0.54 | * | * |
| M_IPF | 0.67 | 0.63 | 0.63 | 0.60 | 0.63 | 0.64 | 0.68 | * | * |
| M_Prostate | 0.53 | 0.53 | 0.53 | 0.53 | 0.53 | 0.55 | 0.59 | * | * |
*denotes that transfer learning methods were not evaluated. Currently, TRL and TRL-FM cannot be applied to cross-domain studies (i.e., transfer from one disease type to another)