Literature DB >> 21559189

Bioinformatics identification of modules of transcription factor binding sites in Alzheimer's disease-related genes by in silico promoter analysis and microarrays.

Regina Augustin¹, Stefan F Lichtenthaler, Michael Greeff, Jens Hansen, Wolfgang Wurst, Dietrich Trümbach.

Abstract

The molecular mechanisms and genetic risk factors underlying Alzheimer's disease (AD) pathogenesis are only partly understood. To identify new factors, which may contribute to AD, different approaches are taken including proteomics, genetics, and functional genomics. Here, we used a bioinformatics approach and found that distinct AD-related genes share modules of transcription factor binding sites, suggesting a transcriptional coregulation. To detect additional coregulated genes, which may potentially contribute to AD, we established a new bioinformatics workflow with known multivariate methods like support vector machines, biclustering, and predicted transcription factor binding site modules by using in silico analysis and over 400 expression arrays from human and mouse. Two significant modules are composed of three transcription factor families: CTCF, SP1F, and EGRF/ZBPF, which are conserved between human and mouse APP promoter sequences. The specific combination of in silico promoter and multivariate analysis can identify regulation mechanisms of genes involved in multifactorial diseases.

Entities: CellLine Chemical Disease Gene Mutation Species

Year: 2011 PMID： 21559189 PMCID： PMC3090009 DOI： 10.4061/2011/154325

Source DB: PubMed Journal: Int J Alzheimers Dis

1. Introduction

Alzheimer's disease (AD) is the most common form of dementia, which slowly destroys neurons and causes serious cognitive disability [1]. Characteristics of AD are insoluble amyloid plaques and neurofibrillary tangles in the brains of AD patients, which extend progressively to neocortical brain areas during AD [2]. AD exists in a sporadic and familial (heritable) form. Mutations in amyloid-beta precursor protein (APP), presenilin1, and presenilin2 are associated with early-onset forms of familial AD, whereas sporadic AD occurs in people over the age of 65 years [3]. APP was the first gene linked to AD and is located on chromosome 21. APP is cleaved by different proteases named α-, β-, and γ-secretase. These proteases control the generation of the amyloid-β peptide (Aβ), which is considered the culprit in AD. β- and γ-secretase cleavage leads to Aβ formation. β-secretase is the aspartyl protease BACE1 [4, 5]. A homolog of BACE1, BACE2, cleaves within the Aβ domain and does not contribute to Aβ generation. γ-secretase is a heterotetramer consisting of the four subunits presenilin 1 or 2 (PS1, PS2), Aph1, Nicastrin, and Pen-2 [6]. Aggregates of Aβ are neurotoxic and start the so-called amyloid cascade, which describes the molecular mechanisms leading to AD, including formation of plaques and tangles [1]. The third protease, the alpha-secretase ADAM10 [7, 8], avoids formation of Aβ, because it cleaves APP inside the Aβ sequence [9]. Additionally, α-secretase generates the soluble sAPPα, which enhances memory in normal and amnesic mice [10]. In AD patients, the amount of sAPPα in the cerebrospinal fluid is reduced [11]. Microarray studies measure the changes in expression level for thousands of genes simultaneously, detect single nucleotide polymorphisms, and therefore are an unbiased approach to identify genes with an altered expression, for example, in diseases such as AD [12, 13]. Aims of the analysis of microarray datasets include the discovery of gene function [14], getting insights into human disease progression [15], and prediction of gene regulatory elements like transcription factor binding sites (TFBSs) of coregulated genes [16]. TFs bind predominantly to the upstream region of genes, and each TF has its own specific binding motif. TFs with a similar binding motif are grouped to a TF-family. Combinations of TFs in a defined order, distance range, and orientation are known as TFBSs modules [17]. Modules common to a set of genes, which act together in the same biological context, are able to control the expression of these gene products [18]. Conversely, the finding that the expression of different genes is coregulated in a certain biological process may indicate that they are functionally linked in this process. This may be applicable to the identification of new disease-linked genes, for example, in AD. Our aim was to identify modules of transcription factor binding sites in the promoters of AD-linked genes, which may contribute to AD pathogenesis. In our study, we developed a workflow and included three already existing microarray datasets, which were established under different conditions and in summary contain over 400 arrays, for analysis by using state-of-the-art bioinformatics tools focussing on multivariate methods. Multivariate variable selection was performed because variables (transcripts) contribute only in combination with other variables to the discrimination of input dataset rather than in isolation, which help to identify highly correlated genes (i.e., interaction networks) [19]. We hypothesized that beta- and gamma-secretase, which are responsible for Aβ formation, are coregulated. Thus, we started by analyzing TF binding modules in the genes for β- and γ- secretase, that is, BACE1, PS1, and PS2. We also included the BACE1 homolog BACE2.

2. Materials and Methods

2.1. In Silico Promoter Analysis

Promoter analysis was done with Genomatix software (Munich/Germany). All promoter sequences are derived from the promoter sequence retrieval database ElDorado (Release 4.9, Human Genome NCBI build 37/hg19). The DiAlignTF task of GEMS Launcher was used to check for conserved TFBSs between the human and mouse APP promoter sequence (Matrix Family Library, Version 8.0, Vertebrates; Genomatix: 690 matrices from 162 families). The Frameworker tool (GEMS Launcher) searches for all modules composed of two or more TFBSs in aligned promoter sequences. A module is defined as a set of two or more TFBSs with a defined order, distance range between the individual TFBSs, and strand orientation. A total of 727 matrices from 170 families (Matrix Family Library, Version 8.2, Vertebrates; Genomatix) were used for the analysis. The ModelInspector searches for all determined modules of TFBSs in the human promoter library (first approach: ElDorado 07–2009: 93372 promoter regions; second approach: ElDorado 02–2010: 97259 promoter regions; Genomatix Promoter Database).

2.2. Microarray Datasets

We used three microarray datasets downloaded from the Gene Expression Omnibus in this study. The dataset of Blalock et al. [12] consists of hippocampal probes: 9 controls and 22 AD patients with different severity (GSE1297). Gene expression was measured using GPL96: Affymetrix Human Genome U133A Array (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL96) covering 22283 probesets. The second dataset used in our analysis consists of total RNA of brains from five-month-old double-transgenic (6 ADAM10/APP, 6 dnADAM10/APP, 6 monotransgenic APP control) mice (GSE10908) from Prinzen et al. [20]. Gene expression was measured using GPL1261: Affymetrix Mouse Genome 430 2.0 Array (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL1261) covering 45101 probesets. The third dataset from Webster et al. [21] consists of human cortical samples: 187 controls and 176 patients with diagnosis of late onset AD (LOAD) (GSE15222). Gene expression was measured using GPL2700: Sentrix HumanRef-8 Expression BeadChip (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL2700) covering 24354 probesets.

2.3. Multivariate Analysis

Statistical analysis was performed with R statistical software (R version 2.8.0, http://www.r-project.org/). Background correction and normalization of the microarray datasets (GSE1297, GSE10908) was done with the R function expresso from the R package affy. The parameter setting was as follows: bgcorrect.method (background adjustment method) = “mas”, normalize.method (normalization method) = “quantiles”, pmcorrect.method (perfect matches and mismatches adjustment) = “mas”, and summary.method (computation of expression values) = “mas”. The dataset GSE15222 is already rank-invariant normalized. We applied multiclass support vector machines with recursive feature elimination (mSVM-RFE). A SVM considers a set of objects (i.e., biological replicates) as classes, so that around the class boundaries the broadest possible range remains, which is free of data points. We used the svm function from the e1071 package in R for SVM prediction and developed an algorithm for multiclass gene selection with recursive feature elimination according to Zhou and Tuck [22] and Guyon et al. [23]. We implemented our own mSVM-RFE method, as described in the following, because such a specific combination of mSVM and RFE is not available in R until now. First, the samples of a microarray dataset are randomly grouped into stratified four folds, which are four equally sized folds, such that each class is uniformly distributed among the four folds, and all combinations of three folds are used for mSVM-RFE (four combinations for mSVM-RFE). Stratified cross validation has smaller bias and variance than regular cross validation [24]. mSVM-RFE starts with all the features of a microarray, in our case gene expression values, and recursively eliminates 10% of the remaining expression values, which are not good enough for classifying according to the cost function of the SVM classifier (based on the coefficients and support vectors), until a given number of expression values are reached. By starting mSVM-RFE, this given number (stop condition for iterations) is turned over to the program as parameter and in our case this parameter, is 400 in the first part and 1000 in the second part of the workflow. This grouping into folds was done three times (to obtain stable results), and mSVM-RFE algorithm was applied twelve times on different subsets of the original dataset (in total, we did three groupings into four folds with four times mSVM-RFE per grouping, because of four combinations of three folds per grouping). We got twelve different gene selections and computed the frequency of each gene occurring in all the gene selections to identify the most important genes. The mSVM-RFE output in the first part of the workflow is restricted to genes occurring at least in 2 out of 12 gene selections, and in the second part of the workflow genes occurring at least in 1 out of 12 gene selections are taken. The gene selection in the second part is less stringent than in the first part of the workflow, since a rigorous restriction of the number of input genes for the subsequent biclustering analysis reduces the number of genes in the resulting clusters. The function svm was used with default settings except the parameters type = “C-classification”, kernel = “linear”, cost = 0.1. The dataset GSE15222 was not filtered by mSVM-RFE but by the illumina detection score [25]. Transcripts that have a detection score ≥0.99 in less than 90% of cases or 90% of controls are excluded, and 8650 probesets remain [21]. Furthermore, a two-sample t-test was performed for the expression values of the remaining 8650 probesets by the R function t-test with default settings, and afterwards FDR correction [26] was applied by the R function p.adjust with method = “fdr”. Significantly regulated genes were considered if the FDR value is equal to or below 0.05. Pearson's chi-squared tests were performed with the R function chisq.test from the R package stats to show whether the overlap between two different genesets is significant. The function chisq.test was used with default settings. Next, we applied biclustering by the biclust function from the biclust package in R to the output of the mSVM-RFE and the 8650 probesets of the third microarray study and used the method BCPlaid according to Turner et al. [27] to group coregulated genes into clusters. The method allows a gene to belong to more than one cluster, and each cluster is defined with regard to some, but not necessarily all, samples. In principle, we used default parameters except the following ones. Parameter setting for AD patients: cluster = “r” (to cluster rows (probesets)), row.release (threshold to prune rows in the clusters depending on row homogeneity) = 0.1, col.release (as before, with columns) = 0.2, shuffle (before cluster is added, its statistical significance is compared against random clusters defined by this parameter) = 10, back.fit (after a cluster is added, additional iterations can be done to refine the fitting of the cluster) = 10, max.layers (maximum number of clusters) = 10, iter.startup (number of iterations to find starting values) = 80, and iter.layer (number of iterations to find each cluster) = 80. Parameter setting for double transgenic mice: cluster = “r”, row.release = 0.3, col.release = 0.5, shuffle = 100, back.fit = 500, max.layers = 100, iter.startup = 1000, and iter.layer = 1000. Parameter setting for LOAD patients: cluster = “r”, row.release = 0.3, col.release = 0.5, shuffle = 10, back.fit = 100, max.layers = 20, iter.startup = 100, and iter.layer = 100. The expression profiles were established with the R function matplot from the R package graphics. The function was used with default values except col (colour of lines in plot) = c(1), type (type of plot) = “l”, and lty (line type) = “solid”. The coloured lines of the described genes are added to the plot by matplot function with the parameter add = TRUE (if TRUE, plots are added to current one), and the width of the coloured lines is enlarged by the parameter lwd = 4.

2.4. Enrichment Analysis

Each cluster of coregulated genes was explored for enrichment of genes in Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (version 6.07.2010). After retrieving the number of coregulated genes in each pathway, the P value was computed by the R function fisher.test, and afterwards FDR correction [26] was applied by the R function p.adjust with method = “fdr”. For the pathway analysis, we report the P value and FDR value. Additionally, we analyzed the enrichment of the modules in the corresponding cluster of coregulated genes in contrast to the whole set of human promoters. After searching for each module in human promoters, once again the P value computation was performed, and afterwards FDR correction [26] was applied by the R function p.adjust with method = “fdr”. The results are indicated to be significant if the FDR value is equal to or below 0.05.

2.5. Literature Mining

Literature search by PubMed was done to extract information about the target genes of the TF modules identified and their relation to AD. Target genes are genes which are regulated by TF modules. To verify the modules, searches were performed for TFBS-target gene interactions in all PubMed abstracts with the help of two text mining programs Pathway Studio 7.1 (Ariadne Genomics) and EXCERBT (MIPS, Helmholtz Zentrum München; http://tinyurl.com/excerbt/) [28] based on the natural language processing (NLP) technology. Additionally, information about the TFs of the modules was collected from the BIOBASE Biological Databases (Wolfenbüttel/Germany): TRANSPATH. Mouse Genome Informatics (MGI) database (Mouse Genome Database, The Jackson Laboratory, Bar Harbor, Maine; http://www.informatics.jax.org/) [29] was searched for expression tissue of genes, and gene ontology was used for the functional annotation and classification of the target genes [30].

3. Results and Discussion

3.1. Workflow

The workflow (Figure 1) consists of two different approaches starting at different points, but resulting in similar modules in the end.

Figure 1

Workflow of bioinformatics analysis of promoter sequences and gene expression data to identify modules of TFBSs in AD-related genes. The workflow is divided into two parts, the first and second approach. The coloured boxes describe the methods which were used. The yellow boxes represent tools of the Genomatix Software (1DiAlignTF, 2FrameWorker, 3ModelInspector), and the blue boxes indicate multivariate methods or filtering by illumina detection score. The beginning and the end of the arrow specify input and output of the methods, respectively. The grey arrow denotes the comparison of the target genes of the modules with genes differentially regulated in microarray analyses. The scheme at the end of the second approach indicates a module composed of three TFBSs (blue, red, and green), which is common to three promoter sequences with transcription start site at the red arrow.

The first approach starts with an alignment between human and mouse APP promoter sequences to search for conserved TFBSs. The result yielded 22 conserved TFBSs (data not shown). Some of them have previously been verified. For example, the transcription factors CTCF and AP1F (including JUN) bind to APP [31, 32], and some are expressed in the hippocampus according to MGI database. In the next step of the approach, the conserved TFBSs and the promoter sequences of genes involved in Aβ formation (APP, BACE1, PS1/2, PEN-2, APH1A, and NCSTN) and the homolog of BACE1, BACE2, were used as input to search for all modules of TFBSs, which occur in a multiple alignment of a subset of AD key genes. With this analysis, we got 17 modules composed of two or more TFBSs families (Table 1). Modules regulating key genes of AD are thought to play a role in AD. The three microarray datasets should verify these modules by searching for significantly regulated genes on the microarray. We applied mSVM-RFE, a frequently used tool, because it is very accurate and fast in classification and has a low error rate [22]. RFE is applied due to the large number of the gene expression values on the microarray and helps to reduce the search space and avoids overfitting. Each of the TFBSs families is represented by several TFs (Table 2). After searching for these 17 modules in all human promoters, we limited the result to genes, which have been tested for genetic association with AD according to the AlzGene database (http://www.alzgene.org/; Version: 12.05.2010) [33] in order to get only those target genes possibly involved in AD. AlzGene database is a regularly updated aggregation of all published genetic association studies including GWAS (genome-wide association studies) performed on AD phenotypes. It is an important resource for AD and contains all considerable genetic association studies and key genes of AD. We detected these 17 modules in the promoters of 369 putative AD-risk genes.

Table 1

Modules identified by the first approach: 17 modules composed of two or more TFBSs families. The TFBSs families consist of several TFs (Table 2). The second column specifies the key genes of AD the TFs of the module bind to, according to the search of the module in all human promoters by ModelInspector. Human and mouse APPs are indicated by hAPP and mApp, respectively.

Module	AD key genes—targets of module
CTCF-E2FF-SP1F	hAPP, mApp, BACE1, NCSTN, APH1A,
CTCF-SP1F	hAPP, mApp, BACE1, PS2, NCSTN, APH1A
E2FF-E2FF-EGRF	hAPP, mApp, BACE1, BACE2, PEN-2, APH1A
CTCF-E2FF-EGRF	hAPP, BACE2, PEN-2, NCSTN, APH1A
CTCF-E2FF-EGRF	hAPP, mApp, BACE1, BACE2, PEN-2, APH1A
CTCF-HAND-SP1F	hAPP, mApp, BACE2, PS2
CTCF-SP1F-SP1F	hAPP, mApp, BACE1, BACE2
CTCF-NRF1-SP1F	hAPP, mApp, BACE1, APH1A
CTCF-EGRF-NRF1	hAPP, BACE2, PS2, PEN-2
CTCF-SP1F-ZBPF	hAPP, mApp, BACE2
CTCF-EGRF-ZBPF	hAPP, BACE2, PS2
CTCF-NRF1	hAPP, mApp, BACE1, BACE2, PEN-2
CTCF-EGRF-SP1F	hAPP, BACE2, PS2
NRF1-ZBPF	hAPP, mApp, BACE2, PEN-2, APH1A
CTCF-EGRF	hAPP, mApp, BACE1, BACE2, PS1, PEN-2, APH1A
CTCF-E2FF	hAPP, mApp, BACE1, BACE2, NCSTN, APH1A
SP1F-ZBPF-ZBPF	BACE1, PS1, PEN-2, APH1A

Table 2

Composition of the TF families of the modules. Each TF family consists of several transcription factors (TFs). Additional information about description and binding domains of the families is given in the second and fourth column, respectively.

TF family	Description	TFs	Binding domains
CTCF	CTCF and BORIS gene family, transcriptional regulators with 11 highly conserved zinc finger domains	CTCF, CTCFL	C2H2 zinc finger domain
E2FF	E2F-myc activator/cell cycle regulator	E2F1, E2F2, E2F3, E2F4, E2F5, E2F6, E2F7, E2F8, TFDP1, TFDP2, TFPD3	E2F winged helix
EGRF	EGR/nerve growth factor-induced protein C and related factors	EGR1, EGR2, EGR3, EGR4, WT1, ZBTB7A, ZBTB7B	C2H2 zinc finger domain
HAND	Twist subfamily of class B bHLH transcription factors	HAND1, HAND2, LYL1, MESP1, MESP2, NHLH1, NHLH2, SCXA, SCXB, TAL1, TAL2, TCF12, TCF15, TCF3, TWIST1, TWIST2	bHLH
KLFS	Krueppel-like transcription factors	KLF1, KLF2, KLF3, KLF4, KLF6, KLF7, KLF8, KLF9, KLF12, KLF13, KLF15	—
NRF1	Nuclear respiratory factor 1	NRF1	bZIP
SP1F	GC-Box factors SP1/GC	KLF10, KLF11, KLF16, KLF5, SP1, SP2, SP3, SP4, SP5, SP6, SP7, SP8	C2H2 zinc finger domain
ZBPF	Zinc binding protein factors	ZKSCAN3, ZNF148, ZNF202, ZNF219, ZNF281, ZNF300	C2H2 zinc finger domain

Subsequently, the 369 putative AD-risk genes were compared to the output from the mSVM-RFE of microarray datasets. We obtained for the AD patients dataset, which is composed of data from AD patients at different stages of severity and control, in the end 948 genes after mSVM-RFE and an overlap with the putative AD-risk genes of 31 genes. Applying a chi-squared test, we got a P value of 0.0001182, which shows that 31 is a significant high number of overlapping genes between the two gene sets compared to the number of genes in the whole dataset. Additionally, we took a second dataset from a transgenic mouse model of AD to affirm the results of the AD patients dataset. The probes of the dataset are expressed in the brain, and the amount of plaques in the mice is controlled by the active as well as the dominant-negative form of α-secretase ADAM10 in order to imitate the situation of plaque formation in AD brain. The three mouse lines show different amounts of plaques in the brain. Fewer plaques are found in the brains of ADAM10/APP mice, medium plaques occur in brains of monotransgenic APP control mice, and most plaque formation appears in dnADAM10/APP mice. The reason for different plaque formation is that neurotoxic Aβ peptide levels are increased, and neuroprotective sAPPα is drastically decreased in dnADAM10/APP mice, and in ADAM10/APP mice the levels of the APP fragments are vice versa. The different mouse lines show different stages of plaque formation just like AD patients at different stages of severity [34]. Thus, this AD mouse model and its microarray dataset are appropriate to be included in this analysis to verify modules [20]. The double-transgenic mice dataset was reduced by mSVM-RFE to 878 genes with an overlap of 26 to the 369 AD-related genes. A chi-squared test with P value = 4.21 × 10−12 shows that 26 is a significant high number of overlapping genes compared to the whole number of genes on the array. The third dataset was not filtered by mSVM-RFE, since it is already reduced by illumina detection score and therefore consists of only 8650 normalized expression values, which is a suitable number to apply biclustering. Comparing the 8457 genes of the GSE15222 dataset with the 369 putative AD-risk genes, we got an overlap of 199 genes between these two genesets. The P value = 8.99 × 10−15 of the chi-squared test indicates a significant number of overlapping genes. Starting point of the second approach is the three microarray datasets established from AD patients at three different stages of severity [12], double-transgenic ADAM10/APP, dominant negative ADAM10/APP as well as APP control mice [20], and AD patients with late-onset AD (LOAD) [21]. Not differentially regulated genes are excluded by mSVM-RFE, and regulated genes of these microarrays may potentially play a role in AD. By mSVM-RFE, we reduced the dataset of the AD patients at three different stages of severity from 22283 probesets to 4844 probesets, and then after biclustering these 4844 probesets, we got five and eight clusters of coregulated genes from two biclustering runs with the same parameter setting. The double-transgenic mice dataset was reduced by mSVM-RFE from 45101 to 5198 probesets, and after biclustering, we obtained 13 clusters of coregulated genes. The third dataset of LOAD patients is already reduced by illumina detection score to 8650 probesets, and therefore, we did not apply mSVM-RFE. By biclustering, we got 18 clusters of coregulated genes. By grouping the regulated genes into clusters of coregulated genes and searching for modules in the promoters of these coregulated genes, we got modules possibly responsible for the common regulation of these genes and also putatively playing an important role in the modification of AD. At the end, we obtained several modules for each cluster of coregulated genes. The target genes of three selected modules, which are described in more detail in section 3.3, 3.4, and 3.5, are listed in the Supplementary Tables 1–8 (see Tables 1–8 in Supplementary Material available online at doi: 10.4061/2011/154325). Target genes are activated or repressed by TF modules and have corresponding TFBSs in their promoter sequences. After the whole analysis composed of the first and second approach, we got four different sets of TFBSs modules. We obtained one set of 17 TFBSs modules from the first approach and three sets (one set for each microarray study) of on average five different TFBSs modules per cluster from the second approach. We compared these four sets with regard to similar modules and found two modules in common: CTCF-EGRF-SP1F as well as CTCF-SP1F-ZBPF (Figure 2). According to the first module, the target genes VAPA and EIF5 overlap between AD patients and double-transgenic mice dataset, and the target genes REEP5 and SYP overlap between double-transgenic mice and LOAD patients dataset. The overlapping target gene ADD3 between AD and LOAD patients dataset of module CTCF-SP1F-ZBPF is also target gene of the module KLFS-SP1F-ZBPF, which is common to the three microarray datasets in the second approach. The third module has additionally overlapping target genes between AD and LOAD patients dataset: CLU and NUCKS1.

Figure 2

Relations of predicted target genes of three TFBSs modules. This picture summarizes important target genes of the modules, the relation of target genes to KEGG pathways playing a role in AD (blue rectangle), and the relation of the target genes to some AD key genes (red pentagon). The target genes are coloured according to their membership to microarray studies, and some target genes with two colours are derived from analysis of two different microarray studies. The grey arrows are the predicted regulations of the target genes by the modules (orange rectangle), and the black lines indicate that the target gene is part of the corresponding KEGG pathway. Additionally, three different relations of the target genes to AD key genes are shown by purple, green, and blue lines, which indicate protein-protein binding, protein modification, and regulation, respectively.

In general, TFBSs frequently occurring in modules of both approaches are CTCF, EGRF, SP1F, and ZBPF, but the composition of the TFBSs for a module is slightly different between the first and second approach. The modules of the second approach mostly contain one TFBS, which is not conserved between human and mouse APP, and therefore these modules could not be found by the first approach focussing on conserved TFBSs. Moreover, the starting points of the two approaches are different, because in the first approach we include conservation and in the second approach not. This leads to slightly different modules between the two approaches, but in the end, we have two modules in common. Although TFBSs can occur almost anywhere in the promoter and do not show any pattern with respect to location, TFBSs are often grouped together and such functional modules have been described in many cases. The arrangement of TFBSs of a promoter module seems to be much more restricted than the variety and distribution of TFBSs in the promoter sequence [35]. However, TFBSs modules found in the promoter of genes suggest functional connection but do not prove it [36]. The common regulation of genes from different gene classes can depend on regulatory modules of TFBSs, which are often conserved regions in the promoter sequences and therefore can be identified, while other TFBSs are nonconserved between the promoters [37, 38]. This workflow with the combination of multivariate analysis methods and in silico promoter analysis based on the hypothesis that the key genes of AD are coregulated is a new approach and different to already existing ones to find possible transcriptional regulations addressing AD pathogenesis. The confirmation of this hypothesis would offer better possibilities to develop new strategies for the therapeutic treatment of AD. In this study, commercial software has been used as part of the bioinformatics workflow, but comparable open-source software and databases can be used as well. VISTA (http://genome.lbl.gov/vista/index.shtml) and JASPAR (http://jaspar.genereg.net/) can be used for the analysis of TFBSs. For literature mining, STRING (EMBL, http://string-db.org/) and EXCERBT (MIPS, http://tinyurl.com/excerbt/) are available, and for finding tissue expression or binding partners of the genes, the GeneCards database (Weizmann Institute of Science, http://www.genecards.org/) is one alternative.

3.2. Modules and Confirmations of TFBSs and AD-Risk Genes

The first common module is composed of the binding sites of the three TF families: CTCF, EGRF, and SP1F, and the second module consists of CTCF, SP1F, and ZBPF, which are all conserved between human and mouse APP promoter sequences. TFs representing both modules are predicted to bind to the promoter sequences of significantly frequent target genes in the corresponding cluster of the AD patients dataset (FDR (CTCF-EGRF-SP1F) = 0.0003; FDR (CTCF-SP1F-ZBPF) = 0.0003), of the transgenic mice dataset (FDR (CTCF-EGRF-SP1F) = 4.2 × 10−7; FDR (CTCF-SP1F-ZBPF) = 1.3 × 10−12) and of the LOAD patients dataset (FDR (CTCF-EGRF-SP1F) = 0.0139; FDR (CTCF-SP1F-ZBPF) = 0.0139), compared to the incidence in the whole set of human promoters. The expression profiles of the coregulated genes from the three datasets in each module show similar expression patterns among a subset of microarray samples (Figure 3 and supplementary Figure 1). Additionally, a third significant module was detected established from a set of coregulated genes of the AD (Figure 3(b)) and LOAD patient's dataset (Figure 3(c)). The corresponding motif of this module consists of KLFS, SP1F, and ZBPF binding sites and occurs in the promoter sequences of several interesting genes in particular to Clusterin (CLU/APOJ), which is according to AlzGene database the second most strongly associated gene to AD [39]. The enrichment of the module in the cluster of coregulated genes from the AD patients dataset (FDR (KLFS-SP1F-ZBPF) = 0.0003) and the LOAD patients dataset (FDR (KLFS-SP1F-ZBPF) = 0.0139) is significant compared to the occurrence in all human promoters.

Figure 3

Expression profiles of five different clusters of coregulated genes. On the x-axis, the sample IDs (specified by accession numbers of GEO/NCBI) incorporated in the cluster are given, and y-axis indicates values of expression. One gene corresponds to a single line in the profile, and the target genes of the modules as mentioned in the text are coloured. The two upper profiles (a, b) are clusters of coregulated genes of the AD patients dataset, the profile in the middle (c) corresponds to coregulated genes of the LOAD patients dataset, and the two profiles below (d, e) correspond to coregulated genes of the double-transgenic mice dataset. The target genes of the profiles (a) and (d) were used for the establishment of the module CTCF-EGRF-SP1F and the profiles (b) and (e) for the module CTCF-SP1F-ZBPF, at which (b) was also used for the module KLFS-SP1F-ZBPF. The five lines in profile (e) correspond to genes, which are involved in the MAPK signaling pathway. The target genes of the profile (c) were used for the establishment of the modules CTCF-SP1F-ZBPF and KLFS-SP1F-ZBPF.

Already known regulations of AD key genes by the transcription factors CTCF, EGR1, SP1, and ZNF202 confirm these modules. EGR1 is known to upregulate Presenilin2 [40], CTCF binds to APBβ domain a nuclear factor binding site in proximal APP promoter as well as acts as a transcriptional activator in the APP gene promoter [31], and APOE is repressed by ZNF202 (ZBPF) according to TRANSPATH database. SP1 can regulate both BACE1 and BACE2 genes [41], activates the transcription of PS1 [42], and upregulates APP gene expression [43]. In addition, we found SP1 significantly regulated on the LOAD patients microarrays (FDR = 3.4 × 10−9), which is in agreement with the results of a dysregulation of this transcription factor in AD [44, 45]. While most of the transcription factors of all families in the resulting modules play a role in apoptosis, the transcription factor families have additional different main functions according to TRANSPATH database. The CTCF zinc finger proteins are involved in chromatin remodelling, the early growth response transcription factors (EGRFs) in learning and memory and brain development, the GC-Box factors (SP1Fs) in chromatin silencing as well as embryonic development, the zinc binding protein factors (ZBPFs) in lipid metabolism and the Krueppel-like transcription factors (KLFSs) in nervous system development and response to stress. Most of the TFs of the families are expressed in whole brain, hippocampus, or cortex. To evaluate the importance of the detected modules, we incorporated information of the AlzGene database. Some target genes of the modules are already mentioned in AlzGene database to be associated to AD like Gsk3b, GOT1 (CTCF-EGRF-SP1F), Col25a1, Il33, and Tanc2 (CTCF-SP1F-ZBPF), and CLU (KLFS-SP1F-ZBPF).

3.3. First Module CTCF-EGRF-SP1F

Literature mining revealed known relations between the target genes of the first module and AD (supplementary Tables 1, 2, and 3) confirming this module. Long-term depression and calcium signaling are pathways which are connected to AD by the upregulation of calcium signaling leading to amyloid metabolism with neuronal cell apoptosis and an enhancement of long-term depression, which leads to loss of memory initiated by long-term potentiation [46]. GNAS is incorporated in both pathways, which occur significantly often in the cluster of the coregulated genes of the AD patients dataset (P value = 0.0404, FDR = 0.0952 (human long-term depression); P value = 0.0476, FDR = 0.0952 (human calcium signaling)). The study from Zhang et al. shows that the gene expression of APP and GNAS is significantly upregulated in patients with endogenous depression [47]. Interestingly, clinically significant depression develops in at least 40% of all demented patients, and depressive symptoms like significant loss of appetite, insomnia, and fatigue occur commonly in the course of Alzheimer's disease [48]. Another pathway related to AD is the mitogen-activated protein kinase (MAPK) signaling pathway, which is involved in the production of proinflammatory cytokines in the hippocampus induced by Aβ and is a potential target for future therapeutics in AD [49]. Cacna2d1, a target gene of the double-transgenic mice dataset, is involved in MAPK signaling in the mouse, and this pathway is significantly overrepresented among the coregulated genes of the double-transgenic mice dataset (P value = 0.0024, FDR = 0.0121). The target gene Gsk3b is involved in AD pathway and Wnt signaling (enrichment analysis: P-value = 0.0021, FDR = 0.0121 (mouse AD pathway); P-value = 0.0206, FDR = 0.0497 (mouse Wnt signaling)) and regulates the APP accumulation after Aβ formation [50]. Another target gene Ppp3cb is also involved in AD and MAPK signaling in the mouse according to KEGG pathways, which are significantly enriched as mentioned before. Comparing the target genes of this module from the AD patients and double-transgenic mice microarray study, two overlapping genes are found, which provide more evidence for the functionality of the module. Additional literature search and text mining reveal relations to AD. The target gene EIF5 (eukaryotic translation initiation factor 5) is not mentioned in AlzGene database, but in the EIF2 regulation pathway, it is downstream from its family member EIF2AK2, which is listed in AlzGene database. Another member of the gene family, EIF2alpha, is also linked to AD. The phosphorylation of EIF2alpha leads to termination of global protein translation and induces apoptosis. In addition, degenerative neurons in AD brain show high immunoreactivity for phosphorylated EIF2alpha concluding that phosphorylation of EIF2alpha is associated with the degeneration of neurons in AD [51]. Additionally, GSK3B, which is involved in AD pathway, EIF5, EIF2AK2, and EIF2alpha play a role in the regulation of EIF2 according to BioCarta (http://www.biocarta.com/pathfiles/h_eif2Pathway.asp#). The second overlapping target gene VAPA, a vesicle-associated membrane protein, interacts with its family member VAPB through the transmembrane domain [52], and both are reduced in human amyotrophic lateral sclerosis (ALS) patients, another neurodegenerative disease. Additionally, both genes interact with lipid binding proteins; especially VAPA is involved in lipid export and neurite outgrowth. The mutation VAPB-P56S, which forms stable aggregates that are continuous with the endoplasmic reticulum (ER) and mitochondria and impairs normal VAP function, may result in abnormal lipid transport and biosynthesis and induce slow degeneration of neurons [53]. Approximately 30% of ALS patients with dementia have AD [54]. Furthermore, two overlapping target genes from the double transgenic mice and LOAD patients dataset were found: SYP, REEP5. SYP, a synaptic vesicle marker, was colocalized with the reactivity of APP and PS1, two key genes of AD. SYP is also localized in the synaptosomal vesicles, where also an association of N- and C-terminal PS1 fragments and APP was detected [55]. Other diseases associated with SYP are schizophrenia, ALS, and dementia, and it is expressed in hippocampus and cortex according to TRANSPATH database. The promoter region of SYP contains four SP1 binding sites located within 100 bp from the transcription start point [56]. An abnormally elevated SYP level in the frontal cortex and hippocampal molecular layer exists in old mice lacking BACE1. Studies demonstrate that the absence of BACE1 eliminates plaque pathology. Additionally, SYP deficits correlate with levels of soluble Aβ, and the loss of the associated presynaptic protein SYP is a key pathological feature of AD [57]. The second overlapping gene REEP5 (receptor accessory protein 5) is expressed in the brain and central nervous system according to MGI database and induces apoptosis according to gene ontology. A common characteristic in the brains of patients suffering from neurodegenerative diseases like AD is massive neuronal death due to apoptosis, and furthermore apoptotic cell death has been found in neurons and glial cells in AD [58]. Several studies have shown the direct effect of REEP5 on shaping ER tubules and propose that this protein is involved in the stabilization of highly curved ER membrane tubules. The peripheral ER consists of a network of membrane tubules and in the study by Voeltz et al. IP3 (inositol trisphosphate) receptor was a candidate to be involved in ER network formation as well as rapid Ca2+ efflux correlating with ER network formation [59]. Moreover, there exist indications that ER stress is involved in AD pathogenesis. The ER can release stored Ca2+ through ER membrane receptor channels like IP3, and some findings suggest that perturbed ER Ca2+ homeostasis contributes to the dysfunction and degeneration of neurons that occur in AD [60]. Concluding, literature search revealed several target genes, which show a relation to AD, like SYP and Gsk3b strengthening the assumption that this module may be part of neurodegenerative processes as observed in AD.

3.4. Second Module CTCF-SP1F-ZBPF

The AD and LOAD patients target genes of the second module (supplementary Tables 4, 5) partially overlap with the AD and LOAD patients target genes of the third module KLFS-SP1F-ZBPF (supplementary Tables 6, 7), respectively. Additional hints for the importance of the target genes putatively activated or repressed by a module were also found for the double transgenic mice target genes of the second module (supplementary Table 8). For example, the C-terminal binding protein CTBP2 (target gene of AD patients dataset) binds ZNF219, a member of the ZBPF TF family, in vitro [61]. Verifications of the second module by double transgenic mice target genes revealed Cdc42 to regulate synapse formation in neurons and an increased Cdc42-GTPase activity in neurons stimulated with Aβ1–42. Additionally, Cdc42 is upregulated in neuronal populations of AD brains in comparison to controls [62] and involved in the mouse MAPK signaling pathway (enrichment analysis: P-value = 0.0010, FDR = 0.0242). The MAPK pathway, which is significantly enriched in the coregulated genes of the double transgenic mice, incorporates the target genes Map2k4, Mapk1, Mapk9, and Ppm1a of the second module (Figure 3(e)). Further, PS1, a AD key gene, inhibits the Map2k4 activity [63]. The phosphorylation of Mapk1, an extracellular signal-regulated kinase (Erk), is activated by Aβ, and this Erk signal is involved in repression of L-glutamate uptake in astrocytes possibly defending neurodegeneration in the pathogenesis of AD [64]. Additionally, the activation of Erk cascades through EGF yields an increased CTCF expression showing that CTCF is a downstream target of Erk cascades [65]. The ubiquitously expressed Mapk9 (JNK2) phosphorylates APP and amyloid precursor-like protein 2 (APLP2) induced by cellular stress. The phosphorylation is involved in neural functions and AD pathogenesis [66]. Some studies indicate that Mapk9 possibly interacts with SP1 and the JNK pathway targets SP1 [67]. The gene ADD3 (adducin 3 (gamma)) overlaps between the target genes of the AD and LOAD patients dataset. According to TRANSPATH database, it is known to be upregulated in ALS and to play a role in apoptosis, which leads to neuronal death in AD. ADD3 belongs to the adducin family of proteins, which is involved in postsynaptic changes in the actin cytoskeleton that occur as a result of synaptic activation. A study from 2005 suggests that adducin is involved in setting synaptic strength, as well as synaptic plasticity underlying learning and memory [68, 69]. Memory loss is the cardinal and one of the earliest clinical manifestations of AD, and studies suggest that aging promotes the formation of soluble Aβ assemblies mediating negative effects on memory [70]. In summary, five target genes (Cdc42, Map2k4, Mapk1, Mapk9, and Ppm1a) of the module are involved in MAPK signaling confirming the relation of the module to AD, and a possible new gene involved in AD, ADD3, was identified.

3.5. Additional Information of the Third Module KLFS-SP1F-ZBPF

One TF of the third module, KLF3 (BKLF), a member of the KLFS TF family, binds the AD patients target gene CTBP2 in vivo [71], which is involved in the Wnt signaling pathway. In our study, the human Wnt pathway genes are significantly enriched in the cluster of coregulated genes (P-value = 0.0149, FDR = 0.0464), and Wnt pathway has been found to prevent neurodegenerative diseases like AD by inhibiting Aβ-dependent cytotoxic effects [72]. Comparing the AD and LOAD patients target genes from the third module, three genes were found in common: ADD3, CLU, and NUCKS1. ADD3 is also in common for the second module described above, and a relation to AD by learning and memory is already known in literature. Verifications by literature exist also for the target gene CLU, which is induced by KLF4 overexpression [73] and contains binding sites for SP1 [74]. Genome-wide association studies in AD have detected CLU to be involved in developing AD by finding a strong association for an intronic single nucleotide polymorphism. Biologically, CLU seems to be involved in the pathogenesis of AD by interacting with different molecules like lipids or amyloid proteins, but also in brains of AD patients the level of CLU mRNA is significantly higher than in control brains [39]. Furthermore, APP/PS1 transgenic mice showed increased plasma CLU, age-dependent increase in brain CLU, and amyloid and CLU colocalization in plaques [75]. Another disease associated with CLU is ALS and according to gene ontology, CLU, which has increased protein levels in frontal cortex and hippocampus in AD [76], is involved in lipid transporter activity, apoptosis, and neuron development. The third overlapping target gene NUCKS1 (nuclear casein kinase and cyclin-dependent kinase substrate 1) is known to be a strongly associated gene to Parkinson's disease (PD) [77], another neurodegenerative disease. A study in 2003 by Wilson et al. showed that progression of classical symptoms of PD in old person is associated with eight times as likely to develop AD as well. The results of this study suggest a strong link between progressive motor impairment and the development of AD. Interestingly, pathologic findings like Lewy bodies are similar to PD and dementia, and perhaps there is a connection in how the two diseases progress over time [78]. Furthermore, NUCKS1 may play a role in cell proliferation [79]. The proliferation of neural progenitor cells is reduced in mice transgenic for a mutated form of amyloid precursor protein, transgenic mouse model of AD, that causes early onset familial AD, and it was shown that Aβ can affect the proliferation of neural progenitor cells [80]. Taken together, all relations described here between the target genes of the modules to AD key genes and KEGG pathways are shown in Figure 2. The three modules composed of TFs, which are mainly expressed in brain and partially associated with AD, are connected to each other by common target genes or KEGG pathways playing a role in AD. Some target genes have known relations to AD key genes further verifying the relation of the modules with their target genes to AD. Promising is the fact that three microarray studies with different platforms, stages of AD in patients and varying amounts of plaques in an AD mouse model have in the end two modules in common. Additionally, one module common to two microarray studies includes the target gene CLU, which is the second most strongly associated gene to AD according to AlzGene database. Although the datasets are different, we got in the end significantly regulated target genes of the modules, which are even the same between datasets of different species and brain tissues, again verifying the significance of the three modules for AD.

4. Conclusion

As described above, many of the genes that we identified in our bioinformatics analysis have links to the molecular mechanisms of AD. In particular, the binding sites of the TF families: CTCF, EGRF, KLFS, SP1F, and ZBPF are proposed for further investigations and could provide potential targets for therapeutic treatment of AD and other neurodegenerative diseases. Already known regulations of target genes by transcription factors confirm these modules, such as CLU, which is linked to AD. Additionally, several target genes like ADD3, which are not yet described as AD-related genes, are possibly involved in AD pathogenesis. The most likely candidate genes are ADD3, CLU, EIF5, NUCKS1, REEP5, SYP, and VAPA, which are derived from analysis of two different microarray studies. A next step is the experimental validation of TFBSs target interactions to verify the modules. At first, the differential regulation of candidate genes from AD mouse models can be validated by qRT-PCR. We also suggest a knockout (or knockdown) of single or combinations of TFs in mice and then the measurement of the expression compared to wildtype mice by qRT-PCR of the best downstream candidate genes. Another possible experiment would be the screening for SNPs in the sequences of the candidate genes from postmortem AD patient's DNA.

75 in total

1. Incorporating gene functions as priors in model-based clustering of microarray gene expression data.

Authors: Wei Pan
Journal: Bioinformatics Date: 2006-01-24 Impact factor: 6.937

2. Impaired synaptic plasticity and learning in mice lacking beta-adducin, an actin-regulating protein.

Authors: Rebecca L Rabenstein; Nii A Addy; Barbara J Caldarone; Yukiko Asaka; Lore M Gruenbaum; Luanne L Peters; Diana M Gilligan; Reiko M Fitzsimonds; Marina R Picciotto
Journal: J Neurosci Date: 2005-02-23 Impact factor: 6.167

3. beta-adducin (Add2) KO mice show synaptic plasticity, motor coordination and behavioral deficits accompanied by changes in the expression and phosphorylation levels of the alpha- and gamma-adducin subunits.

Authors: F Porro; M Rosato-Siri; E Leone; L Costessi; A Iaconcig; E Tongiorgi; A F Muro
Journal: Genes Brain Behav Date: 2009-09-22 Impact factor: 3.449

Review 4. Depression and Alzheimer's disease.

Authors: L E Tune
Journal: Depress Anxiety Date: 1998 Impact factor: 6.505

5. The LIM protein FHL3 binds basic Krüppel-like factor/Krüppel-like factor 3 and its co-repressor C-terminal-binding protein 2.

Authors: Jeremy Turner; Hannah Nicholas; David Bishop; Jacqueline M Matthews; Merlin Crossley
Journal: J Biol Chem Date: 2003-01-29 Impact factor: 5.157

6. A class of membrane proteins shaping the tubular endoplasmic reticulum.

Authors: Gia K Voeltz; William A Prinz; Yoko Shibata; Julia M Rist; Tom A Rapoport
Journal: Cell Date: 2006-02-10 Impact factor: 41.582

7. Negative regulation of the SAPK/JNK signaling pathway by presenilin 1.

Authors: J W Kim; T S Chang; J E Lee; S H Huh; S W Yeon; W S Yang; C O Joe; I Mook-Jung; R E Tanzi; T W Kim; E J Choi
Journal: J Cell Biol Date: 2001-04-30 Impact factor: 10.539

8. Linking disease-associated genes to regulatory networks via promoter organization.

Authors: S Döhr; A Klingenhoff; H Maier; M Hrabé de Angelis; T Werner; R Schneider
Journal: Nucleic Acids Res Date: 2005-02-08 Impact factor: 16.971

9. Differential gene expression in ADAM10 and mutant ADAM10 transgenic mice.

Authors: Claudia Prinzen; Dietrich Trümbach; Wolfgang Wurst; Kristina Endres; Rolf Postina; Falk Fahrenholz
Journal: BMC Genomics Date: 2009-02-05 Impact factor: 3.969

10. Human VAP-B is involved in hepatitis C virus replication through interaction with NS5A and NS5B.

Authors: Itsuki Hamamoto; Yorihiro Nishimura; Toru Okamoto; Hideki Aizaki; Minyi Liu; Yoshio Mori; Takayuki Abe; Tetsuro Suzuki; Michael M C Lai; Tatsuo Miyamura; Kohji Moriishi; Yoshiharu Matsuura
Journal: J Virol Date: 2005-11 Impact factor: 5.103

9 in total

Review 1. Targeting innate immunity for neurodegenerative disorders of the central nervous system.

Authors: Katrin I Andreasson; Adam D Bachstetter; Marco Colonna; Florent Ginhoux; Clive Holmes; Bruce Lamb; Gary Landreth; Daniel C Lee; Donovan Low; Marina A Lynch; Alon Monsonego; M Kerry O'Banion; Milos Pekny; Till Puschmann; Niva Russek-Blum; Leslie A Sandusky; Maj-Linda B Selenica; Kazuyuki Takata; Jessica Teeling; Terrence Town; Linda J Van Eldik
Journal: J Neurochem Date: 2016-09 Impact factor: 5.372

2. Nearest hyperplane distance neighbor clustering algorithm applied to gene co-expression analysis in Alzheimer's disease.

Authors: Cristian F Pasluosta; Prerna Dua; Walter J Lukiw
Journal: Conf Proc IEEE Eng Med Biol Soc Date: 2011

3. Mitochondrial dysfunction and decrease in body weight of a transgenic knock-in mouse model for TDP-43.

Authors: Carola Stribl; Aladin Samara; Dietrich Trümbach; Regina Peis; Manuela Neumann; Helmut Fuchs; Valerie Gailus-Durner; Martin Hrabě de Angelis; Birgit Rathkolb; Eckhard Wolf; Johannes Beckers; Marion Horsch; Frauke Neff; Elisabeth Kremmer; Sebastian Koob; Andreas S Reichert; Wolfgang Hans; Jan Rozman; Martin Klingenspor; Michaela Aichler; Axel Karl Walch; Lore Becker; Thomas Klopstock; Lisa Glasl; Sabine M Hölter; Wolfgang Wurst; Thomas Floss
Journal: J Biol Chem Date: 2014-02-10 Impact factor: 5.157