Literature DB >> 15701758

Linking disease-associated genes to regulatory networks via promoter organization.

S Döhr¹, A Klingenhoff, H Maier, M Hrabé de Angelis, T Werner, R Schneider.

Abstract

Pathway- or disease-associated genes may participate in more than one transcriptional co-regulation network. Such gene groups can be readily obtained by literature analysis or by high-throughput techniques such as microarrays or protein-interaction mapping. We developed a strategy that defines regulatory networks by in silico promoter analysis, finding potentially co-regulated subgroups without a priori knowledge. Pairs of transcription factor binding sites conserved in orthologous genes (vertically) as well as in promoter sequences of co-regulated genes (horizontally) were used as seeds for the development of promoter models representing potential co-regulation. This approach was applied to a Maturity Onset Diabetes of the Young (MODY)-associated gene list, which yielded two models connecting functionally interacting genes within MODY-related insulin/glucose signaling pathways. Additional genes functionally connected to our initial gene list were identified by database searches with these promoter models. Thus, data-driven in silico promoter analysis allowed integrating molecular mechanisms with biological functions of the cell.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Transcription Factors

Year: 2005 PMID： 15701758 PMCID： PMC549397 DOI： 10.1093/nar/gki230

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The completion of several whole-genome sequencing projects has provided extensive lists of genes (DNA), RNAs and proteins of mammalian organisms (1–3). However, it quickly became evident that the complexity of higher organisms cannot be explained solely by the number of parts, but mainly arises from more sophisticated interactions and networks of the DNAs, RNAs and proteins (4). This triggered a new focus towards the analysis of gene groups, their products and their network interactions (e.g. signaling and metabolic networks), which is now defined as the ultimate goal of systems biology (5,6). Part of that effort is the elucidation of transcriptional co-regulation networks, which can be seen as one of the most important levels at which network connections emerge (7,8). Considerable progress has been made in analysis of yeast regulatory networks from microarray experiments (9,10). However, those results cannot be generally transferred to the human system (11). Therefore, mammalian transcriptome analysis, which is a current focus of research (12,13), requires different strategies suitable for mammalian networks. A common theme to all analyses aiming at gene or gene product interactions is the definition of one or several interacting subsets associated by some evidence to a biological process, disease or condition. Such gene groups often are not well defined and contain several functionally distinct subgroups, which cannot be separated by conventional clustering methods (14). However, genes within such subgroups contributing to a particular biological pathway or process may be transcriptionally coupled to insure coordinated availability of the proteins. Transcription is primarily regulated by the binding of transcription factors to their specific binding sites in the promoter/enhancer of the genes (7). Therefore, one way to trace co-regulated transcription on the molecular level is by promoter analysis revealing shared organization of sets of transcription factor binding sites (referred to as frameworks hereafter). Such frameworks can be represented by computational models, which can be used to scan sequence databases for genes showing a similar promoter organization (15). Unfortunately, promoter sequence conservation is not general (15) and even conserved sequence regions, called phylogenetic footprints (16) are not directly associated with functional conservation. Each mammalian promoter represents a mixture of conserved frameworks (associated with different signaling responses of the same promoter) necessary to ensure correct timing and spatial distribution of expression during development as well as correct function in the adult stage. Therefore, separation of individual functions by phylogenetic promoter analysis without further information about the biological context is usually not possible. On the other hand, horizontally co-regulated promoters (different genes within one mammalian species) often also share arbitrary frameworks that cannot be distinguished from those associated with the observed co-regulation. We have designed a completely new strategy that combines phylogenetic analysis (inter-species analysis) with cross-gene analysis within one species (intra-species analysis) to identify single process-associated frameworks, overcoming the functional ambiguities of the individual approaches. We demonstrate on an example of a disease-related gene list that in silico promoter analysis contributes to bridging the gap between molecular mechanisms and biological functions of the cell.

METHODS

Terminology

Framework: Two or more transcription factor binding sites (TFBSs) arranged in a defined order, orientation and a defined distance range between adjacent TFBSs. Model: Computational description of a framework for the purpose of computer-assisted detection of occurrences of frameworks in long DNA sequences. Recall: Percentage of input gene promoters recognized by a model: 100% recall means all input gene promoters are found. Selectivity: The ratio of recall versus the fraction (in %) of promoters from a large promoter database matched by the model (control). The step numbers below refer to the numbers in Figure 1.

Figure 1

General strategy for problem-oriented promoter modeling. The bold numbers to the left of the short descriptions indicate the different steps of the strategy and correspond to the numbering used in Methods and Results. Step 2 indicates selection of orthologous promoters. Genes are symbolized by squares and the three species used are indicated (human, mouse, rat). Step 3 symbolizes the generation of models each containing two transcription factor binding sites (TFBSs) from orthologous promoter sets of individual genes obtained in Step 2. Horizontal optimization is done in Steps 4–6 across promoters from the initial problem-specific gene list (IPL). The links between promoter models and the functional association of genes in the cell is symbolized at the bottom (Step 8). For details of our application example, see Figure 4.

Literature analysis software (Step 1): Current data from literature on subject-related gene expression, gene function and gene–disease relationship were collected with the programs LitMiner (GSF, H. Maier, S. Döhr, K. Grote, S. O'Keeffe, T. Werner, M. Hrabé de Angelis and R. Schneider, in preparation), BiblioSphere™ (Genomatix), GeneCards™ (17), and OMIM (18). The LitMiner is a web-based resource that was developed by the GSF group. It allows the generation of ranked lists of genes associated with diseases and tissues from abstracts of scientific publications, which are available from PubMed®. Promoter extraction (Step 2): We extracted the promoter sequences from human, mouse and rat where available using the ‘Comparative Genomics’ task of the ElDorado™ database (Genomatix Suite–ElDorado™, release 3.0, Human Genome NCBI build 34, Mouse Genome MGSCv3, Rat Genome NCBI build 2). The promoter sequences used in this study are available as Supplementary Material. Promoter selection and modeling (Steps 3–4): The DiAlign (19) task of GEMS Launcher was used for nucleotide sequence alignments to check overall promoter similarity for each orthologous promoter set. The GEMS Launcher task ‘FrameWorker’ using the available weight matrix library (GEMS Launcher Version 3.0, matrix library vertebrate section, Matrix Family Library 4.0 containing 535 matrices in 253 families, Genomatix software, Munich; ) was applied. Model optimization (Step 5): The FastM (20) task of GEMS Launcher was used to optimize models. ModelInspector (21) (a GEMS launcher task) was used to search databases with the optimized models. Selectivity was determined against the Eukaryotic Promoter Database (EPD, release 76, >4000 promoters) (22) and against the human promoter database (Genomatix Promoter Database, GPD, Genomatix software, Munich, release 3.0, >50 000 promoters). Model extension (Step 6): The FastM task of GEMS Launcher was used to extend models by manually adding TFBSs (identified by MatInspector (23) analysis) to existing models. Database search with final models (Step 7): ModelInspector database searches in the GPD were carried out with the final models. Functional association (Step 8): Additional information about connections between the genes from the initial list and candidate genes found by the model search was taken from BiblioSphere™ analyses (basis for Figure 4).

Figure 4

Functional association between the biological networks and promoter model-derived regulatory networks. The gray arc symbolizes the cell membrane. Dark gray symbols indicate gene products. Membrane receptors are shown inserted into the membrane (with symbolized ligand docking site outside the membrane); ion channels are shown as bipartite structures crossing the membrane; gray circles indicate intracellular proteins. The functional connections between the genes from the IPL were derived by BiblioSphere™ analysis and are indicated by gray arrows; ‘?’ indicates putative connections. M1,2,3 above the gene symbols indicates that models 1, 2 and 3 all match within the promoter of the respective gene. Shaded areas underlying the graphics indicate potential regulatory networks, which are linked by shared promoter models (regulatory network M1b and regulatory network M5a).

Default parameters were applied for the initial analyses in all programs, if not indicated otherwise.

RESULTS

Rational of the strategy

Functional conservation of promoter organization is evident in two directions: vertically, in promoter sequences from orthologous genes (inter-species) and horizontally, in promoter sequences of co-regulated genes within one species (intra-species). Thus, selection of promoter substructures conserved vertically as well as horizontally should be best correlated with particular biological functions. The only prerequisites for this strategy are a list of genes associated with the biological or medical question to be analyzed, and that the underlying biological processes are evolutionarily conserved. This allows generation of promoter models based on combined conservation (vertical and horizontal). Ensuring tight association of models with the biological problem requires further optimization. We propose to use selectivity for this purpose because biologically meaningful models are expected to show better association with the problem-correlated gene promoters. This resulted in the following strategy (Steps 1–8; Figure 1).

Strategy

Problem-oriented gene selection: The first step is the identification of an Initial Problem-specific List (IPL) of genes correlated with a disease, a signaling pathway, a metabolic pathway or any other gene group linked by a biological function. Orthologous promoters: Orthologous promoter sets from three mammalian species (human, mouse and rat where available) are collected for every gene in the IPL. 2-TFBS-models: Orthologous promoter sets are analyzed for frameworks consisting of two elements, resulting in initial models (each model representing one vertically conserved framework). Shared models: Networks have to contain at least three members. Therefore, models from the orthologous promoter sets of the genes in the IPL are selected for further analysis if they match at least two additional promoters of the IPL. Optimization of model selectivity: The models are refined solely based on promoters present within the IPL using the following restrictions: each TFBS is oriented strand-specific, and distance range variability between the two TFBSs is minimized. Selectivity (defined in methods) versus a genome-wide promoter database is used as the sole optimization criterion. Extension of models: In this step, models resulting from Step 5 are extended by at least one additional TFBS (missed by standard parameters) resulting in models of more than two elements. Optimization of models proceeds as in Step 5. At this point, orthologous conservation of the extended models in the additionally identified genes is no longer required. Database search with final models: Next, the complete match list for the models defined in Step 6 is determined from a database of all available human promoters. This provides the basis for extension of the initial gene list. Hitherto unrelated genes can be linked to the original problem on the basis of their promoter organization, and subsequent verification of the connection from an independent source justifies extending the initial list. Functional associations: The regulatory networks of IPL-genes defined by matches to shared promoter models are then superimposed onto the literature-derived biological process network of all IPL genes to assess concurrence between these independently derived networks.

Application example

We have applied this strategy to identify genes and their transcriptional networks important in the context of maturity onset diabetes of the young (MODY). We were able to identify at least two potential co-regulation networks clearly associated with different biological networks directly connected to insulin/glucose signaling. We also extended the original gene list by several new candidate genes for these networks.

Problem-oriented gene selection by automated literature analysis (Step 1)

Mechanisms of glucose homeostasis are disturbed in the MODY-syndromes (diabetes mellitus type II) that were used as model system. We initiated the analysis with an exhaustive automatic literature search using all available PubMed® abstracts. LitMiner was used to extract disease-associated genes. The following queries were used independently: , <Non-insulin-dependent diabetes mellitus>, , , <glucose homeostasis insulin signaling>. The result of each query was a separate list of genes. All of these were merged to compile the list shown in Table 1.

Table 1

Problem-oriented gene selection: MODY

Gene	LocusID	Description	Ortholog	Functional data (Literature)
ABCC8^*	6833	ATP-binding cassette, subfamily C (CFTR/MRP), member 8	hmr	Insulin release
ANXA7	310	Annexin VII: calcium-channel, voltage-gated	hmr	Membrane fusion
CACNA1A	773	Calcium channel, voltage-dependent, P/Q type, alpha 1A subunit	hr	Hormone release
CACNA1D	776	Calcium channel, voltage-dependent, L type, alpha 1D subunit	h	Calcium signaling
CACNA1H	8912	Calcium channel, voltage-dependent, L type, alpha 1H subunit	h	Calcium signaling
GCG^*	2641	Glucagon	hm	Glucose metabolism
GCGR	2642	Glucagon receptor	hm	Carbohydrate metabolism
GCK^*	2645	Glucokinase	hmr	Glucose metabolism
GCKR	2646	Glucokinase regulatory peptide	hm	Glucose metabolism
GIPR^*	2696	Gastric inhibitory polypeptide receptor	hmr	Stimulates insulin release
GLP1R^*	2740	Glucagon-like peptide 1 receptor	hmr	Stimulates insulin release
IGF1^*	3479	Insulin-like growth factor 1	hmr	Glucose metabolism
IGF1R^*	3480	Insulin-like growth factor 1 receptor	hmr	Carbohydrate metabolism
INS	3630	Insulin	hmr	Glucose metabolism
INSR^*	3643	Insulin receptor	hmr	Carbohydrate metabolism
INSrR	3645	Insulin related receptor	hmr	Carbohydrate metabolism
IRS1^*	3667	Insulin receptor substrate 1	hmr	Inhibition of insulin signaling
ITPR3	3710	Inositol 1,4,5-triphosphate receptor 3	hm	Calcium channel, signaling
KCNJ3^*	3760	Potassium inwardly rectifying channel, subfamily J, member 3	hmr	Insulin release (assumed)
KCNJ5	3762	Potassium inwardly rectifying channel, subfamily J, member 5	hr	Insulin release (assumed)
KCNJ6^*	3763	Potassium inwardly rectifying channel, subfamily J, member 6	hm	Insulin release
KCNJ11^*	3767	Potassium inwardly rectifying channel, subfamily J, member 11	hmr	Insulin release
LEPR	3953	Leptin receptor	h	Adipose-tissue regulation
NPY1R	4886	Neuropeptide Y/peptide YY receptor Y1	h	Gastrointestinal signaling
PCSK1^*	5122	EC 3.4.21.93, proprotein convertase 1	hmr	Insulin processing
PCSK2^*	5126	EC 3.4.21.94, proprotein convertase 2	hmr	Insulin processing
SLC2A2^*	6514	Solute carrier family 2	hmr	Carbohydrate metabolism

The initial problem-specific list (IPL) of 27 genes; all gene names are according to HUGO officially preferred symbols (46). Availability of orthologous gene promoters is indicated by single-letter abbreviations in column 4. h = human, m = mouse, r = rat. The 15 final orthologous promoter sets used for promoter modeling are indicated by asterisks (*).

Orthologous promoter sets (Step 2)

Promoters were identified and extracted for all the genes in our list (Table 1). For the majority of genes, promoter sequences were available from all species (human, mouse and rat) that were chosen for the interspecies comparison. For seven genes, promoters were only available in two species (human and mouse or human and rat) and for four genes, promoters were available only for human (CACNA1D, CACNA1H, LEPR, NPY1R). We obtained 23 sets of orthologous gene promoters from a total of 62 promoter sequences, some of which consisted only of two sequences (see Table 1, column 4). Promoter sequences were extracted from ElDorado™. Functionally conserved frameworks cannot be distinguished from trivial occurrences caused by sequence identity in case of high overall sequence similarity (every sequence-associated feature is necessarily ‘conserved’ when the sequence is identical). Therefore, we first checked the degree of sequence similarity for each orthologous promoter set by sequence alignment. Overall sequence similarity ranged from 36% to 77% for human versus mouse/rat and from 62% to 95% for mouse versus rat. Twenty-one sets with an overall sequence similarity up to 60% (empirical limit) were accepted for further analysis. Models of 2-TFBS-frameworks represent the smallest functional transcriptional units as known from composite elements (24) and transcriptional modules (25). Therefore, 2-TFBS-frameworks were generated within these orthologous promoter sets (interspecies comparison). Each promoter set was subjected to three separate FrameWorker runs using distances of 5–150 bp between elements. These models were required to be present in all orthologous promoters of each set. The remaining 15 suitable promoter sets fulfilling both criteria (up to 60% sequence similarity and matching all available orthologous promoters, marked by * in Table 1) yielded 89 different models.

Shared models (Step 4) and optimization of model selectivity (Step 5)

Five of the 89 models recognized at least two additional gene promoters in the IPL and were selected for further optimization (M1–M5 depicted on top in Figure 2). The different parameters for matrix similarity, matrix orientation (strand specificity), model similarity and distance variation between weight matrices could be adjusted manually for three models to maximize selectivity against the EPD (Figure 3). We found that all five 2-TFBS-models contain at least one TFBS associated with endocrine tissues, and four of the eight transcription factors associated with weight matrices in our models are described as being expressed in endocrine tissues (V$FKHD, V$HOXF, V$MAZF, V$NEUR, BiblioSphere™ analysis).

Figure 2

Model descriptions. The selected five 2-TFBSs-models (TFBSs symbolized by gray boxes) generated from promoter analysis are shown on the top (M1–M5). Naming of TFBSs is according to vertebrate matrix families in MatInspector (Genomatix). The threshold used (opt = optimized; −0.02 = optimized − 0.02) is indicated above the boxes. ‘+’ and ‘−’ signs inside the boxes indicate strand orientation of the respective TFBS. Numbers centered below the boxes denote distances between TFBSs. Extended models (M1a, M1b, M1c, M5a) are shown below models M1–M5 (newly added TFBSs are indicated by open boxes).

Figure 3

Optimization of model selectivity. The histogram shows the increase in selectivity (as defined in Methods) determined for the gene list against the Genomatix Human Promoter Database (see also Table 2). The joined boxes below the histogram indicate the different model structures with 2-, 3- or 4-TFBSs.

Extension of models (Step 6)

Models containing 3-TFBSs were generally found to be more selective than 2-TFBS models (26,27). Therefore, we inspected the orthologous promoter sets for the genes KCNJ11, ABCC8, GIPR, GCG and GLP1R (models M1–M5, Table 2) by MatInspector™ for additional less well-conserved TFBSs in all three organisms, and within a distance range limit of 100 bp from one of the two initial TFBSs. Again as in Step 3, this range was manually adjusted for individual models. This process resulted in extension of model M1 and model M5 by a third TFBS leading to models M1a, M1b (one additional EBOX binding site each) and M5a (additional SP1 binding site). We noticed that model M1a and M1b extended the same model in two directions and then merged them into model M1c (schematic drawing in Figure 2), which now consists of four TFBSs.

Table 2

Model evaluation

Model	Origin	Model matches in IPL(27 genes)	Recall in IPL	Hits in EPD		Hits in GPD		Selectivity
			%	N	%	N	%	EPD	GPD
M1	KCNJ11	KCNJ11, ABCC8, ANXA7, GCGR, INSRR, IRS1, ITPR3, KCNJ3	30.0	96	3.2	1335	2.7	9.4	11.1
M2	ABCC8	ABCC8, ANXA7, CACNA1H, GIPR, IGF1R, KCNJ11, LEPR, PCSK1, PCSK2	33.0	253	8.4	3283	6.5	3.9	5.1
M3	GIPR	GIPR, KCNJ3, CACNA1H, IRS1, KCNJ11	18.5	95	3.2	1650	3.3	5.8	5.6
M4	GCG	GCG, ANXA7, INSR	11.1	145	4.8	3093	6.2	2.3	1.8
M5	GLP1R	GLP1R, ABCC8, GIPR, INS, PCSK1, PCSK2	22.2	35	1.2	484	1.0	18.5	22.2
M1a	KCNJ11	KCNJ11, ABCC8, ITPR3	11.1	34	1.1	490	1.0	9.8	11.3
M1b	KCNJ11	KCNJ11, ABCC8, ANXA7, INSRR, IRS1, ITPR3, KCNJ3	25.9	36	1.2	505	1.0	21.6	25.6
M5a	GLP1R	GLP1R, GIPR, INS, PCSK2	14.8	20	0.7	260	0.5	22.1	28.5
M1c	KCNJ11	KCNJ11, ABCC8, ITPR3	11.1	15	0.5	191	0.4	22.2	29.2

Selected models and their matches found in the list (IPL) of 27 genes and in two different databases (EPD and GPD). All gene names are according to HUGO officially preferred symbols (46). Origin of the model (column 2) denotes the respective set of orthologous gene promoters used for modeling. Promoters of four genes (ABCC8, ANXA7, GIPR, KCNJ11) match to three different models indicating highly interconnected networks. Models with three TFBSs show higher selectivity than models with two TFBSs (columns 5, 6 and 7, absolute match numbers, percentage recognized of all sequences in database and selectivity).

The model selectivity was assessed against the GPD Database. The most selective model (model M5) matched in 484 (1.0%, Table 2) gene promoters and the least specific model (model M2) matched in 3283 (6.5%, Table 2) gene promoters. Model M2 exhibited the best recall (33%, Table 2). The recall of the 3-TFBS-models was lower as compared to models with 2-TFBSs, but showed increased selectivity (Figure 3). The increase in selectivity of the 3- and 4-TFBS-models based on the GPD (>50 000 promoters) is clearly evident (Figure 3), which was essentially paralleled in an analysis based on EPD (>4000 promoters, data not shown).

Database search with final models (Step 7)

The GPD was searched with all models M1–M5 as well as models M1a, M1b, M1c and M5a (Table 2). A clear reduction in the number of matches in the database (3- to 6-fold) can be seen between the 2-TFBS-models and the extended models, which is reflected in a corresponding increase in selectivity. Inspection of the matches of the extended models also allowed extension of the IPL. We found additional genes already known to be involved in insulin/glucose signaling that were not contained in the IPL, as they did not match our LitMiner queries (PRKAA1, ADRB3, PPARGC1B, CLIC3, RyR2, VIPR).

Functional association (Step 8)

Biological links between the genes of the IPL were identified from BiblioSphere™, which is a gene-centered approach combining literature with sequence analysis (used to compile the scheme shown in Figure 4). This biological network revolving around insulin/glucose signaling is overlaid with gray areas indicating the groups of IPL genes identified by the two models M1b and M5a, which are extensively linked in the biological networks (summarized in Figure 4). Briefly, the ATP-sensitive K+ channels composed of KCNJ11 and ABCC8 (28) (probably extended by KCNJ3 through models 1 and 3) are involved in glucose-induced insulin secretion (29), and seem to be co-regulated as indicated by their shared promoter framework. INSRR is known to form heterodimers with INSR and IGF1R (30) and is involved in tyrosine-phosphorylation of the IRS1 product (31), which in turn inhibits insulin secretion (32). The CACNA1H gene encodes the L-type voltage-dependent calcium channel VDCC, which is linked to other genes: It may be involved in the actions of two insulin pro-protein convertases PCSK1, PCSK2 (33). VDCC might also influence the GIPR and GLP1R receptor genes both of which enhance insulin secretion (34).

DISCUSSION

We show that promoter modeling can link disease-associated genes to potential regulatory networks. The most important result obtained in this study is achieving this by using a generally applicable strategy based on optimization of selectivity of promoter models that also identifies regulatory subgroups when necessary. We were able to identify putative regulatory networks within the initial gene list, adding another level of evidence derived from promoter analysis to links known from the literature. We also identified novel members of the putative regulatory networks, which were clearly associated with the biological processes analyzed. Thus, a link between known biological networks and regulatory networks described by molecular promoter organization became evident. Although such links have been established in previous studies (27,35), these depended on particular expert knowledge and/or problem-specific conditions preventing generalization of the approach. As shown in Figure 4, literature analysis identified a group of genes, which are tightly linked in larger functional networks. Furthermore, for nine genes (ABCC8, KCNJ11, PCSK1, PCSK2, INS, INSR, GCG, IGF1R, LEPR) the BiblioSphere™ literature co-citation analysis revealed a connection to one of the transcription factors that are part of the 2-TFBS-models. However, we used the literature analysis solely to compile the IPL, and then relied entirely on sequence analysis to find and improve subgroups of potentially co-regulated genes as exemplified by shared promoter frameworks. This allowed us to use a systematic approach, purging the huge list of possible frameworks to only five. The final extended models M1a,b,c and M5a preferentially link the promoters of genes that are also functionally connected, such as binding to each other (e.g. receptor complexes) or acting in a common pathway (e.g. insulin processing, Figure 4). This further supports the idea that promoter organizational models can indeed provide the link between the genomic sequence and their biological function. We found at least six new candidate genes for the insulin/glucose signaling network by searching the human promoter database with models M5a and M1c that were not in the IPL, but clearly associated with insulin/glucose signaling (PRKAA1, ADRB3, PPARGC1B, CLIC3, RyR2, VIPR). They were not included into the IPL either because the literature was not yet available at the time of IPL compilation or they ranked too low in the initial list (e.g. no explicit link to beta cells). The PPARGC1B gene (coding for PGC-1beta) for example is clearly affected in diabetes (36,37). However, this gene is not solely associated with beta-cells and, for example, may be involved in diabetes-related events in the liver (38), further extending the range of the regulatory network. Promoter analysis added another line of evidence for the relevance of these newly identified genes, which allows better experimental setup for further evaluation of these signaling networks. This should help to gain a better understanding of complex biological processes. Our strategy described here has several advantages over problem-specific approaches. Compilation of a complete gene list from literature would require a priori knowledge of the solution in order to define the correct queries. In our approach, the initial problem-oriented list of genes does not need to be complete, and it can be compiled semiautomatically. When starting with a single gene or even just a disease name, it is possible to collect a list of genes definitely related to the topic of interest. This was shown using the literature tools described here for mammalian systems. There is also no need to exactly know how the selected genes are linked. Our strategy successfully analyzed mixed data sets not restricted to a single transcriptional mechanism, and identified subsets connected by shared promoter frameworks (see Figure 4). Mixed data sets usually present an obstacle to pattern analysis and only recently the problem has been approached successfully in mammalian systems (39). However, this and other approaches (40,41) focused on individual elements rather than complete promoter organization, which is the focus of this study. Throughout the analysis, selectivity was evaluated against databases, which were orders of magnitude larger than our training set. Selectivity was chosen, as sensitivity and specificity require knowledge about the true positive and false negative, both not available for whole-genome promoter databases. Evaluation of results against the background of all promoters in the human genome is desirable as it excludes any artificial bias on control sampling, supporting biological relevance of our findings. Selectivity proved to be a suitable optimization criterion as demonstrated in Figure 4. The importance of combinations of TFBSs for biological function was also well established before (20), and the particular organization of frameworks has been used successfully to describe individual functions already (42). Phylogenetic conservation of TFBSs was used for promoter analysis as well (43,44). However, the combination of vertical (inter-species) and horizontal (intra-species) framework conservation has so far not been exploited to the extent implemented here. The key to success was the extension from single gene analysis (orthologous sets) towards non-orthologous gene groups providing the basis to separate different gene groups matching to distinct models. This required to limit the first step (orthologous promoter analysis) to frameworks of two elements, which are usually neither selective nor necessarily linked to a particular function. Larger models of four or more TFBSs in orthologous promoter sets begin to show over-fitting (we generally found them recognizing only the training set), a feature not desirable in this context. Selectivity and functional association were brought to these models by the interactive optimization process. Gain in selectivity almost always causes a loss of recall. Models containing three TFBSs turned out to represent a good balance between selectivity and recall in our example, which is required for a successful search for potential new candidates in a regulatory network. This strategy currently requires interactive decisions (such as which models to extend and how). However, such decisions are reached in a data-driven approach and the selectivity analysis provides an objective measure of improvement. Thus, model finding and optimization are principally suitable for automation, which could be achieved by systematic parameter range variation. Detailed expert knowledge of the problem is only required for the functional assessment in Step 8, but will also facilitate compilation of the IPL. The systematic extraction of promoter structures (frameworks) from a group of genes related to a wide variety of questions or fields of interest and linking these frameworks to biological functions becomes possible by our strategy. However, as the input gene list may be incomplete, so may the result. This strategy will probably not identify all the models or all the functions hidden in the input genes. Nevertheless, even being aware that the result will only be a partial analysis of the problem, this strategy can be used for most problems involving evolutionarily conserved mechanisms of gene regulation. Elucidation of regulatory mechanisms (45) through functional models as demonstrated here, significantly contributes to the functional annotation of mammalian genomes.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.

46 in total

1. Identifying regulatory networks by combinatorial analysis of promoter elements.

Authors: Y Pilpel; P Sudarsanam; G M Church
Journal: Nat Genet Date: 2001-10 Impact factor: 38.330

2. Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors.

Authors: A E Kel; O V Kel-Margoulis; P J Farnham; S M Bartley; E Wingender; M Q Zhang
Journal: J Mol Biol Date: 2001-05-25 Impact factor: 5.469

3. Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity.

Authors: A Klingenhoff; K Frech; K Quandt; T Werner
Journal: Bioinformatics Date: 1999-03 Impact factor: 6.937

4. DIALIGN: finding local similarities by multiple sequence alignment.

Authors: B Morgenstern; K Frech; A Dress; T Werner
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

5. A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter.

Authors: K Frech; J Danescu-Mayer; T Werner
Journal: J Mol Biol Date: 1997-08-01 Impact factor: 5.469

6. Preserved pancreatic beta-cell development and function in mice lacking the insulin receptor-related receptor.

Authors: T Kitamura; Y Kido; S Nef; J Merenmies; L F Parada; D Accili
Journal: Mol Cell Biol Date: 2001-08 Impact factor: 4.272

7. Insulin receptor-related receptor is expressed in pancreatic beta-cells and stimulates tyrosine phosphorylation of insulin receptor substrate-1 and -2.

Authors: I Hirayama; H Tamemoto; H Yokota; S K Kubo; J Wang; H Kuwano; Y Nagamachi; T Takeuchi; T Izumi
Journal: Diabetes Date: 1999-06 Impact factor: 9.461

8. Recognition of NFATp/AP-1 composite elements within genes induced upon the activation of immune cells.

Authors: A Kel; O Kel-Margoulis; V Babenko; E Wingender
Journal: J Mol Biol Date: 1999-05-07 Impact factor: 5.469

9. Enrichment of regulatory signals in conserved non-coding genomic sequence.

Authors: S Levy; S Hannenhalli; C Workman
Journal: Bioinformatics Date: 2001-10 Impact factor: 6.937

10. Human-mouse genome comparisons to locate regulatory sites.

Authors: W W Wasserman; M Palumbo; W Thompson; J W Fickett; C E Lawrence
Journal: Nat Genet Date: 2000-10 Impact factor: 38.330

22 in total

1. Template-driven gene selection procedure.

Authors: N Knowlton; I Dozmorov; K D Kyker; R Saban; C Cadwell; M B Centola; R E Hurst
Journal: Syst Biol (Stevenage) Date: 2006-01

2. Linkage of cardiac gene expression profiles and ETS2 with lifespan variability in rats.

Authors: Anna Sheydina; Maria Volkova; Liqun Jiang; Ondrej Juhasz; Jing Zhang; Hyun-Jin Tae; Maria G Perino; Mingyi Wang; Yi Zhu; Edward G Lakatta; Kenneth R Boheler
Journal: Aging Cell Date: 2012-02-15 Impact factor: 9.304

3. A molecular profile of focal segmental glomerulosclerosis from formalin-fixed, paraffin-embedded tissue.

Authors: Jeffrey B Hodgin; Alain C Borczuk; Samih H Nasr; Glen S Markowitz; Viji Nair; Sebastian Martini; Felix Eichinger; Courtenay Vining; Celine C Berthier; Matthias Kretzler; Vivette D D'Agati
Journal: Am J Pathol Date: 2010-09-16 Impact factor: 4.307

4. Comparative promoter analysis allows de novo identification of specialized cell junction-associated proteins.

Authors: Clemens D Cohen; Andreas Klingenhoff; Anissa Boucherot; Almut Nitsche; Anna Henger; Bodo Brunner; Holger Schmid; Monika Merkle; Moin A Saleem; Klaus-Peter Koller; Thomas Werner; Hermann-Josef Gröne; Peter J Nelson; Matthias Kretzler
Journal: Proc Natl Acad Sci U S A Date: 2006-03-31 Impact factor: 11.205

5. Seeking unique and common biological themes in multiple gene lists or datasets: pathway pattern extraction pipeline for pathway-level comparative analysis.

Authors: Ming Yi; Uma Mudunuri; Anney Che; Robert M Stephens
Journal: BMC Bioinformatics Date: 2009-06-29 Impact factor: 3.169

6. Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies.

Authors: Matthew S Hestand; Andreas Klingenhoff; Matthias Scherf; Yavuz Ariyurek; Yolande Ramos; Wilbert van Workum; Makoto Suzuki; Thomas Werner; Gert-Jan B van Ommen; Johan T den Dunnen; Matthias Harbers; Peter A C 't Hoen
Journal: Nucleic Acids Res Date: 2010-07-07 Impact factor: 16.971

10. In Silico Promoter Analysis can Predict Genes of Functional Relevance in Cell Proliferation: Validation in a Colon Cancer Model.

Authors: Alan C Moss; Peter P Doran; Padraic Macmathuna
Journal: Transl Oncogenomics Date: 2007-02-14