Literature DB >> 12814519

Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements.

Zhaolei Zhang1, Mark Gerstein.   

Abstract

Phylogenetic footprinting is an approach to finding functionally important sequences in the genome that relies on detecting their high degrees of conservation across different species. A new study shows how much it improves the prediction of gene-regulatory elements in the human genome.

Entities:  

Mesh:

Year:  2003        PMID: 12814519      PMCID: PMC193683          DOI: 10.1186/1475-4924-2-11

Source DB:  PubMed          Journal:  J Biol        ISSN: 1475-4924


It has been a great challenge for biologists to understand the complicated and often myriad mechanisms of gene regulation. The recent success of genome sequencing projects [1,2], combined with very effective gene-prediction algorithms, has generated abundant gene sequences, but our understanding of gene regulation has remained very limited. In human and other higher eukaryotes, gene expression is modulated by the binding of various transcription factors onto cis-regulatory regions of a gene. Binding of different combinations of transcription factors may result in a gene being expressed in different tissue types or at different developmental stages. To fully understand a gene's function, therefore, it is essential to identify the transcription factors that regulate the gene and the corresponding transcription-factor-binding sites (TFBSs) within the DNA sequence. Traditionally, these regulatory sites were determined by labor-intensive wet-lab techniques such as DNAse footprinting or gel-shift assays [3]; several online databases, such as TRRD, COMPEL and TRANSFAC [4,5] have been constructed to store experimentally determined TFBSs. Now, Lenhard and colleagues [6] describe a new addition to the toolkit for TFBS prediction. In recent years, various computational methods have been developed to model and predict gene-regulatory elements. But predicting TFBSs has proved to be much harder than predicting genes, the intrinsic difficulty being that TFBSs are in general very short and often degenerate in sequence. Most TFBSs are short sequences of 6–12 base-pairs located in the non-coding regions of a gene, most often in the 5' flanking region but sometimes in the 3' region or even introns. Only between four and six bases within each TFBS are fully conserved, however, with the other positions being highly variable from gene to gene. As a result, TFBSs are often modeled using position-specific weight matrices (PWMs) [7], which in essence summarize the relative frequencies of each of the four nucleotides at each position. Figure 1 shows an example of such a matrix, for the human transcription factor GATA-1, from the widely used TRANSFAC database [5].
Figure 1

An example of a position-specific weight matrix (PWM) adapted from the TRANSFAC database [5]. The sequences that have been shown experimentally to bind to the human transcription factor GATA-1 have 14 positions, among which only positions 6–10 are fully conserved. Abbreviations: R, G or A (purine); N, any; S, G or C (strong); D, G or A or T. Twelve sequences were used to build this matrix.

Given a PWM and a reliable scoring function, one can scan genomic DNA sequences and identify potential TFBSs. But because TFBSs are highly degenerate, the majority of predicted sites are 'false positives' that have no biological significance [8]. Several strategies have therefore been developed to reduce the false-positive rate; these include combining predictions with gene-expression data [9] or using prior knowledge of gene co-regulation [10]. Another approach is to take advantage of the fact that genes are often regulated by multiple transcription factors, so potential TFBSs tend to be clustered or adjacent to each other [11]. Alternatively, some researchers have tried to create more precise and sensitive tools for local sequence alignment and pattern discovery [12,13]. With the advance of genome sequencing projects, it has become obvious that comparing genomic sequences across species – 'comparative genomics' – is a very effective way to identify functionally important DNA sequences. At first comparative techniques were primarily applied to the coding regions of genomes, to identify genes or exon-intron boundaries [14]. More recently, such evolutionary approaches have become central to the efforts to predict gene-regulatory sites, and the technique itself in this context has become known as 'phylogenetic footprinting' [15,16], a term inspired by the wet-lab technique of DNAse footprinting. The reasoning behind the approach is that, just like coding sequences, regulatory elements are functionally important and are under evolutionary selection, so they should have evolved much more slowly than other non-coding sequences. Genome-wide sequence comparison and studies on individual genes have confirmed that regulatory elements are indeed conserved between related species [17-19]. Thus, if we align the non-coding regions of orthologous genes from two species that are sufficiently evolutionarily distant (but not too distant), we should be able to detect the conserved regulatory elements interspersed between the truly non-functional background sequences. This approach is illustrated schematically in Figure 2, in which a hypothetical human gene and its orthologs from mouse, rat and chimpanzee are shown together; alignment of the orthologous sequences reveals conserved TFBSs that are present in more than one species.
Figure 2

Using phylogenetic footprinting to detect conserved TFBSs. This schematic diagram shows a hypothetical human gene aligned with its orthologs from three other mammals. Cross-species sequence comparison reveals conserved TFBSs in each sequence. Sequence motifs of the same shape (colored in green) represent binding-sites of the same class of transcription factors. TFBS1 and TFBS4 are conserved in all four mammals; TFBS3 represents a newly acquired, primate-specific binding site. TFBS2 and TFBS2' represent orthologous regulatory sites that have diverged significantly between the primate and rodent lineages. Blue rectangles represent TATA boxes.

Phylogenetic footprinting was first performed by visually examining the alignment of orthologous sequences; then, automated computer programs were developed to assist the process. In this issue of Journal of Biology, Lenhard, Sandelin and colleagues describe their most recent success in predicting TFBSs by comparative genome analysis [6]. They also introduce an interactive, web-based computational platform, ConSite [20], which allows users to do their own phylogenetic footprinting. The power of any TFBS prediction algorithm that uses PWMs depends on the quality of the matrix models that it uses, since the matrices represent an abstraction of experimentally verified TFBSs. Lenhard and colleagues [6] collected TFBSs from both in vivo and in vitro assays and used an improved motif discovery algorithm, ANN-Spec [21], to construct over 100 distinct and high-quality TFBS profile matrices. These comprehensive profiles were collected into an online database JASPAR [22], which is freely available to the scientific community. Users of ConSite can either provide an existing alignment of two orthologous sequences or input just the sequences alone and the program will generate the alignment. The program then scans the individual sequences for potential TFBSs and compares the potential sites between the aligned sequences. Only those conserved sites that are present in both sequences and also, more importantly, are located in equivalent positions in the two aligned sequences, are selected and reported in the output. The remainder of the sites, which are not conserved between the two species, are considered to be false positives and are eliminated. This phylogenetic filtering procedure significantly improves the power of TFBS prediction, as is demonstrated by an example described in detail in the article by Lenhard et al. [6]. The authors compared the human β-globin promoter sequence with the orthologous sequences from mouse and cow; this dramatically reduced the false-positive prediction of TFBSs and they were able to identify a previously documented regulatory site. The authors also studied a larger set of human-mouse gene pairs and compared the results predicted by ConSite with the previously verified regulatory sites. On average, phylogenetic footprinting improved the selectivity of TFBS prediction by 85% compared to using matrix models alone, and could detect the majority of verified sites. When compared with other available systems, ConSite has a flexible and easy-to-use web interface. Users of the website can choose to search for binding sites for any numbers of transcription factors or can even provide their own defined PWMs. The entire procedure and the output graphs can be modulated by many user-specified parameters such as the extent of required conservation (cut-off), and the length of sequence to search (window size). It is becoming evident that comparative genome analysis is very powerful and will be of use not only for genome annotation but also as an adjunct to more traditional disciplines, such as molecular biology and genetics. Just like the sequence-alignment programs that emerged in the early 1990s, ConSite and other similar programs [23,24] will prove very valuable and timely research tools for the scientific community. Many new research directions are currently being pursued in this area; for example, pair-wise sequence comparisons can be expanded to include multiple species and to make use of additional information, such as evolutionary distance and phylogenetic relationships [25]. More precise and effective sequence alignment programs have been created to handle genome-scale sequences [26,27]. In addition to the human-mouse comparisons, some researchers are also proposing cross-species comparison between human and other primates, which has been described as 'phylogenetic shadowing' [28]. This approach complements human-rodent comparisons and will detect primate-specific regulatory elements (see Figure 2). On the 'wet' experimental front, recent developments include microarray-based technologies such as 'ChIP-chip', which combines chromatin immunoprecipitation (ChIP) with analysis of the precipitated DNA on a microarray (chip), to detect TFBSs within a whole genome [29]. It can be imagined that, with the emergence of more mammalian genome sequences in the near future, we can finally identify all the gene regulatory elements in the human genome and use them as a blueprint for understanding the mysteries of gene regulation.
  27 in total

1.  The TRANSFAC system on gene expression regulation.

Authors:  E Wingender; X Chen; E Fricke; R Geffers; R Hehl; I Liebich; M Krull; V Matys; H Michael; R Ohnhäuser; M Prüss; F Schacherer; S Thiele; S Urbach
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.

Authors:  G Z Hertz; G D Stormo
Journal:  Bioinformatics       Date:  1999 Jul-Aug       Impact factor: 6.937

Review 3.  Discovery and modeling of transcriptional regulatory regions.

Authors:  J W Fickett; W W Wasserman
Journal:  Curr Opin Biotechnol       Date:  2000-02       Impact factor: 9.740

4.  Initial sequencing and analysis of the human genome.

Authors:  E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal:  Nature       Date:  2001-02-15       Impact factor: 49.962

5.  ANN-Spec: a method for discovering transcription factor binding sites with improved specificity.

Authors:  C T Workman; G D Stormo
Journal:  Pac Symp Biocomput       Date:  2000

6.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome.

Authors:  Benjamin P Berman; Yutaka Nibu; Barret D Pfeiffer; Pavel Tomancak; Susan E Celniker; Michael Levine; Gerald M Rubin; Michael B Eisen
Journal:  Proc Natl Acad Sci U S A       Date:  2002-01-22       Impact factor: 11.205

7.  Conserved noncoding sequences are reliable guides to regulatory elements.

Authors:  R C Hardison
Journal:  Trends Genet       Date:  2000-09       Impact factor: 11.639

8.  Combining frequency and positional information to predict transcription factor binding sites.

Authors:  S M Kiełbasa; J O Korbel; D Beule; J Schuchhardt; H Herzel
Journal:  Bioinformatics       Date:  2001-11       Impact factor: 6.937

9.  Human and mouse gene structure: comparative analysis and application to exon prediction.

Authors:  S Batzoglou; L Pachter; J P Mesirov; B Berger; E S Lander
Journal:  Genome Res       Date:  2000-07       Impact factor: 9.043

10.  Human-mouse genome comparisons to locate regulatory sites.

Authors:  W W Wasserman; M Palumbo; W Thompson; J W Fickett; C E Lawrence
Journal:  Nat Genet       Date:  2000-10       Impact factor: 38.330

View more
  40 in total

Review 1.  Molecular and evolutionary processes generating variation in gene expression.

Authors:  Mark S Hill; Pétra Vande Zande; Patricia J Wittkopp
Journal:  Nat Rev Genet       Date:  2020-12-02       Impact factor: 53.242

Review 2.  Transposable elements donate lineage-specific regulatory sequences to host genomes.

Authors:  L Mariño-Ramírez; K C Lewis; D Landsman; I K Jordan
Journal:  Cytogenet Genome Res       Date:  2005       Impact factor: 1.636

3.  Origin and evolution of human microRNAs from transposable elements.

Authors:  Jittima Piriyapongsa; Leonardo Mariño-Ramírez; I King Jordan
Journal:  Genetics       Date:  2007-04-15       Impact factor: 4.562

4.  Evolutionary patterns of non-coding RNAs.

Authors:  Athanasius F Bompfünewerer; Christoph Flamm; Claudia Fried; Guido Fritzsch; Ivo L Hofacker; Jörg Lehmann; Kristin Missal; Axel Mosig; Bettina Müller; Sonja J Prohaska; Bärbel M R Stadler; Peter F Stadler; Andrea Tanzer; Stefan Washietl; Christina Witwer
Journal:  Theory Biosci       Date:  2005-04       Impact factor: 1.919

5.  Conservation across species identifies several transcriptional enhancers in the HEX genomic region.

Authors:  Angela Valentina D'Elia; Elisa Bregant; Nadia Passon; Cinzia Puppin; Alessia Meneghel; Giuseppe Damante
Journal:  Mol Cell Biochem       Date:  2009-06-25       Impact factor: 3.396

6.  Evidence that purifying selection acts on promoter sequences.

Authors:  Robert K Arthur; Ilya Ruvinsky
Journal:  Genetics       Date:  2011-09-06       Impact factor: 4.562

7.  Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: a comparative genomics approach.

Authors:  Swaine L Chen; Chia-Seui Hung; Jian Xu; Christopher S Reigstad; Vincent Magrini; Aniko Sabo; Darin Blasiar; Tamberlyn Bieri; Rekha R Meyer; Philip Ozersky; Jon R Armstrong; Robert S Fulton; J Phillip Latreille; John Spieth; Thomas M Hooton; Elaine R Mardis; Scott J Hultgren; Jeffrey I Gordon
Journal:  Proc Natl Acad Sci U S A       Date:  2006-04-03       Impact factor: 11.205

Review 8.  Molecular genetic manipulation of vector mosquitoes.

Authors:  Olle Terenius; Osvaldo Marinotti; Douglas Sieglaff; Anthony A James
Journal:  Cell Host Microbe       Date:  2008-11-13       Impact factor: 21.023

9.  Proteins of the secretory pathway govern virus productivity during lytic gammaherpesvirus infection.

Authors:  J Mages; K Freimüller; R Lang; A K Hatzopoulos; S Guggemoos; U H Koszinowski; H Adler
Journal:  J Cell Mol Med       Date:  2008-01-11       Impact factor: 5.310

10.  Short Promoters in Viral Vectors Drive Selective Expression in Mammalian Inhibitory Neurons, but do not Restrict Activity to Specific Inhibitory Cell-Types.

Authors:  Jason L Nathanson; Roberto Jappelli; Eric D Scheeff; Gerard Manning; Kunihiko Obata; Sydney Brenner; Edward M Callaway
Journal:  Front Neural Circuits       Date:  2009-11-09       Impact factor: 3.492

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.