| Literature DB >> 16855295 |
Abstract
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.Entities:
Mesh:
Year: 2006 PMID: 16855295 PMCID: PMC1524905 DOI: 10.1093/nar/gkl372
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1The IUPAC (International union of pure and applied chemistry) code for representing degenerate nucleotide sequence patterns.
Figure 2(A) The collection of eight known Rox1-binding sites taken from SCPD (47). Scores of the sites are according to the PWM described in (C). (B) Alignment matrix and IUPAC representation of the eight Rox1-binding sites. The cells represent the number of times a base i is observed at position j in the alignment of sites. The frequencies, f, of base i at position j of the binding sites can be obtained by dividing the values in the cells of the alignment matrix by the total number of sites, e.g. f = f = 4/8 = 0.5. (C) PWM for scoring sequences. Each weight is given by log2(f/P) (see text), where P is the probability of observing the base i in the data; here we have taken PA = PT = 0.32, and PC = PG = 0.18 (corresponding to the S.cerevisiae genome). A pseudocount of 1 was added to the alignment before deriving the weights. This matrix was used to score the sites in A. As an example, the score of the site in red (sequence CCAATTGTTTTG, score 13.87) is given by the summation of the scores that are circled in red. Note that the scores of the two consensus sequences, CCCATTGTTCTC and TCCATTGTTCTC are different because PC ≠ PT. (D) Sequence logo representation (187) of the alignments, visually showing the IC and conservation at each of the alignment positions. The IC of this matrix is 11.3 bits or 7.83 nats (Equation 1).
Figure 3Predicted TF-binding sites in human–mouse conserved regions around the CKM (creatine kinase, muscle) gene. The genomic regions, along with 5 kb upstream and 2 kb downstream, of the CKM gene were extracted from the human and mouse genomes and aligned using the BLASTZ software (151). The BLASTZ alignments were then fed into the rVISTA program (153) through the website . Binding sites for several TFs that are known to regulate gene expression in the muscle tissue were then predicted on the human sequence using the PWM models available from the TRANSFAC database (12). The predicted sites can be dynamically viewed and clustered through the above website. For the purpose of this current figure, we required that at least two binding sites belonging to different TFs be present within a window of 100 nt. A cluster of sites was observed in the immediate 5′ upstream region of this gene (boxed). Percent conservation between the two sequences is shown; regions with ≥75% conservation are colored. The human gene structure is shown at the top in blue.