| Literature DB >> 16420672 |
Ruth Van Hellemont1, Pieter Monsieurs, Gert Thijs, Bart de Moor, Yves Van de Peer, Kathleen Marchal.
Abstract
Although proven successful in the identification of regulatory motifs, phylogenetic footprinting methods still show some shortcomings. To assess these difficulties, most apparent when applying phylogenetic footprinting to distantly related organisms, we developed a two-step procedure that combines the advantages of sequence alignment and motif detection approaches. The results on well-studied benchmark datasets indicate that the presented method outperforms other methods when the sequences become either too long or too heterogeneous in size.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16420672 PMCID: PMC1414112 DOI: 10.1186/gb-2005-6-13-r113
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Schematic representation of the two-step procedure for phylogenetic footprinting. In the data reduction step, regions conserved among closely related (mammalian) orthologs are selected. Subsequently, these strongly conserved sequences are combined with a more distant ortholog (for example, Fugu); this set of genes is then subjected to motif detection. Finally, significantly conserved blocks are identified using a threshold defined by a random analysis.
Conserved blocks detected in benchmark datasets
| Gene | Number of blocks | ||
| Two-step | UCSC | UCR | |
| 0 | 0 | 0 | |
| 8 | 5 | 0 | |
| 13 | 11 | 0 | |
| 1 | 0 | 0 | |
| Total | 22 | 16 | 0 |
Number of blocks two-step: number of conserved blocks identified using the two-step procedure. For more details on the blocks see Tables 2 (hoxb2), 3 (pax6) and 4 (scl). Number of blocks UCSC: the number of blocks detected by the two-step procedure that were recovered in the USCS genome browser (aligned between mammals and Fugu) [51]. Number of blocks UCR: the number of blocks detected by the two-step procedure that correspond to an ultra-conserved region [20].
List of the significant blocks detected in the hoxb2 dataset
| Block | Consensus sequence and possible binding sites |
| Hoxb2 1.1 (-) | |
| Hoxb2 2.1 (-) | |
| Hoxb2 2.2 (UCSC) | |
| Hoxb2 2.3 (UCSC) | |
| Hoxb2 2.4 (UCSC) | |
| Hoxb2 2.5 (UCSC) | |
| Hoxb2 3.1 (UCSC) | |
| NF-Y, M00185, TRRCCAATSRN: 12-22 - (0.915) | |
| Hoxb2 3.2 (-) | |
| USF, M00217, NCACGTGN: 1-8 + (0.902) |
For each block, the consensus sequence is given followed by the possible binding sites situated in this block: motifs previously described in the literature [46] are marked with an asterisk. The motifs are summarized by their motif name (in bold), by their consensus sequence, if known, as described in the original article, by the sequence of the motif instance in our search, by the positions of the motif instance relative to the consensus sequence of the entire block and by the strand (indicated by a '+' or a '-') on which the motif occurred. Motif hits derived by Transfac are indicated by their matrix accession number, the consensus of this binding site and the instances of this motif in our search. These are further characterized by their positions relative to the consensus sequence of the entire block, by the strand on which the motif occurred and by the corresponding MotifLocator score (in parentheses). The blocks identified by the UCSC genome browser as conserved between mammals and Fugu are marked with 'UCSC', while the blocks detected by our two-step methodology but not present in the UCSC genome browser are indicated with a '-'.
List of the significant blocks detected in the pax6 dataset
| Block | Consensus sequence and possible binding sites |
| pax6 1.1 (UCSC) | CTTAATGATGAGAGATCTTTCCGCTCATTGCCCATTCAAATACAATTGTAGATCGAAGCCGGCCTT GTCAsGTTGAGAAAAAGTGAATTTCTAACATCCAGGACGTGCCTGTCTACT |
| pax6 1.2 (UCSC) | |
| pax6 1.3 (UCSC) | |
| pax6 1.4 (UCSC) | |
| pax6 1.5 (UCSC) | |
| pax6 1.6 (UCSC) | |
| pax6 2.1 (UCSC) | |
| pax6 2.2 (-) | ATTTTGGTTGCTTTCAGGTwTAATTAACTTT |
| pax6 2.3 (UCSC) | |
| pax6 2.4 (-) | GGTTGCTTTCAGGTwTAATTAACTTTGAACAACAAATA |
| pax6 3.1 (UCSC) | |
| pax6 3.2 (UCSC) | |
| pax6 3.3 (UCSC) | |
For each block, the consensus sequence is given followed by the possible binding sites situated in this block: motifs previously described in the literature [47] are marked with an asterisk. The motifs are summarized by their motif name (in bold), by their consensus sequence, if known, as described in the original article, by the sequence of the motif instance in our search, by the positions of the motif instance relative to the consensus sequence of the entire block and by the strand (indicated by a '+' or a '-') on which the motif occurred. Motif hits derived by Transfac are indicated by their matrix accession number, the consensus of this binding site and the instances of this motif in our search. These are further characterized by their positions relative to the consensus sequence of the entire block, by the strand on which the motif occurred and by the corresponding MotifLocator score (in parentheses). The blocks identified by the UCSC genome browser as conserved between mammals and Fugu are marked with 'UCSC', while the blocks detected by our two-step methodology but not present in the UCSC genome browser are indicated with a '-'.
List of the significant blocks detected in the scl dataset
| Block | Consensus sequence and possible binding sites |
| scl 1.1 (-) | |
For each block, the consensus sequence is given followed by the possible binding sites situated in this block: motifs previously described in the literature [48] are marked with an asterisk. The motifs are summarized by their motif name (in bold), by their consensus sequence, if known, as described in the original article, by the sequence of the motif instance in our search, by the positions of the motif instance relative to the consensus sequence of the entire block and by the strand (indicated by a '+' or a '-') on which the motif occurred. Motif hits derived by Transfac are indicated by their matrix accession number, the consensus of this binding site and the instances of this motif in our search. These are further characterized by their positions relative to the consensus sequence of the entire block, by the strand on which the motif occurred and by the corresponding MotifLocator score (in parentheses). The blocks identified by the UCSC genome browser as conserved between mammals and Fugu are marked with 'UCSC', while the blocks detected by our two-step methodology but not present in the UCSC genome browser are indicated with a '-'.
Figure 2Localization of clusters and conserved blocks in the (a) hoxb2, (b)pax6 and (c)scl datasets. For each dataset, the different orthologous intergenic sequences are shown: Rn,Rattus norvegicus; Mm, Mus musculus; Pt, Pan troglotydes; Hs, Homo sapiens; Fr, Fugu rubripes. Clusters of conserved mammalian subsequences that were subjected to motif detection (that is, clusters containing at least one subsequence per mammalian organism) are represented on the respective mammalian sequences (cluster 1 in red, cluster 2 in blue and cluster 3 in green). The conserved blocks identified using BlockSampler are represented on the Fugu intergenic sequence (in the color of the mammalian cluster it is located in). For each block the localization relative to the start of the Fugu gene is given. The transcription start sites are marked with an inverse triangle.
Comparison of two-step procedure with other methodologies
| Gene | Number of motifs | Two-step BS | BS | Two-step MS | MS | MAVID | TBA |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 17 | 8 (+5) | 13 | 2 | 1 | 0 | 0 | |
| 6 | 6 | 1* | 0 | 0 | 6 | 6 | |
| 5 | 3 (+1) | 1 | 0 | 0 | 0 | 0 | |
| Total | 30 | 17 (+6) | 15 | 2 | 1 | 6 | 6 |
Number of motifs: the number of motifs reported by Blanchette and Tompa [26] in cfos, Scemama et al. [46] in hoxb2, Kammandel et al. [47] in pax6 and Göttgens et al. [48] in scl. Two-step BS: the number of previously described motifs detected by the two-step procedure, combining data reduction and motif detection using BlockSampler. The numbers in parentheses are the number of motifs present in non-significant blocks. BS: the number of previously described motifs detected by BlockSampler in initial full-length datasets. Two-step MS: the number of previously described motifs detected by combining data reduction and motif detection using MotifSampler. MS: the number of previously described motifs detected by MotifSampler in initial full-length datasets. MAVID: the number of previously described motifs detected (correctly aligned) by MAVID. TBA: the number of previously described motifs detected by TBA. *Only part of a motif was detected.
Base pair lengths of the intergenic sequences for each benchmark dataset
| Gene | |||||
| 40,154 | 33,157 | 40,132 | 40,154 | 3,606* | |
| 3,606† | |||||
| 1,244‡ | |||||
| 4,973 | 6,744 | 7,640 | 4,878 | 39,219 | |
| 40,102 | 40,000 | 40,000 | 40,000 | 21,204 | |
| 20,981 | 16,471 | 20,343 | 39,999 | 20,155 |
The Fugu cfos intergenic sequences are derived from *SINFRUG00000132418, †SINFRUG00000132419 and ‡SINFRUG00000143787. The Ensemble IDs (+ 1 Genebank accession number) are given in [56]. Fr,Fugu rubripes; Hs, Homo sapiens; Mm, Mus musculus; Pt, Pan troglotydes; Rn, Rattus norvegicus.
Figure 3Comparison of two-step strategy with MAVID for the scl data set (a) Conserved block: alignment of the different scl orthologs. The conserved block as identified by BlockSampler - is marked with a boxed area. (b) Visualization of the MAVID alignment of the corresponding region. The dashed line denotes a gap in the alignment. Rn, Rattus norvegicus; Mm, Mus musculus; Pt, Pan troglotydes; Hs, Homo sapiens; Fr, Fugu rubripes.
Figure 4Schematic representation of subclusters, that is, clusters of conserved orthologous sequences that contain one region in each ortholog. See text for details. Rn, Rattus norvegicus; Mm, Mus musculus; Pt, Pan troglotydes; Hs, Homo sapiens.