| Literature DB >> 26254488 |
Dieter De Witte1, Jan Van de Velde2, Dries Decap1, Michiel Van Bel2, Pieter Audenaert1, Piet Demeester1, Bart Dhoedt1, Klaas Vandepoele2, Jan Fostier1.
Abstract
MOTIVATION: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26254488 PMCID: PMC4653392 DOI: 10.1093/bioinformatics/btv466
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Overview of BLSSpeller. The input consists of homologous promoter sequences grouped into gene families. During the intrafamily phase, conserved words are exhaustively enumerated for each gene family individually. A word is considered to be conserved in a gene family if its branch length score (BLS) exceeds threshold T. Multiple BLS thresholds T can be used in a single run. In the alignment-free mode, the BLS of a word is computed irrespective of its orientation or relative position within the promoter sequences. Alternatively, in the alignment-based mode, words must appear aligned in the multiple sequence alignment. During the sorting phase, conserved words of all gene families are sorted according to permutation group, i.e. words with the same length and base content are grouped together. In the interfamily phase, permutation groups are handled individually. First, for each word, the conserved family count , i.e. the number of gene families in which the word is conserved with BLS , is established for all BLS thresholds T Next, a background model is created by selecting the median value of the conserved family count of a large number of randomly generated instances of the permutation group, again for each threshold T Finally, a confidence score is computed for each T Words for which and for any threshold T are considered to be genome-wide conserved motifs and are retained
Fig. 2.Number of genome-wide conserved motifs for both alignment-based and alignment-free discovery for different values of and and different subsets of the six BLS thresholds T ( and ). Top number: real Monocot dataset; bottom number between brackets: random dataset (zeroth-order Markov model). The colors represent the false discovery rate (see legend)
Overlap between conserved genomic regions as identified by BLSSpeller and experimentally profiled open chromatin regions in rice and transcription factor binding sites inferred through protein-binding microarrays in rice and maize
| Overlap with experimentally profiled open chromatin regions (OCR) in | |||||
|---|---|---|---|---|---|
| BLSSpeller | No. of conserved | No. of OCR regions | No. of conserved regions | No. of rand. conserved regions | enrichment |
| thresholds | regions | within OCR regions | within OCR regions | fold | |
| BLS | 754 205 | 77 247 | 121 026 | 40 277 | 3.005 |
| BLS | 464 229 | 77 247 | 98 681 | 25 996 | 3.796 |
Regions are required to fully overlap in order to be scored.
List of genome-wide conserved ga2ox1-like KN1 motif variants identified by BLSSpeller using both AF and AB discovery
Alignment-free discovery | Alignment-based discovery | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| KN1 motif variant | KN1 motif variant | ||||||||
| TGATNGATKGAY | 59 | 0.93 | 75 | 24 | TGATNGAYGGAY | 11 | 0.91 | 10 | 3 |
| TGATNGAYKGAT | 59 | 0.93 | 74 | 20 | TGATNGATKGAY | 11 | 0.82 | 11 | 3 |
| TGAYNGATKGAT | 54 | 0.93 | 68 | 21 | TGAYNGACKGAC | 10 | 0.90 | 11 | 3 |
| TGATNGAYWGAT | 40 | 0.88 | 50 | 11 | TGAYGGAYGGAY | 9 | 1.00 | 9 | 3 |
| TGAYNGAYTGAT | 36 | 0.89 | 48 | 11 | TGATNGAYRGAT | 9 | 0.89 | 10 | 3 |
| TGAYTGAYTGAY | 33 | 0.97 | 42 | 9 | TGAYNGAYTGAC | 8 | 0.88 | 9 | 2 |
| TGATNGAYTGAY | 32 | 0.88 | 40 | 7 | TGACNGAYTGAY | 8 | 0.88 | 10 | 3 |
| TGAYNGATWGAT | 31 | 0.84 | 42 | 12 | TGACNGACWGAY | 7 | 0.86 | 7 | 2 |
| TGATNGATWGAY | 30 | 0.83 | 36 | 9 | TGACAGAYRGAY | 3 | 1.00 | 4 | 0 |
| TGATNGATRGAY | 29 | 0.86 | 39 | 9 | |||||
| TGAYNGATRGAT | 27 | 0.85 | 37 | 9 | |||||
| TGATNGAYRGAT | 26 | 0.85 | 35 | 8 | |||||
| TGAYNGATTGAY | 25 | 0.84 | 34 | 7 | |||||
| TGAYNGATGGAY | 24 | 0.88 | 35 | 9 | |||||
| TGATNGAYGGAY | 24 | 0.88 | 31 | 8 | |||||
| TGAYTGAYWGAT | 22 | 0.91 | 27 | 6 | |||||
| TGAYNGACTGAY | 22 | 0.91 | 28 | 9 | |||||
| TGAYNGAYTGAC | 21 | 0.90 | 27 | 8 | |||||
| TGAYNGACKGAC | 20 | 0.90 | 25 | 10 | |||||
| Union (all variants) | 165 | – | 213 | 51 | Union (all variants) | 37 | – | 41 | 10 |
denotes the number of gene families in which the motif is conserved with while denotes the corresponding confidence score. denotes the number of maize genes contained in the gene families while denotes the intersection with experimentally profiled maize genes.