| Literature DB >> 15113408 |
Henry R Bigelow1, Adam S Wenick, Allan Wong, Oliver Hobert.
Abstract
BACKGROUND: All known genomes code for a large number of transcription factors. It is important to develop methods that will reveal how these transcription factors act on a genome wide level, that is, through what target genes they exert their function.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15113408 PMCID: PMC406492 DOI: 10.1186/1471-2105-5-27
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flow chart of program pipeline. Information is shown as rectangles, procedures as ovals. The only user defined inputs are the Transcription Factor Binding Site Alignment file and the number of hits to retrieve. All other input files are downloaded from sources mentioned in the text.
Figure 2Screenshot of the Web Interface. The address is: . The program will be eventually run by WormBase at .
Figure 3Classification of non-exonic regions. A hypothetical gene arrangement is shown. "5' intergenic": between exon1 and exon1 of two separate genes; "3' intergenic": between the last exon of both genes. "5'/3' intergenic": between first exon of one gene and last exon of the other gene; "intronic#": between any two exons of one gene; "other": all other possible combinations. In cases where the gene flanking a segment is known to exhibit alternative splicing, the segment was prefixed with 'alt_', i.e. 'alt_intronic#', 'alt_3'intergenic', etc. Two other categories, BEGIN and END, denote regions at the beginning or ending of the chromosome, in the case of C. elegans, or of the sequencing reads in the case of C. briggsae. There were two exceptions to the procedure. The first was due to the fact that the C. briggsae genome we used was an unassembled collection of 578 individual sequence reads. 112 of these reads had no exon annotations, and were ignored in this study. Of these 112, only two were greater than 10,000 bases long, with an average length of 3679.3 nucleotides. Secondly, there were 16 C. elegans and 35 C. briggsae exon annotations one nucleotide long. By visual inspection, we determined that for C. elegans these exons were in fact longer than one nucleotide, but noncoding: in all cases the single nucleotide is 'A' and when spliced forms a TGA stop codon. They were treated as non-existent for this study, which has very little effect on the procedure except that the last true intron of the gene will be considered its 3' region. For C. briggsae, they appear to be errors in the gene annotations and fall within introns. Thus, they were treated as part of the intron in which they occur.
Figure 4Output of the program pipeline. Hits of a search with the TTX-3 consensus binding site is shown. num: number in list. mis: Number of base mismatches between first C. elegans and first C. briggsae hits. segtype: Type of non-exonic region (see Figure 3). str1/2: negative (N) or positive (P), strand on which the first/second of the two genes that flank the identified target site are located; offset1/2: distance of the target site to the flanking gene(s) (in relation to the start codon if the target site is 5' or located in an intron; in relation to the stop codon if the site is 3' to the gene; in the latter two cases, the number has a positive value); ID: cosmid name of the flanking genes, name: flanking gene names (if available). Gene IDs/names are linked to the WormBase gene model at , which contains further information about the gene. In case there are multiple target sites located in a defined inter/intragenic region, there is an option to report the n highest scoring hits for each ortholog. If this option is used, the top-scoring C. elegans or C. briggsae hit in each hit-pair will be highlighted, and the next n-1 hits will be gray. Color coding: Orthologous C. elegans/C. briggsae genes ("hit-pairs") are color coded in blue (Y39A3B.5 and CBG15122 are orthologs) and green (M01E10.2 and CBG15118 are orthologs).