| Literature DB >> 18186941 |
Scott Christley1, Neil F Lobo, Greg Madey.
Abstract
BACKGROUND: Ultraconserved elements are nucleotide or protein sequences with 100% identity (no mismatches, insertions, or deletions) in the same organism or between two or more organisms. Studies indicate that these conserved regions are associated with micro RNAs, mRNA processing, development and transcription regulation. The identification and characterization of these elements among genomes is necessary for the further understanding of their functionality.Entities:
Mesh:
Year: 2008 PMID: 18186941 PMCID: PMC2244594 DOI: 10.1186/1471-2105-9-15
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Workflow for Computing Maximal Common Prefixes between Two Organisms. The steps for computing the maximal common prefixes between two organisms involves preparing the raw genome data into a set of sequence files in FASTA format, generating suffix array data structures from the sequence files, computing the MCP's in a pairwise fashion with the suffix arrays, then perform the union of the pairwise files together to produce a single MCP result file.
Figure 2Tournament-style Intersection for Computing Maximal Common Prefixes between Multiple Organisms. A generic workflow for intersecting the MCP's in a tournament-style to produce the common sequences for any number of organisms. Typically the number of MCPs becomes less than the total number of suffixes as more organisms are intersected so the later stages of the workflow execute faster than the earlier stages. This generic workflow can be easily modified to support more automated specialized processes, for example comparing all the organisms within the clade of a phylogenetic tree, because the output of each stage can be directly input to the next stage without any additional processing. Trimming the file is a separate step only needed for reporting final results and does not alter the MCP file allowing it to be used for continual stages of the workflow.
Figure 3Suffix Array from Sequence. A sequence of length N produces N suffix strings which are sorted in the suffix array data structure as shown with this 11 bp sequence and its corresponding suffix array. Each suffix string does not have to be stored separately, instead character positions or offsets are used to reference back to the original sequence; the offsets are the numbers to the right of each suffix.
Vertebrate Genomes
| Vertebrate Genomes | Size | |
| Human | Mar. 2006 (hg18) | 3000 Mbp |
| Chimp | Mar. 2006 (panTro2) | 3100 Mbp |
| Rhesus | Jan. 2006 (rheMac2) | 2800 Mbp |
| Chicken | May 2006 (galGal3) | 1200 Mbp |
| Mouse | Mar. 2006 (mm8) | 2500 Mbp |
| Rat | Nov. 2004 (rn4) | 2800 Mbp |
| Cow | May 2005 (bosTau2) | 3000 Mbp |
| Dog | May 2005 (canFam2) | 2400 Mbp |
| Armadillo | May 2005 (dasNov1) | 3000 Mbp |
| Elephant | May 2005 (loxAfr1) | 3000 Mbp |
| Opossum | Jan. 2006 (monDom4) | 3400 Mbp |
| Fugu | Aug. 2002 (fr1) | 350 Mbp |
| Rabbit | May 2005 (oryCun1) | 3500 Mbp |
| Zebrafish | Mar. 2006 (danRer4) | 1700 Mbp |
| Tetraodon | Feb. 2004 (tetNig1) | 380 Mbp |
| X. tropicalis | Aug. 2005 (xenTro2) | 1700 Mbp |
| Tenrec | July 2005 (echTel1) | 3800 Mbp |
All vertebrate genomes with the given assembly data where downloaded from the UCSC Genome Bioinformatics Site [33]. Sizes from NCBI Genome Project web pages or approximated from downloaded assembly data.