| Literature DB >> 17362514 |
Sung Tae Doh1, Yunyu Zhang, Matthew H Temple, Li Cai.
Abstract
BACKGROUND: Completion of the human genome sequence along with other species allows for greater understanding of the biochemical mechanisms and processes that govern healthy as well as diseased states. The large size of the genome sequences has made them difficult to study using traditional methods. There are many studies focusing on the protein coding sequences, however, not much is known about the function of non-coding regions of the genome. It has been demonstrated that parts of the non-coding region play a critical role as gene regulatory elements. Enhancers that regulate transcription processes have been found in intergenic regions. Furthermore, it is observed that regulatory elements found in non-coding regions are highly conserved across different species. However, the analysis of these regulatory elements is not as straightforward as it may first seem. The development of a centralized resource that allows for the quick and easy retrieval of non-coding sequences from multiple species and is capable of handling multi-gene queries is critical for the analysis of non-coding sequences. Here we describe the development of a web-based non-coding sequence retrieval system.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17362514 PMCID: PMC1838437 DOI: 10.1186/1471-2105-8-94
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Statistics of gene annotation for ENSEMBL and NCBI.
| NCBI 36 | Aug 2006 | 41.36c | 22205 | 1019 | 69185 | |
| NCBI m36 | Apr 2006 | 41.36b | 21839 | 2599 | 71259 | |
| WASHUC 1 | Dec 2005 | 41.1p | 5123 | 5417 | 76146 | |
| NCBI 36 | Oct 2006 | 197 | 38597 | 85590 | ||
| NCBI m36 | Oct 2006 | 159 | 60745 | 64618 | ||
| WASHUC 1 | Aug 2006 | 31 | 24313 | 30837 | ||
Known – genes that have species-specific protein sequences already available in the public sequence databases. Novel – genes that could not be mapped with confidence to existing entries. Total Predictions – the number of 'known', 'novel' and 'pseudogenes' predicted by the Ensembl analysis and annotation pipeline.
Entrez Genes – number of genes defined by sequence and/or located in the NCBI Map Viewer. Total Unigene Clusters – the number of non-redundant sets of gene-oriented clusters automatically partitioned by UniGene.
Statistics of homology prediction for human, mouse, and chicken
| - | 13049/46.7% | 9839/50.7% | ||
| 12036/38.6% | - | 11698/60.3% | ||
| 11773/37.7% | 12187/43.6% | - | ||
| 31206 | 27964 | 19399 | ||
| - | 16325/73.0% | 10498/84.0% | ||
| 16325/41.2% | - | 10299/83.3% | ||
| 10498/26.5% | 10299/46.6% | - | ||
| 39605 | 22364 | 12500 | ||
The homolog data table files for each of the baseline species were queried to find the total number of genes along with the number of homologous genes that are present in another given species' genome. Similarly the homologene.data file was used to generate the homologene statistics. Shown are the number of homologs and the percentage of coverage (the number of genes that have homologs in a particular species' genome divided by the total number of genes for the baseline species.)
Analysis of known and predicted genes for chicken, rat, mouse, and human from Ensembl Mart v.41
| 2726 | 24939 | 10.93% | 1 | 24910 | 0.00% | |
| 9119 | 37825 | 24.11% | 9731 | 38778 | 25.09% | |
| 21336 | 36898 | 57.82% | 16931 | 46566 | 36.36% | |
| 29836 | 62076 | 48.06% | 9849 | 63575 | 15.49% | |
| 2727 | 49849 | 5.47% | ||||
| 18850 | 76603 | 24.61% | ||||
| 38267 | 83464 | 45.85% | ||||
| 39685 | 125651 | 31.58% | ||||
Figure 1Snapshot of the web based user interface for the NCSRS. The user interface allows the user to input the HUGO (Human Genome Organization) ID, i.e., Entrez gene ID (LocusLink ID), Gene Symbol, and Ensembl ID numbers and set the other search options.
Figure 2Work flow diagram of the NCSRS. The Refseq annotation uses Entrez gene IDs as the database key while Ensembl uses gene stable IDs. The input ID is converted into the appropriate database key if necessary. Entrez gene IDs are used directly for the Refseq annotation but are converted to gene stable IDs for the Ensembl annotation. Gene symbols are translated into Entrez gene IDs and gene stable IDs. Once the database keys are acquired, the homologous genes can be identified using the available homology databases if the "pull ortholog" option is activated. The database key is then used to access the mapping information that has been compiled from the annotation data. The mapping information is then used to locate the relevant sequences. These sequences are extracted then copied to a new ".fa" file with FASTA sequence format; and the annotation information about the exons is written to the ".exon" file. Thus, for each requested gene, there are one pair of files for each genome.
Figure 3An example webpage that display the results for the NCSRS. The sequences and annotation information written to the FA and EXON files respectively are bundled and zipped into a single file that can be accessed by the "Download Results!" link. A table with links to NCBI, UCSC genome browser and Ensembl for the gene and specific species is also provided.