| Literature DB >> 16086854 |
Zhenyu Xuan1, Fang Zhao, Jinhua Wang, Gengxin Chen, Michael Q Zhang.
Abstract
Large-scale and high-throughput genomics research needs reliable and comprehensive genome-wide promoter annotation resources. We have conducted a systematic investigation on how to improve mammalian promoter prediction by incorporating both transcript and conservation information. This enabled us to build a better multispecies promoter annotation pipeline and hence to create CSHLmpd (Cold Spring Harbor Laboratory Mammalian Promoter Database) for the biomedical research community, which can act as a starting reference system for more refined functional annotations.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16086854 PMCID: PMC1273639 DOI: 10.1186/gb-2005-6-8-r72
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Number of genes and transcripts of different types in the three mammalian genomes
| Type | HSPD | MMPD | RNPD | |||
| Gene* | Transcript† | Gene | Transcript | Gene | Transcript | |
| RefSeq | 17,354 | 22,425 | 16,329 | 17,438 | 6,400 | 6,807 |
| mRNA | 8,846 | 106,279 | 2,641 | 40,552 | 1,967 | 11,116 |
| Ensembl | 3,160 | 33,653 | 6,601 | 31,022 | 14,276 | 27,989 |
| RefSeq_XM | 2,400 | 6,105 | 4,974 | 5,829 | 3,021 | 15,023 |
| TwinScan | 3,189 | 25,633 | 4,528 | 25,583 | 5,015 | 25,499 |
| EST | 0 | 4,488,530 | 0 | 3,254,853 | 0 | 477,321 |
| Total | 34,949 | 4,682,625 | 35,073 | 3,375,277 | 30,679 | 563,755 |
*Number of genes in non-overlapping gene types. †Number of all transcripts of this type.
Figure 1Distribution of conservation scores in promoter alignments. (a) Pairwise promoter alignments of human-rodent and mouse-rat non-orthologous genes (control set II) with different promoter GC content. (b) Pairwise promoter alignments of most conserved promoter pairs and randomly selected 1 kb sequence pairs (control set I). (c) Alignments of mouse-rat and human-rodent homologous promoter pairs. (d) Three-way promoter alignments of homologous promoter triplets and sequence triplets from control set II.
Figure 2Flowchart of the pipeline to construct the promoter database. Ovals indicate data and rectangles the method. The ovals shaded gray represent the data stored in CSHLmpd.
Sensitivity and specificity of promoter prediction with different methods
| Method 0* | 72% | 46% |
| Method 1† | 67% | 46% |
| Method 2‡ | 70% | 57% |
| Method 3§ | 69% | 66% |
| Method 1 + script¶ | 64% | 33% |
| Method 2 + script | 67% | 44% |
| Method 3 + script | 66% | 60% |
| Method 1 + script | 80% | 37% |
| Method 2 + script | 84% | 46% |
| Method 3 + script | 82% | 69% |
*Method 0 used original FirstEF alone to predict promoters in the upstream and genic regions of these genes. †Method 1 used de novo FirstEF to predict promoters in the upstream and genic regions of these genes. ‡Method 2 compared mRNAs or predicted transcripts with original FirstEF predictions to filter out promoters that were neither located in the upstream of the gene region nor overlapping with the 5'-end of any transcripts of this gene. §Method 3 tried to first find the promoters in one gene that have homologous rodent promoters. If no such promoters were found, it used Method 2 to select promoters for this gene. ¶script, a post-clustering script to select representative TSSs from the output of each method described above that were at least 500 bp apart (see Materials and methods for details).
Figure 3Sensitivity and specificity of promoter prediction for CpG-island related and non-CpG-island related promoters in different gene sets. (a) 5,893 human genes with homologous rodent promoters. (b) All 8,949 human genes in the test set. The definition of different methods is described in the text and in Materials and methods.
Statistics of promoters and genes in CSHLmpd
| HSPD | MMPD | RNPD | |
| Total genes | 34,949 | 35,073 | 30,679 |
| Known genes (RefSeq and mRNA) | 26,200 | 18,970 | 8,367 |
| Canonical genes (RefSeq, mRNA, and Ensembl) | 29,360 | 25,571 | 22,643 |
| Genes with promoters | 26,820 | 25,592 | 21,125 |
| Genes with homologous promoters | 13,432 | 14,626 | 12,302 |
| Predicted genes with promoters | 4,340 | 7,343 | 13,230 |
| Total promoters* | 55,513 | 46,207 | 37,479 |
| Known promoters | 14,314 | 841 | 943 |
| FirstEF predicted promoters | 39,233 | 34,994 | 34,227 |
| Transcript-supported FirstEF predicted promoters | 19,331 | 16,913 | 11,798 |
| RefSeq END promoters | 1,828 | 2,988 | 2,270 |
| Bidirectional gene promoters | 138 | 84 | 39 |
| Core promoters | 26,820 | 25,592 | 21,125 |
| Homologous promoters | 21,594 | 21,501 | 17,257 |
| Homologous known promoters | 10,561 | 6,854 | 817 |
| CpG-island related RefSeq genes | 12,259 (71%) | 9,831 (60%) | 2,987 (47%) |
| CpG-island related other mRNA genes | 2,679 (30%) | 993 (38%) | 907 (46%) |
| CpG-island related canonical genes | 15,707 (54%) | 12,293 (48%) | 8,420 (37%) |
| CpG-island related promoters | 37,572 (68%) | 24,726 (54%) | 20,826 (56%) |
| CpG-island related known promoters | 10,332 (72%) | 5,115 (63%) | 444 (47%) |
| CpG-island related predicted promoters | 26,936 (69%) | 19,363 (55%) | 20,207 (59%) |
| CpG-island related RefSeq END promoters | 187 (10%) | 201 (7%) | 153 (7%) |
| CpG-island related bidirectional gene promoters | 53 (38%) | 47 (56%) | 22 (56%) |
| CpG-island related homologous promoters | 13,974 (82%) | 11,867 (76%) | 9,372 (80%) |
*Predicted promoters were separated with other predicted or known promoters by at least 500 bp.
Figure 4Screen shots of the CSHLmpd user interface. (a) Gbrowse for genome-wide gene and promoter display. (b) Homologous promoter search and analysis.