| Literature DB >> 11516338 |
F A Wright1, W J Lemon, W D Zhao, R Sears, D Zhuo, J P Wang, H Y Yang, T Baer, D Stredney, J Spitzner, A Stutz, R Krahe, B Yuan.
Abstract
BACKGROUND: The recent draft assembly of the human genome provides a unified basis for describing genomic structure and function. The draft is sufficiently accurate to provide useful annotation, enabling direct observations of previously inferred biological phenomena.Entities:
Mesh:
Year: 2001 PMID: 11516338 PMCID: PMC55322 DOI: 10.1186/gb-2001-2-7-research0025
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Identification of exons on the genome
| Category | Database | Total records | Percent placed (%) | Total unique exons | Exons in complete ORFs | Exons in partial ORFs | Exon length (bp) | ORF length (bp) | Putative genes (non-splicing singletons) | Protein homology (Pfam hits) | CpG islands |
| Known | UTR-DB | 40,258 | 80 | 19,195 | 5,075 | 1,895 | 6,925,762 | 1,990,818 | 10,007 (426) | 5,701 (3,813) | 3,866 |
| genes | HTDB | 15,305 | 89 | 48,477 | 12,077 | 7,706 | 11,893,081 | 4,043,544 | 4,816 (148) | 2,938 (1,943) | 1,960 |
| Consensus | HINT | 87,125 | 77 | 103,817 | 47,055 | 15,061 | 23,381,024 | 10,144,988 | 20,357 (959) | 9,121 (6,453) | 7,557 |
| transcripts | EG | 62,064 | 80 | 13,085 | 5,389 | 1,904 | 4,562,954 | 1,873,723 | 4,800 (154) | 2,177 (1,679) | 2,462 |
| THC | 84,837 | 81 | 38,806 | 15,463 | 6,671 | 12,406,081 | 5,078,661 | 8,604 (322) | 2,907 (2,026) | 3,983 | |
| Transcripts | GenBank CDS | 110,222 | 81 | 41,917 | 31,626 | 1,452 | 5,303,064 | 4,299,272 | 2,634 (227) | 1,858 (1,607) | 1,178 |
| dbEST Human | 2,154,995 | 73 | 273,881 | 147,819 | 17,694 | 32,288,385 | 14,975,758 | 20,073 (7,136) | 5,377 (3,745) | 11,807 | |
| Rodent | MINT | 92,531 | 30 | 8,284 | 5,433 | 120 | 866,046 | 780,566 | 777 | 123 (56) | 486 |
| transcripts | RINT | 37,367 | 46 | 5,600 | 3,588 | 75 | 592,788 | 546,932 | 458 | 65 (32) | 255 |
| EMBL | 43,488 | 28 | 5,819 | 4,108 | 59 | 724,630 | 655,993 | 202 | 68 (72) | 135 | |
| Protein | SWISS-PROT | 86,593 | 38 | 27,526 | 12,072 | 1,163 | 9,858,797 | 7,784,205 | 1,648 | 1,648 (1,244) | 158 |
| homology | TrEMBL | 351,834 | 13 | 22,670 | 8,134 | 1,677 | 4,385,497 | 2,886,034 | 1,185 | 1,185 (654) | 92 |
| PIR | 182,106 | 16 | 4,106 | 1,175 | 383 | 1,355,644 | 764,339 | 321 | 321 (132) | 20 | |
| Total | 613,183 | 299,014 | 55,860 | 114,543,753 | 55,824,833 | 75,982 (9,372) | 33,489 (23,008) | 33,959 |
Exons were identified after vector screening using transcript, rodent, and protein databases. The definition of a record varies according to the database, while 'exons' refer to high-scoring segment pairs in BlastN comparisons (E < 10-15 and sequence identity >90%) to the genome. Unique exons and all subsequent columns refer to placements that were possible after considering the preceding databases. Placement of rodent transcripts required evidence of splicing and sequence identity >80%. ORFs were identified using getorf [84] using a minimum size of 30bp to report. Protein homology required BlastX E < 10-15. Pfam hits required score >20 using hmmpfam [92]. Gene prediction programs are described in Table 2. CpG islands were identified using cpgreport [84] using standard criteria [45].
Genome-wide assessment of ab initio gene prediction methods
| Genscan | Grail | Fgenes | Transcript- | Protein- | Unconfirmed |
| confirmed | supported | exons | |||
| exons | exons | ||||
| • | 25,619 | 2,890 | 45,025 | ||
| • | 52,644 | 14,685 | 434,409 | ||
| • | 7,791 | 796 | 257,676 | ||
| • | • | 17,841 | 3,761 | 28,556 | |
| • | 13,915 | 1,711 | 11,628 | ||
| • | • | 3,990 | 450 | 49,420 | |
| • | • | • | 53,566 | 9,871 | 20,569 |
| Total exons | 175,366 | 34,164 | 847,283 |
Three gene prediction programs, Genscan [93], Fgenes [94] and Grail1.3 [95] were used to screen individual genomic contigs. Exons consistently predicted by more than one program are merged into a unique exon index, which is then compared to transcript- and protein-based exons in complete ORFs. Transcript-confirmed exons, overlapping of predicted exons with transcript-based exons; protein-supported exons, predicted exons have at least strong protein homology (E < 10-15); unconfirmed exons, predicted exons have no overlap with transcripts nor protein homology.
Ontological classification of 22,339 human gene products
| Biological function | Number of transcripts | Biological process | Number of transcripts |
| Transcription factor | 958 (306) | Carbohydrate metabolism | 281 (84) |
| Translation factor | 62 (27) | Nucleotide and nucleic acid metabolism | 173 (51) |
| RNA binding | 142 (41) | DNA replication | 240 (126) |
| Ribosomal protein | 232 (130) | Transcription | 1,059 (651) |
| Cell cycle regulator | 42 (16) | RNA processing | 204 (59) |
| Structural protein | 145 (48) | Amino acid and derivative metabolism | 87 (29) |
| Cytoskeleton structural protein | 329 (181) | Protein biosynthesis | 264 (162) |
| Extracellular matrix | 361 (87) | Protein modification | 235 (88) |
| Actin binding | 66 (25) | Protein targeting | 26 (5) |
| Motor protein | 245 (77) | Protein degradation | 136 (45) |
| Chaperone | 87 (27) | Proteolysis and peptidolysis | 96 (36) |
| Enzyme | 2,664 (1,404) | Lipid metabolism | 424 (187) |
| Protein kinase | 895 (484) | Monocarbon compound metabolism | 9 (3) |
| Protein kinase inhibitor | 19 (12) | Coenzyme and prosthetic group metabolism | 92 (29) |
| Protein phosphatase | 43 (7) | Steroid compound metabolism | 40 (10) |
| Protein phosphatase inhibitor | 17 (3) | Prostaglandin metabolism | 12 (3) |
| Protease | 441 (255) | Transport | 549 (288) |
| Protease inhibitor | 92 (37) | Electron transport | 491 (273) |
| Enzyme activator | 18 (3) | Ion transport | 302 (90) |
| Enzyme inhibitor | 14 (4) | Small molecular transport | 19 (9) |
| Alkyl transfer | 17 (3) | Neurotransmitter transport | 9 (3) |
| Amide transfer | 15 (3) | Ion homeostasis | 201 (57) |
| Carbonyl transfer | 191 (38) | Organelle organization and biogenesis | 408 (254) |
| Hydroxyl transfer | 13 (6) | Nuclear organization and biogenesis | 1,380 (647) |
| Phosphoryl transfer | 823 (281) | Cytoplasm organization and biogenesis | 42 (20) |
| Oxireduction | 148 (76) | Meiosis | 15 (2) |
| Transmembrane protein | 184 (48) | Mitosis | 25 (6) |
| Receptor | 921 (478) | Cell cycle | 271 (100) |
| G protein-linked receptor | 164 (106) | DNA packaging | 15 (6) |
| Defense/immunity protein | 353 (164) | DNA repair | 132 (41) |
| Ligand binding or carrier | 691 (331) | DNA recombination | 31 (3) |
| Ion channel | 245 (141) | Methylation | 185 (53) |
| Oncogene | 128 (42) | Signal transduction | 1,231 (383) |
| Tumor suppressor | 8 (6) | Growth regulation | 15 (4) |
| Growth factor | 95 (40) | Differentiation | 24 (6) |
| Hormone | 42 (14) | Apoptosis | 160 (49) |
| Cell communication | 247 (84) | Angiogenesis | 11 (4) |
| Cell adhesion | 433 (252) | Defense/immunity | 112 (49) |
| Detoxification | 33 (15) | ||
| Stress response | 90 (41) | ||
| Developmental process | 278 (99) | ||
| Neurogenesis and regeneration | 147 (43) | ||
| Physiological process | 159 (43) | ||
| Sensory perception | 292 (65) | ||
| Functionally classified | 12,334 (5,204) | Process classified | 10,005 (4,225) |
Each transcriptional unit and HINT transcript (in parentheses) was assigned to a unique biological function or process.
Figure 1Overview map of features on the entire human genome, based on the working draft assembly (15 June 2000 release) and finished sequences for chromosomes 21 and 22. Ideograms are oriented with the p-arm at the top, and are assembly-corrected to form an approximate cytogenetic alignment with the features of the draft assembly depicted to the right of each ideogram. Sequencing gaps at the centromeres and contiguous heterochromatic regions are represented by horizontal lines. Chromosome 19 is an exception, for which evidence suggests that both heterochromatic regions are at least partially sequenced. Genomic features are presented as densities (that is, proportion of base pairs occupied by each feature) in nonoverlapping 1Mb intervals. The densities are corrected for sequencing gaps, indicated in the draft assembly as 50-200 kb segments of Ns (unsequenced nucleotides), but (with the exception of GC content) are not corrected for sporadic Ns of lower-quality base calls, because these would not interfere with assignment of the feature to the assembly. Exon density (red) is based on high-scoring pairs from Table 1, not necessarily in ORFs. CpG island density (blue) is based on standard definitions [45] of a run of at least 200 bases with GC content >50% and observed over expected CpG >0.6, and implemented using the program cpg [90]. GC content (green) is the number of G or C bases divided by the number of non-N bases in the 1Mb interval. LINE1 (blue) and Alu (black) repeat elements were determined using RepeatMasker [91] and minisatellites of repeat size 20-50bp by the etandem program of the EMBOSS suite [84]. Density ranges were selected to illuminate features across the genome while preserving a common scale to facilitate comparison. A number of values exceed the range for the feature and are truncated, with a small dot of the corresponding color placed under the ordinate. The data points for the figure are available in the additional data file.
(a) Density of features per megabase in Giemsa-staining cytogenetic bands
| R | G | R/G ratio | ||||
| Exons | 0.0415 | 0.0319 | 1.30 | |||
| CpG islands | 0.0119 | 0.0075 | 1.59 | |||
| GC content | 42.23% | 39.76% | 1.06 | |||
| LINE1 repeats | 0.1435 | 0.1602 | 0.90 | |||
| 0.1204 | 0.0937 | 1.28 | ||||
| Minisatellites | 0.0090 | 0.0078 | 1.15 | |||
| Exon | CpG | GC | LINE1 | Minisatellite | ||
| Exon | 1.00 | 0.65 | 0.64 | -0.26 | 0.73 | 0.19 |
| CpG | 1.00 | 0.73 | -0.42 | 0.58 | 0.16 | |
| GC | 1.00 | -0.54 | 0.61 | 0.13 | ||
| LINE1 | 1.00 | -0.20 | 0.28 | |||
| 1.00 | 0.23 | |||||
| Minisatellite | 1.00 | |||||
(a) Pale-staining (R) and dark-staining (G) bands are compared, with alignment of cytogenetic bands to sequence as described in the text. All of the features except LINE1 elements are denser in the R bands. The true differences are likely to be larger, as errors in cytoband alignment will tend to understate the differences in the band types. The differences in the bands are highly significant at p < 0.001 for all features except for minisatellites (p = 0.006). (b) Rank correlations of features, in 1Mb intervals (p = 0.03, corrected for multiple comparisons).
Figure 2Coding sequence density for human chromosomes. (a) The proportion of assembled sequence that is exonic provides direct confirmation of previously hypothesized patterns of gene density. (b) Transcriptional units per megabase. Additional plots and data are in the additional data files.
Figure 3Total number of embryo-specific genes (based on HINT clusters) for each chromosome. Chromosomes 13, 18, 21 and Y clearly have lower numbers than other chromosomes.
Figure 4The correspondence between physical location and maps constructed using different mapping methods. (a) Correspondence between the genetic map and physical location. (b) Correspondence between radiation hybrid maps versus physical location. The GB4 (black) radiation hybrid map shows a jump at the centromere, reflecting a sequencing gap and possible increased radiation sensitivity in the region. The jump for the Stanford G3 map (blue) is not easily estimated and is suppressed in the published map. Chromosome 1 is shown here for illustration, and the corresponding figures and data points for the entire genome are available in the additional data files.
Figure 5Repeat-masked chromosome sequences were divided into 1Mb segments and analyzed against the entire chromosomal sequence. Matches of at least 70% identity (both forward and reverse) and E < 10-25 are plotted. The diagonal line of complete identity has been removed to clarify features near the diagonal. Plots for each chromosome are available in the additional data files.