| Literature DB >> 15461792 |
Eric E Schadt1, Stephen W Edwards, Debraj GuhaThakurta, Dan Holder, Lisa Ying, Vladimir Svetnik, Amy Leonardson, Kyle W Hart, Archie Russell, Guoya Li, Guy Cavet, John Castle, Paul McDonagh, Zhengyan Kan, Ronghua Chen, Andrew Kasarskis, Mihai Margarint, Ramon M Caceres, Jason M Johnson, Christopher D Armour, Philip W Garrett-Engele, Nicholas F Tsinoremas, Daniel D Shoemaker.
Abstract
BACKGROUND: Computational and microarray-based experimental approaches were used to generate a comprehensive transcript index for the human genome. Oligonucleotide probes designed from approximately 50,000 known and predicted transcript sequences from the human genome were used to survey transcription from a diverse set of 60 tissues and cell lines using ink-jet microarrays. Further, expression activity over at least six conditions was more generally assessed using genomic tiling arrays consisting of probes tiled through a repeat-masked version of the genomic sequence making up chromosomes 20 and 22.Entities:
Mesh:
Year: 2004 PMID: 15461792 PMCID: PMC545593 DOI: 10.1186/gb-2004-5-10-r73
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1A process to generate a comprehensive transcript index (CTI) for the human genome. The first step is the assembly of a comprehensive set of annotations to generate a predicted transcript index (PTI). Sets of microarrays capable of monitoring the transcription activity over the entire genome can then be designed on the basis of the PTI. The different microarray types that can be used in this process include predicted transcript arrays (PTA), exon junction arrays (EJA) [21] and genome tiling arrays (GTA). After hybridizing a diversity of conditions onto these arrays, the transcription data are processed to identify a comprehensive set of transcripts (the CTI) and associated probes that are capable of querying all forms of transcripts that may exist in the genome. This set of probes comprises a focused set of microarrays that can be used in more standard microarray-based experiments.
Comparison of locus projections in the PTI on chromosomes 20 and 22 to Sanger-annotated genes
| Sanger chromosome 20, genes | Non-Sanger chromosome 20, genes | Sanger chromosome 22, genes | Non-Sanger chromosome 22, genes | ||
| Sanger genes (including pseudogenes) | 1,297 | 936 | |||
| Locus projection categories | |||||
| High-confidence categories | RefSeq | 676 (30) | 8 | 375 (47) | 12 |
| 336 (63) | 10 | 285 (127) | 10 | ||
| 38 (2) | 96 | 28 (7) | 74 | ||
| 28 (11) | 37 | 31 (18) | 29 | ||
| Expressed sequence + protein | 38 (30) | 37 | 36 (30) | 24 | |
| Low-confidence categories | 22 (4) | 674 | 50 (21) | 362 | |
| Protein | 17 (14) | 157 | 18 (13) | 121 | |
| Expressed sequence | 22 (2) | 1,591 | 31 (7) | 1,127 | |
| Higher-confidence categories | 1,116 (136) | 188 | 755 (229) | 149 | |
| All categories | 1,177 (156) | 2,610 | 854 (270) | 1,759 | |
Columns 1 and 3 provide the number of locus projections in the PTI set that overlap Sanger genes for chromosomes 20 and 22, respectively. The numbers given in parentheses indicate the number of Sanger-annotated pseudogenes; these pseudogenes were not used when summarizing the results. Columns 2 and 4 give the number of genes in the PTI set that were not overlapping Sanger genes.
Summary of expression-validated genes (EVGs) from predicted transcripts over the entire human genome
| Gene categories | Sanger/PTI chromosome 20 | Non-Sanger PTI chromosome 20 | Sanger/PTI chromosome 22 | Non-Sanger PTI chromosome 22 | PTI genome-wide |
| Total Sanger genes represented | 1,177 (826) | 854 (575) | |||
| RefSeq | 676 (552) | 8 (2) | 375 (290) | 12 (5) | 10,720 (7992) |
| 336 (229) | 10 (2) | 285 (202) | 10 (5) | 8,801 (4269) | |
| 38 (17) | 96 (8) | 28 (15) | 74 (8) | 3,733 (784) | |
| 28 (9) | 37 (7) | 31 (16) | 29 (4) | 1,983 (233) | |
| Expressed sequence + protein | 38 (2) | 37 (2) | 36 (10) | 24 (4) | 1,126 (271) |
| Expressed sequence | 22 (3) | 1,591 (44) | 31 (3) | 1,127 (33) | 7,170 (1428) |
| 22 (12) | 674 (39) | 50 (35) | 362 (17) | 16,822 (555) | |
| Protein | 17 (2) | 157 (7) | 18 (4) | 121 (4) | 540 (110) |
| High-confidence categories | 1,116 (809) | 188 (21) | 755 (533) | 149 (26) | 26,363 (13,549) |
| All categories | 1,177 (826) | 2,610 (111) | 854 (575) | 1,759 (80) | 50,895 (15,642) |
Columns 1 and 3 provide the total number of Sanger genes for each category for chromosomes 20 and 22, respectively, with the number of EVGs detected given in parentheses. Columns 2 and 4 provide the total number of LPs that did not overlap Sanger genes, with the number of EVGs detected given in parentheses. The last column provides the total number of LPs in the PTI represented on the PTA microarrays, with the number of EVGs detected over the entire genome given in parentheses.
Figure 2Gene Ontology (GO) classification of novel expression-validated genes (EVGs). EVGs not supported by the expressed sequence data (2,093) were submitted to a search against the Pfam database. Those with significant alignments (339) were assigned GO codes based on Pfam. The pie charts show the distribution of GO terms within this set of EVGs. Note that the total number of GO terms in each category is greater than the number of EVGs because of assignment of multiple GO terms to some EVGs. (a) Distribution of the different 'biological process' GO codes assigned to the EVGs with significant hits to the Pfam database: a total of 526 GO terms. (b) Distribution of the different 'molecular function' GO codes assigned to the EVGs with significant hits to the Pfam database: a total of 374 GO terms.
Figure 3Utilizing PTA data as an expression index. Absolute transcript abundance over the 60 conditions described in [19] for two expression-supported transcripts. RLP09885002 represents a known gene (ATP1A1, ATPase, Na+/K+ transporting, alpha 1 polypeptide) whereas RLP10406004 was supported solely by gene model predictions before microarray validation.
Figure 4Examples of tiling results for known genes. The colored bars across the bottom of the data window are color matched with the corresponding exon annotations shown in the genome viewer. (a) The KDELR3 gene shows strong agreement between the public transcript annotations and the tiling results. The top panel represents a screen shot from the UCSC genome browser [60] highlighting KDLER3. The bottom panel represents transcription activity as raw intensities (y-axis) for each probe used to tile through KDLER3 (x-axis), in one of the eight conditions monitored by the genomic tiling arrays. (b) The EWRS1 gene potentially contains a larger number of false-positive predictions, but more probably lends additional experimental support to previously predicted alternative splice forms (EWSR.b and EWSR.g), giving a more accurate representation of the putative structure of this gene. The top panel represents a screen shot from the UCSC genome browser [60] highlighting EWRS1. The bottom panel represents transcription activity as raw intensities (y-axis) for each probe used to tile through EWSR1 (x-axis), in one of the eight conditions monitored by the genomic tiling arrays. (c) Conserved regions between mouse and human upstream of the beta-actin gene. The tiling data readily detect all of the transcribed parts of the gene, but not the conserved regulatory regions. The green bars in the probe-intensity plot represent the annotated transcribed regions for the beta-actin gene, while the blue bars indicate regions that are not known to be transcribed. The lower section shows the sequence conservation between human and mouse as obtained through the program rVISTA [36,61]. Conserved coding (blue peaks) and non-coding regions (red peaks) are shown where the two genomic sequences align with 75% identity over 100-bp windows. The rows marked ELK, ETF, and SRF show binding sites for these transcription factors predicted using TRANSFAC matrix models and the MATCHTM program, which are part of the rVISTA suite. The exons for the gene are shown in blue.
Summary of transcription activity detected from the chromosome 20 and 22 genome tiling data
| Locus projection categories | Sanger tiling chromosome 20 | Non-Sanger tiling chromosome 20 | Sanger tiling chromosome 22 | Non-Sanger tiling chromosome 22 | |
| Total Sanger genes | 1,278 | 933 | |||
| Sanger category 1 | 577 (398) | 368 (184) | |||
| Sanger category 2 | 155 (32) | 121 (60) | |||
| Sanger category 3 | 338 (150) | 144 (52) | |||
| Sanger category 4 | 161 (117) | 294 (138) | |||
| RefSeq | 3 | 0 | |||
| 1 | 0 | ||||
| 15 | 8 | ||||
| 6 | 4 | ||||
| Expressed sequence + protein | 4 | 1 | |||
| 71 | 26 | ||||
| protein | 11 | 21 | |||
| Expressed sequence | 80 | 46 | |||
| Outside all annotations* | 1,936 | 1,058 | |||
| High-confidence categories | NA | 25 | NA | 12 | |
| All annotation categories | 1,231 (697) | 191 | 927 (434) | 106 |
*Number of probes detected as components of EVGs. Columns 1 and 3 provide the number of Sanger genes represented on the genome tiling arrays for chromosomes 20 and 22, respectively, with the number of genes detected given in parentheses. Columns 2 and 4 provide the number of LPs not overlapping Sanger genes that were detected on chromosomes 20 and 22, respectively. NA, not applicable.