| Literature DB >> 20222956 |
Can Cenik1, Adnan Derti, Joseph C Mellor, Gabriel F Berriz, Frederick P Roth.
Abstract
BACKGROUND: Approximately 35% of human genes contain introns within the 5' untranslated region (UTR). Introns in 5'UTRs differ from those in coding regions and 3'UTRs with respect to nucleotide composition, length distribution and density. Despite their presumed impact on gene regulation, the evolution and possible functions of 5'UTR introns remain largely unexplored.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20222956 PMCID: PMC2864569 DOI: 10.1186/gb-2010-11-3-r29
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Characterization of fundamental properties of 5'UTR introns. (a) Histogram of the total 5'UTR intron length. A well annotated set of RefSeq transcript IDs are used in this analysis and this histogram shows the distribution of the log10 of the total number of intronic nucleotides in the 5'UTR. (b) Distribution of the number of introns in the 5'UTR. The log10 of number of transcripts that have a given number of introns in their 5'UTR is shown. The number of transcripts with a given number of 5'UTR introns decreases exponentially. (c) Heat map depicting the relationship between total lengths of 5'UTR introns and 5'UTR exons. (d) Heat map depicting the relationship between total lengths of 5'UTR introns and non-5'UTR introns. In both heatmaps, darker shades of gray indicate more transcripts.
Figure 2Expression analysis as a function of total 5'UTR intron length. (a) Heat map of the mean expression level versus the total 5'UTR intron length. The shade of gray represents the number of transcripts in each bin with darker shades implying more transcripts. The overrepresentation of short 5'UTR-intron-containing genes among the highest expression levels is apparent. (b) Quantile-quantile plot of total 5'UTR intron length of short 5'UTR intron-containing genes divided into highly expressed (top 5%) and other genes. The most highly expressed genes tend to have shorter 5'UTR introns. (c) Smoothed histogram of the mean expression level with respect to presence/absence of 5'UTR intron and its length. A kernel density estimator was fitted to the expression data and the corresponding probability density is plotted as a function of the mean expression level. The black line corresponds to the probability density for transcripts without any 5'UTR introns. Genes with long 5'UTR introns are represented by the red line while genes with short 5'UTR introns are represented by the blue line. The vertical line represents the top 5% of mean expression level of all genes. (d) Total 5'UTR intron length of genes in different expression level categories. The width of the boxes represents the relative number of data points in each category. Transcripts in the top 1% and top 5% in expression level tend to have shorter 5'UTR introns.
Figure 3Analysis of variability in expression across tissues as a function of the total 5'UTR intron length. (a) Transcripts with low mean expression have higher normalized expression variability. A standardized measure of the variability in gene expression across tissues was calculated and plotted against the natural logarithm of mean expression level. The black vertical line represents the lowest 25th percentile in mean expression. Since transcripts with low levels of mean expression tend to exhibit an artificially high variability in expression, they are removed from further analysis. (b) Boxplot of the coefficient of variation (standard deviation-to-mean ratio) of genes grouped by the total length of 5'UTR intron. The width of the boxes represents the relative number of data points in each category. There are no apparent differences between the three groups (c) Boxplot of log10 of total 5'UTR intron length of genes grouped by their across-tissue variability. Genes are divided into six categories depending on their coefficient of variation. Error bars correspond to standard deviation of the mean. No obvious dependence of expression variability to total 5UI length can be observed except for the most highly variable genes, which tend to have slightly shorter 5'UTR introns. (d) Boxplot of log10 of total 5'UTR intron length for gene groups defined by the number of tissues in which expression of each gene was detected. A gene was defined to have detectable expression in a given tissues if its expression was higher than the 25th percentile of mean expression of all genes. We found no differences in total 5'UTR intron length amongst the different gene groups. (e) Histogram of number of genes divided by the presence of 5'UTR introns and by the number of tissues in which expression was detected. The number of tissues in which expression was detected was independent of the presence of 5'UTR introns.
Overrepresented Gene Ontology attributes for genes with 5'UTR introns
|
|
|
|
|
| Gene Ontology attribute | |
|---|---|---|---|---|---|---|
| 25 | 35 | 0.650 | 1.4e-05 | 0.0153 | GO:0004715: | non-membrane spanning protein tyrosine kinase activity |
| 27 | 38 | 0.644 | 7.5e-06 | 0.0073 | GO:0051261: | protein depolymerization |
| 31 | 44 | 0.633 | 2.1e-06 | 0.0017 | GO:0051494: | negative regulation of cytoskeleton organization and biogenesis |
| 32 | 48 | 0.560 | 9.2e-06 | 0.0085 | GO:0032956: | regulation of actin cytoskeleton organization and biogenesis |
| 32 | 49 | 0.534 | 1.8e-05 | 0.0193 | GO:0032970: | regulation of actin filament-based process |
| 48 | 76 | 0.497 | 6.6e-07 | 0.0004 | GO:0051493: | regulation of cytoskeleton organization and biogenesis |
| 39 | 62 | 0.491 | 8.3e-06 | 0.0078 | GO:0016459: | myosin complex |
| 43 | 71 | 0.449 | 1.2e-05 | 0.0120 | GO:0051129: | negative regulation of cellular component organization and biogenesis |
| 51 | 88 | 0.404 | 1.1e-05 | 0.0114 | GO:0033043: | regulation of organelle organization and biogenesis |
| 105 | 216 | 0.243 | 3.5e-05 | 0.0398 | GO:0015629: | actin cytoskeleton |
| 1094 | 2356 | 0.232 | 5.7e-33 | <0.0001 | GO:0008270: | zinc ion binding |
| 139 | 294 | 0.220 | 1.3e-05 | 0.0139 | GO:0003779: | actin binding |
| 996 | 2218 | 0.199 | 1.4e-23 | <0.0001 | GO:0006355: | regulation of transcription, DNA-dependent |
| 1000 | 2233 | 0.197 | 3.4e-23 | <0.0001 | GO:0051252: | regulation of RNA metabolic process |
| 1061 | 2380 | 0.195 | 7.5e-24 | <0.0001 | GO:0045449: | regulation of transcription |
| 1013 | 2273 | 0.193 | 1.2e-22 | <0.0001 | GO:0006351: | transcription, DNA-dependent |
| 1015 | 2277 | 0.193 | 9.5e-23 | <0.0001 | GO:0032774: | RNA biosynthetic process |
| 191 | 420 | 0.190 | 8.3e-06 | 0.0077 | GO:0008092: | cytoskeletal protein binding |
| 1078 | 2436 | 0.189 | 6.6e-23 | <0.0001 | GO:0019219: | regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process |
| 1106 | 2512 | 0.185 | 1.3e-22 | <0.0001 | GO:0010468: | regulation of gene expression |
| 1189 | 2713 | 0.183 | 1.6e-23 | <0.0001 | GO:0031323: | regulation of cellular metabolic process |
| 1088 | 2477 | 0.182 | 8.6e-22 | <0.0001 | GO:0006350: | transcription |
| 1211 | 2791 | 0.175 | 4.7e-22 | <0.0001 | GO:0019222: | regulation of metabolic process |
| 989 | 2267 | 0.174 | 1.2e-18 | <0.0001 | GO:0003677: | DNA binding |
| 1507 | 3515 | 0.172 | 2.9e-25 | <0.0001 | GO:0003676: | nucleic acid binding |
| 1212 | 2825 | 0.165 | 5.5e-20 | <0.0001 | GO:0046914: | transition metal ion binding |
| 1682 | 4053 | 0.147 | 1e-20 | <0.0001 | GO:0050794: | regulation of cellular process |
| 1157 | 2784 | 0.136 | 5.6e-14 | <0.0001 | GO:0016070: | RNA metabolic process |
| 1758 | 4305 | 0.134 | 3.7e-18 | <0.0001 | GO:0050789: | regulation of biological process |
| 1772 | 4364 | 0.129 | 4.2e-17 | <0.0001 | GO:0005634: | nucleus |
| 1463 | 3584 | 0.127 | 1.1e-14 | <0.0001 | GO:0006139: | nucleobase, nucleoside, nucleotide and nucleic acid metabolic process |
N represents the number of transcripts in the RefSeq collection that have both a 5'UTR intron and a given GO attribute; X represents the total number of transcripts having that GO attribute. For each attribute, P is the nominal P-value obtained from a one-tailed Fisher's Exact Test that calculates the probability that at least N transcripts have the particular attribute given the number of genes with 5'UTR introns. This nominal P-value is adjusted for multiple hypothesis testing to yield P-adj using a resampling approach that accounts for dependencies among the tested hypotheses (see [40] for precise procedure). The table is sorted in descending order by the log10 of the odds ratio (LOD score), where and M is the number of all genes, e is a pseudocount of 0.5 and q is the query set size. All attributes with LOD > 0.125 and a P-adj < 0.05 are reported.
Figure 4Comparative genomics of 5'UTR introns within non-receptor tyrosine kinases. Several human NRTKs have multiple splice isoforms and for these we used three different methods for calculating total 5'UTR intron length: mean of 5'UTR intron length for isoforms with 5'UTR introns (HS_Mean); longest total 5'UTR intron length (HS_Longest); 5'UTR intron length most similar to its ortholog in the genome of interest (HS_Closest). (a) Heatmap of length correlation (considering genes with non-zero 5'UTR intron lengths) was plotted for the specified comparisons. As expected from the evolutionary distances between the analyzed species, the highest correlation (93%) was observed between mouse and rat NRTKs. (b) For each mouse ortholog of a human NRTK, the heatmap depicts the changes in total 5'UTR intron length (color reflects log10 of total 5'UTR intron length). The histogram above the color scale summarizes the distribution of changes in 5'UTR intron length. A 5'UTR intron may be present in mouse but not in the compared species (light blue) or vice versa (dark blue). Comparisons require an annotated 5'UTR for each ortholog, and were therefore not possible in some cases (white). (c) Same as (b) but substituting 'rat' for 'mouse'. (d) Human genomic region containing the 5'UTR and first few coding exons (UCSC Genome Browser view). '7X Regulatory Potential', for which higher scores indicate a greater potential for harboring regulatory sequence elements, was calculated using alignments of seven mammalian genomes as previously described [44].
Figure 5Characterization of an 8-nucleotide DNA motif in the 5'UTR of human NRTKs. (a) Representative motif and its reverse complement. (b) Comparison of the representative motif to the TRANSFAC v11.3 database of known transcription factor binding sites. (c) Comparison of the representative motif to a list of conserved human predicted motifs [46]. STAMP website was used for the comparisons [47]. The default ungapped Smith-Waterman alignment was used and the P-value was calculated using the methods of Sandelin and Wasserman [74].
Figure 6The effect of 5'-proximal coding intron presence on gene expression. (a) Smoothed histogram of the mean expression level with respect to presence/absence of 5'-proximal coding region introns (5PCIs). A kernel density estimator was fitted to the expression data and the corresponding probability density is plotted as a function of the mean expression level. The black line corresponds to the probability density for transcripts without any 5'UTR introns or any 5PCIs. The red line represents the probability density for 5'UTR intronless transcripts that have 5PCIs. The vertical line represents the top 5% of mean expression level of all genes without 5'UTR introns.