| Literature DB >> 18466625 |
Jill L Wegrzyn1, Thomas M Drudge, Faramarz Valafar, Vivian Hook.
Abstract
BACKGROUND: Utilization of alternative initiation sites for protein translation directed by non-AUG codons in mammalian mRNAs is observed with increasing frequency. Alternative initiation sites are utilized for the synthesis of important regulatory proteins that control distinct biological functions. It is, therefore, of high significance to define the parameters that allow accurate bioinformatic prediction of alternative translation initiation sites (aTIS). This study has investigated 5'-UTR regions of mRNAs to define consensus sequence properties and structural features that allow identification of alternative initiation sites for protein translation.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18466625 PMCID: PMC2396638 DOI: 10.1186/1471-2105-9-232
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Mammalian mRNAs that Utilize Alternative Initiation Sites (non-AUG) for Protein Translation
| hemopoietic cell kinase (hck) | CUG | Homo sapiens | 172 | [3] | |
| hemopoietic cell kinase (hck) phosphoribosyl pyrophosphate synthetase 1-like 1 | CUG | Mus musculus | 186 | [3] | |
| (PRPS1L1) | ACG | Homo sapiens | 82 | [7] | |
| leukocyte tyrosine kinase (ltk), tv 1 | CUG | Mus musculus | 228 | [8] | |
| leukocyte tyrosine kinase (ltk), tv 2 | CUG | Mus musculus | 228 | [8] | |
| v-myc myelocytomatosis viral oncogene homolog (Myc) | CUG | Homo sapiens | 525 | [9] | |
| myelocytomatosis viral oncogene homolog (Myc) | CUG | Rattus norvegicus | 537 | [9] | |
| hemopoietic cell kinase (hck) | CUG | Rattus norvegicus | 116 | [3] | |
| cyclin-dependent kinase 10 (CDK10), tv 1 | CUG | Homo sapiens | 237 | [10] | |
| cyclin-dependent kinase 10 (CDK10), tv 1 | CUG | Homo sapiens | 255 | [10] | |
| cyclin-dependent kinase 10 (CDK10), tv 2 | CUG | Homo sapiens | 91 | [10] | |
| cyclin-dependent kinase 10 (CDK10), tv 2 | GUG | Homo sapiens | 109 | [10] | |
| cyclin-dependent kinase 10 (CDK10), tv 2 | CUG | Mus musculus | 91 | [10] | |
| solute carrier family 30 (zinc), member 2 (Slc30a2) | CUG | Rattus norvegicus | 52 | [11] | |
| Wilms tumor 1 (WT1), tv A | CUG | Homo sapiens | 197 | [4] | |
| TEA domain family member 4 (TEAD4), tv 1 | UUG | Homo sapiens | 275 | [12] | |
| p97, repressor translation | GUG | Mus musculus | 333 | [13] | |
| neuronal pentraxin receptor (NPR) | CUG | Homo sapiens | 155 | [14] | |
| neuronal pentraxin receptor (NPR) | CUG | Mus musculus | 126 | [14] | |
| neuronal pentraxin receptor (NPR) | CUG | Rattus norvegicus | 115 | [14] | |
| Bcl2-associated athanogene 1 (BAG1) | CUG | Mus musculus | 34 | [15] | |
| BCL2-associated athanogene (BAG1) | CUG | Homo sapiens | 73 | [16] | |
| Human nuclear receptor (hPAR) | CUG | Homo sapiens | 1840 | [17] | |
| fibroblast growth factor 2 (FGF2) | CUG | Homo sapiens | 303 | [18] | |
| fibroblast growth factor 2 (FGF2) | CUG | Homo sapiens | 330 | [18] | |
| fibroblast growth factor 2 (FGF2) | CUG | Homo sapiens | 345 | [18] | |
| fibroblast growth factor 2 (FGF2) | CUG | Homo sapiens | 69 | [18] | |
| minor histocompatability antigen HB-1 | CUG | Homo sapiens | 108 | [19] | |
| tumor suppression (MRVI1) | CUG | Homo sapiens | 502 | [20] | |
| tumor suppression (MRVI1) | CUG | Mus musculus | 647 | [20] | |
| TREF-5 transcription enhancer factor | AUA | Homo sapiens | 161 | [21] | |
| Human I-mfa domain-containing protein (HIC) | GUG | Homo sapiens | 264 | [22] | |
| calcium channel, voltage-dependent (CACNG8) | CUG | Homo sapiens | 104 | [23] | |
| spatial stromal protein associated thymii and lymph node | CUG | Mus musculus | 84 | [24] | |
| stromal interaction molecule 2 (STIM2) | UUG | Homo sapiens | 531 | [25] | |
| regulatory factor X-associated protein (RFXAP) | ACG | Mus musculus | 97 | [26] | |
| DEAD box protein (DDX17) | CUG | Homo sapiens | 75 | [27] | |
| Leucine zipper DNA binding protein (JUND1) | CUG | Mus musculus | 139 | [28] | |
| Leucine zipper DNA binding protein (JUND1) | CUG | Rattus norvegicus | 294 | [28] | |
| vascular endothelial growth factor (VEGF), tv 1 | CUG | Homo sapiens | 492 | [1] | |
| vascular endothelial growth factor (VEGF), tv 3 | CUG | Homo sapiens | 666 | [1] | |
| vascular endothelial growth factor (VEGF), tv 4 | CUG | Homo sapiens | 908 | [1] | |
| vascular endothelial growth factor (VEGF), tv 2 | CUG | Homo sapiens | 645 | [1] | |
| Sp3 transcription factor (SP3), tv 2 | AUA | Homo sapiens | 385 | [29] | |
| DNase X (LOC515176) | CUG | Bos taurus | 185 | [30] |
This positive set of 45 validated RefSeq mammalian sequences have been identified as containing at least one alternative initiation site. This information was obtained from Genbank records and includes accession, protein name, annotated start codon, species and the relative start position. References for each alternative start site are provided (last column).
Figure 1Functional classification of proteins derived from mRNAs utilizing alternative translation initiation sites. The positive training set, consisting of verified Mammalian RefSeq mRNA sequences, was analyzed for functional biological categories. These annotations were compiled via BLAST searches and subsequent Gene Ontology (GO) and protein family analysis. The chart depicts the protein functions represented by the identified aTIS sequences. The functions of these proteins are significant for biological regulation.
Figure 2Unique consensus sequences at aTIS compared to non-aTIS (AUG) in the 5'-UTR region of mRNAs. Figures 2A and 2B where generated with an adaptation of the WebLogo application [32]. The overall height of the nucleotide stack indicates the sequence conservation at that position, while the height of nucleotide symbols within the stack indicates the relative frequency of each base at that position. The start site is indicated at positions 1, 2, and 3. (A) Distinct consensus nucleotide sequences near confirmed alternative translation initiation sites in mammalian mRNAs. The relative abundance of nucleotides (A, T, C, G) at aTISs is shown for a window of -10 to +10 bases at the initiation codon, with the aTIS start codon in positions 1, 2, and 3. Conservation around all of the alternative start sites aTISs is illustrated. Note the strong conservation of (G/C) at the -6 position and C at the -7 position. (B) Consensus nucleotides near AUG translation initiation sites in mammalian mRNAs. Graphical representation of relative nucleotide abundances at AUG sites is shown for bases in the region of -10 to +10 bases relative to the initiation codon, with the AUG codon in positions 1, 2, and 3. Conservation at the -3 and +4 locations are consistent with traditional Kozak consensus sequence. These features are distinguished from that of thte aTIS sequences which show conservation at positions -6 and -7 (Figure 2A).
5'-UTR Sequence Parameters Utilized for Analyses by the Classification Tree
| Pattern #1 (G/C, C) | Position -6/-7 | Yes/No |
| Pattern #2 (Kozak) | Position -3/+4 | Yes/No |
| 5'-UTR Length | Length of 5'-UTR | 80 to 2000 bp |
| mRNA Sequence Length | Full mRNA length | 350 to 5000 bp |
| G/C Ratio | Ratio of G to C | 0.4 to 3 |
| GC Percentage | Percentage of GC content | 0.3 to 0.9 |
| A/T Ratio | Ratio of A to T | 0.2 to 8 |
| Number of AUGs | Number of upstream AUGs from first start site. | 0 to 19 |
| Codon Bias | Codonw | 0.02 to 1.0 |
| IRES | UTRScan/NCBI | Yes/No |
| GLUT1 | UTRScan/NCBI | Yes/No |
| TOP | UTRScan/NCBI | Yes/No |
This table illustrates the parameters (properties) of 5'-UTR regions of mRNAs utilizing aTIS with respect to distinct consensus sequence patterns compared to the traditional Kozak pattern, sequence parameters of 5'-UTR regions, and secondary structure features. The distinct consensus sequence pattern consists of C at position -6 and G/C at position -7 (Fig. 2A). 5'-UTR length is defined as the length of the mRNA sequence from the beginning of the 5'-UTR to the translation initiation site and sequence length as annotated in GenBank. Number of AUGs is determined from the start of the 5'-UTR to the first annotated start site. Secondary structure features considered are Internal Ribosome Entry Sites (IRES), Glucose Transporter 1 (GLUT1), and Terminal Oligopyrimidine (TOP).
Figure 3Classification tree for identification of mRNAs with alternative translation initiation sites (aTIS). The C4.5 classification are displayed here in the form of a decision tree. Training, testing, and cross-validation produced a set of 'if-then' rules that allowed the sequences containing the aTISs to be classified independently from sequences that do not possess aTISs. The critical parameters for the classification tree consisted of number of upstream AUGs, 5'-UTR length, consensus sequences (at positions-6/-7), the presence or absence of the Internal Ribosome Entry Site (IRES) structure, as well as G/C ratio.
C4.5 Classification Tree Results
| 41 | 82 | 81 | 1 | 0.0122 | 0 | 1 | |
| 4 | 8 | 7 | 1 | 0.1250 | 0 | 1 | |
| 45 | 90 | 87 | 3 | 0.0333 | 0 | 3 | |
| 0 | 500 | 469 | 31 | 0.062 | 0 | 31 | |
| 43 | 86 | 83 | 3 | 0.0349 | 0 | 3 |
Results of the C4.5 classification tree on training, testing, cross-validation, full negative set and provisional data sets are presented here. The classification tree was able to effectively distinguish between sequences that contain aTIS sites and those that do not. The 45 RefSeq sequences validated as containing alternative start sites were combined with 45 randomly selected sequences without alternative start sites resulting in 90 sequences used for the data sets and training. Of these 90 sequences, 82 were used for the 'training' set and the remaining 8 were used as an 'independent' testing set. 'Cross-validation' of the C4.5 classification tree was performed using 10 fold cross-validation on the full set of 90 sequences, represented by the training and independent testing sets. The 'Full Negative Set' was the full set of 500 non-aTIS sequences which was compiled for generation of testing sets. The resulting performance of this set provides a measure of the ability of the classification tree to generalize to larger datasets. The 'provisional' set consisted of sequences predicted to contain at least one aTIS; this data set performed well during classification. The Mean Absolute Error is calculated as the fraction of incorrectly classified sequences compared to the total number of mRNA sequences in the designated training sets.
Significant Parameters Utilized by C4.5 Classification Tree as Properties to Determine aTIS
| 2.41 | 3.16 | 1.72 | 2.73 | |
| 477.23 | 1303.24 | 100.24 | 278.97 | |
| 60% | n/a | 27.6% | n/a | |
| 58% | n/a | 11.84% | n/a | |
| 1.11 | 0.47 | 1.00 | 0.33 | |
Critical parameters in the classification tree are illustrated with respect to their average values for alternative start sites and non-alternative start sites. Differences among these values facilitated the effective classification of mRNAs with aTIS. The number of AUGs is a count of the number of AUGs in the 5'-UTR. Consensus sequences matches against a C in the -7 position from the start site and a G or C in the -6 position. The sequences are also scanned for the IRES secondary structure and the ratio of Guanine and Cytosine in the 5' UTR is also measured. Although the G/C ratio was the least important variable, it resulted in a split in the classification tree and is, therefore, listed as an parameter required for the final classification results.
5'-UTR Sequence Properties Utilized for Analyses by the Artificial Neural Network (ANN)
| Pattern #1 (C/(G/C) | Position -6/-7 | Yes/No |
| Pattern #2 (Kozak) | Position -3/+4 | Yes/No |
| 5'-UTR Length | Length of 5'-UTR | 80 to 2000 bp |
| ORF Length | Length of ORF from the annotated start codon to the stop codon | 350 to 5000 bp |
| Start Codon | Frequency of aTIS in training set | 0 to1 |
| Number of AUGs | Number of upstream AUGs from aTIS | 0 to 19 |
| G/C Ratio | Normalized ratio of G to C | -1 to 1 |
| Free Energy | 50 bp UnaFold | -40 to 0 |
| Secondary Structure | UnaFold | 0 = stem, 3 = loop |
Properties of 5'-UTR sequences of mRNAs using aTIS are shown according to their application in the ANN. The derivation of each feature is shown, as well as the range of representation to the ANN for training and testing. These features are implemented in the ANN analyses which includes refined representations of secondary structure.
Figure 4Secondary structural features of alternative translation initiation sites in 5'-UTR regions of mRNA. Folding of a representative alternative start site in a stem-loop model using UNAFold centered on the start site is illustrated in this figure. The alternative start site, CUG, is circled in red. Features of this secondary structure served as inputs to the ANN. For each 50 base pair window surrounding the putative alternative initiation site (as shown in this figure), the local stability of the start codon itself and the free energy of the structure were recorded. The window size (50 bp) was experimentally determined as the minimum window size which produced consistent foldings through shifts in the folding window. Based on the scale of 0 to 3 scale, the stability would be measured as a 3 since the codon (all three bases) are present entirely in the loop structure.
Figure 5Organization of artificial neural network (ANN) for identification of alternative translation initiation sites (aTIS). To identify aTISs, this study used a feed-forward back-propagation ANN using Matlab's Neural Network toolbox. Artificial Neural Networks are a computational algorithm that uses layers of neurons with weighted edges connecting each layer to perform classification. To determine the specific ANN architecture, this study started with a static training set and modified the number of neurons in the hidden layer of the ANN as well as the activation function used for the neurons in each layer. The resulting ANN contained 10 neurons in the input layer, 20 neurons in the hidden layer and a single output neuron. Inputs to the ANN are normalized in order to negate the effect of measurements in different ranges. The output neuron provides values in the range [0, 1].
Results of ANN aTIS Identification
| 41 | 82 | 0.1165 | 0.1109 | 0.1221 | 82.9 | 7 | 7 | |
| 4 | 8 | 0.0969 | 0.1436 | 0.0502 | 87.5 | 0 | 1 | |
| 45 | 7627 | 0.0471 | 0.1138 | 0.0467 | 93.4 | 495 | 8 |
Results of the ANN on the training, testing, and full data sets are shown here. For the purpose of identifying the aTISs, the 45 validated alternative start sites were combined with 45 non-alternative start sites randomly selected from the same sequences, resulting in a set of 90 sites. 82 of the 90 sites are used for the 'Training' Set, and the remaining 8 are used as an 'independent testing' set. Further all 7627 potential start sites from the full set of 45 sequences containing alternative start sites were compiled and tested by the ANN. The full set represents all potential start sites present in the 45 positive set of mRNAs, and tests the ability of the ANN to define which sites are likely to act as an alternative initiation start site.