| Literature DB >> 31234415 |
Pavel Dvorak1,2, Sarah Leupen3, Pavel Soucek4,5.
Abstract
Single nucleotide polymorphisms located in 5' untranslated regions (5'UTRs) can regulate gene expression and have clinical impact. Recognition of functionally significant sequences within 5'UTRs is crucial in next-generation sequencing applications. Furthermore, information about the behavior of 5'UTRs during gene evolution is scarce. Using the example of the ATP-binding cassette transporter A1 (ABCA1) gene (Tangier disease), we describe our algorithm for functionally significant sequence finding. 5'UTR features (upstream start and stop codons, open reading frames (ORFs), GC content, motifs, and secondary structures) were studied using freely available bioinformatics tools in 55 vertebrate orthologous genes obtained from Ensembl and UCSC. The most conserved sequences were suggested as hot spots. Exon and intron enhancers and silencers (sc35, ighg2 cgamma2, ctnt, gh-1, and fibronectin eda exon), transcription factors (TFIIA, TATA, NFAT1, NFAT4, and HOXA13), some of them cancer related, and microRNA (hsa-miR-4474-3p) were localized to these regions. An upstream ORF, overlapping with the main ORF in primates and possibly coding for a small bioactive peptide, was also detected. Moreover, we showed several features of 5'UTRs, such as GC content variation, hairpin structure conservation or 5'UTR segmentation, which are interesting from a phylogenetic point of view and can stimulate further evolutionary oriented research.Entities:
Keywords: 5′ untranslated region; ABCA1; bioinformatics; gene regulation; single nucleotide polymorphism
Mesh:
Substances:
Year: 2019 PMID: 31234415 PMCID: PMC6627321 DOI: 10.3390/cells8060623
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Figure 1Workflow of significant sequence finding in 5′UTRs applied in this study. ORF, open reading frame.
Figure 2Simplified phylogenetic tree highlighting 55 vertebrate species evaluated in this study. Clades were colored as follows: Primates—light violet; Rodents—dark violet; Mammals—blue; Ray-finned fishes—green; Reptiles and birds—orange; and Coelacanth—red.
Figure 3Detail of the 5′ end of the human ABCA1 gene; 5′UTR, 5′ untranslated region; bp, base pairs; mORF, main open reading frame.
Figure 4Length comparison of 5′UTRs, their subsections and ABCA1 proteins in vertebrates. (A) before-Intron-1 sections of 5′UTRs (Pr vs. Ro p = 0.018, Mann–Whitney pairwise, Bonferroni corrected p values); (B) after-Intron-1 sections (Pr vs. Ro p = 0.000003, Pr vs. Plm p = 0.031, Pr vs. Ma p = 0.012, Pr vs. Rebi p = 0.001, Pr vs. Rfa p = 0.0003); (C) whole 5′UTRs (Pr vs. Ro p = 0.032); (D) whole Intron 1 regions (Pr vs. Ro p = 0.004, Pr vs. Plm p = 0.043, Pr vs. Rfa p = 0.034); and (E) ABCA1 proteins (Pr vs. Rfa p = 0.002, Pr vs. Rfb p = 0.007). Pr., primates; Ro, rodents; Plm, other placental mammals; Ma, marsupials; Rebi, reptiles and birds; Rfa, ray-finned fishes–ABCA1a genes; Rfb, ray-finned fishes–ABCA1b genes.
Figure 5Results of the ABCA1 5′UTR multi-sequence alignment of transcripts from 15 vertebrate species performed in ClustalO and visualized in Jalview. Lines represent individual transcripts, the conservation of nucleotides among vertebrate species is highlighted with the most conserved ones having the darkest color, and levels of identity and occupancy scores are visible as histograms at the bottom. (A) Alignment of whole 5′UTR regions. 5′UTR segments, which contain conserved nucleotides, can be distinguished. (B) A detail of the 5′UTR segment, which contains the most conserved sequences, is provided. The position of the spliced Intron 1 is marked by a red arrow.
Comparison of the conserved uORF predicted in ABCA1 orthologous genes.
| Species | uORF | mORF | ||||
|---|---|---|---|---|---|---|
| Start | Stop | Length | Protein Sequence | Start | ||
| nts | aa | |||||
| Human | 307 | 417 | 111 | 36 | MTSHGVPAVSSGRCLPGLPSHTLGVLAEGTWLVGLS | 396 |
| Macaque | 364 | 474 | 111 | 36 | MTSHGIPAVSSGRCLPGLLSHTLWVLAEGTWLVGLS | 453 |
| Mouse lemur | 314 | 424 | 111 | 36 | MTSHSIPAVSSGHCLPGLLSHTLWVPAEVTWLVGPS | 403 |
| Rabbit | 327 | 440 | 114 | 37 | MTSHSGFATSSGRCLQGRATSRLPWVPAEVTWPAGLS | 419 |
| Mouse | 224 | 319 | 96 | 31 | MTSHRVTALCSGCSLQGSRAADAGRCGCRLW | 321 |
| Squirrel | 279 | 365 | 87 | 28 | MTSHSVCCELRPVPPGLLSHTQVALGAG | 372 |
| Cat | 304 | 393 | 90 | 29 | MTSHSVPAVSCCCCLQKLLSHTQVAAAAG | 400 |
| Armadillo | 304 | 399 | 96 | 31 | MTSHSVPAVSSGHCPHGLPTSHTQVAWARLR | 401 |
| Tasmanian devil | 4 | 66 | 63 | 20 | MTSHSVPAQRYLCSLHYLPG | 96 |
| Opossum | 333 | 395 | 63 | 20 | MTSHGVLAQCCLCSLHYLLD | 425 |
| Platypus | 54 | 131 | 78 | 25 | MTSHSVPAVCCCHCPCHTRGAVPAC | 138 |
| Chicken | 135 | 200 | 66 | 21 | MPSHNVLVVYCCCCTKGRRHC | 207 |
| Flycatcher | 4 | 60 | 57 | 18 | MPGHNICTVLLLLHKESF | 77 |
| Anole lizard | 221 | 271 | 51 | 16 | MTSHSSSAVCCFHPRC | 295 |
| Coelacanth | 245 | 274 | 30 | 9 | MSDNNIPAA | 297 |
Note: Start/Stop positions from transcription start site.
Comparison of uORF and mORF start nucleotide contexts in ABCA1 orthologous genes.
| Species | uORF nt Context | mORF nt Context |
|---|---|---|
| Human | AAACAGTTA | TGAGGGAAC |
| Macaque | AAACAGTTA | TGAGGGAAC |
| Mouse lemur | AAGCAGTTA | TGAGGTGAC |
| Rabbit | AAGCAGTTA | TGAGGTAAC |
| Mouse | AAACAGTTA | TGTGGTGAC |
| Squirrel | AAACAGTTA | TGAGGTAAC |
| Cat | AAACAGTTA | TGAGGAAAC |
| Armadillo | AAACAGTTA | TGAGGTAAC |
| Tasmanian devil | TTA | TGAGGAAAG |
| Opossum | AAGCAGTTA | TGAGGAGAG |
| Platypus | TTCCAGTTA | TGAGGAAAG |
| Chicken | CCGGAGTTA | TGAAGAACG |
| Flycatcher | TTA | TGAAGGAAG |
| Anole lizard | GAGGAGTTG | AGAAGGAAG |
| Coelacanth | AAAAAGTTA | TGGGAAAAG |
Abbreviations: nt, nucleotide; ATG codons are highlighted in red font.
Analysis of conserved RNA motifs and sites in ABCA1 5′UTRs of 15 vertebrate species.
| Species | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Motif Type | Motif Name | Hu. | Ma. | Mo. le. | Sq. | Mo. | Ra. | Cat | Ar. | Ta. de. | Op. | Pl. | Chi. | Fl. | An. li. | Co. | Nr. of Hits |
| ESE | beta-globin ex. 2 | • | • | 2 | |||||||||||||
| ct/cgrp | • | • | • | 3 | |||||||||||||
|
| • | • | • | • | • | • | • | • | • | 9 | |||||||
| ESS |
| • | • | • | • | • | • | • | 7 | ||||||||
| ISE | cftr, in. 9 | • | • | • | 3 | ||||||||||||
|
| • | • | • | • | • | • | • | 7 | |||||||||
|
| • | • | • | • | • | • | • | • | 8 | ||||||||
|
| • | • | • | • | • | • | • | • | • | • | 10 | ||||||
| rho-ind. ter. | Rho-ind. ter. | • | • | • | • | 4 | |||||||||||
| TRM | Sp1 | • | • | 2 | |||||||||||||
| Msx-1 | • | • | 2 | ||||||||||||||
| Freac | • | • | 2 | ||||||||||||||
| c-Ets-1_p54 | • | • | 2 | ||||||||||||||
| SMAD | • | • | 2 | ||||||||||||||
| GATA-5 | • | • | 2 | ||||||||||||||
| ZNF333 | • | • | 2 | ||||||||||||||
| ELF1 | • | • | • | 3 | |||||||||||||
| SPI1 | • | • | • | 3 | |||||||||||||
| MZF1 | • | • | • | 3 | |||||||||||||
| MYB | • | • | • | 3 | |||||||||||||
| AP-2 | • | • | • | 3 | |||||||||||||
| BEN | • | • | • | 3 | |||||||||||||
| Kid3 | • | • | • | • | 4 | ||||||||||||
| PEA3 | • | • | • | • | 4 | ||||||||||||
| E2F-3 | • | • | • | • | 4 | ||||||||||||
| GABP | • | • | • | • | 4 | ||||||||||||
| ZF5 | • | • | • | • | 4 | ||||||||||||
| SOX10 | • | • | • | • | 4 | ||||||||||||
| Elk-1 | • | • | • | • | • | 5 | |||||||||||
| ER81 | • | • | • | • | • | 5 | |||||||||||
| ETV7 | • | • | • | • | • | 5 | |||||||||||
| LXR_direct | • | • | • | • | • | 5 | |||||||||||
|
| • | • | • | • | • | • | • | 7 | |||||||||
|
| • | • | • | • | • | • | • | • | 8 | ||||||||
|
| • | • | • | • | • | • | • | • | 8 | ||||||||
|
| • | • | • | • | • | • | • | • | 8 | ||||||||
|
| • | • | • | • | • | • | • | • | 8 | ||||||||
| UTR motifs | Musashi bin. El. (MBE) | • | • | • | • | • | • | 6 | |||||||||
| microRNA target sites | hsa-miR-5581-5p | • | • | 2 | |||||||||||||
| hsa-miR-3194-3p | • | • | 2 | ||||||||||||||
| hsa-miR-4435 | • | • | • | 3 | |||||||||||||
|
| • | • | • | • | • | • | • | • | • | 9 | |||||||
Abbreviations: bin, binding; el., element; ESE, exon splicing enhancer; ESS, exon splicing silencer; ex., exon; in., intron; ind., independent; ISE, intron splicing enhancer; Nr., Number; ter., terminator; TRM, transcriptional regulatory motif; UTR, untranslated region; Species Abbreviations: Hu., human; Ma., macaque; Mo. le., mouse lemur; Sq., squirrel; Mo., mouse; Ra., rabbit; Cat, cat; Ar., armadillo; Ta. de., Tasmanian devil; Op., opossum; Pl., platypus; Chi., chicken; Fl., flycatcher; An. li., anole lizard; Co., coelacanth; •—motif was found within the relevant 5′UTR in one or more copies; the most important motifs are highlighted in bold;
Figure 6Results of DNA motif analyses performed in Motif Alignment and Search Tool (MAST), part of motif-based sequence analysis tools (MEME). 5′UTR DNA sequences of 15 vertebrate species were evaluated. Sequence logos of the seven significant motifs discovered by the program are shown in the upper part of the figure. Colors were assigned to 10 motifs found (7 significant and 3 not significant) and their distribution along the 5′UTRs is shown in the bottom part of the figure. Conservation of the motif distribution along 5′UTRs can be seen and a pattern recognized.
Figure 7Secondary structure prediction of the human ABCA1 5′UTR; two web servers—RNAstructure (secondary structure diagram) and RNAfold (mountain plot) were independently employed for this analysis. A mountain plot representation of the minimum free energy prediction of secondary structure (mfe), the thermodynamic ensemble of RNA structures (pf), and the centroid structure (centroid) is presented. A mountain plot represents secondary structure in a plot of height versus position, where the height m(k) is given by the number of base pairs enclosing the base at position k, i.e., loops correspond to plateaus (hairpin loops are peaks) and helices to slopes. Additionally, the positional entropy for each position is shown (the higher the entropy, the lower the structural stability). The start position of the upstream ORF (uORF) revealed and discussed in the text is marked with an arrow.
Analysis of ABCA1 5′UTR secondary structures in RNAstructure and RNAfold programs.
| Species | No. of Hairpin Loops | |||
|---|---|---|---|---|
| RNAstructure | RNAfold | |||
| Before In. 1 Section | After In. 1 Section | Before In. 1 Section | After In. 1 Section | |
| Human | 6 | 1 | 6 | 1 |
| Mouse lemur | 4 | 1 | 5 | 2 |
| Rabbit | 7 | 1 | 7 | 1 |
| Mouse | 5 | 1 | 4 | 2 |
| Armadillo | 7 | 1 | 5 | 1 |
| Opossum | 7 | 1 | 5 | 1 |
| Platypus | 2 | 2 | ||
| Flycatcher | 2 | 2 | ||
| Anole lizard | 7 | 1 | 6 | 1 |
| Coelacanth | 6 | 1 | 4 | 1 |
Abbreviations: In., intron; No., number.
Summary of software and database resources.
|
| |
|
| genome browsers for the retrieval of genomic information |
|
| builds trees based on the classification in the NCBI taxonomy database |
|
| phylogenetic tree development and visualisation |
|
| translates nucleic acid sequences to their corresponding peptide sequences |
|
| transcription, translation, reverse transcription |
|
| |
|
| multiple sequence alignment editing, visualisation and analysis |
|
| bioinformatics resources including alignment programs |
|
| |
|
| searches for open reading frames in DNA sequences |
|
| produces neural network predictions of translation start in nucleotide sequences |
|
| location and identification of upstream open reading frames that have the potential to encode bioactive peptides |
|
| predict the initiation site of mRNA sequences that lack the preferred nucleotides |
|
| |
|
| GC content, GC Content Distribution Plots |
|
| finds motifs that characterize 3′UTR and 5′UTR sequences |
|
| identifies functional RNA motifs and sites in RNA sequences |
|
| discovers novel motifs in nucleotide or protein sequences |
|
| |
|
| predict RNA secondary structures |
|
| |
|
| scientific data analysis, statistics |