| Literature DB >> 25961948 |
Makoto K Shimada1, Noriko Sasaki-Haraguchi2,3, Akila Mayeda4.
Abstract
According to the length distribution of human introns, there is a large population of short introns with a threshold of 65 nucleotides (nt) and a peak at 85 nt. Using human genome and transcriptome databases, we investigated the introns shorter than 66 nt, termed ultra-short introns, the identities of which are scarcely known. Here, we provide for the first time a list of bona fide human ultra-short introns, which have never been characterized elsewhere. By conducting BLAST searches of the databases, we screened 22 introns (37-65 nt) with conserved lengths and sequences among closely related species. We then provide experimental and bioinformatic evidence for the splicing of 15 introns, of which 12 introns were remarkably G-rich and 9 introns contained completely inefficient splice sites and/or branch sites. These unorthodox characteristics of ultra-short introns suggest that there are unknown splicing mechanisms that differ from the well-established mechanism.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25961948 PMCID: PMC4463651 DOI: 10.3390/ijms160510376
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The length distribution of human introns (≤949 nt). (A) The lengths of introns with terminal GT and AG bases were plotted (red shading) based on the calculated lengths using H-InvDB annotation. The red arrow indicates the threshold (65 nt) from which the number of introns with GT–AG increases drastically toward the first peak at 83 nt; and (B) The lengths of introns with terminal non-GT and non-AG bases were plotted (green shading) in the same way. The rate of the number of non-GT–AG introns to the number of GT–AG introns was plotted with black line. The red arrow indicates the threshold (65 nt) from which the ratios of non-GT–AG introns to GT–AG introns decreased markedly.
Candidate of human introns (≤65 nt) conserved in both genome and transcriptome sequences.
| SN a | Length (nt) b | ID number of HIT c | Intron number d | Total no. of introns e | Site of intron | Data in Ensembl f | AA-seq g | Intron frequency h | RT–PCR analysis * i | RNA-Seq data * j | Individually sequenced * k | Confirm. l | ID number of HIX m | Host gene (HGNC) n |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37 | HIT000059291 | 1 | 3 | CDS | Yes | I | 1/2 | Expressed | No | HIX0029777 | AQP12A | |||
| 2 | 41 | HIT000276161 | 4 | 4 | CDS | Yes | II | 1/23 | Expressed | No | RM | No | HIX0001032 | ENSA |
| 43 | HIT000008845 | 6 | 14 | CDS | Yes | I | 1/4 | No | HIX0013170 | ESRP2 | ||||
| 47 | HIT000325704 | 2 | 15 | CDS | Yes | II | 1/1 | No-Exp | No | HIX0003317 | IFRD2 | |||
| 49 | HIT000009363 | 12 | 13 | CDS | Yes | I | 3/11 | No | HIX0023123 | NDOR1 | ||||
| 6 | 50 | HIT000084762 | 8 | 10 | CDS | Yes | III | 1/13 | Expressed | No | No | No | HIX0022245 | SAMD14 |
| 54 | HIT000325704 | 3 | 15 | CDS | Yes | II | 1/1 | Expressed | No | HIX0003317 | IFRD2 | |||
| 8 | 54 | HIT000333308 | 1 | 2 | CDS | No | VII | 1/1 | Expressed | No | No | No | HIX0059400 | HSP90B2P |
| 9 | 55 | HIT000278575 | 1 | 5 | CDS | No | IV | 1/7 | Expressed | No | RM | No | HIX0006057 | AKIRIN2 |
| 56 | HIT000192494 | 7 | 13 | CDS | Yes | I | 9/10 | HIX0005482 | HNRNPH1 | |||||
| 11 | 61 | HIT000302202 | 1 | 13 | 5′ UTR | Yes | I | 1/15 | Expressed | No | RM | No | HIX0001133 | MSTO1 |
| 62 | HIT000279220 | 1 | 7 | CDS | Yes | I | 9/11 | HIX0027515 | SIGLEC6 | |||||
| 13 | 62 | HIT000333305 | 1 | 2 | CDS | Yes | II | 1/1 | Expressed | No | No | No | HIX0202199 | HSP90AB4P |
| 62 | HIT000495960 | 1 | 6 | CDS | Yes | II | 5/5 | No | HIX0202884 | SIGLECP3 | ||||
| 63 | HIT000191419 | 3 | 4 | CDS | Yes | I | 3/3 | n/a | No | HIX0079411 | PRH1 | |||
| 16 | 63 | HIT000091849 | 1 | 2 | 5′ UTR | Yes | VI | 1/1 | Expressed | No | No | No | HIX0036362 | – |
| 65 | HIT000324311 | 10 | 28 | CDS | Yes | II | 1/1 | HIX0003640 | PLXNA1 | |||||
| 65 | HIT000058074 | 1 | 20 | CDS | Yes | I | 1/4 | No-PCR | HIX0034231 | RECQL4 | ||||
| 65 | HIT000052133 | 11 | 13 | CDS | Yes | IV | 2/2 | No-PCR | No | HIX0026183 | C11orf35 | |||
| 65 | HIT000082518 | 3 | 11 | CDS | Yes | I | 1/9 | No | HIX0202311 | PDIA2 | ||||
| 65 | HIT000252921 | 4 | 4 | CDS | Yes | I | 6/8 | UC | HIX0028549 | TNFRSF18 | ||||
| 65 | HIT000058190 | 7 | 26 | CDS | Yes | I | 4/4 | HIX0039022 | ADAM11 |
a Serial number (SN). The 15 ultra-short introns that were confirmed are highlighted in bold font (see “Confirmation”); b Intron length within the ultra-short range (≤65 nt); c H-InvDB transcript (HIT) identifier; d Intron number (position) in the host gene (in the HITs); e Total number of introns (in the HITs); f Whether or not the intron is also found in the Ensembl transcript database (“Yes” or “No”); g Levels of sequence similarity of the encoded amino-acids (AA) sequence to known proteins or protein domains; h Intron frequency in the aligned HITs represented by the ratio of the number of HITs spliced at the ultra-short introns to the number of all aligned HITs across the ultra-short intron region; i The RT–PCR detection of the endogenous splicing or transcription. “Spliced”: splicing of endogenous ultra-short intron was observed. “Expressed”: splicing was not observed but transcription was observed; “No-Exp”: expression was not detected by RT–PCR but genomic PCR worked properly. “No-PCR”: PCR did not work, even with genomic DNA. “n/a”: RT–PCR could not performed because of the difficulty in designing a primer for the repetitive region; j Whether splicing of the ultra-short introns could be checked in mRNA-Seq data (“Yes”) or not (“No”); k Whether the transcripts were cloned and sequenced individually by researchers in the INSDC databases (“Yes”) or whether they just automatically sequenced in high-throughput studies (“No”). “RM” indicates that the original accession data were removed by the contributors, and “UC” indicates that the description was unclear. In the case of SN12, the original accession data were removed (CR600025 in INSDC), but the individually sequenced data were proposed by another source (D86358 in INSDC); l Confirmation of the ultra-short introns if at least one of three experimental studies (labeled with *) is positive (“Yes”), otherwise no evidence (“No”); m H-Invitational cluster (locus, HIX) identifier; n Approved gene symbols by HUGO Gene Nomenclature Committee (HGNC).
Figure 2Splicing of 9 human ultra-short introns was detected in the indicated human cells (Hep, HepG2 cells; MDA, MDA-MB231 cells) and tissues (Cer, Cerebrum; Pla, Placenta; Leu, Leukocytes). See Table 1 for the gene names and serial numbers (SN). RT–PCR targeting the indicated endogenous gene transcripts was performed. G indicates the fragments that were amplified from genomic DNA. The amplified fragments corresponding to the pre-mRNAs and spliced mRNAs, separated by 5% PAGE, are indicated on the right with their schematic structures and the lengths of the introns (in nt). Asterisks (*) indicate nonspecific by-products that were not relevant to splicing. See Table S3 for more detailed information.
Sequence analyses and scoring of essential splicing signals of selected ultra-short introns.
See Table 1 for the introns and their corresponding serial numbers (SN; bold numbers are the confirmed introns). a The consensus human branch site sequences, (C/T)TNA(C/T) [16], are underlined. The G nucleotide and the branched A nucleotide are highlighted in red and blue, respectively. Previously identified G-rich ISSs [8] are indicated with boxes; b The most frequent bases are indicated with underlined bold font; c The 5′ splice site, 3′ splice site, and branch site sequences were scored by SRROGLE [17]; ΔG (free energy of the base pairing with U1 snRNA), MAX (maximum entropy model) and S&S (Shapiro and Senapathy score), Kol (human-mouse comparative analysis) and Sch. (large-scale comparative analysis in eukaryotes). The detailed estimations of these algorithms have been reported elsewhere [18,19]. “NA” indicates that the value was not available due to lack of a target sequence. The inefficient splice sites and branch sites, with scores of <0.1 or “NA” for all the values in each pair/triplet, are highlighted in red.