| Literature DB >> 24059539 |
Jeroen Crappé1, Wim Van Criekinge, Geert Trooskens, Eisuke Hayakawa, Walter Luyten, Geert Baggerman, Gerben Menschaert.
Abstract
BACKGROUND: It was long assumed that proteins are at least 100 amino acids (AAs) long. Moreover, the detection of short translation products (e.g. coded from small Open Reading Frames, sORFs) is very difficult as the short length makes it hard to distinguish true coding ORFs from ORFs occurring by chance. Nevertheless, over the past few years many such non-canonical genes (with ORFs < 100 AAs) have been discovered in different organisms like Arabidopsis thaliana, Saccharomyces cerevisiae, and Drosophila melanogaster. Thanks to advances in sequencing, bioinformatics and computing power, it is now possible to scan the genome in unprecedented scrutiny, for example in a search of this type of small ORFs.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24059539 PMCID: PMC3852105 DOI: 10.1186/1471-2164-14-648
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Basic sORF characteristics
| 1 | 197,195,432 | 3,070,032 | 128 | 160,770 | 2,453 |
| 2 | 181,748,087 | 2,830,394 | 128 | 176,654 | 2,058 |
| 3 | 159,599,783 | 2,507,691 | 127 | 124,217 | 2,570 |
| 4 | 155,630,120 | 2,385,489 | 130 | 155,419 | 2,003 |
| 5 | 152,537,259 | 2,335,678 | 131 | 158,789 | 1,921 |
| 6 | 149,517,037 | 2,342,614 | 128 | 130,505 | 2,291 |
| 7 | 152,524,553 | 2,235,697 | 136 | 146,314 | 2,085 |
| 8 | 131,738,871 | 1,990,727 | 132 | 134,093 | 1,965 |
| 9 | 124,076,172 | 1,910,809 | 130 | 126,743 | 1,958 |
| 10 | 129,993,255 | 2,024,292 | 128 | 121,848 | 2,134 |
| 11 | 121,843,856 | 1,845,184 | 132 | 142,610 | 1,709 |
| 12 | 121,257,530 | 1,875,766 | 129 | 109,403 | 2,217 |
| 13 | 120,284,312 | 1,867,333 | 129 | 108,919 | 2,209 |
| 14 | 125,194,864 | 1,959,570 | 128 | 102,912 | 2,433 |
| 15 | 103,494,974 | 1,599,415 | 129 | 101,315 | 2,043 |
| 16 | 98,319,150 | 1,528,958 | 129 | 81,787 | 2,404 |
| 17 | 95,272,651 | 1,441,669 | 132 | 102,617 | 1,857 |
| 18 | 90,772,031 | 1,404,482 | 129 | 80,524 | 2,255 |
| 19 | 61,342,430 | 912,412 | 134 | 65,280 | 1,879 |
| X | 166,650,296 | 2,594,439 | 128 | 82,073 | 4,061 |
| Y | 15,902,555 | 41,696 | 762 | 1,566 | 20,310 |
Overview of putatively coding sORFs grouped by Mus musculus chromosomes, showing the total number and the distribution of sORFs for each chromosome, as well as the number and distribution of sORFs with high coding potential according to sORFfinder.
Figure 1Overview of the coding sORF prediction. (A) Histogram of the total number of sORFs depicted by ORF length (in AA). (B) Distribution of sORFs according to their genomic location. sORfs overlapping more than one different category are grouped as “others”. (C) Evaluation of the sORF coding probability. The fractions of annotated and predicted coding and non-coding sORFs within the test dataset are plotted. (D) Visual representation of the classification of all 9,612 test subjects, based upon both SVMs (SVMlight and libSVM). True coding sORFs are depicted in green and true non-coding in red (see Additional file 1: Figure S2).
Coding potential of sORFs in different genomic locations
| ncRNA | 20,810 | 9,922 | 6,443 | 1,100 | 528 | 401 |
| Exonic | 63,180 | 34,063 | 21,546 | 10,872 | | |
| Other | 155,633 | 80,891 | 37,730 | 9,894 | | |
| Intronic | 417,277 | 34,845 | 14,582 | 2,361 | | |
| Intergenic | 1,757,458 | 223,235 | 107,567 | 27,371 | 226 | 89 |
Number of sORFs divided per genomic region and for which certain in silico and/or expression evidence can be found. Included are total number of sOFs with high coding potential (according to sORFfinder), number of sORFs having scores above certain thresholds (according to SVM analysis), number of sORFs which show ribosome profiling expression and number of sORFs for which in silico coding as well as expression evidence is available.
a Total number of sORFs with high coding potential according to sORFfinder.
b Total number of sORFs classified as coding by SVMlight.
c Pcod is the coding probability score as predicted by SVMlight.
d sORFs with mapped ribosome profiles, attaining sequence read coverage > 75% of the total ORF (based on cycloheximide treatment), and ribosome profile hits at the ORF start site (based on harringtonine treated samples).
e Ribo sORFs (see under d for description) classified as coding by SVMlight.
Figure 2The combined approach identifies many putatively functional sORFs in intergenic regions. (A) Visual representation of the intergenic sORF located on the forward strand of chromosome 11 (69,794,326-69,794,388) based on data from the H2G2 genome browser. (B) DNA multiple alignments for the intergenic sORF presented in Figure 2A and based on the 8 species under investigation from the UCSC mm9 multi-species alignment. (C) Visual representation of the overlap between intergenic sORFs with ribosomal profiling evidence and the classified test subjects. True coding sORFs are depicted in green and true non-coding in red (see Additional file 1: Figure S2), black dots represent the intergenic sORFs. Classification and presentation are based on the coding probability scores from the 2 SVMs used during the analysis (See Methods). (D) Visual representation of the intergenic sORF located on the reverse strand of chromosome X (71,212,050-71,212,082) based on data from the H2G2 genome browser. The sORF is located approximately 400 bp upstream of a known protein-coding gene (Hcfc1).
Figure 3The combined approach identifies many putatively functional sORFs in ncRNA regions. (A) Visual representation of the lincRNA overlapping sORF located on the reverse strand of chromosome 2 (127,618,033 – 127,618,203) based on data from the H2G2 genome browser. (B) Visual representation of the overlap between ncRNA overlapping sORFs with ribosomal profiling evidence and the classified test subjects. True coding sORFs are depicted in green and true non-coding in red (see Additional file 1: Figure S2), black dots represent the ncRNA overlapping sORFs. Classification and presentation are based on the coding probability scores from the 2 SVMs used during the analysis (See Methods). (C) AA multiple alignments for the lincRNA overlapping sORF presented in Figure 3A and based on the 8 species under investigation from the UCSC mm9 multi-species alignment. Next to the AA sequences for each species, a synonymous (S) versus non-synonymous (N) annotated conservation line is added for better interpretation (see Additional file 1: Figure S1 for the complete sORF overview file).
Figure 4General layout of the identification pipeline. The identification pipeline consists of different steps as outlined in the workflow. Central in the analysis is the MySQL sORF database where all obtained and calculated data is stored. This overall sORF data matrix can be downloaded via Additional file 1. (1) Genome-wide search for sORFs (with high coding potential) with the sORFfinder package. (2) Calculation of different peptide conservation measures based on the UCSC Mouse multiple alignments. (3) Coding capability assessment of the sORFs by means of a Support Vector Machine (SVM) learning algorithm. (4) Inspection of the sORF locations for presence of ribosome profiling signals obtained from mESC experiments. (5) Genome-wide visualization of all (experimental) data and all sORF information on our in-house developed H2G2 Genome Browser.