Literature DB >> 29140531

An update on sORFs.org: a repository of small ORFs identified by ribosome profiling.

Volodimir Olexiouk¹, Wim Van Criekinge¹, Gerben Menschaert¹.

Abstract

sORFs.org (http://www.sorfs.org) is a public repository of small open reading frames (sORFs) identified by ribosome profiling (RIBO-seq). This update elaborates on the major improvements implemented since its initial release. sORFs.org now additionally supports three more species (zebrafish, rat and Caenorhabditis elegans) and currently includes 78 RIBO-seq datasets, a vast increase compared to the three that were processed in the initial release. Therefore, a novel pipeline was constructed that also enables sORF detection in RIBO-seq datasets comprising solely elongating RIBO-seq data while previously, matching initiating RIBO-seq data was necessary to delineate the sORFs. Furthermore, a novel noise filtering algorithm was designed, able to distinguish sORFs with true ribosomal activity from simulated noise, consequently reducing the false positive identification rate. The inclusion of other species also led to the development of an inner BLAST pipeline, assessing sequence similarity between sORFs in the repository. Building on the proof of concept model in the initial release of sORFs.org, a full PRIDE-ReSpin pipeline was now released, reprocessing publicly available MS-based proteomics PRIDE datasets, reporting on true translation events. Next to reporting those identified peptides, sORFs.org allows visual inspection of the annotated spectra within the Lorikeet MS/MS viewer, thus enabling detailed manual inspection and interpretation.

Entities: Chemical Species

Mesh：

Year: 2018 PMID： 29140531 PMCID： PMC5753181 DOI： 10.1093/nar/gkx1130

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The probability of generating a start site (‘ATG’) by random sampling the nucleotide space is 1 out of 64. In addition, the probability of sampling a stop codon (‘TAA’, ‘TAG’, ‘TGA’) within the next 99 codons is ∼99%. Consequently, this implies that approximately 1.5% of the genome would consist of small open reading frames (sORFs, ≤300 nucleotides), assuming that the genome is generated by a random event, without considering splice events, reading frames, nucleotide biases, CG-content of the genome, or strandedness (1). Identifying translating sORFs in this vast pool of random sORFs is challenging, further complicated by the lack of sequence similarity between sORFs and known protein coding ORFs (2–4). Also, RNA-sequencing is unable to delineate ORFs and MS-based proteomic approaches have difficulties in detecting small protein products, illustrating the technological complications we are facing in the micropeptide detection process (5,6). As a result of this complex process, sORFs have historically been labelled as lacking coding potential. It is the advent of ribosome profiling (RIBO-seq) (7,8), that forced us to reconsider our opinion on the truly non-coding nature of these small ORFs (9–11). Since the initial release of sORF.org, the Ensembl consortium (12) re-annotated 147 non-protein coding transcripts to protein coding (updated annotation from Ensembl version 81–90), where the protein product is <100 AA long. This set holds 54 long non-coding RNA (lncRNA) transcripts. Our initial release nourished this growing field on sORF-encoded polypeptides by establishing a first public portal bundling this focussed information, soon other initiatives followed, such as ARA-PEP (13) and SmProt (14). Here, an update on the sORFs.org repository is provided incorporating 78 new RIBO-seq datasets and including support for three new species, currently harbouring 34 human, 27 mouse, 5 rat, 3 zebrafish, 3 fruit fly and 6 Caenorhabditis elegans datasets. This vast increase in number of processed datasets (three at initial release) is mainly attributable to the development of a modified pipeline enabling the detection of sORFs in absence of data on initiating ribosomes, where an extra noise filtering step controls for false positive events.The addition of data on new species to sORFs.org drove the development of a ‘between species’ sORF BLAST (15) to detect sORFs with sequence similarity. Next, publicly available mass spectrometry (MS) datasets from PRIDE (16) are rescanned to acquire translational evidence for sORFs, as already available in our initial release of sORFs.org (17) as a proof of concept. Additionally, a visual platform was developed allowing the inspection of annotated identified MS/MS fragmentation spectra in the Lorikeet MS/MS viewer (https://github.com/jmchilton/lorikeet). This valuable feature provides a significant advantage over conventional MS-based identification reporting, which report identification either by a score, as in SmProt (14), or by a static figure (18). Figure 1 summarizes the most important improvements to sORFs.org since its initial release.

Figure 1.

An overview of the most important improvements to sORFs.org since its initial release. The modified TIS-calling pipeline together with the noise filtering algorithm enabled the inclusion of datasets on additional species, wherefore no initiating RIBO-seq data (LTM or HAR treated) was available. Currently, a total of 78 RIBO-seq datasets are processed, identifying numerous novel sORFs with ribosome occupancy. Implementation of the inner-BLAST pipeline revealed sORFs with sequence similarity identified in multiple species and the PRIDE-ReSpin pipeline provides an extra layer of translation evidence based on MS data for a plethora of sORFs.

MATERIALS AND METHODS

Summary of the initial sORFs.org features

The initial release of sORFs.org provided 2 query interfaces. A default query interface enables quick, real-time lookup of specific sORFs whereas a second BioMart query interface (19) provides advanced query and export functionality. The query interfaces were optimized and improved based on community requests and input. Every sORF within the repository has its own detail page, bundling all available information. All metrics and information from our initial release (17) are still present, but we would like to stress that this page also contains two RIBO-seq coverage representations. A first one presents dataset-specific ribosome occupancy information within the UCSC genome browser interface (20), enabling inspection of the ribosome profile in or surrounding the sORF. A second intuitive in-house developed visualization allows more detailed inspection, allowing to select for certain reading frames or ribosome protected fragment (RPF) lengths. In our initial release, conservation was calculated using PhyloCSF (21), the inclusion of many new datasets constrained us to change to PhastCon (22) and PhyloP (23) due to computational limitation. However, in a future release we plan to optimize and implement PhyloCSF (21). Also, the BLASTp (15) search for sORFs against the non-redundant protein database from NCBI (24,25), which is periodically updated, is presented alongside.

TIS calling

The initial TIS-calling method required data on initiating ribosomes (e.g. by means of lactomidomycin (LTM) or harringtonine (HAR) treatment), with matching data on elongating ribosomes (e.g. by means of cycloheximide (CHX) treatment) (26). A limited amount of studies was published combining the two types of ribosome profiling experiments measuring both initiating and elongating ribosomes. This urged for the development of a modified TIS-calling algorithm based solely on translating ribosomes. In a first step, all start sites are identified genome-wide only taking into account the four most prominent start triplets ‘ATG’, ‘CTG’, ‘TTG’ and ‘GTG’, as opposed to the initial TIS-calling algorithm that considers all near cognate start triplets. Data on initiating ribosomes allows to pinpoint the correct TIS and the lack thereof increases the difficulty of non-ATG start site detection, resulting in an increase of truncations and extension caused by near-cognate start-sites occurring by chance. However, for well translated sORFs, data on initiating ribosomes should not be necessary for detection. Next, all start sites are scanned for an in-frame stop codon within 300nt, both with and without considering splice information extracted from the Ensembl annotation (12). For each possible sORF, the in-frame coverage and the RPF read count is calculated. A lenient threshold of at least 10% in-frame coverage and 10 RPFs is imposed to withhold sORFs. For those passing these criteria, the identified TIS are used in the assembly step as described in the initial release (17). The modified TIS-calling method enabled the addition of numerous datasets, resulting in the identification of novel sORFs as well as reoccurring sORFs (∼45% of sORFs are identified in multiple datasets, see supplementary file, Figure S1).

Noise filtering

As the novel TIS-calling algorithm does not build on two layers of evidence, comprising both data from elongating and initiating RIBO-seq experiments, it is clear that (non-AUG) start site prediction becomes more difficult and more false positive results are introduced. In order to counteract this, an accompanying novel noise filtering approach was developed comparing the RPF occupancy of sORFs with ‘simulated’ noise, trying to truly asses these translation events. First, the transcript of the corresponding sORF is converted into a binary array, where ‘1’ represents a position covered by ribosomes whereas ‘0’ points to uncovered positions. After calculating the in-frame coverage for the sORF, this binary array is shuffled and the in-frame coverage is recalculated. This shuffling and recalculation of the in-frame coverage is repeated 10.000 times, creating a distribution of shuffled in-frame coverages, representing randomly allocated RPF coverage. Next, the probability is calculated to obtain an in-frame coverage of at least the actual in-frame coverage (Figure 2). The resulting P-values are subjected to the Benjamin–Hochberg (27) procedure for multiple testing to control the FDR at α = 0.05. Notably, for intronic sORFs, the intron where the sORF resides is considered as an exon in the noise filtering step and for intergenic sORFs the transcript is considered to be the region 1000nt up- and down-stream of the sORF. Also, sORFs are inspected for overlap with any protein coding exon on any transcript, sORFs overlapping with protein coding exons are reported and sORFs overlapping and in-frame with the protein coding exons are discarded. The noise filtering algorithm has been validated on the crappé_2014 dataset (GSM1403307) using annotated canonical protein-coding transcript as a positive and 3′UTR regions as a negative control. These results are represented in supplementary Figures S4 and S5.

Figure 2.

Visual representation of the noise filtering algorithm. The transcript of the sORF is reconstructed into a binary array, where ‘1’ represent positions covered by ribosome P-site and ‘0’ uncovered. This array is then shuffled 10 000 times, each iteration calculates the in-frame coverage in the sORF region, shaping a distribution of shuffled in-frame coverage as represented in gray. Next, the probability of sampling a value equal or greater than the actual in-frame coverage of the sORF is calculated (represented in red).

Inner BLAST

Addition of new species enabled us to investigate whether sORFs with sequence similarity over different species are present. Also, linking these related sORF sequences, provides experimentalists to perform functional characterization in a more convenient test model based on other organisms. The inner BLAST is performed by searching for sequence similarity in sORFs identified in distinct species using BLASTp (15) at an expected value of 0, 0000000001. Roughly 18% of sORFs express sequence similarity with at least one sORF (see supplementary file, Figure S2).

PRIDE-ReSpin

Acquiring proteomic evidence for micropeptides has proven to be strenuous (4,5,28,29). Many features such as their low abundance and putative hydrophobicity but also the lack of enzymatic cleavage sites and specific extraction protocols makes their identification hard with MS approaches. Yet, technological and computational advancements have recently resulted in the identification of several micropeptides using proteomics approaches (4,5,18,30,31). Including all possible translated sORF sequences on genome-wide scale impairs their identification and validation by inflating the search space, that is why these micropeptide sequences are generally excluded. sORFs.org provides a focussed database of putative micropeptides with translational evidence from RIBO-seq, suitable for inclusion into the search space within proteomics experiments. Most proteomics experiment are tailored for a specific purpose and are only examined once within the context of the study. Much more information thus remains undetected, which is gaining awareness in the community. The potential of reprocessing public proteomics datasets has been stressed (32–38), and is applied here for micropeptide detection. The PRIDE-ReSpin runs continuously, periodically updating validated peptides to sORFs.org. At the time of writing, 302 human, 126 mouse, 18 rat, 10 zebrafish and 3 C. elegans datasets were processed identifying 463.678 PSMs that account for 10.583 uniquely identified peptides. For human, 291 3′-UTR, 675 5′-UTR, 1.954 exonic, 131 intronic, 129 intergenic, and 19 lincRNA unique sORF peptides were identified (see supplementary file, Figure S3). sORFs.org allows to visually inspect the identified spectra in the Lorikeet MS/MS browser, enabling manual assessment and validation of the identifications rather than bluntly reporting identified peptides (Figure 3). A detailed description of the PRIDE-ReSpin methodology can be found in the supplementary file.

Figure 3.

General overview of the PRIDE-ReSpin pipeline. First, MS-based proteomics experiments are downloaded from the PRIDE public repository. Next, a reverse engineering mechanism based on PRIDE-ASAP and Pladipus extracts the database (DB) search parameters for that study. These are inputted into the searchGUI search engine management software, launching a DB search against a concatenated database consisting of the UniProt reference proteome, the cRAP database and the sORFs.org database, using the X!Tandem and MS-GF+ as search engines. Consecutively, the output is imported into PeptideShaker to validate and export identified peptides at an FDR of 1%, with a minimum of 30% spectrum coverage and no PSMs having a higher confidence to non sORF peptides. These resulting peptides are then imported into sORFs.org for visualization in the Lorikeet MS/MS browser.

COMPARISON WITH OTHER RESOURCES

Since the initial release of sORFs.org, several other public databases containing small open reading frame information emerged (39). The ARA-PEP repository (http://www.biw.kuleuven.be/CSB/ARA-PEPs/) (13) focusses on Arabidopsis thaliana and presents genomic, transcriptional and conservation information in order to annotate sORFs. The smPROT repository (14) has more overlap with sORFs.org, harbouring a vast amount of identified sORFs across distinct species. smPROT uses the RiboTaper (40) tool to identify putatively translated sORFs from ribosome profiling data and thus significantly differs from our approach, which in not primarily based on the triplet periodicity. sORFs.org includes all sORFs with evidence of ribosome occupancy and computes various sORF translation detection metrics (e.g. FLOSS, ORFscore) alongside genomic and proteomic features, thus providing researchers the capability to tailor sORFs.org information to their own research projects, using our query interfaces. SmProt provides limited translation detection metrics and genomic features (conservation, variation), however, detects sORFs from literature mining, a feature currently missing in sORFs.org. Furthermore, sORFs.org aims to be as transparent as possible in data acquisition and processing, providing information and statistics both on the datasets used, as well as providing visual tools to inspect data, for instance by representing RPF data in the UCSC genome browser (20) or in our in-house developed browser. In contrast, smPROT reports only limited genomic and RPF based features and provides no means to inspect the credibility of the reported information. This in our opinion is a very important feature, especially in this field where false positive detection is possible. Also, smPROT reports 117.099 sORFs with MS-evidence including 83.159 exonic sORFs, 24.539 lincRNA sORFs, 5.272 antisense sORFs and 1.854 ‘sense no exonic’ sORFs. This huge amount of identified micropeptides based on MS information has not been corroborated by us or other studies. As the smPROT does not have the ability to validate/inspect the MS data—only a raw score of the identified peptide is reported—these findings could not be verified. sORFs.org allows the inspection of matched fragmentation through the Lorikeet viewer and also dynamically scans more deposited dataset based on the PRIDE-ReSpin approach, which is in shear contrast to the smPROT database.

CONCLUSION AND FUTURE PERSPECTIVES

sORFs now additionally supports three species (rat, zebrafish and C. elegans) and includes 78 extra datasets. This has been achieved by implementing a novel TIS-calling algorithm, enabling the identification of sORFs from RIBO-seq experiments comprising solely elongating ribosome data (through CHX treatment). Moreover, a novel noise filtering algorithm was devised to distinguish sORFs translation events with true ribosome occupancy from simulated noise. The addition of new species led to the development of the inner-BLAST pipeline, identifying homologues sORFs in our repository. Lastly, the PRIDE-ReSpin MS data reprocessing pipeline was released and incorporated into sORFs.org, periodically scanning publicly available datasets to acquire relevant translational evidence for sORFs. The Lorikeet MS/MS viewer ensures visual inspection of the annotated fragmentation spectra. sORFs.org will continue to periodically include new datasets supporting extra species. Also, the PRIDE-ReSpin will be fine-tuned and optimized, increasing the amount of processable data. To build in a second layer of translational evidence based on MS, integration of sORFs.org with PeptideAtlas (41) and NextProt (42) is investigated. At present, the incorporation of small linear motives (sLIM) into sORFs.org is examined, by exploring the potential integration with the ELM database (43). Also, ways to incorporate protein family domains and motives such as pFAM (44) are investigated (including e.g. of transmembrane motives (45) and signal peptides (46)). In general, integration with different sources such as HaltORF (47) and RPFdb (48) will strengthen sORFs.org by accumulating relevant evidence for translation. A text-mining approach could help the annotation of sORFs by reporting recent scientific manuscripts. In all, sORFs.org continuously will follow the sORF research community enabling the implementation of novel features when requested.

AVAILABILITY

sORF.org is publicly available at http://www.sorfs.org. The underlying pipelines used for sORFs.org can be made available upon request, however, were not optimized for public usage. Click here for additional data file.

43 in total

1. PRIDE: the proteomics identifications database.

Authors: Lennart Martens; Henning Hermjakob; Philip Jones; Marcin Adamski; Chris Taylor; David States; Kris Gevaert; Joël Vandekerckhove; Rolf Apweiler
Journal: Proteomics Date: 2005-08 Impact factor: 3.984

2. SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci.

Authors: Yajing Hao; Lili Zhang; Yiwei Niu; Tanxi Cai; Jianjun Luo; Shunmin He; Bao Zhang; Dejiu Zhang; Yan Qin; Fuquan Yang; Runsheng Chen
Journal: Brief Bioinform Date: 2018-07-20 Impact factor: 11.622

Review 3. Mining for Micropeptides.

Authors: Catherine A Makarewich; Eric N Olson
Journal: Trends Cell Biol Date: 2017-05-18 Impact factor: 20.808

4. State of the Human Proteome in 2014/2015 As Viewed through PeptideAtlas: Enhancing Accuracy and Coverage through the AtlasProphet.

Authors: Eric W Deutsch; Zhi Sun; David Campbell; Ulrike Kusebauch; Caroline S Chu; Luis Mendoza; David Shteynberg; Gilbert S Omenn; Robert L Moritz
Journal: J Proteome Res Date: 2015-07-24 Impact factor: 4.466

5. The UCSC Genome Browser database: 2015 update.

Authors: Kate R Rosenbloom; Joel Armstrong; Galt P Barber; Jonathan Casper; Hiram Clawson; Mark Diekhans; Timothy R Dreszer; Pauline A Fujita; Luvina Guruvadoo; Maximilian Haeussler; Rachel A Harte; Steve Heitner; Glenn Hickey; Angie S Hinrichs; Robert Hubley; Donna Karolchik; Katrina Learned; Brian T Lee; Chin H Li; Karen H Miga; Ngan Nguyen; Benedict Paten; Brian J Raney; Arian F A Smit; Matthew L Speir; Ann S Zweig; David Haussler; Robert M Kuhn; W James Kent
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 19.160

6. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments.

Authors: Nicholas T Ingolia; Gloria A Brar; Silvia Rouskin; Anna M McGeachy; Jonathan S Weissman
Journal: Nat Protoc Date: 2012-07-26 Impact factor: 13.491

7. Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue.

Authors: Jiao Ma; Carl C Ward; Irwin Jungreis; Sarah A Slavoff; Adam G Schwaid; John Neveu; Bogdan A Budnik; Manolis Kellis; Alan Saghatelian
Journal: J Proteome Res Date: 2014-02-14 Impact factor: 4.466

8. ELM 2016--data update and new functionality of the eukaryotic linear motif resource.

Authors: Holger Dinkel; Kim Van Roey; Sushama Michael; Manjeet Kumar; Bora Uyar; Brigitte Altenberg; Vladislava Milchevskaya; Melanie Schneider; Helen Kühn; Annika Behrendt; Sophie Luise Dahl; Victoria Damerell; Sandra Diebel; Sara Kalman; Steffen Klein; Arne C Knudsen; Christina Mäder; Sabina Merrill; Angelina Staudt; Vera Thiel; Lukas Welti; Norman E Davey; Francesca Diella; Toby J Gibson
Journal: Nucleic Acids Res Date: 2015-11-28 Impact factor: 16.971

9. sORFs.org: a repository of small ORFs identified by ribosome profiling.

Authors: Volodimir Olexiouk; Jeroen Crappé; Steven Verbruggen; Kenneth Verhegen; Lennart Martens; Gerben Menschaert
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

10. PRIDE Inspector Toolsuite: Moving Toward a Universal Visualization Tool for Proteomics Data Standard Formats and Quality Assessment of ProteomeXchange Datasets.

Authors: Yasset Perez-Riverol; Qing-Wei Xu; Rui Wang; Julian Uszkoreit; Johannes Griss; Aniel Sanchez; Florian Reisinger; Attila Csordas; Tobias Ternent; Noemi Del-Toro; Jose A Dianes; Martin Eisenacher; Henning Hermjakob; Juan Antonio Vizcaíno
Journal: Mol Cell Proteomics Date: 2015-11-06 Impact factor: 5.911

40 in total

1. An integrative proteogenomics approach reveals peptides encoded by annotated lincRNA in the mouse kidney inner medulla.

Authors: Cameron T Flower; Lihe Chen; Hyun Jun Jung; Viswanathan Raghuram; Mark A Knepper; Chin-Rang Yang
Journal: Physiol Genomics Date: 2020-08-31 Impact factor: 3.107

2. Accurate detection of short and long active ORFs using Ribo-seq data.

Authors: Saket Choudhary; Wenzheng Li; Andrew D Smith
Journal: Bioinformatics Date: 2020-04-01 Impact factor: 6.937

3. CPPred: coding potential prediction based on the global description of RNA sequence.

Authors: Xiaoxue Tong; Shiyong Liu
Journal: Nucleic Acids Res Date: 2019-05-07 Impact factor: 16.971

Review 4. Small open reading frames in plant research: from prediction to functional characterization.

Authors: Sheue Ni Ong; Boon Chin Tan; Aisyafaznim Al-Idrus; Chee How Teo
Journal: 3 Biotech Date: 2022-02-24 Impact factor: 2.406

Review 5. Non-coding transcript variants of protein-coding genes - what are they good for?

Authors: Sonam Dhamija; Manoj B Menon
Journal: RNA Biol Date: 2018-09-10 Impact factor: 4.652

6. LncSEA: a platform for long non-coding RNA related sets and enrichment analysis.

Authors: Jiaxin Chen; Jian Zhang; Yu Gao; Yanyu Li; Chenchen Feng; Chao Song; Ziyu Ning; Xinyuan Zhou; Jianmei Zhao; Minghong Feng; Yuexin Zhang; Ling Wei; Qi Pan; Yong Jiang; Fengcui Qian; Junwei Han; Yongsan Yang; Qiuyu Wang; Chunquan Li
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

7. A vast pool of lineage-specific microproteins encoded by long non-coding RNAs in plants.

Authors: Igor Fesenko; Svetlana A Shabalina; Anna Mamaeva; Andrey Knyazev; Anna Glushkevich; Irina Lyapina; Rustam Ziganshin; Sergey Kovalchuk; Daria Kharlampieva; Vassili Lazarev; Michael Taliansky; Eugene V Koonin
Journal: Nucleic Acids Res Date: 2021-10-11 Impact factor: 16.971

8. LncRBase V.2: an updated resource for multispecies lncRNAs and ClinicLSNP hosting genetic variants in lncRNAs for cancer patients.

Authors: Troyee Das; Aritra Deb; Sibun Parida; Sudip Mondal; Sunirmal Khatua; Zhumur Ghosh
Journal: RNA Biol Date: 2020-10-28 Impact factor: 4.652

9. Annotating high-impact 5'untranslated region variants with the UTRannotator.

Authors: Xiaolei Zhang; Matthew Wakeling; James Ware; Nicola Whiffin
Journal: Bioinformatics Date: 2021-05-23 Impact factor: 6.937

10. Genome-Wide Identification and Characterization of Small Peptides in Maize.

Authors: Yan Liang; Wanchao Zhu; Sijia Chen; Jia Qian; Lin Li
Journal: Front Plant Sci Date: 2021-06-16 Impact factor: 5.753