| Literature DB >> 35300685 |
Alyssa Zi-Xin Leong1, Pey Yee Lee1, M Aiman Mohtar1, Saiful Effendi Syafruddin1, Yuh-Fen Pung2, Teck Yew Low3.
Abstract
A short open reading frame (sORFs) constitutes ≤ 300 bases, encoding a microprotein or sORF-encoded protein (SEP) which comprises ≤ 100 amino acids. Traditionally dismissed by genome annotation pipelines as meaningless noise, sORFs were found to possess coding potential with ribosome profiling (RIBO-Seq), which unveiled sORF-based transcripts at various genome locations. Nonetheless, the existence of corresponding microproteins that are stable and functional was little substantiated by experimental evidence initially. With recent advancements in multi-omics, the identification, validation, and functional characterisation of sORFs and microproteins have become feasible. In this review, we discuss the history and development of an emerging research field of sORFs and microproteins. In particular, we focus on an array of bioinformatics and OMICS approaches used for predicting, sequencing, validating, and characterizing these recently discovered entities. These strategies include RIBO-Seq which detects sORF transcripts via ribosome footprints, and mass spectrometry (MS)-based proteomics for sequencing the resultant microproteins. Subsequently, our discussion extends to the functional characterisation of microproteins by incorporating CRISPR/Cas9 screen and protein-protein interaction (PPI) studies. Our review discusses not only detection methodologies, but we also highlight on the challenges and potential solutions in identifying and validating sORFs and their microproteins. The novelty of this review lies within its validation for the functional role of microproteins, which could contribute towards the future landscape of microproteomics.Entities:
Keywords: Mass spectrometry; Microproteins; Proteogenomics; Ribosome profiling (RIBO-Seq); Short open reading frame (sORF); Small open reading frame (smORF)
Mesh:
Substances:
Year: 2022 PMID: 35300685 PMCID: PMC8928697 DOI: 10.1186/s12929-022-00802-5
Source DB: PubMed Journal: J Biomed Sci ISSN: 1021-7770 Impact factor: 8.410
Fig. 1A comparison between sORF and altORF transcripts in terms of length and initiation codons. A sORF transcript structure with AUG or non-AUG initiation codons, characterised by its short length of 100 codons after post-transcriptional modifications. B altORF transcript structure described with AUG initiation codon, longer than 30 codons and without an upper limit on length, differing from sORFs
Fig. 2Localities of sORFs in the genome and transcripts. Genomic locations of sORFs include in the 3’ UTR (uORF), 5’ UTR (dORF), overlapping within the main ORF, intergenic regions and pseudogenes. sORF-containing long intergenic non-coding RNA (lincRNA) are also localised in the nucleus. In the mitochondria, sORFs are found in the mitochondrial DNA (mtDNA). In the cytoplasm, sORFs are scattered across different RNA transcripts i.e., circular RNA (circRNA), long non-coding RNA (lncRNA), and pri-microRNA
sORF prediction tools
| Prediction tool | References | Website | Description |
|---|---|---|---|
| Coding Non-Coding Identifying Tool (CNIT) | [ | Distinguishes between coding and non-coding regions based on intrinsic sequence compositions | |
| Coding Region Identification Tool Invoking Comparative Analysis (CRITICA) | [ | Analyses nucleotide sequence composition and conservation at the amino acid level | |
| Coding Potential Calculator (CPC)/CPC2 | [ | Assess protein-coding potential based on important features (ORF size, coverage, integrity); CPC2 improves run speed and accuracy | |
| Coding Potential Predictor (CPPred) | [ | Predicts the coding potential of RNA transcript | |
| CPPred-sORF | [ | Addition of 2 new features from CCPred i.e., GCcount, mRNN-11codons and CUG, GUG start codons | |
| MicroPeptide Tool (MiPepid) | [ | Identifies coding sORFs based on existing microproteins subpopulation set | |
| sORF Finder | [ | Identifies sORF with high coding potential based on nucleotide composition bias and potential functional constraint at the amino acid level | |
| smORFunction | [ | Provides function prediction of sORFs/microproteins | |
| miPFinder | [ | Identifies and evaluates microproteins functionality using information on size, domain, protein interactions and evolutionary origin | |
| PhastCons | [ | Based on conservation scoring and identification of conserved elements | |
| PhyloCSF | [ | Determines a conserved protein-coding region based on formal statistical comparison of phylogenetic codon models | |
| uPEPperoni | [ | Specifically for 5’UTR sORFs, based on conservation | |
| AnABLAST | [ | Identifies putative protein-coding regions in DNA regardless of ORF length and reading frame shifts | |
| Small Peptide Alignment Discovery Application (SPADA) | [ | Homology-based gene prediction programme | |
| Deep Neural Network for coding potential prediction (DeepCPP) | [ | Effective on RNA coding potential prediction, spefically sORF mRNA prediction |
This table shows prediction tools that can be used for putative sORF detection based on sequence homology and similarity in all genomes. CNIT and CPPred utilises a positive set of normal-sized proteins and may not be optimised for sORF and microprotein detection. CPPre-sORF is an improved version of CPPred for sORF detection. MiPepid, sORF Finder, miPFinder and smORFunction are designed especially for sORF detection, identification, and function prediction. PhastCons, PhyloCSF, SPADA and uPEPperoni utilise conservation analyses for prediction, with the latter designed spefically for sORFs in the upstream region. DeepCPP is based on a deep learning method to evaluate RNA coding potential and demonstrated high performance in sORF data
Fig. 3Ribosome profiling process where ribosome footprints are obtained for deep sequencing. Isolation of ribosome-bound mRNAs is conducted through treatment of non-specific nucleases such as RNase I or micrococcal nuclease). Ribosome footprints (showing positioning between start and stop codon of gene) are then used for library generation and deep sequencing. Identification of novel small peptides made possible by isolation of actively translated regions of the transcript, which is directly mapped back to genomic coding regions
Fig. 4Mass-spectrometry based approaches to isolate microproteins. Sample preparation prior to LC–MS/MS analysis to isolate microprotein species < 30 kDa in size includes size exclusion approaches. Molecular weight cut off filters (MWCOs) can sieve for microproteins depending on the type of filter used i.e., 10 kDa or 30 kDa. Acid precipitation is a common enrichment step for to precipitate larger proteins. Solid phase extraction (SPE) enrichment occurs via reverse-phase C8 cartridges and elutes microproteins of interest. Further methods in reducing sample complexities include electrostatic repulsion-hydrophilic interaction chromatography (ERLIC) and high-resolution isoelectric focusing (Hi-RIEF). ERLIC separates based on charged analytes and utilises SAX resin for strong anion exchange, whereas Hi-RIEF seperates peptides based on their isoelectric points (pI) on a pH gradient gel. Post-fractionation accuracy is dependent on high sequence coverage and low background noise in mass spectra. This can be achieved with using High-energy Collision Induced Dissociation (HCD) on Fusion Tribrid MS or Q-Exactive MS
Online repositories tailored for sORF identification
| Database | References | Website | Type | Description |
|---|---|---|---|---|
| sORFs.org | [ | sORF repository | Obtains experimental data from RIBO-seq with conservation analyses and rescanning MS data from PRIDE for updated small peptide validation | |
| SmProt | [ | sORF repository | Database on small proteins specifically from lncRNA, obtains data from RIBO-seq, literature mining and MS data, integrates conservation analyses | |
| OpenProt | [ | altORF resource | Contains information on protein isoforms and altORFs with experimental evidence, intergrates RIBO-seq, MS, conservation analyses and functional domains | |
| ARA-PEPs | [ | sORF repository | Repository of putative sORF-encoded peptides specifically in | |
| PsORF | [ | sORF repository | Database of sORF across different plant species, incorporating genomic, transcriptomic, RIBO-Seq and MS data | |
| MetamORF | [ | sORF repository | A repository of unique sORFs in | |
| nORFs.org | [ | novel ORF (nORF) repository | Provides aggregated information from databases such as sORFs.org, OpenProt and OpenCB |
This table shows the databases available publicly for sORF identification. sORFs.org and OpenProt evaluate protein sequence identity based on BLASTp score, whereas SmProt provides a BLAST alignment search for manual evaluation of protein sequence identity. OpenProt annotates sORFs but under the label of altORFs that are longer than 30 codons and originating from ncRNAs, pseudogenes or has multiple ORFs per transcript, hence the limits set during search identification should be noted. ARA-PEPs were developed specifically from A. thaliana sORF experimental data, and PsORF aimed to store a more complete record of plant sORF. A large bulk of both MetamORF and nORFs.org data was obtained from sORFs.org and OpenProt. nORFs.org provides additional protein sequence viewer, OpenCB variants and customises annotation metrics functions