| Literature DB >> 35769907 |
Jianfeng Pan1, Ruijun Wang1,2,3,4, Fangzheng Shang1, Rong Ma1, Youjun Rong1, Yanjun Zhang1,2,3,4.
Abstract
Long non-coding RNAs (lncRNAs) were originally defined as non-coding RNAs (ncRNAs) which lack protein-coding ability. However, with the emergence of technologies such as ribosome profiling sequencing and ribosome-nascent chain complex sequencing, it has been demonstrated that most lncRNAs have short open reading frames hence the potential to encode functional micropeptides. Such micropeptides have been described to be widely involved in life-sustaining activities in several organisms, such as homeostasis regulation, disease, and tumor occurrence, and development, and morphological development of animals, and plants. In this review, we focus on the latest developments in the field of lncRNA-encoded micropeptides, and describe the relevant computational tools and techniques for micropeptide prediction and identification. This review aims to serve as a reference for future research studies on lncRNA-encoded micropeptides.Entities:
Keywords: Ribo-seq; coding potential prediction; lncRNA; micropeptide; sORF
Year: 2022 PMID: 35769907 PMCID: PMC9234465 DOI: 10.3389/fmolb.2022.817517
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
Advantages and disadvantages of translation-nomics related techniques.
| Techniques | Advantages | Disadvantages | References |
|---|---|---|---|
| Polysome profiling | RNC-mRNA can be obtained; any length, sequence variation, number of ribosomes on each mRNA can be detected | It is difficult to perform in-depth analysis of all translated mRNA |
|
| RNC-Seq | It can effectively reveal the full-length information of the RNA being translated, including abundance, and type | Prone to ribosome dissociation or RNA degradation after cell lysis; low sequencing precision; no access to ribosome, ORF, uORF information |
|
| TRAP-Seq | RNC-mRNA can be obtained; avoids contamination by eliminating the need for ultracentrifugation; it has the advantage of isolating RNC-mRNA from complex tissues and specific cell types | Stably transfected cell lines need to be established to produce labeled ribosomal proteins; over-labeling of ribosomal proteins may alter the structure and properties of the ribosome |
|
| Ribo-Seq | Accurately locates genes under translation; accurately quantifies gene translation levels; instantaneously measures translation efficiency; obtains ribosome position, density, ORF, and uORF information | Complex experiment; expensive; can only detect ribosome-protected RNA fragments; poor reproducibility |
|
ORF prediction and evaluation related calculation tools.
| Name | Characteristics | Website | References |
|---|---|---|---|
| CPC | Use sequence features and support vector machines (SVM) to evaluate the protein coding potential of transcripts; assessing the scope, quality, integrity of ORFs |
|
|
| sORF finder | Package for identifying sORF with high encoding potential |
|
|
| PhyloCSF | Based on the formal statistical comparison of phylogenetic codon models, the nucleotide sequence alignment of multiple species is analyzed to determine whether it may represent a conserved protein coding region; it can delimit likely protein-coding ORFs within transcript models that include untranslated regions |
|
|
| RNAcode | Comparison of conserved regions in coding and non-coding regions in sequence data and evaluation of coding potential; analysis of sORF or bifunctional RNAs |
|
|
| CNCI | Classification of protein-coding and long non-coding transcripts using sequence intrinsic composition (adjacent nucleotide triplets) (SVM-based) |
|
|
| CPAT | The coding potential assessment tool uses a permutation-free logistic regression model that can ORFs size and coverage to be assessed |
|
|
| iSeeRNA | Identification of long intergenic non-coding RNA (lincRNA) transcripts in transcriptome sequencing data (SVM-based) |
|
|
| PLEK | Efficient alignment-free computational tool for differentiating coding and non-coding transcripts in RNA-seq transcriptomes of species lacking a reference genome (SVM-based) |
|
|
| LncRNA-ID | The tool calculates the coding potential of transcripts based on a machine learning model (random forest) and multiple features |
|
|
| lncRNA-MFDL | By fusing multiple features and using deep learning classification algorithms to identify human lncRNA, coding and long non-coding RNA can be quickly distinguished |
|
|
| COME | A multi-feature-based coding potential calculation tool for lncRNA coding potential assessment |
|
|
| CPC2 | A fast and accurate coding potential calculator based on intrinsic sequence features for ORF feature evaluation (SVM-based) |
|
|
| CNIT | A tool for identifying protein coding and long non-coding transcripts based on intrinsic sequence composition (upgraded version of CNCI) |
|
|
| ORF Finder | A software provided by NCBI that performs six-frame translation of a nucleotide sequence, allowing all possible ORFs to be inferred |
|
|
Commonly used databases for micropeptide research.
| Name | Characteristics | Website | References |
|---|---|---|---|
| BLAST | A tool for similarity analysis in protein databases or gene databases to find sequences that are similar to the query sequence. This includes patterns such as blastp, blastx, etc |
|
|
| Pfam | A database that classifies protein sequences into families and domains, which can be queried for protein conserved structural domains |
|
|
| CDD | NCBI conserved domain database, annotated biomolecular sequences with evolutionarily conserved protein domain footprint positions, as well as functional sites deduced from these footprints |
|
|
| cncRNADB | Manually manage a resource database of bifunctional RNA (cncRNA) with protein-coding and non-coding functions |
|
|
| LNCipedia | A public database for storing lncRNA sequences and annotation information |
|
|
| lnCAR | A comprehensive resource for lncRNA from cancer arrays (including lncRNA coding information) |
|
|
| NONCODE | A database annotated with a large amount of lncRNA information |
|
|
| UCSC | Genome Browser database that provides high quality visualization of genomic data and genome annotation. Has tools such as BLAT, track hubs, etc. for viewing, analyzing and downloading data |
|
|
| UniProt | The most comprehensive database of protein sequence and annotation information, consisting of UniProtKB, UniRef, and UniParc, and integrating data from three major databases, swiss-prot, TrEMBL, and PIR-PSD |
|
|
| Expasy | A database of reliable and most advanced bioinformatics service tools and resources is stored. Has tools such as protscale, TMpred, etc. for viewing, analyzing, and downloading data |
|
|
| LncPep | The lncRNA coding peptides database |
|
|
| SPENCER | A comprehensive database for small peptides encoded by noncoding RNAs in cancer patients |
|
|
FIGURE 1Schematic illustration of the workflow for bioinformatics prediction and experimental analysis of lncRNA-encoded micropeptides. (A) Bioinformatics prediction: firstly, construct a database of putative lncRNA-encoding micropeptides by applying the results of omics sequencing, and search the putative lncRNA sequences with coding potential through NCBI or NONCODE database; secondly, use calculation tools, and databases such as CPC2, CNIT, ORF Finder, PyhloCSF, etc. to evaluate the coding potential of the putative lncRNA, and deduce the corresponding sORF, and amino acid sequence; thirdly, the deduced amino acid sequences were put into the Pfam and CDD databases to look for them, and if they matched, the search for the putative micropeptide information was continued through the UniProt database; finally, the characteristics and structure of the putative micropeptide were predicted and modeled through calculation tools and databases such as SignalP-5.0, TMHMM, ProtScale and SWISS-MODEL; (B) Laboratory identification: design a series of special vectors to be transfected into specific cells, and apply western blot and immunofluorescence experiments to identify micropeptides; meanwhile, polyclonal antibodies to this micropeptide were designed, and detected by western blot and LC-MS/MS experiments on sample cells and tissues. Based on the results of both experimental procedures, the putative micropeptide was identified as a novel micropeptide, and then the function and mechanism of the micropeptide were investigated.
FIGURE 2Schematic illustration of the regulatory role of lncRNA-encoded micropeptides in muscle physiological processes as well as disease and tumorigenesis and development. (A) Mechanism of action diagram of micropeptide MLN encoded by lncRNA LINC00948 in skeletal muscle physiological process; (B) Mechanism of action diagram of conserved peptide SPAR encoded by lncRNA LINC00961 in muscle regeneration process; (C) Mechanism of action diagram of micropeptide miPEP155 (P155) encoded by lncRNA MIR155HG in immunity and inflammation; (D) Mechanism of action diagram of the 53-aa conserved peptide encoded by lncRNA HOXB-AS3 in CRC; (E) Mechanism of action diagram of the micropeptide SRSP encoded by lncRNA LOC90024 in CRC; (F) Mechanism of action diagram of the micropeptide CASIMO1 encoded by lncRNA NR_029453 in BC; (G) Mechanism of action diagram of the conserved peptide SMIM30 encoded by LINC00998 in HCC; (H) Mechanism of action diagram of the 99-aa conserved peptide KRASIM encoded by lncRNA NCBP2-AS2 interacting with KRAS in HCC; (I) Mechanism of action diagram of the micropeptide PINT87aa encoded by LINC-PINT interacting with FOXM1 in HCC cell senescence; (J) Mechanism of action diagram of the micropeptide RPS4XL encoded by lnc-Rps41 interacting with RPS6 in PASMC.
Micropeptide information and structure-related prediction tools.
| Name | Characteristics | Website | References |
|---|---|---|---|
| TMHMM | Prediction software for transmembrane structural domains (using hidden Markov model to predict the topological structure of transmembrane proteins) |
|
|
| TMpred | Predict the transmembrane regions and directions |
|
|
| SignalP | Signal peptide prediction tool |
|
|
| ProtScale | An online tool for mapping the hydrophilic and hydrophobic atlas of proteins |
|
|
| SWISS-MODEL | An automated protein structure homology modeling platform that uses comparative methods to generate protein 3D models |
|
|
| I-TASSER | An integrated platform for automated protein structure and function prediction based on the sequence- to-structure-to-function paradigm |
|
|
| AlphaFold2 | A tool for accurately predicting the 3D structure of a protein based on its amino acid sequence |
|
|
| RoseTTAFold | A tool for accurate structure prediction of proteins and protein complexes using three-track neural networks |
|
|