| Literature DB >> 24379829 |
Mingming Su1, Yunchao Ling1, Jun Yu2, Jiayan Wu2, Jingfa Xiao2.
Abstract
Polypeptides containing ≤100 amino acid residues (AAs) are generally considered to be small proteins (SPs). Many studies have shown that some SPs are involved in important biological processes, including cell signaling, metabolism, and growth. SP generally has a simple domain and has an advantage to be used as model system to overcome folding speed limits in protein folding simulation and drug design. But SPs were once thought to be trivial molecules in biological processes compared to large proteins. Because of the constraints of experimental methods and bioinformatics analysis, many genome projects have used a length threshold of 100 amino acid residues to minimize erroneous predictions and SPs are relatively under-represented in earlier studies. The general protein discovery methods have potential problems to predict and validate SPs, and very few effective tools and algorithms were developed specially for SPs identification. In this review, we mainly consider the diverse strategies applied to SPs prediction and discuss the challenge for differentiate SP coding genes from artifacts. We also summarize current large-scale discovery of SPs in species at the genome level. In addition, we present an overview of SPs with regard to biological significance, structural application, and evolution characterization in an effort to gain insight into the significance of SPs.Entities:
Keywords: evolution characterization; protein annotation coherence; protein identification; small ORFs; small proteins
Year: 2013 PMID: 24379829 PMCID: PMC3864261 DOI: 10.3389/fgene.2013.00286
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Function characterization of small proteins in COG/KOG database.
| [J] Translation, ribosomal structure, and biogenesis | + | + | + | + | + | + | + |
| [A] RNA processing and modification | − | − | − | + | + | + | + |
| [K] Transcription | + | + | + | + | + | + | + |
| [L] Replication, recombination, and repair | + | + | − | + | + | + | + |
| [B] Chromatin structure and dynamics | + | + | + | + | + | + | + |
| [D] Cell cycle control, cell division, chromosome partitioning | + | + | + | + | + | + | + |
| [Y] Nuclear structure | − | − | − | − | − | − | − |
| [V] Defense mechanisms | + | + | − | + | − | + | + |
| [T] Signal transduction mechanisms | + | + | + | + | + | + | + |
| [M] Cell wall/membrane/envelope biogenesis | + | + | − | − | + | + | + |
| [N] Cell motility | + | + | − | + | − | + | + |
| [Z] Cytoskeleton | − | + | + | + | + | + | + |
| [W] Extracellular structures | − | − | − | + | + | + | + |
| [U] Intracellular trafficking, secretion, and vesicular transport | + | + | + | + | + | + | + |
| [O] Posttranslational modification, protein turnover, chaperones | + | + | + | + | + | + | + |
| [C] Energy production and conversion | + | + | + | + | + | + | + |
| [G] Carbohydrate transport and metabolism | + | + | + | + | + | − | + |
| [E] Amino acid transport and metabolism | + | + | − | + | + | + | + |
| [F] Nucleotide transport and metabolism | + | + | − | − | + | − | + |
| [H] Coenzyme transport and metabolism | + | + | + | + | + | − | + |
| [I] Lipid transport and metabolism | + | + | + | + | + | + | + |
| [P] Inorganic ion transport and metabolism | + | + | + | + | + | + | + |
| [Q] Secondary metabolites biosynthesis, transport, and catabolism | + | + | − | − | + | + | + |
| [R] General function prediction only | + | + | + | + | + | + | + |
| [S] Function unknown | + | + | + | + | + | + | + |
“+,” found; “−,” not found. It describes function types of SPs in Archaea, Bacteria, and Fungi in COG database and those of SPs in eukaryotic species in KOG database. In Archaea, bacteria, and fungi, constitutive or structural classes are not covered, that is, RNA processing and modification, nuclear structure, extracellular structures. In Arabidopsis thaliana (Ath), Caenorhabditis elegans (Cel), Drosophila melanogaster (Dme), and Homo sapiens (Hsa), the nuclear structure class is not covered in all these organisms.
Figure 1Domain number distribution of small proteins in NCBI genpept. SP usually contains a single domain. The NCBI genpept database contains 14,324,397 proteins, including 1,796,324 (12.54%) SPs. Only 310,909 (17.31%) SPs, about 2.17% of total proteins, are annotated, and among the annotated domain SPs, most of them (85.26%) have only one domain.
Figure 2An overview of integrated strategies for small proteins prediction. It is challenge to differentiate meaningful gene-coding sORFs from inutile sORFs because the shorter the protein sequence, the greater the probability of error rate of detection. First we suggest splitting the annotation of SPs from other proteins. Second, it is better to combine both in silico algorithms and evidence-based analysis. Then merge the two parts of results and get two sets of SPs as follows. The strict validated SPs are those validated by both methods, while other validated SPs are those only validated by either in silico algorithms or evidence-based analysis.
Summary of large-scale sORF studies in different organisms.
| Prokaryotes | 4 | 4100 | 345 | 180 | 4 | ||
| Eukaryotes | 12 | 5865 | 299 | 247 | 4 | ||
| 120 | 29,157 | 7159 | 3241 | 11 | |||
| 180 | 13,907 | 4561 | 401 | 3 | |||
| 2500 | 31,035 | 1240 | 1167 | 4 | |||
It describes studies focused on large-scale discovery of SPs in species and their results.
Numbers of coding or annotated sORFs (<100 AA);
Numbers of sORFs with experimental evidence or known function;
The fraction of verified sORFs relative to previously annotated protein coding genes.