| Literature DB >> 34536568 |
Yanyan Li1, Honghong Zhou2, Xiaomin Chen3, Yu Zheng1, Quan Kang2, Di Hao2, Lili Zhang3, Tingrui Song2, Huaxia Luo2, Yajing Hao4, Runsheng Chen5, Peng Zhang6, Shunmin He7.
Abstract
Small proteins specifically refer to proteins consisting of less than 100 amino acids translated from small open reading frames (sORFs), which were usually missed in previous genome annotation. The significance of small proteins has been revealed in current years, along with the discovery of their diverse functions. However, systematic annotation of small proteins is still insufficient. SmProt was specially developed to provide valuable information on small proteins for scientific community. Here we present the update of SmProt, which emphasizes reliability of translated sORFs, genetic variants in translated sORFs, disease-specific sORF translation events or sequences, and remarkably increased data volume. More components such as non-ATG translation initiation, function, and new sources are also included. SmProt incorporated 638,958 unique small proteins curated from 3,165,229 primary records, which were computationally predicted from 419 ribosome profiling (Ribo-seq) datasets or collected from literature and other sources from 370 cell lines or tissues in 8 species (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, and Escherichia coli). In addition, small protein families identified from human microbiomes were also collected. All datasets in SmProt are free to access, and available for browse, search, and bulk downloads at http://bigdata.ibp.ac.cn/SmProt/.Entities:
Keywords: Disease; Ribosome profiling; Small open reading frame; Upstream open reading frame; Variants
Mesh:
Substances:
Year: 2021 PMID: 34536568 PMCID: PMC9039559 DOI: 10.1016/j.gpb.2021.09.002
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 6.409
Figure 1Construction of SmProt. Items in blue background represent data sources. Items in yellow background represent management processes. Items in red background represent results. WGS, whole genome sequencing; MS, mass spectrometry; TIS, translation initiation site; ORF, open reading frame; sORF, small ORF; uORF, upstream ORF.
Figure 2Usage of SmProt. SmProt provides multiple ways to search, browse, and visualize small proteins, as well as related diseases and variants.
Figure 3Contents of SmProt. Detailed information for small proteins are provided, including general annotation, information from ribosome profiling data, literature, other databases, MS, function domain prediction, related diseases, and related variants from WGS projects, as well as corresponding effects.
Statistics of unique small proteins in SmProt
| Ribo-seq | ATG | 70,931 | 48,909 | 5269 | 3560 | 4334 | 4535 | 1881 | 1924 | 141,343 |
| Near-cognate codons | 229,653 | 133,037 | 29,679 | 9910 | 9894 | 12,339 | 10,004 | 1347 | 435,863 | |
| Literature | ATG and near-cognate codons | 38,157 | 8875 | 22,228 | 163 | 4 | 355 | 296 | 3612 | 73,690 |
| Databases | ATG and near-cognate codons | 786 | 797 | 100 | 271 | 120 | 336 | 955 | 64 | 3429 |
| MS | ATG and near-cognate codons | 768 | 51 | 66 | 38 | 0 | 3 | 0 | 1 | 927 |
| All IDs examined | ATG and near-cognate codons | 327,995 | 189,433 | 56,574 | 13,829 | 14,255 | 17,312 | 12,881 | 6679 | 638,958 |
Note: Small protein families from human microbiomes are not included. Near-cognate codons refer to non-ATG start codons that differ from the canonical ATG start codon by a single base but are able to initiate translation, such as TTG, GTG, CTG, AAG, AGG, ACG, ATA, ATT, and ATC. ID refers to a unique entry with identical genomic loci in one species. Ribo-seq, ribosome profiling; MS, mass spectrometry.