Literature DB >> 34536568

SmProt: A Reliable Repository with Comprehensive Annotation of Small Proteins Identified from Ribosome Profiling.

Yanyan Li¹, Honghong Zhou², Xiaomin Chen³, Yu Zheng¹, Quan Kang², Di Hao², Lili Zhang³, Tingrui Song², Huaxia Luo², Yajing Hao⁴, Runsheng Chen⁵, Peng Zhang⁶, Shunmin He⁷.

Abstract

Small proteins specifically refer to proteins consisting of less than 100 amino acids translated from small open reading frames (sORFs), which were usually missed in previous genome annotation. The significance of small proteins has been revealed in current years, along with the discovery of their diverse functions. However, systematic annotation of small proteins is still insufficient. SmProt was specially developed to provide valuable information on small proteins for scientific community. Here we present the update of SmProt, which emphasizes reliability of translated sORFs, genetic variants in translated sORFs, disease-specific sORF translation events or sequences, and remarkably increased data volume. More components such as non-ATG translation initiation, function, and new sources are also included. SmProt incorporated 638,958 unique small proteins curated from 3,165,229 primary records, which were computationally predicted from 419 ribosome profiling (Ribo-seq) datasets or collected from literature and other sources from 370 cell lines or tissues in 8 species (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, and Escherichia coli). In addition, small protein families identified from human microbiomes were also collected. All datasets in SmProt are free to access, and available for browse, search, and bulk downloads at http://bigdata.ibp.ac.cn/SmProt/.

Entities: Chemical

Keywords: Disease; Ribosome profiling; Small open reading frame; Upstream open reading frame; Variants

Mesh：

Substances：
Proteins

Year: 2021 PMID： 34536568 PMCID： PMC9039559 DOI： 10.1016/j.gpb.2021.09.002

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 6.409

Introduction

Genome annotation is fundamental to life science. In recent years, it has been found that small open reading frames (sORFs) widely exist in genomes of many organisms including humans [1] and human microbiomes [2], and some are able to be translated into small proteins [3], [4], [5]. Small proteins are proteins with less than 100 amino acids, which may be derived from untranslated regions (UTRs) of mRNAs [6] or non-coding RNAs (ncRNAs) [7], [8] including primary microRNAs (pri-miRNAs) [9], [10], long ncRNAs (lncRNAs) [11], and circular RNAs (circRNAs) [12]. Small proteins were usually missed in previous coding sequence annotation, while their significance has been revealed in current years for diverse functions [13], such as embryonic development [14], [15], cell apoptosis [16], muscle contraction [17], and antimicrobial activity [18]. Some small proteins play roles in multiple diseases [19], [20] including tumors [9], [11], [12]. Despite the abundance of sORFs in genome, the number of well-studied small proteins is very limited. Annotation of numerous small proteins will contribute to studies on various physiological and pathological processes. Identification of small proteins at proteomic level is challenging. Mass spectrometry (MS) can provide direct evidence of small proteins, but it relies much on the coverage of existing libraries, which mainly focus on large proteins rather than small proteins. Protease cleavage sites are lacking in small proteins due to the limited length. Besides, small proteins are usually of low abundance, and tend to be filtered out during enrichment process [21]. Ribosome profiling (also named as ribosomal footprinting or Ribo-seq) provides a more sensitive way for global detection of translation events based on the deep sequencing of ribosome-protected mRNA fragments (RPFs) [22], [23], which allows for identifying the location of translated ORFs and translation initiation sites (TISs), the distribution of ribosomes on mRNA, and the speed of translating ribosomes [24]. Reference libraries for MS can also be constructed with Ribo-seq data. The regular Ribo-seq (rRibo-seq) utilizes cycloheximide (CHX) [25], a drug bound at the ribosome E-site [26], as a translation elongation inhibitor to freeze the translating ribosomes. Translation is principally regulated at the initiation stage. Translation initiation sequencing (TI-seq) is a variation of rRibo-seq technique that uses different translation inhibitors, usually lactomidomycin (LTM) [25] or harringtonine (HARR) [27], which can induce ribosome stasis at TISs. TI-seq enables the global mapping of TISs, and is more accurate in prediction of non-ATG start codons. Many sORFs are proved to use non-classical ATG start codons [28], which is also an important mechanism for generating protein isoforms [29], [30]. rRibo-seq data usually show clear triplet periodicity [26]. Different computational analysis strategies [31], [32], [33], [34], [35], [36], [37], [38] have been developed to identify translated sequences using Ribo-seq data. Emerging evidence shows that many upstream ORFs (uORFs) act in cis to regulate the translation of downstream ORFs by leaky scanning [39], reinitiation [40], and ribosome stalling [41]. Recently, variants creating new upstream start codons or disrupting stop sites of existing uORFs (uORF-perturbing) are found to be under strong negative selection [42]. uORF-perturbing variants have been demonstrated as an under-recognized functional class that contribute to human disease. Given the great importance of small proteins, in-depth investigations of small proteins across various species are in need. SmProt is dedicated to integrating knowledge of small proteins translated from various sources, especially for those from UTRs and ncRNAs. The annotation information and functional sections in the current release are much richer than those in the initial release [43], and the data volume and reliability are also greatly improved.

Data collection and processing

Data sources

rRibo-seq and TI-seq datasets derived from diverse tissues/cell lines were collected from GEO [44] and European Nucleotide Archive (ENA) [45] databases. The latest reference genomes and gene annotations were download from Ensembl [46], GENCODE [47], and NCBI-Genome database. Whole-genome sequencing (WGS) variants were collected from various websites. The construction pipeline of SmProt is summarized in Figure 1.

Figure 1

Construction of SmProt. Items in blue background represent data sources. Items in yellow background represent management processes. Items in red background represent results. WGS, whole genome sequencing; MS, mass spectrometry; TIS, translation initiation site; ORF, open reading frame; sORF, small ORF; uORF, upstream ORF.

Ribo-seq data processing

The fastq files of 547 Ribo-seq datasets were downloaded from GEO and ENA databases. Each dataset was checked manually to confirm the sequencing adapters. The adapters were removed using cutadapt 1.18 [48] and only reads with 25−35 bp in length were retained. Then the sequences were mapped to the latest genome using STAR 2.5.2a [49] using EndToEnd mode with allowance of up to 2 mismatches. Ribo-seq quality and P-site offsets were assessed by Ribo-TISH [34] quality module. For TI-seq data, more attention was put on TIS quality (-t). Manual checks were then carried out to verify offset values and eliminate datasets without obvious triplet periodicity. After the quality control, 419 Ribo-seq datasets (Table S1) were retained. Translated ORFs were predicted by Ribo-TISH predict module. Biological and technical duplication data under the same treatment in one dataset were merged. Minimum amino acid length of candidate ORFs was set to 5. Considering both ATG and near-cognate start codons (with one base different from ATG), rRibo-seq datasets using only CHX without matched TI-seq data were analyzed twice. One is prediction of ORFs with canonical ATG start codon, the other is prediction of ORFs with near-cognate start codons (--alt). Preferring data evidence instead of prior assumption in our database, only the best frame test results from multiple candidate start codons in the same ORF were reported (--framebest). For datasets containing TI-seq data, alternative start codons were included (--alt), and different parameters were set for LTM-based TI-seq and HARR-based TI-seq (--harr). sORFs with less than 100 amino acids were filtered from the prediction results above. Furthermore, we removed some prediction results that may be supported by RPFs from other classic proteins with more than 100 amino acids. These include ORFs marked as known (i.e., TIS annotated in another transcript), CDSFrameOverlap (i.e., ORF overlapping with annotated CDS in another transcript in the same reading frame), and Truncated (i.e., ORF as a part of annotated CDS in the same transcript) without translation initiation evidence (i.e., no significant results identified from paired TI-seq datasets). In-frame reads of sORFs were counted and normalized by library sequencing depth (in-frame total reads count) and sORF length, a similar method with reads per kilobase per million mapped reads (RPKM) in RNA-seq but using ribosome profiling data that represents the translation levels. Finally, 3,060,793 records (i.e., unmerged primary results from all datasets and tissues) were retained. Results with the identical genome loci in one species were merged as the same small protein, generating 577,206 unique IDs, while information derived from multiple datasets were retained, a similar integration method as for piRBase [50].

Variants from ribosome profiling data

We performed germline variants detection on 96 human ribosome profiling datasets, referring to the workflow for processing RNA data for germline short variant discovery with GATK v4.1.8 [51], [52], [53], [54]. Duplicate reads were identified using MarkDuplicates tool after alignment, then reads with unidentified nucleotide (N) in Cigar were split using SplitNCigarReads tool. Base quality score recalibration was carried out based on true sites in training sets using BaseRecalibrator tool and applied using ApplyBQSR tool. Variants were called individually in each sample using the HaplotypeCaller tool. Variants with QualByDepth (QD) < 2 were removed using VariantFiltration tool. Germline single nucleotide variants (SNVs) were linked to small proteins in SmProt according to genomic positions.

Variants from WGS data

Variants from 1KGP3 [55], GAsP [56], TOPMed [57], gnomAD3 [42], [58], and NyuWa [59] were collected. VCF files were lifted over from old genome version to GRCh38 using LiftoverVcf tool of GATK with allowance to recover swapped reference and alternative alleles. Variants in 5′ UTRs were evaluated for their effects on uORFs using VEP [60] with plugin UTRannotator [42], [61], and classified by their functional consequences. Translation evidence of uORFs was based on small proteins recorded in SmProt.

Disease-specific small proteins

Small proteins identified only from diseased cell lines/tissues but not from corresponding normal cell lines/tissues were predicted as disease-specific translation events. If there were matched data of normal and diseased groups in the same dataset, small proteins derived uniquely from diseased group were screened as disease-specific ones. If there was no matched control group in the same dataset, the same type of healthy tissue/cell line in other datasets were used as control. If there was no matched same tissue/cell line, all data from diverse normal tissues/cell lines were merged for comparisons (Table S2), and small proteins identified only from the diseased cell lines/tissues were predicted as tissue-specific. Disease-specific or tissue-specific translation events require Ribo P value in disease groups lower than 0.01 while similar proteins with different TISs at the same loci in control group not detected (Ribo P value higher than 0.05). SNVs in diseased cell lines/tissues derived from ribosome profiling data and located within the genomic region of small proteins were regarded as diseased variant sets. SNVs detected only in diseased variant sets but not in normal sets were predicted as disease-specific SNVs. SNVs in corresponding normal cell lines/tissues (Table S2) derived from ribosome profiling data were combined with all variants derived from multiple WGS projects, as control variant sets for comparison.

Function domain prediction

Besides function of small proteins collected from literature mining, we used InterProScan [62] to predict function domain of small proteins, which focuses on combination of protein family membership and the functional domains/sites, and has been extensively used by genome sequencing projects and the UniProt Knowledgebase [63]. Default thresholds and additional parameters -goterms -pa were adopted for gene oncology and pathway annotations.

PhyloCSF calculation

Pre-calculated BigWig data of PhyloCSF [64] scores at each base across the whole genome were downloaded from Broad Institute (https://data.broadinstitute.org/compbio1/PhyloCSFtracks/), and the score for genomic region of each small protein was extracted with our script using pyBigWig (https://github.com/deeptools/pyBigWig).

Database implementation

Database website was organized with HTML (https://html.spec.whatwg.org/), JavaScript (https://www.javascript.com/), PHP (https://www.php.net/), and MYSQL (https://www.mysql.com/). UCSC Genome Browser (http://genome.ucsc.edu/) was used to visualize the small proteins and variants. NCBI BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) was used for sequence similarity searches.

Database content and usage

Overview

SmProt was constructed by pipeline described in Figure 1. Multiple ways were provided to search, browse, visualize, and study small proteins (Figure 2). Small proteins were found mainly from rRibo-seq and TI-seq data. All information for small proteins from different data sources and datasets were integrated. General information for small proteins was provided such as sequence, mass, location, blocks, tissue or cell line, predicted functions, conservation, and multiple IDs including small protein ID, Ensembl ID, and NONCODE [65] ID. Translation level (in frame counts and Ribo RPKM) of small proteins identified from each dataset and record was provided. Details for their related variants and diseases were also provided (Figure 3). SmProt now has 638,958 unique small proteins and 3,165,229 small protein records in total (Table 1; Table S3).

Figure 2

Usage of SmProt. SmProt provides multiple ways to search, browse, and visualize small proteins, as well as related diseases and variants.

Figure 3

Contents of SmProt. Detailed information for small proteins are provided, including general annotation, information from ribosome profiling data, literature, other databases, MS, function domain prediction, related diseases, and related variants from WGS projects, as well as corresponding effects.

Table 1

Statistics of unique small proteins in SmProt

Source	Start codon	Human	Mouse	Fruit fly	Rat	C. elegans	Yeast	E. coli	Zebrafish	All species examined
Ribo-seq	ATG	70,931	48,909	5269	3560	4334	4535	1881	1924	141,343
Ribo-seq	Near-cognate codons	229,653	133,037	29,679	9910	9894	12,339	10,004	1347	435,863
Literature	ATG and near-cognate codons	38,157	8875	22,228	163	4	355	296	3612	73,690
Databases	ATG and near-cognate codons	786	797	100	271	120	336	955	64	3429
MS	ATG and near-cognate codons	768	51	66	38	0	3	0	1	927
All IDs examined	ATG and near-cognate codons	327,995	189,433	56,574	13,829	14,255	17,312	12,881	6679	638,958

Note: Small protein families from human microbiomes are not included. Near-cognate codons refer to non-ATG start codons that differ from the canonical ATG start codon by a single base but are able to initiate translation, such as TTG, GTG, CTG, AAG, AGG, ACG, ATA, ATT, and ATC. ID refers to a unique entry with identical genomic loci in one species. Ribo-seq, ribosome profiling; MS, mass spectrometry.

Usage of SmProt. SmProt provides multiple ways to search, browse, and visualize small proteins, as well as related diseases and variants. Contents of SmProt. Detailed information for small proteins are provided, including general annotation, information from ribosome profiling data, literature, other databases, MS, function domain prediction, related diseases, and related variants from WGS projects, as well as corresponding effects. Statistics of unique small proteins in SmProt Note: Small protein families from human microbiomes are not included. Near-cognate codons refer to non-ATG start codons that differ from the canonical ATG start codon by a single base but are able to initiate translation, such as TTG, GTG, CTG, AAG, AGG, ACG, ATA, ATT, and ATC. ID refers to a unique entry with identical genomic loci in one species. Ribo-seq, ribosome profiling; MS, mass spectrometry.

Reliability of small proteins

SmProt emphasizes reliability of small proteins, which is ensured mainly by the significance of 3-nt periodicity in RPF P-site profile. Firstly, we constructed new pipeline based on independently published toolkit Ribo-TISH [34], which allows for accurate detection of ORFs and TISs using rRibo-seq and TI-seq. Ribo-TISH uses rank sum test to detect 3-nt periodicity, and uses negative binomial test to detect TISs, which outperforms other established methods in prediction accuracy. Secondly, in addition to the quality control based on Ribo-TISH quality module, manual checks were also carried out to ensure clear triplet periodicity and unambiguous offset of Ribo-seq data, which further eliminates noises. Thirdly, we provided several evaluations as supporting evidence. These include 1) P values of small proteins called from multiple ribosome profiling datasets, which indicate the confidence in different samples and conditions; 2) PhyloCSF conservation of genomic regions, which reflects coding potential; and 3) peptide evidence derived from mass spectrum data. All these lines of evidences are exhibited in the small protein page. Moreover, a set with evidence of both translation events and protein fragments is provided on download page. In addition, information of small protein derived from multiple sources is also integrated in small protein information page.

Variants related to small proteins

In total, 25,475 variants located on translated sORFs were provided, which are on display in the related small protein page. Given that uORF-perturbing variants are likely to impact translation of downstream proteins [42], variants from multiple WGS projects and ribosome profiling data were evaluated for their effects on uORFs. These include creating a new start codon ATG, removing an existing start codon ATG, creating a new stop codon within an existing uORF, removing the stop codon of an existing uORF, and creating a frameshift mutation in an existing uORF, which can be found at variants page. Disease-specific small proteins are potential candidates of molecular markers or targets for diagnosis and treatment. Disease-specific translation events as well as disease-specific SNVs of small proteins in 16 types of diseases were identified (see “Data collection and processing” section) (Table S4). Besides, small proteins that have been verified experimentally in certain diseases were also documented through literature mining.

Human microbiome small proteins

Over 4000 conserved small protein families identified from human microbiomes were collected [2]. A new section HumanMicroBio was created to integrate and display selected information of these small protein families.

Other sources

We use a set of keywords (File S1) to search articles about small proteins in PubMed database. High-confidence small proteins in CCDS [66] and Swiss-Prot [67] were also integrated. Literature mining is processed in stages, and the newly-published data from other sources will be released continuously after completion of manual review and curation. For successfully predicted functions of small proteins derived from ribosome profiling and literature mining, SmProt provides graph for visualization and prediction details including Gene Oncology (GO) and pathway annotations. Users can choose predicted functions on Browse page to filter the results with function domain prediction.

Inner BLAST

The abundant small proteins across multiple species allow for sequence similarity searches at both nucleotide and protein levels. Users can search for sequences of interests using BLASTp and BLASTx (NCBI BLAST 2.2.24 release) online.

Visualization using UCSC Genome Browser

SmProt incorporates UCSC Genome Browser [68] to visualize all the information including genomic loci of small proteins, as well as variants from ribosome profiling data and multiple WGS projects related to small proteins, MS data, and gene annotations. The latest genome versions including hg38, mm10, rn6, dm6, ce11, sacCer3, and danRer11 were provided.

Comparison with other databases

SmProt currently includes 419 Ribo-seq datasets derived from 116 cell lines/tissues, compared to 60 datasets derived from 37 cell lines/tissues in the initial version. The number of small protein records identified from ribosome profiling in the current release is 60 times that of the initial release (3 million vs. 0.05 million). The current release of SmProt combined a large amount of duplicate records in the initial release [43], and Ribo-seq analysis pipeline was optimized to ensure the reliability of our results. Variants in translated sORFs identified from Ribo-seq data as well as uORF-perturbing variants identified from WGS projects were provided. Disease-specific small proteins may provide new perspectives for clinical studies. Currently, there are a few databases for small proteins such as ARA-PEPs [69], PsORF [70], and sORFs.org [71]. ARA-PEPs and PsORF only harbor small proteins in plants. sORFs.org developed simple inner TIS-calling algorithm not based on triplet periodicity, which should be the most important feature of Ribo-seq. SmProt emphasizes high confidence using our Ribo-TISH pipeline that is more accurate than previous methods. In total, 419 Ribo-seq datasets have been analyzed in SmProt, while there were only 78 Ribo-seq datasets in sORFs.org. Moreover, SmProt pays special attention to function, variants, and related diseases of small proteins. Furthermore, WGS data resources are also integrated in SmProt, which are not covered in other databases. Other proteomic databases such as UniProt, neXtProt [72], and OpenProt [73] are not specifically designed for small proteins. neXtProt only harbors proteins of humans while SmProt harbors small proteins of 8 species. Simialr to SmProt, OpenProt also used ribosome profiling and mass spectrum to predict proteins including some small proteins longer than 30 amino acids. Nonetheless, SmProt has analyzed many more ribosome profiling datasets (419), which are about 5 times that in OpenProt (87), and provides information for small proteins longer than 5 amino acids.

Conclusion

In brief, SmProt integrates small proteins from large amount of ribosome profiling data, and provides more abundant details. We strongly believe that SmProt will provide valuable and accurate information on small proteins for scientific community. Moreover, SmProt provides a new resource for users interested in functional and mechanistic studies, and a reference for construction of MS libraries of small proteins.

Data availability

SmProt is publicly available at http://bigdata.ibp.ac.cn/SmProt/. Competing interests The authors have declared no competing interests. Handled by Zhang Zhang 2021 The Authors. Published by Elsevier B.V. and Science Press on behalf of Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nd-nd/4.0/).

Orcid

0000-0001-5256-6696 (Yanyan Li) 0000-0001-7409-3092 (Honghong Zhou) 0000-0002-0633-2984 (Xiaomin Chen) 0000-0003-4936-8407 (Yu Zheng) 0000-0001-6790-5259 (Quan Kang) 0000-0003-0082-0730 (Di Hao) 0000-0002-3601-0150 (Lili Zhang) 0000-0003-2967-7704 (Tingrui Song) 0000-0001-9944-0345 (Huaxia Luo) 0000-0003-1384-4176 (Yajing Hao) 0000-0001-6049-8347 (Runsheng Chen) 0000-0001-9303-1639 (Peng Zhang) 0000-0002-7294-0865 (Shunmin He)

CRediT authorship contribution statement

Yanyan Li: Conceptualization, Methodology, Investigation, Formal analysis, Data curation, Writing – original draft, Software, Visualization. Honghong Zhou: Investigation, Data curation, Funding acquisition. Xiaomin Chen: Investigation, Data curation. Yu Zheng: Data curation, Software, Visualization. Quan Kang: Software, Visualization. Di Hao: Data curation, Software. Lili Zhang: Visualization. Tingrui Song: Visualization. Huaxia Luo: Writing – review & editing. Yajing Hao: Writing – review & editing. Runsheng Chen: Resources, Supervision, Funding acquisition. Peng Zhang: Conceptualization, Methodology, Investigation, Software, Writing – review & editing, Visualization, Project administration, Funding acquisition. Shunmin He: Conceptualization, Methodology, Resources, Investigation, Writing – review & editing, Supervision, Funding acquisition.

70 in total

1. The Translational Landscape of the Human Heart.

Authors: Sebastiaan van Heesch; Franziska Witte; Valentin Schneider-Lunitz; Jana F Schulz; Eleonora Adami; Allison B Faber; Marieluise Kirchner; Henrike Maatz; Susanne Blachut; Clara-Louisa Sandmann; Masatoshi Kanda; Catherine L Worth; Sebastian Schafer; Lorenzo Calviello; Rhys Merriott; Giannino Patone; Oliver Hummel; Emanuel Wyler; Benedikt Obermayer; Michael B Mücke; Eric L Lindberg; Franziska Trnka; Sebastian Memczak; Marcel Schilling; Leanne E Felkin; Paul J R Barton; Nicholas M Quaife; Konstantinos Vanezis; Sebastian Diecke; Masaya Mukai; Nancy Mah; Su-Jun Oh; Andreas Kurtz; Christoph Schramm; Dorothee Schwinge; Marcial Sebode; Magdalena Harakalova; Folkert W Asselbergs; Aryan Vink; Roel A de Weger; Sivakumar Viswanathan; Anissa A Widjaja; Anna Gärtner-Rommel; Hendrik Milting; Cris Dos Remedios; Christoph Knosalla; Philipp Mertins; Markus Landthaler; Martin Vingron; Wolfgang A Linke; Jonathan G Seidman; Christine E Seidman; Nikolaus Rajewsky; Uwe Ohler; Stuart A Cook; Norbert Hubner
Journal: Cell Date: 2019-05-30 Impact factor: 41.582

2. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation.

Authors: Ariel A Bazzini; Timothy G Johnstone; Romain Christiano; Sebastian D Mackowiak; Benedikt Obermayer; Elizabeth S Fleming; Charles E Vejnar; Miler T Lee; Nikolaus Rajewsky; Tobias C Walther; Antonio J Giraldez
Journal: EMBO J Date: 2014-04-04 Impact factor: 11.598

3. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Authors: Geraldine A Van der Auwera; Mauricio O Carneiro; Christopher Hartl; Ryan Poplin; Guillermo Del Angel; Ami Levy-Moonshine; Tadeusz Jordan; Khalid Shakir; David Roazen; Joel Thibault; Eric Banks; Kiran V Garimella; David Altshuler; Stacey Gabriel; Mark A DePristo
Journal: Curr Protoc Bioinformatics Date: 2013

4. Bayesian prediction of RNA translation from ribosome profiling.

Authors: Brandon Malone; Ilian Atanassov; Florian Aeschimann; Xinping Li; Helge Großhans; Christoph Dieterich
Journal: Nucleic Acids Res Date: 2017-04-07 Impact factor: 16.971

5. NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population.

Authors: Peng Zhang; Huaxia Luo; Yanyan Li; You Wang; Jiajia Wang; Yu Zheng; Yiwei Niu; Yirong Shi; Honghong Zhou; Tingrui Song; Quan Kang; Tao Xu; Shunmin He
Journal: Cell Rep Date: 2021-11-16 Impact factor: 9.423

6. Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans.

Authors: Sarah E Calvo; David J Pagliarini; Vamsi K Mootha
Journal: Proc Natl Acad Sci U S A Date: 2009-04-16 Impact factor: 11.205

7. A Peptide Encoded by a Putative lncRNA HOXB-AS3 Suppresses Colon Cancer Growth.

Authors: Jin-Zhou Huang; Min Chen; Xing-Cheng Gao; Song Zhu; Hongyang Huang; Min Hu; Huifang Zhu; Guang-Rong Yan
Journal: Mol Cell Date: 2017-10-05 Impact factor: 17.970

8. NONCODE v2.0: decoding the non-coding.

Authors: Shunmin He; Changning Liu; Geir Skogerbø; Haitao Zhao; Jie Wang; Tao Liu; Baoyan Bai; Yi Zhao; Runsheng Chen
Journal: Nucleic Acids Res Date: 2007-11-13 Impact factor: 16.971

9. A peptide encoded by circular form of LINC-PINT suppresses oncogenic transcriptional elongation in glioblastoma.

Authors: Maolei Zhang; Kun Zhao; Xiaoping Xu; Yibing Yang; Sheng Yan; Ping Wei; Hui Liu; Jianbo Xu; Feizhe Xiao; Huangkai Zhou; Xuesong Yang; Nunu Huang; Jinglei Liu; Kejun He; Keping Xie; Gong Zhang; Suyun Huang; Nu Zhang
Journal: Nat Commun Date: 2018-10-26 Impact factor: 14.919

10. GENCODE reference annotation for the human and mouse genomes.

Authors: Adam Frankish; Mark Diekhans; Anne-Maud Ferreira; Rory Johnson; Irwin Jungreis; Jane Loveland; Jonathan M Mudge; Cristina Sisu; James Wright; Joel Armstrong; If Barnes; Andrew Berry; Alexandra Bignell; Silvia Carbonell Sala; Jacqueline Chrast; Fiona Cunningham; Tomás Di Domenico; Sarah Donaldson; Ian T Fiddes; Carlos García Girón; Jose Manuel Gonzalez; Tiago Grego; Matthew Hardy; Thibaut Hourlier; Toby Hunt; Osagie G Izuogu; Julien Lagarde; Fergal J Martin; Laura Martínez; Shamika Mohanan; Paul Muir; Fabio C P Navarro; Anne Parker; Baikang Pei; Fernando Pozo; Magali Ruffier; Bianca M Schmitt; Eloise Stapleton; Marie-Marthe Suner; Irina Sycheva; Barbara Uszczynska-Ratajczak; Jinuri Xu; Andrew Yates; Daniel Zerbino; Yan Zhang; Bronwen Aken; Jyoti S Choudhary; Mark Gerstein; Roderic Guigó; Tim J P Hubbard; Manolis Kellis; Benedict Paten; Alexandre Reymond; Michael L Tress; Paul Flicek
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

2 in total

1. A Novel Proteogenomic Integration Strategy Expands the Breadth of Neo-Epitope Sources.

Authors: Haitao Xiang; Le Zhang; Fanyu Bu; Xiangyu Guan; Lei Chen; Haibo Zhang; Yuntong Zhao; Huanyi Chen; Weicong Zhang; Yijian Li; Leo Jingyu Lee; Zhanlong Mei; Yuan Rao; Ying Gu; Yong Hou; Feng Mu; Xuan Dong
Journal: Cancers (Basel) Date: 2022-06-19 Impact factor: 6.575

2. In Depth Exploration of the Alternative Proteome of Drosophila melanogaster.

Authors: Bertrand Fabre; Sebastien A Choteau; Carine Duboé; Carole Pichereaux; Audrey Montigny; Dagmara Korona; Michael J Deery; Mylène Camus; Christine Brun; Odile Burlet-Schiltz; Steven Russell; Jean-Philippe Combier; Kathryn S Lilley; Serge Plaza
Journal: Front Cell Dev Biol Date: 2022-05-26

2 in total