| Literature DB >> 30299502 |
Marie A Brunet1,2, Mylène Brunelle1,2, Jean-François Lucier3,4, Vivian Delcourt1,2,5, Maxime Levesque3,4, Frédéric Grenier3,4, Sondos Samandi1,2, Sébastien Leblanc1, Jean-David Aguilar1, Pascal Dufour1, Jean-Francois Jacques1,2, Isabelle Fournier5, Aida Ouangraoua6, Michelle S Scott1, François-Michel Boisvert7, Xavier Roucou1,2.
Abstract
Advances in proteomics and sequencing have highlighted many non-annotated open reading frames (ORFs) in eukaryotic genomes. Genome annotations, cornerstones of today's research, mostly rely on protein prior knowledge and on ab initio prediction algorithms. Such algorithms notably enforce an arbitrary criterion of one coding sequence (CDS) per transcript, leading to a substantial underestimation of the coding potential of eukaryotes. Here, we present OpenProt, the first database fully endorsing a polycistronic model of eukaryotic genomes to date. OpenProt contains all possible ORFs longer than 30 codons across 10 species, and cumulates supporting evidence such as protein conservation, translation and expression. OpenProt annotates all known proteins (RefProts), novel predicted isoforms (Isoforms) and novel predicted proteins from alternative ORFs (AltProts). It incorporates cutting-edge algorithms to evaluate protein orthology and re-interrogate publicly available ribosome profiling and mass spectrometry datasets, supporting the annotation of thousands of predicted ORFs. The constantly growing database currently cumulates evidence from 87 ribosome profiling and 114 mass spectrometry studies from several species, tissues and cell lines. All data is freely available and downloadable from a web platform (www.openprot.org) supporting a genome browser and advanced queries for each species. Thus, OpenProt enables a more comprehensive landscape of eukaryotic genomes' coding potential.Entities:
Year: 2019 PMID: 30299502 PMCID: PMC6323990 DOI: 10.1093/nar/gky936
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.OpenProt pipeline graphical representation. OpenProt pipeline contains two main features: prediction and evidence collection. OpenProt enforces a polycistronic model of eukaryotic genes contrary to the actual dogma of one CDS per transcript. The protein sequence similarity filter (Homology) holds two arguments as described in the material and methods. The hidden proteome consists of currently non-annotated ORFs highlighted by OpenProt. These ORFs are either novel isoforms of known CDS (II_accessions) or novel alternative proteins (IP_ accessions). All evidence collection parameters are described in the material and methods section.
OpenProt (1.3) prediction pipeline output
| Annotations | ORFeome | ||||||
|---|---|---|---|---|---|---|---|
| Species | Genome assembly | NCBI RefSeq | Ensembl | Total # | Ref # | II_ # | IP_ # |
|
| GRCh38.p5 | GRCh38.p7 | GRCh38.83 | 646 403 | 129 888 | 55 053 | 461 462 |
|
| CHIMP2.1.4 | CHIMP2.1.4 | CHIMP2.1.4.87 | 227 950 | 37 059 | 16 402 | 174 489 |
|
| GRCm38.p4 | GRCm38.p4 | GRCm38.84 | 486 198 | 82 477 | 30 220 | 373 501 |
|
| Rnor_6.0 | Rnor_6.0 | Rnor_6.0.84 | 289 077 | 51 423 | 6718 | 230 936 |
|
| UMD_3.1 | UMD_3.1 | UMD_3.1.86 | 220 483 | 49 026 | 6942 | 164 515 |
|
| Oar_v3.1 | Oar_v3.1 | Oar_v3.1.89 | 340 974 | 40 000 | 19 022 | 281 952 |
|
| GRCz10 | GRCz10 | GRCz10.84 | 257 534 | 56 247 | 14 523 | 186 764 |
|
| Release 6 plus ISO1 MT | BDGP6 | BDGP6.84 | 97 934 | 22 204 | 2148 | 73 582 |
|
| WBcel235 | WBcel235 | WBcel235.84 | 94 087 | 29 563 | 2340 | 62 184 |
|
| R64 | R64 | R64.83 | 16 865 | 6613 | 28 | 10 224 |
Ref = currently annotated protein (RefProt), II_ = novel isoforms of known protein, IP_ = novel protein from alternative ORF (AltProt).
Figure 2.OpenProt features. (A) OpenProt Search page. All OpenProt pages are accessible from the top menu. The Search page displays advanced query settings (species, annotation, gene, transcript and protein) and allows to filter according to supporting evidence and other characteristics. The result table displays the main protein characteristics, supporting evidence, and link to a Details page (shown in C). A getting started tutorial from the Search page is available from the help menu (www.openprot.org/p/help). (B) OpenProt Genome Browser page. The Browser page displays advanced query settings (species, annotation, gene, transcript, protein and genomic coordinates) and a genome browser with customable tracks. By default, the genome browser shows tracks of transcripts, proteins and peptide detection (from re-analyzed mass spectrometry studies). A getting started tutorial from the Browser page is available from the help menu (www.openprot.org/p/help). (C) Each ORF has an individual ‘Details’ page. The page holds 5 tabs, the first one shown here displays information such as genomic and transcript coordinates, and other characteristics. Protein and DNA sequences are available from the details link. (D) The ‘Conservation’ tab displays the ORF orthologs and paralogs. The size of the nodes represents the number of related ORFs and the colour shows the identity percentage. The conservation score (top of the tab) corresponds to the number of species in which an ortholog was identified out of the 10 currently supported by OpenProt. (E.)The ‘Translation’ tab displays PRICE analysis results with the associated P-value and the readcount per sample. The TE score (top of the tab) corresponds to the number of studies in which this ORF was detected. (F) The ‘Mass spectrometry’ tab displays the identified peptides with the study name and a link to the original data. The match count corresponds to the number of PSM (peptide spectrum match) per peptide, per study. The MS score (top of the tab) corresponds to the sum of unique peptides per study. More information and tutorials are available from the Help page (www.openprot.org/p/help).
OpenProt (1.3) evidence collection output
| Conservation evidence | Translation evidence ( | Protein evidence (MS) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Species | Sp # | Ref # | II_ # | IP_ # | St # | Ref # | II_ # | IP_ # | St # | Ref # | II_ # | IP_ # |
|
| 9 | 189 319 | 38 325 | 239 394 | 33 | 17 435 | 2048 | 5 696 | 62 | 113 006 | 1455 | 28 641 |
|
| 9 | 63 408 | 28 930 | 148 989 | 0 |
|
|
| 0 |
|
|
|
|
| 9 | 131 130 | 21 245 | 121 890 | 22 | 14 607 | 1 088 | 3081 | 28 | 61 440 | 165 | 2 877 |
|
| 9 | 82 951 | 5354 | 81 600 | 2 | 6661 | 202 | 870 | 8 | 21 282 | 19 | 410 |
|
| 9 | 70 697 | 6086 | 88 550 | 0 |
|
|
| 1 | 12 778 | 5 | 37 |
|
| 9 | 56 331 | 28 247 | 107 900 | 0 |
|
|
| 1 | 1 466 | 18 | 69 |
|
| 9 | 81 958 | 19 560 | 8 965 | 2 | 9 | 1 | 0 | 7 | 26 114 | 263 | 386 |
|
| 9 | 39 246 | 763 | 452 | 3 | 2453 | 39 | 113 | 3 | 9783 | 20 | 113 |
|
| 9 | 28 429 | 861 | 450 | 5 | 8142 | 161 | 84 | 0 |
|
|
|
|
| 9 | 5842 | 5 | 38 | 20 | 5357 | 4 | 283 | 4 | 4028 | 0 | 20 |
Sp = Number of species evaluated for orthology relationships (not counting the queried species); St = number of studies re-analyzed by OpenProt; Ref = currently annotated CDS (RefORF); II_ = novel isoforms of known CDS; IP_ = novel CDS from alternative ORF (AltORF); n/a = when no dataset has been re-analysed for this species yet (OpenProt release 1.3). Conservation evidence = all proteins with at least one ortholog in at least one species. Translation evidence = all ORFs detect in at least one detection by PRICE analysis of Ribo-seq data. Protein evidence = all proteins with at least one unique peptide in at least one study.