| Literature DB >> 34156446 |
Sebastien A Choteau1,2, Audrey Wagner1, Philippe Pierre2,3,4, Lionel Spinelli1,2, Christine Brun1,5.
Abstract
The development of high-throughput technologies revealed the existence of non-canonical short open reading frames (sORFs) on most eukaryotic ribonucleic acids. They are ubiquitous genetic elements conserved across species and suspected to be involved in numerous cellular processes. MetamORF (https://metamorf.hb.univ-amu.fr/) aims to provide a repository of unique sORFs identified in the human and mouse genomes with both experimental and computational approaches. By gathering publicly available sORF data, normalizing them and summarizing redundant information, we were able to identify a total of 1 162 675 unique sORFs. Despite the usual characterization of ORFs as short, upstream or downstream, there is currently no clear consensus regarding the definition of these categories. Thus, the data have been reprocessed using a normalized nomenclature. MetamORF enables new analyses at locus, gene, transcript and ORF levels, which should offer the possibility to address new questions regarding sORF functions in the future. The repository is available through an user-friendly web interface, allowing easy browsing, visualization, filtering over multiple criteria and export possibilities. sORFs can be searched starting from a gene, a transcript and an ORF ID, looking in a genome area or browsing the whole repository for a species. The database content has also been made available through track hubs at UCSC Genome Browser. Finally, we demonstrated an enrichment of genes harboring upstream ORFs among genes expressed in response to reticular stress. Database URL https://metamorf.hb.univ-amu.fr/.Entities:
Year: 2021 PMID: 34156446 PMCID: PMC8218702 DOI: 10.1093/database/baab032
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.MetamORF pipeline. This figure represents the workflow used to build MetamORF. First, the data from the sources selected have been inserted into the database, and the absolute genomic coordinates have been homogenized from their original annotation version to the most recent version (GRCh38 or GRCm38). Then, the redundant information, i.e. the entries describing the same ORFs (same start, stop and splicing), has been merged, allowing to get one single and unique entries for each ORF detected on the human and mouse genomes. The missing information (sequences and transcript biotypes) has been downloaded from Ensembl, and the ORF relative coordinates have been computed. Finally, the cell types and ORF classes have been normalized, and the Kozak contexts have been computed using the sequences flanking the start codons.
Information about the data sources used to build MetamORF
| Publication | DOI |
|---|---|
| Mackowiak | 10.1186/s13059-015-0742-x |
| Erhard | 10.1038/nmeth.4631 |
| Johnstone | 10.15252/embj.201592759 |
| Laumont | 10.1038/ncomms10238 |
| Samandi | 10.7554/eLife.27860 |
| Olexiouk | 10.1093/nar/gkx1130 |
See Supplementary Table S1 for more information about these data sources.
Features allowing to characterize the sORFs
| Family | Feature | Details |
|---|---|---|
| Location | Chromosome | The chromosome or scaffold on which the ORF is located |
| Strand | The strand of the sORF | |
| ORF start | The absolute genomic coordinates of the start codon (position of the first nucleotide) | |
| ORF stop | The absolute genomic coordinates of the stop codon (position of the third nucleotide) | |
| Splicing status | Is the sORF spliced? | |
| Splicing coordinates | The coordinates of the start and end of each exon constituting the sORF | |
| Transcript | The name or ID of the transcript(s) related to the sORF (eventually with transcript strand, start and end positions and transcript biotype) | |
| Gene | The name, symbol, alias or ID of the gene(s) related to the sORF (when not intergenic) | |
| Lengths | Length | The length of the sORF (in nucleotides) |
| Putative sPEP length | The length of the (putative) sORF-encoded peptide in amino acids | |
| Category | Category | The category to which the sORF belongs (e.g. upstream or downstream) |
| Sequence signature | Start codon sequence | The nucleic sequence of the sORF start codon |
| Nucleic sequence | The nucleic sequence of the sORF | |
| Amino acid sequence | The amino acid sequence of the (putative) sORF-encoded peptide | |
| Environmental signature | Kozak context | Does a Kozak context has been identified for the sORF start codon? |
| Conservation | PhyloCSF score | The PhyloCSF score computed for the sORF |
| PhastCons score | The PhastCons score computed for the sORF | |
| Coding potential assessment | FLOSS class and score | The FLOSS class and score computed for the sORF |
| ORF score | The ORF score computed for the sORF | |
| Biological context | Cell context | The cellular context in which the sORF has been identified or detected |
MetamORF most important statistics
| Feature |
|
| |
|---|---|---|---|
| Original data sources | ORFs | 1 344 978 | 1 249 176 |
| Transcripts | 101 597 | 85 653 | |
| Predicted ORFs for which the transcript is unknown | 181 122 | 213 301 | |
| ORFs detected by Ribo-seq for which the transcript is unknown | 79 422 | 8546 | |
| ORFs detected by MS for which the transcript is unknown | 54 | 0 | |
| ORF to transcript associations | 3 379 219 | 2 066 627 | |
| ORFs predicted | 202 309 | 222 705 | |
| ORFs identified by ribosome profiling | 1 142 669 | 1 026 471 | |
| ORFs identified by MS | 166 | 0 | |
| ORFs for which the homogeneization of genomic coordinates failed | 709 | 0 | |
| MetamORF database | ORFs | 664 771 | 497 904 |
| Transcripts | 90 406 | 63 147 | |
| Predicted ORFs for which the transcript is unknown | 13 440 | 14 327 | |
| ORFs detected by Ribo-seq for which the transcript is unknown | 71 158 | 2 | |
| ORFs detected by MS for which the transcript is unknown | 48 | 0 | |
| ORF for which the transcripts are unknown | 83 403 | 14 329 | |
| ORF to transcript associations | 729 793 | 696 785 | |
| ORFs predicted | 17 027 | 14 500 | |
| ORFs identified by ribosome profiling | 664 771 | 497 904 | |
| ORFs identified by MS | 147 | 0 | |
| Genes harboring at least 1 sORF | 23 767 | 15 869 | |
| ORFs having at least one class annotation (short, upstream) | 630 953 | 497 904 | |
MS: mass spectrometry.
Figure 2.Count of ORFs in each class. The bar plots represent the count of ORFs annotated for each class for (A) H. sapiens and (B) M. musculus. The percentages displayed over the bars indicate the proportion of ORFs annotated in the class over the total number of ORFs registered in the database for the species. NMD: non-sense-mediated decay; NSD: non-stop decay.
Figure 3.MetamORF gene-centric view. The page displays the transcripts and the ORFs related to SGK3 gene. A filter has been applied to select exclusively the ORFs detected in HFF, Jurkat, RPE-1, HEK293 or HeLa cells. Other filters may be used and the results can be exported as CSV, FASTA or BED files.
Enrichment analysis
| Gene list | List size | Genes harboring uORFs | Intersection size | Universe | FDR | Odds ratio |
|---|---|---|---|---|---|---|
| ATF4 targets | 392 | 8863 | 256 | 19 985 | 5.52.10−17 | 2.40 |
| CHOP targets | 256 | 8863 | 166 | 19 985 | 3.34.10−11 | 2.34 |
| Genes congruently upregulated | 484 | 8863 | 268 | 19 985 | 5.41.10−7 | 1.57 |
| Genes transitionally upregulated | 1068 | 8863 | 736 | 19 985 | 1.21.10−61 | 2.94 |
See Supplementary Table S7 for more information about the gene lists.