Literature DB >> 34904638

Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.

Husen M Umer¹, Enrique Audain², Yafeng Zhu³, Julianus Pfeuffer⁴, Timo Sachsenberg⁵, Janne Lehtiö¹, Rui Branca¹, Yasset Perez-Riverol⁶.

Abstract

SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs, and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD, and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to more than 5% of the total number of peptides identified. AVAILABILITY: The software is freely available. pypgatk: (https://github.com/bigbio/py-pgatk/), and pgdb: (https://nf-co.re/pgdb). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34904638 PMCID： PMC8825679 DOI： 10.1093/bioinformatics/btab838

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Proteogenomics is a rapidly developing multiomics field that integrates genomics and transcriptomics information with proteomics to improve gene annotation, often uncovering novel or non-canonical protein-coding regions in the genome (Branca ). One of the most important applications is in the study of cancer cells and tumors, where identifying cancer-specific proteins holds great potential in both elucidating cancer biology and in developing cancer therapies. However, the discovery of such proteins remains particularly challenging and is still largely linked to evidence from genome sequencing data, rather than directly from the protein data that have become abundant (Perez-Riverol ). Recent applications of proteogenomics have enabled multiomics detection of novel peptide sequences that are not present in the canonical protein database. For instance, Ruiz Cuevas recently identified a large number of non-canonical proteins in B cell lymphomas. However, customized protein databases are needed to enable the identification of such peptides. Recently, tools for generating sample-specific protein databases have been implemented using genomic sequencing data (Ruggles ) and transcriptomics data (Cesnik ; Cifani ). Since matching sequencing data is not available for a large fraction of the currently available proteomics datasets, resources have been developed to provide protein databases generated from cancer somatic mutations and genomic variants (Zhang ). To make progress in high throughput proteogenomics analysis, we present a Python application integrated into a Nextflow workflow to facilitate the generation of proteogenomics databases from sample-specific and public resources under varying conditions (e.g. cancer type and transcript biotype). The aim is to enable the identification of variant proteins (derived from single nucleotide variant mutations) and non-canonical or cryptic proteins (from normally dormant regions of the genome).

2 Implementation

We implemented pypgatk, a Python package that provides tools to generate protein databases from non-canonical sequences as well as DNA variants and mutations from public resources and custom files (Fig. 1a).

Fig. 1.

(a) pypgatk and pgdb components to generate ENSEMBL-based proteogenomics databases. (b) Reanalyzed datasets (four human and two mice); number of identified canonical, non-canonical, variant and mutated peptides identified using cell-type specific proteogenomics databases

2.1 Non-canonical protein databases

Non-canonical proteins are a product of translation of transcripts that are not reported as protein coding in the reference protein databases, or a product of out-of-frame translation of canonical transcripts (Ruiz Cuevas ). While many of the non-canonical proteins could be attributed to the yet incomplete reference databases, they might also be attributed to the activation of those genes under certain conditions such as genetic and epigenetic misregulation in cancer (Zhu ). We have developed the dnaseq-to-proteindb tool to generate protein sequences from non-canonical transcripts such as pseudogenes and lncRNAs by performing three-frame translation. It also extracts alternative reading frames from canonical protein-coding genes to enable the detection of out-of-frame cryptic proteins. Furthermore, the ensemble-downloader tool enables automatic download of the latest ENSEMBL resources including gene annotations, the reference genome and canonical proteins for the species of interest.

2.2 Variant protein databases

Detection of altered proteins from proteomics data requires the inclusion of the mutated sequences in the target databases. However, due to a large number of potential DNA variants, only potentially relevant variant sequences should be included to keep the database size under control. Here, we implemented methods to automate generation of variant proteins from publicly available cancer mutations datasets, cancer cell lines and custom Variant Calling Format (VCF) files obtained from genome sequencing. cosmic-to-proteindb and cbioportal-to-proteindb enable the generation of cancer-type specific protein databases by generating mutated protein sequences based on genomic mutations identified in cancer samples. cosmic-to-proteindb curates mutations from the Catalogue Of Somatic Mutations In Cancer (COSMIC). It allows filtering the mutations based on cancer type or tissue of origin. Alternatively, cbioportal-to-proteindb translates genomics mutations reported by thousands of cancer studies through cBioPortal. pypgatk enables downloading and processing mutations from ENSEMBL and gnomAD resources. vcf-to-proteindb translates the genomic variants into variant protein sequences. The variants can be filtered based on functional consequences as well as allele frequency to enable a special focus on common variants. The vcf-to-proteindb command accepts a custom VCF file from any species or sample of interest and generates a database of altered protein-coding sequences, which is valuable when whole-exome or whole-genome sequencing data are available; for instance to detect cancer neoantigens from passenger mutations.

3 ENSEMBL-based proteogenomic databases

To enable the generation of ENSEMBL-based proteogenomic databases, we have also built the Proteomics-Genomics DataBase (pgdb—https://nf-co.re/pgdb) workflow in Nextflow using bioconda and BioContainers. The pipeline integrates the various commands of pypgatk allowing the user to generate protein databases by simple parameter selection without any additional input required from the user. Also, the pipeline can be used to generate protein databases for any ENSEMBL species, except for the processes that are dependent on data that are only available for Homo sapiens.

3.1 Identification of non-canonical peptides

We applied pgdb to generate cell-type specific databases for 64 human cell lines (Fig. 1b and Supplementary Note S1 and S2). Mutations from the COSMIC Cell Line project and the Broad CCLE project through cBioPortal were downloaded for each cell line to generate the respective set of variant protein sequences. Additionally, a database of non-canonical proteins was generated from the latest human genome assembly. The variant protein database from each cell line was appended to the non-canonical and canonical protein databases and the decoy sequences were generated to search MS/MS proteomics datasets from the corresponding cell lines. The proteomics data were obtained through the PRIDE database (PXD005946, PXD019263, PXD004452 and PXD014145). proteomicsLFQ (https://nf-co.re/proteomicslfq) was used to identify the novel peptides (Supplementary Note S3). Overall, 402 512 target peptide sequences were identified, including 43 501 non-canonical peptides and 786 variant peptide sequences (Table 1 and Supplementary Note S4 and S5). The majority of the non-canonical peptides were novel coding sequences in their entirety whereas only 16% matched canonical protein sequences with one amino acid mismatch.

Table 1.

Number of peptides identified per class

Species	Class	#PSMs	#Peptide sequences	#Novel peptides
Homo sapiens	Canonical	4 125 497	322 967	NA
	Non-canonical	315 085	74 001	43 501
	Mutated	16 518	5544	786
Mus musculus	Canonical	1 159 049	105 338	NA
	Variant	4630	1928	374
	Mutated	2883	913	166

Number of peptides identified per class Additionally, we reanalyzed two mice datasets (PXD018891 and PXD006439) obtained from the B16 melanoma cell line. A proteogenomic database was generated using mice germline variants from the ENSEMBL variation database (release 104) and somatic mutations detected in mouse melanoma tumors. Overall, 374 variant peptides and 166 mutated peptides were identified. The identified peptides with the corresponding mass spectra and metadata annotations can be accessed via ProteomeXchange (PXD029360 and PXD029362).

4 Conclusions

The developed tools facilitate the creation of proteogenomics databases based on ENSEMBL genomes and other relevant sources of genome variation information. The pgdb is the first Nextflow workflow for proteogenomics database generation and its development within the nf-core community will ensure its stability, continued development and community support. pypgatk (https://pgatk.readthedocs.io/en/latest/pypgatk.html) and pgdb (https://nf-co.re/pgdb/1.0.0/usage) include extensive documentation to help researchers create their custom proteogenomics databases.

Funding

This work was supported by the Swedish Cancer Society [CAN 2017/685 and CAN 2020/1269 PjF], the Erling-Persson Family Foundation [12/12-2017 and 22/9-2020], DART and Rescuer EU-projects to H.U., J.L. and R.B.; the National Natural Science Foundation of China [32100505] and Guangdong Science and Technology Department [2020B1212060018, 2020B1212030004] to Y.Z.; the German Ministry of Research and Education [BMBF, project 031A535A] to T.S.; and the Wellcome Trust [208391/Z/17/Z] to Y.P.R. Conflict of Interest: none declared.

Data availability:

We here explored proteomics datasets PXD005946, PXD019263, PXD004452 and PXD014145, which are from the public domain PRIDE database, at https://www.ebi.ac.uk/pride/. Further data underlying this article are available in its online supplementary material. Click here for additional data file.

8 in total

1. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics.

Authors: Rui M M Branca; Lukas M Orre; Henrik J Johansson; Viktor Granholm; Mikael Huss; Åsa Pérez-Bercoff; Jenny Forshed; Lukas Käll; Janne Lehtiö
Journal: Nat Methods Date: 2013-11-17 Impact factor: 28.547

2. CanProVar 2.0: An Updated Database of Human Cancer Proteome Variation.

Authors: Menghuan Zhang; Bo Wang; Jia Xu; Xiaojing Wang; Lu Xie; Bing Zhang; Yixue Li; Jing Li
Journal: J Proteome Res Date: 2016-12-15 Impact factor: 4.466

3. An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer.

Authors: Kelly V Ruggles; Zuojian Tang; Xuya Wang; Himanshu Grover; Manor Askenazi; Jennifer Teubl; Song Cao; Michael D McLellan; Karl R Clauser; David L Tabb; Philipp Mertins; Robbert Slebos; Petra Erdmann-Gilmore; Shunqiang Li; Harsha P Gunawardena; Ling Xie; Tao Liu; Jian-Ying Zhou; Shisheng Sun; Katherine A Hoadley; Charles M Perou; Xian Chen; Sherri R Davies; Christopher A Maher; Christopher R Kinsinger; Karen D Rodland; Hui Zhang; Zhen Zhang; Li Ding; R Reid Townsend; Henry Rodriguez; Daniel Chan; Richard D Smith; Daniel C Liebler; Steven A Carr; Samuel Payne; Matthew J Ellis; David Fenyő
Journal: Mol Cell Proteomics Date: 2015-12-02 Impact factor: 5.911

4. Spritz: A Proteogenomic Database Engine.

Authors: Anthony J Cesnik; Rachel M Miller; Khairina Ibrahim; Lei Lu; Robert J Millikin; Michael R Shortreed; Brian L Frey; Lloyd M Smith
Journal: J Proteome Res Date: 2020-10-07 Impact factor: 4.466

5. Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow.

Authors: Yafeng Zhu; Lukas M Orre; Henrik J Johansson; Mikael Huss; Jorrit Boekel; Mattias Vesterlund; Alejandro Fernandez-Woodbridge; Rui M M Branca; Janne Lehtiö
Journal: Nat Commun Date: 2018-03-02 Impact factor: 14.919

6. The PRIDE database and related tools and resources in 2019: improving support for quantification data.

Authors: Yasset Perez-Riverol; Attila Csordas; Jingwen Bai; Manuel Bernal-Llinares; Suresh Hewapathirana; Deepti J Kundu; Avinash Inuganti; Johannes Griss; Gerhard Mayer; Martin Eisenacher; Enrique Pérez; Julian Uszkoreit; Julianus Pfeuffer; Timo Sachsenberg; Sule Yilmaz; Shivani Tiwary; Jürgen Cox; Enrique Audain; Mathias Walzer; Andrew F Jarnuczak; Tobias Ternent; Alvis Brazma; Juan Antonio Vizcaíno
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

7. ProteomeGenerator: A Framework for Comprehensive Proteomics Based on de Novo Transcriptome Assembly and High-Accuracy Peptide Mass Spectral Matching.

Authors: Paolo Cifani; Avantika Dhabaria; Zining Chen; Akihide Yoshimi; Emily Kawaler; Omar Abdel-Wahab; John T Poirier; Alex Kentsis
Journal: J Proteome Res Date: 2018-10-19 Impact factor: 4.466

8. Most non-canonical proteins uniquely populate the proteome or immunopeptidome.

Authors: Maria Virginia Ruiz Cuevas; Marie-Pierre Hardy; Jaroslav Hollý; Éric Bonneil; Chantal Durette; Mathieu Courcelles; Joël Lanoix; Caroline Côté; Louis M Staudt; Sébastien Lemieux; Pierre Thibault; Claude Perreault; Jonathan W Yewdell
Journal: Cell Rep Date: 2021-03-09 Impact factor: 9.423

8 in total

2 in total

Review 1. Emerging Computational Approaches for Antimicrobial Peptide Discovery.

Authors: Guillermin Agüero-Chapin; Deborah Galpert-Cañizares; Dany Domínguez-Pérez; Yovani Marrero-Ponce; Gisselle Pérez-Machado; Marta Teijeira; Agostinho Antunes
Journal: Antibiotics (Basel) Date: 2022-07-13

2. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences.

Authors: Yasset Perez-Riverol; Jingwen Bai; Chakradhar Bandla; David García-Seisdedos; Suresh Hewapathirana; Selvakumar Kamatchinathan; Deepti J Kundu; Ananth Prakash; Anika Frericks-Zipper; Martin Eisenacher; Mathias Walzer; Shengbo Wang; Alvis Brazma; Juan Antonio Vizcaíno
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

2 in total