Literature DB >> 26638927

PGx: Putting Peptides to BED.

Manor Askenazi¹, Kelly V Ruggles², David Fenyö².

Abstract

Every molecular player in the cast of biology's central dogma is being sequenced and quantified with increasing ease and coverage. To bring the resulting genomic, transcriptomic, and proteomic data sets into coherence, tools must be developed that do not constrain data acquisition and analytics in any way but rather provide simple links across previously acquired data sets with minimal preprocessing and hassle. Here we present such a tool: PGx, which supports proteogenomic integration of mass spectrometry proteomics data with next-generation sequencing by mapping identified peptides onto their putative genomic coordinates.

Entities: Chemical

Keywords: proteogenomic mapping; proteogenomics; proteomics

Mesh：

Substances：
Neoplasm Proteins

Year: 2015 PMID： 26638927 PMCID： PMC4782174 DOI： 10.1021/acs.jproteome.5b00870

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

Systems biology is premised on the ability to integrate data sets covering all aspects of cellular biochemistry. One such integrative approach is termed proteogenomics[1] and is defined as the integration of proteomic and genomic information, usually referring to the use of mass spectrometry (MS)-based proteomics to improve gene annotation. The field has taken off thanks to recent improvements in both next-generation sequencing (NGS) and proteomic methodologies. It has become feasible both in terms of cost and time to sequence the DNA and RNA of every sample set being studied by MS proteomics. Additionally, modern mass spectrometers are able to sequence peptides at such a depth of coverage that they are now becoming useful in the very identification and validation of genes (whereas historically, proteomics depended entirely on a complete predicted proteome). The integration of proteomics and genomics can therefore improve our understanding of both genomic annotation and of course the functional characterization of protein products in their biological context. As the results of proteogenomics research accumulate, be they in the form of genome annotation, splice isoform prediction, or novel protein discovery, there arises a pressing need to map and visualize all data types onto the same unified coordinate system. There currently exist many tools for the analysis and display of genomic features where the coordinate system of choice is naturally the underlying reference sequence for the organism being studied. This is true even for the most advanced and challenging forms of next generation sequencing data. It follows naturally that the ideal unified coordinate system for proteogenomics should remain genomic in nature. Indeed, effective tools that can map MS-based proteomics results onto genomic coordinates have recently become available (Peppy,[2] Proteogenomic Mapping Tool,[3] Pepline,[4] MS-Dictionary,[5] GappedDictionary,[6] IggyPep,[7] MSProGene,[8] ProteoAnnotator,[9] PGNexus,[10] and GalaxyP[11]); however, these tools are usually couched in a relatively involved and comprehensive pipeline (e.g., the GalaxyP pipeline consists of up to 140 steps) and typically impose a specific mass-informatic[12] workflow on the practitioner, by, for example, requiring the generation of short peptide sequence tags (PSTs) or some complex form of de novo peptide sequencing followed by a lookup against the full six-frame translation of the genomic sequence. Our experience suggests that a more common scenario involves the production, by the genomic arm of the workflow, of a (liberally) predicted proteome (containing what is assumed to be a superset of the observable proteome) so as to leverage existing PSM search engines (such as Mascot,[13] Sequest,[14] X!Tandem[15]) that require a straightforward representation of the predicted proteome (in the form of a FASTA file). We have thus identified the need for an exceedingly focused tool following the Unix tradition (“do one thing and do it well”[16]) that simply leverages the analysis done at the genomic level (represented as a BED file to accompany the FASTA file provided to the search engine), thereby enabling the efficient mapping of proteogenomics results onto the common sequence map. The coupling between the proteomic workflow and the genomic arms of the research project is minimized, which allows the proteomic analysis to proceed using standard proteomic software tools. Our solution is implemented as a Python framework called PGx, which allows for sensitive, relevant, and rapid proteogenomic data integration either at the command line or through a user-friendly web-accessible interface. The key distinguishing property of PGx is that it relies solely on three standard files that succinctly summarize the contribution of the three main arms of the proteogenomics effort: a BED file integrating the results of DNA and RNA sequencing, a FASTA file representing the complete predicted proteome and a peptide list representing the results of peptide sequencing (Figure ).

Figure 1

PGx integrates all “ome” data sets using only a BED file, FASTA file, and a peptide list as input.

PGx integrates all “ome” data sets using only a BED file, FASTA file, and a peptide list as input. Our choice of BED files as input stems from the fact that nearly every genome browser supports the visualization of this file format, including: the Broad Institute’s Interactive Genome Browser,[17] UCSC’s Genome Browser,[18] the WashU Epigenome Browser,[19] and the Ensembl Genome Browser.[20] Additionally, many alignment and genomic tools output data in the form of a BED file such as the RNA-Seq alignment tool TopHat,[21] which outputs a splice junction file in the form of a BED file and the commonly used bedtools software,[22] which is able to complete a wide range of genomic analysis methods using the BED format. Because of its frequent use by the genomic and transcriptomic communities, simple conversion of proteomics data to the familiar BED format allows for seamless inclusion of proteomics data in already existing genome-based tools. The FASTA file, on the contrary embodies a prior interaction between the sequencing efforts in that it is usually a key enabler in sample-specific proteomics: Given the diversity of protein isoforms in different cell types and the growing affordability of next generation sequencing (NGS) technology, it has become advantageous to create sample-specific protein sequence databases for comprehensive peptide identification. RNA-Seq and genome sequencing information can be used to create these databases, incorporating variant proteins, alternatively spliced isoforms, and novel expression, as coded within the genome and transcriptome, allowing for the identification of sample specific peptides from the tandem MS analysis.[23−25] In the Clinical Proteomics Tumor Analysis Consortium (CPTAC) we have combined patient-specific protein databases and used PGx to map identified peptide sequences and a relative estimate of their abundance (via Spectral Counting, Figure ) onto sample-specific genomic coordinates, providing easy-to-use proteogenomic integration techniques for these patient-centric studies.

Figure 4

Multi-omic integration: The quantitative peptide track is provided by a PGx bedGraph file. Contains data from a tumor sample for (A) single nucleotide variants (SNVs) (from VCF files), (B) global and phosphoproteomic quantitative data (from PGx derived bedGraph files), (C) RNA expression and coverage data (BAM file), (D) global and phosphoproteomic peptide mapping (from PGx derived bed files), and (E) RNA splice junction predictions (from junction bed file).

Finally, the choice of a simple peptide list as the third PGx input file minimizes any formatting requirement by the proteomic software. PGx simply researches the protein sequence space provided in the FASTA file, thereby maximizing the decoupling between the various tools in the proteogenomic workflow.

Implementation

PGx is an open-source project released under the MIT license and is also publicly accessible via a web-based API supporting access by researchers using a web browser (Figure ) and programmers using http-based API calls to a simple RESTful interface. It is implemented in pure Python and is therefore extremely portable and easy to customize. PGx leverages a memory resident indexing scheme[26] to perform a very fast (essentially interactive) mapping of peptides to genomic sequence. PGx performs this mapping by using two indexes for each protein sequence database (e.g., RefSeq, Ensembl, or a sample-specific database based on RNA-Seq and whole genome sequencing or exome sequencing data). The first index is a peptide dictionary that contains all four amino acid peptides in the protein sequence database. The dictionary is designed to consider leucine and isoleucine as equivalent because they cannot be distinguished by typical mass spectrometry workflows. The dictionary is used to rapidly lookup and to retrieve all proteins that might contain an experimentally observed peptide based on the occurrence of its constituent 4-mers. The presence of the peptide is then validated in every candidate protein. The second index is a mapping of each protein sequence in the database onto the genome, and this index is used in the second step to map each peptide onto its genomic coordinates. PGx supports the mapping of many peptides at the same time, and the submission of a list with peptide sequences and their quantities will return a BED (qualitative information) and a bedGraph (quantitative information) that can be used to visualize the proteomics data using a broad range of genome browsers such as the UCSC browser (Figure ) or IGV (Figures and 4).

Figure 2

Typical interaction with the PGx Web site: The user simply drags a file containing query peptides onto the dashed rectangle. The example text file yielding this visualization is provided on the Web site itself.

Figure 3

Example of a novel peptide resulting from intronic expression is mapped using PGx framework. (The exact command-line required to generate the final bed file is shown in the purple inset; for more details see the tutorial included with the source code.)

PGx is available for testing against the standard Refseq build at the following Web site http://pgx.fenyolab.org. The site simply expects a file containing peptides to be “drag and dropped” onto it. The results are then available for download and visualization on the UCSC Genome Browser. (The whole process is shown in Figure .) A test file is made available on the site, which is the exact input used to generate the Figure. Typical interaction with the PGx Web site: The user simply drags a file containing query peptides onto the dashed rectangle. The example text file yielding this visualization is provided on the Web site itself. Example of a novel peptide resulting from intronic expression is mapped using PGx framework. (The exact command-line required to generate the final bed file is shown in the purple inset; for more details see the tutorial included with the source code.) Multi-omic integration: The quantitative peptide track is provided by a PGx bedGraph file. Contains data from a tumor sample for (A) single nucleotide variants (SNVs) (from VCF files), (B) global and phosphoproteomic quantitative data (from PGx derived bedGraph files), (C) RNA expression and coverage data (BAM file), (D) global and phosphoproteomic peptide mapping (from PGx derived bed files), and (E) RNA splice junction predictions (from junction bed file). While the Web site is useful in gaining an understanding of PGx’s functionality, the framework is implemented and distributed first and foremost as a collection of Python-based command line tools. In addition to the core-indexing query and formatting scripts, the framework provides some support functionality such as the automatic downloading of genomic resources (e.g., RefSeq[27] gpff files) or the ability to query for the position of peptides that might be present only as nsSNPs[28] relative to the existing sequence base. The complete set of scripts is hosted on github, along with end-user documentation in the form of a tutorial and a test data set capable of regenerating Figure . In brief, custom proteomes are stored in directories containing two files called: “proteome.fasta” and “proteome.bed” referring, respectively, to the sequence space and its genomic mapping. All pgx commands take a proteome (directory path) as a first input and a stream or named file as second argument. As a result, the full power and succinctness of command line streaming can be leveraged, resulting in a simple one-liner capable of generating Figure (purple inset).

Discussion

PGx allows for seamless integration of proteomic mapping and quantitation data into pre-existing multi-omic pipelines. Two examples of this are shown. Figure demonstrates the ability of PGx to map peptides to the genome in cases where a peptide is not contained within the reference protein database. Here a novel splice junction was identified by RNA-seq within the intronic region of the mitochondrial cochaperone HSCB in a tumor sample. Using a proteogenomic-based method of peptide searching in which these novel junction sites were included in the search database,[23−25] one is able to identify the peptide “SPPSDPTDALMQLAK” corresponding to the same novel intronic expression and subsequent PGx processing allows for the visualization of this peptide within a genomic context. Furthermore, Figure demonstrates how PGx can be used to easily obtain a comprehensive visualization of genomic, transcriptomic, and proteomic data. PGx files can be directly uploaded into the IGV or UCSC genome browser to display peptide mapping (Figure D) or peptide quantitation (Figure B) alongside RNA expression and coverage data (Figure C), RNA-seq splice junction mapping (Figure E), and genomic single nucleotide variants (SNVs) (Figure A). In this example, data from whole genome sequencing, RNA-seq, and quantitative MS/MS of a tumor sample were mapped for the serine/threonine kinase, AKT1.

Conclusions

We believe that PGx represents a useful contribution to the software toolset of any proteogenomics practitioner because it does not impose any mass-informatic on the proteomics branch of the workflow but simply relies on three files summarizing the results of the intermediate sequencing efforts to establish full data integration: a BED file, a FASTA file, and a set of peptides. A live instance of PGx supporting peptide mapping onto the standard RefSeq build is freely available for public use at http://pgx.fenyolab.org, and the Python scripts licensed under the MIT license are hosted at https://github.com/FenyoLab/PGx.

26 in total

1. TANDEM: matching proteins with tandem mass spectra.

Authors: Robertson Craig; Ronald C Beavis
Journal: Bioinformatics Date: 2004-02-19 Impact factor: 6.937

2. The complete peptide dictionary--a meta-proteomics resource.

Authors: Manor Askenazi; Jarrod A Marto; Michal Linial
Journal: Proteomics Date: 2010-12 Impact factor: 3.984

3. An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays.

Authors: Natalie E Castellana; Zhouxin Shen; Yupeng He; Justin W Walley; California Jack Cassidy; Steven P Briggs; Vineet Bafna
Journal: Mol Cell Proteomics Date: 2013-10-18 Impact factor: 5.911

4. Gapped spectral dictionaries and their applications for database searches of tandem mass spectra.

Authors: Kyowon Jeong; Sangtae Kim; Nuno Bandeira; Pavel A Pevzner
Journal: Mol Cell Proteomics Date: 2011-03-28 Impact factor: 5.911

5. Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing.

Authors: Chi Nam Ignatius Pang; Aidan P Tay; Carlos Aya; Natalie A Twine; Linda Harkness; Gene Hart-Smith; Samantha Z Chia; Zhiliang Chen; Nandan P Deshpande; Nadeem O Kaakoush; Hazel M Mitchell; Moustapha Kassem; Marc R Wilkins
Journal: J Proteome Res Date: 2013-11-12 Impact factor: 4.466

6. A bioinformatics workflow for variant peptide detection in shotgun proteomics.

Authors: Jing Li; Zengliu Su; Ze-Qiang Ma; Robbert J C Slebos; Patrick Halvey; David L Tabb; Daniel C Liebler; William Pao; Bing Zhang
Journal: Mol Cell Proteomics Date: 2011-03-09 Impact factor: 5.911

7. BEDTools: The Swiss-Army Tool for Genome Feature Analysis.

Authors: Aaron R Quinlan
Journal: Curr Protoc Bioinformatics Date: 2014-09-08

8. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.

Authors: Helga Thorvaldsdóttir; James T Robinson; Jill P Mesirov
Journal: Brief Bioinform Date: 2012-04-19 Impact factor: 11.622

9. The proteogenomic mapping tool.

Authors: William S Sanders; Nan Wang; Susan M Bridges; Brandon M Malone; Yoginder S Dandass; Fiona M McCarthy; Bindu Nanduri; Mark L Lawrence; Shane C Burgess
Journal: BMC Bioinformatics Date: 2011-04-22 Impact factor: 3.307

10. Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework.

Authors: Pratik D Jagtap; James E Johnson; Getiria Onsongo; Fredrik W Sadler; Kevin Murray; Yuanbo Wang; Gloria M Shenykman; Sricharan Bandhakavi; Lloyd M Smith; Timothy J Griffin
Journal: J Proteome Res Date: 2014-10-23 Impact factor: 4.466

7 in total

Review 1. Methods, Tools and Current Perspectives in Proteogenomics.

Authors: Kelly V Ruggles; Karsten Krug; Xiaojing Wang; Karl R Clauser; Jing Wang; Samuel H Payne; David Fenyö; Bing Zhang; D R Mani
Journal: Mol Cell Proteomics Date: 2017-04-29 Impact factor: 5.911

2. A Fast and Quantitative Method for Post-translational Modification and Variant Enabled Mapping of Peptides to Genomes.

Authors: Christoph N Schlaffner; Georg J Pirklbauer; Andreas Bender; Judith A J Steen; Jyoti S Choudhary
Journal: J Vis Exp Date: 2018-05-22 Impact factor: 1.355

3. Large Scale Identification of Variant Proteins in Glioma Stem Cells.

Authors: Ekaterina Mostovenko; Ákos Végvári; Melinda Rezeli; Cheryl F Lichti; David Fenyö; Qianghu Wang; Frederick F Lang; Erik P Sulman; K Barbara Sahlin; György Marko-Varga; Carol L Nilsson
Journal: ACS Chem Neurosci Date: 2017-12-21 Impact factor: 4.418

4. Identification of Differentially Expressed Splice Variants by the Proteogenomic Pipeline Splicify.

Authors: Malgorzata A Komor; Thang V Pham; Annemieke C Hiemstra; Sander R Piersma; Anne S Bolijn; Tim Schelfhorst; Pien M Delis-van Diemen; Marianne Tijssen; Robert P Sebra; Meredith Ashby; Gerrit A Meijer; Connie R Jimenez; Remond J A Fijneman
Journal: Mol Cell Proteomics Date: 2017-07-26 Impact factor: 5.911

5. Breast tumors educate the proteome of stromal tissue in an individualized but coordinated manner.

Authors: Xuya Wang; Arshag D Mooradian; Petra Erdmann-Gilmore; Qiang Zhang; Rosa Viner; Sherri R Davies; Kuan-Lin Huang; Ryan Bomgarden; Brian A Van Tine; Jieya Shao; Li Ding; Shunqiang Li; Matthew J Ellis; John C Rogers; R Reid Townsend; David Fenyö; Jason M Held
Journal: Sci Signal Date: 2017-08-08 Impact factor: 8.192

6. Proteoform Identification by Combining RNA-Seq and Top-Down Mass Spectrometry.

Authors: Wenrong Chen; Xiaowen Liu
Journal: J Proteome Res Date: 2020-11-12 Impact factor: 4.466

7. Fast, Quantitative and Variant Enabled Mapping of Peptides to Genomes.

Authors: Christoph N Schlaffner; Georg J Pirklbauer; Andreas Bender; Jyoti S Choudhary
Journal: Cell Syst Date: 2017-08-23 Impact factor: 10.304

7 in total