Literature DB >> 27786492

Flexible Data Analysis Pipeline for High-Confidence Proteogenomics.

Hendrik Weisser, James C Wright, Jonathan M Mudge, Petra Gutenbrunner¹, Jyoti S Choudhary.

Abstract

Proteogenomics leverages information derived from proteomic data to improve genome annotations. Of particular interest are "novel" peptides that provide direct evidence of protein expression for genomic regions not previously annotated as protein-coding. We present a modular, automated data analysis pipeline aimed at detecting such "novel" peptides in proteomic data sets. This pipeline implements criteria developed by proteomics and genome annotation experts for high-stringency peptide identification and filtering. Our pipeline is based on the OpenMS computational framework; it incorporates multiple database search engines for peptide identification and applies a machine-learning approach (Percolator) to post-process search results. We describe several new and improved software tools that we developed to facilitate proteogenomic analyses that enhance the wealth of tools provided by OpenMS. We demonstrate the application of our pipeline to a human testis tissue data set previously acquired for the Chromosome-Centric Human Proteome Project, which led to the addition of five new gene annotations on the human reference genome.

Entities: Chemical Disease Gene Species

Keywords: bioinformatics; genome annotation; mass spectrometry; proteogenomics; testis; workflow

Mesh：

Year: 2016 PMID： 27786492 PMCID： PMC5703597 DOI： 10.1021/acs.jproteome.6b00765

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

Proteogenomics is an expanding field of inquiry at the intersection of proteomics and genomics that has been growing in line with the advance of the omics era. Studies in this field analyze proteomic data in search of direct evidence of protein expression to help improve the annotation of protein-coding regions in genomes. In the case of mass spectrometry (MS) data, of particular interest are peptides of “unexpected” origin, i.e., peptides that do not match to a known protein sequence, because these could point to previously unrecognized coding regions. However, for genomes that are already well-characterized (particularly the human genome), novel peptides represent needles in a haystack of peptides matching known proteins. The difficulty of finding them is exacerbated by the importance of avoiding false positive hits that could lead to spurious annotations.[1] Several prerequisites are essential for a successful, high-quality proteogenomics endeavor: a suitable proteomics data set, a comprehensive database composed of both known and potential novel protein-coding sequences, and the collaboration of experts for manual genome annotation. Furthermore, a data analysis workflow is needed that reliably and sensitively identifies peptides and filters them according to rigorous criteria. Such a workflow should operate in a reproducible fashion and allow high-throughput processing to enable the analysis of large data sets. We have recently reported the detection of novel proteins in the human genome based on stringent guidelines for the processing of proteomic MS data for genome annotation efforts.[2] Here, we describe an automated processing pipeline based on these principles. Our pipeline is implemented using OpenMS 2.0[3] as an example of combining task-specific tools into a powerful data analysis workflow. The pipeline introduced here aims to identify novel peptides based on a suitable sequence database. However, due to the great flexibility inherent in the modular workflow approach, this “core” proteogenomics pipeline can be easily adapted to handle different requirements and extended to include additional functionality from the OpenMS toolbox, such as quantification or post-translational modification analysis. Our approach achieves the goals of modularity, flexibility, and extensibility that are shared by many previously proposed workflows.[4−7] Significantly, our pipeline directly implements the quality criteria that were developed in collaboration between proteomics experts and genome annotators and published previously.[2] Furthermore, it benefits from the strengths of OpenMS as a mature, reliable platform with an active user and developer base. OpenMS offers high performance suitable for large-scale analyses, supports all major operating systems, and provides integration with various workflow managers. Unlike some alternatives, our pipeline does not generate an amino acid sequence database from genomic data because we prefer a manually curated database that incorporates a variety of sources for proteogenomics. Finally, our pipeline condenses the core of a proteogenomic analysis into a manageable workflow composed of 18 steps, significantly smaller and more straightforward than alternative proteogenomics workflows. In addition to presenting the proteogenomics pipeline itself, we describe here the contributions that we made in this context to the code base of the OpenMS project and to MascotPercolator.[8,9] Finally, we show the application of our pipeline to a relevant published data set, the human testis tissue data set acquired for the Chromosome-Centric Human Proteome Project (C-HPP).[10]

Materials and Methods

Proteomics Data Set

To demonstrate the capabilities of our proteogenomics pipeline, we applied it to the human testis tissue data set published by Zhang and colleagues.[10] We will refer to this as the “C-HPP testis” data set. The data set contains three biological replicates, each fractionated using two different protein separation methods (regular SDS-PAGE and tricine-SDS-PAGE), with six samples in total. Each sample gave rise to 28 (regular) and 22 (tricine) fractions, respectively, which were digested using trypsin and analyzed on an LTQ Orbitrap Velos (Thermo Scientific) mass spectrometer, acquiring fragment ion (MS2) spectra using CID activation. We downloaded the 150 RAW files from these liquid chromatography–tandem mass spectrometry (LC–MS/MS) runs from the PRIDE repository[11] (accession PXD002179) and converted them to mzML format, extracting only the MS2 scans, using the “msconvert” program from the ProteoWizard software suite[12] (version 3.0.8789). A single RAW file was renamed to fit the naming scheme of the remaining files (“CHPP_Testis_tricine_1010.raw” to “CHPP_TESTIS_Tricine_1010.raw”). Another file, “CHPP_SDS_3003.raw”, turned out to be corrupt, so we downloaded the corresponding MGF file from PRIDE and converted that to mzML. The mzML files were then used as inputs for our OpenMS pipeline.

Sequence Database

The amino acid sequence database against which fragment ion (MS2) spectra are matched is critical for comprehensive identification of peptides. A sequence database for use with our pipeline should consist of four parts: (1) known protein-coding sequences; (2) sequences of common contaminants; (3) prospective protein sequences currently unannotated or presumed noncoding that could potentially contain unrecognized coding regions (e.g., lncRNA transcripts, RNA-Seq models, predicted transcripts, etc.); and (4) decoy sequences. The four parts are concatenated into a single sequence database that is used for database searching of MS2 spectra; parts 1–3 are also used individually to filter the downstream search results. When analyzing data with the aim of finding “novel” peptides, we are interested in high-confidence matches to peptides that are in part 3 of the database but are not also in parts 1 or 2. For the analysis of the C-HPP testis data, we used the sequence database from Wright et al.[2] This database consists of human translated amino acid sequences from the following sources. For part 1, it contains the translated CDS from GENCODE[13] v20 and the UniProt[14] reference proteome. For part 2, it contains the collection of common contaminant proteins from the Max Planck Institute of Biochemistry, together with sequences of the major histocompatibility complex from the IPD-IMGT/HLA database.[15] For part 3, it contains noncoding sequences from GENCODE v20 (pseudogenes, lncRNA, 5′ UTR), gene predictions from AUGUSTUS,[16] pseudogene predictions from Pseudogene.org,[17] and three-frame translated transcripts from three different large-scale RNA-Seq experiments.[13,18,19] For part 4, it contains randomized decoy sequences generated from parts 1–3 using the Mimic software (https://github.com/percolator/mimic). To simplify dealing with the isobaric amino acids leucine and isoleucine, all occurrences of “I” in the sequences were replaced by “L”. In total, the database contains 8 406 627 entries; the FASTA file takes up 1.1 GB of memory (including accessions).

Software

Our data analysis pipeline leverages many existing software tools. We used TOPP tools[20] from the OpenMS framework (version 2.0.1; http://openms.org) for most data-processing steps, although in some areas we extended the functionality offered by OpenMS (see below). Importantly, OpenMS provides an easy-to-use graphical interface for designing and running TOPP-based data analysis workflows, called TOPPAS.[21] It also offers a mechanism for wrapping non-TOPP command line applications so that they can be included in TOPPAS workflows (“GenericWrapper”). Central to any proteomics analysis pipeline is the identification of peptides from the MS2 spectra, for which we used the database search engines Mascot[22] (version 2.5.1, Matrix Science) and MS-GF+[23] (version 10089). The following parameters were used with both engines: 10 ppm precursor mass tolerance, 0.5 Da fragment mass tolerance; trypsin cleavage with full specificity, allowing two missed cleavages; fixed modification: carbamidomethylation of cysteine; variable modifications: oxidation of methionine, deamidation of asparagine and glutamine, N-terminal acetylation, and conversion to pyroglutamic acid of N-terminal glutamine and glutamic acid. We applied Percolator[24] (revision “273ff55” from https://github.com/percolator/percolator) for statistical evaluation and rescoring of the search results (in the Mascot case via an adapted MascotPercolator,[8,9] version 2.16; http://www.sanger.ac.uk/science/tools/mascotpercolator). To summarize the results of our pipeline and visualize the data, we used the R software environment for statistical computing (version 3.1.2).[25]

Results

Software Contributions

We were able to rely on existing applications for the construction of our proteogenomic pipeline and for many of the data processing tasks; however, for several tasks, we developed new tools or refined and extended existing ones. We added wrappers for the search engine MS-GF+[23] and for the postprocessing tool Percolator[24] to OpenMS. In addition, we adapted several existing TOPP tools and underlying OpenMS library classes, tailoring them for use in an integrated pipeline and for the needs of proteogenomic data analysis. We not only added missing functionality but in several cases also improved the quality (and thus reliability and maintainability) of the source code.

Adapter for the MS-GF+ Search Engine

OpenMS includes TOPP tools that serve as adapters to several widely used database search engines (e.g., Mascot, OMSSA, and X! Tandem). We developed a TOPP tool that wraps the newer search engine MS-GF+ called MSGFPlusAdapter. The adapter takes as input the location of the MS-GF+ Java package, a spectral data file (mzML), a sequence database file (FASTA), and a number of search parameters, including lists of desired fixed and variable modifications. It creates a temporary file containing the user-defined modifications in the format required by MS-GF+. A Java process runs the MS-GF+ program, supplying search parameters and input files. MS-GF+ performs the search and writes results into an mzIdentML file. Because OpenMS internally still uses its own format, idXML, for peptide and protein identification (ID) data, the adapter can optionally convert the mzIdentML file to idXML once the search is complete. This conversion was initially implemented via an intermediary step in which a tabular text file is generated from the mzIdentML file using a function of MS-GF+. Information from this file is then used to fill internal data structures and, from there, written out to idXML. Support for the mzIdentML format in OpenMS has improved in the meantime, so it is now also possible to convert the mzIdentML file to idXML directly. Irrespective of the method used for the conversion, the adapter needs to look up retention time (RT) values for peptide IDs in the spectral data. This information is missing from the mzIdentML file generated by MS-GF+, but it is vital for many downstream applications and analyses in OpenMS. To support this, we developed two related classes for the OpenMS C++ library, SpectrumLookup and SpectrumMetaDataLookup, to handle the recurring task of looking up spectra and their associated information based on different kinds of spectral references (e.g., scan numbers). These classes have now been broadly adopted in OpenMS.

Percolator Wrappers

Percolator is a tool for postprocessing peptide–pectrum matches (PSMs) from sequence database searches of LC–MS/MS data.[24] It can improve the number of confident PSMs recovered from the data by applying a semisupervised machine learning approach to distinguish correct from incorrect PSMs. Percolator also calculates statistically meaningful scores for PSMs, i.e., q values (a measure of the false discovery rate, FDR) and posterior error probabilities (PEP).[26]

Percolator for MS-GF+

To run Percolator on MS-GF+ search results, we created two wrappers using the Generic-Wrapper mechanism provided by OpenMS: one for the “msgf2pin” program packaged with Percolator, which converts mzIdentML files generated by MS-GF+ to Percolator’s input format,[27] and one for the Percolator executable itself. In addition, we implemented a parser for the output files produced by Percolator. Percolator can generate results on the levels of PSMs, peptides, or proteins. We implemented support for PSM-level output in the C++ class PercolatorOutfile. Our changes enable the reading of the corresponding Percolator result files and their conversion to idXML using the TOPP tool IDFileConverter. Internally, the class SpectrumMetaDataLookup is used to annotate peptide IDs with RT values from original spectra.

MascotPercolator

For applying Percolator to Mascot search results, the program MascotPercolator provides a convenient solution that operates directly on the “raw” Mascot search results (.dat files). We adapted MascotPercolator to make it interoperable with OpenMS. To this end, we added the ability to read an idXML file produced by the Mascot search adapter (MascotAdapterOnline) and to extract the ID number of the Mascot search, which MascotPercolator uses to find the corresponding .dat file and perform its analysis. We also modified the format of the input file submitted by MascotPercolator to Percolator itself to ensure that the output would be fully compatible with OpenMS. Specifically, this meant (a) adding annotations of post-translational modifications to the peptide sequences and (b) generating ID strings for the PSMs in a format that would facilitate the lookup of meta information such as retention times, precursor mass-to-charge values, and charge states. With these changes, MascotPercolator could be wrapped using the GenericWrapper approach and thus integrated into our analysis pipeline.

Changes to Existing OpenMS Tools

MascotAdapterOnline

MascotAdapterOnline is the TOPP tool that facilitates running a database search on a remote Mascot server. We contributed a small usability improvement to this adapter by using the name of the input spectra file (mzML) to set the title of the Mascot search, which makes it easier to identify search runs submitted by OpenMS in the Mascot search log. In addition, we made adaptations geared toward interfacing MascotAdapterOnline with MascotPercolator. First, MascotPercolator relies on the Mascot search number to find the .dat file to process, so this number needed to be extracted from the Mascot server’s response to a search query and written to the output file as a metadata entry. Second, because MascotPercolator operates directly on the Mascot .dat file, we do not need to retrieve the search results from MascotAdapterOnline. We thus added a flag to the adapter that allows us to skip the lengthy export process of the search results and to write an essentially empty idXML file containing only the Mascot search number.

ConsensusID

Analyzing a sufficiently large set of MS2 spectra with different peptide and protein identification (ID) engines, including database search engines, generally gives partially different, often complementary results.[28] To take advantage of this, OpenMS provides the ConsensusID tool, which combines search results on the PSM level from different ID engines. Given ranked lists of PSMs produced by different engines, several algorithms are available to merge and rescore PSMs derived from the same spectrum: similarity scoring based on sequence or fragmentation pattern similarity,[29] ranked voting, or simply using the average or the best score for each peptide. There were, however, problems with the implementations of these algorithms; hence, we rewrote large parts of them, refactored the code into a class hierarchy, and increased the test coverage. These improvements made it easy for us to add new features to ConsensusID. First, a filter allowing the user to specify a minimum fraction of the involved ID engines (e.g., “two out of three”) that must have identified a peptide. Second, a new conservative rescoring algorithm, which assigns to each peptide hit the worst score that it has received from any of the ID engines.

IDFilter

The IDFilter TOPP tool offers a plethora of options for filtering peptide and protein identification data. We added some additional options to complete our proteogenomics pipeline. In preparation for that, we refactored the ID filtering code in the OpenMS library to provide a more-consistent interface, more-descriptive function names, and a cleaner, often more-efficient implementation based on functional programming principles. We realized that most of the filtering options involved checking elements of a list and either removing or keeping those that fulfilled a criterion. Our code is thus built on two aspects. First, simple predicates implemented as functors, e.g., to check if a quality score is above a given threshold. Second, two generic, higher-order functions that filter C++ vectors (e.g., containing identified peptides or proteins) using the erase-remove idiom to either keep or remove elements that match a given predicate. Using these building blocks, we added new filtering options required in our pipeline to the library and exposed them in the IDFilter tool: First, a filter that takes a set of post-translational modifications and removes all peptide hits featuring any of those modifications; this is later used to remove deamidated PSMs (see the Proteogenomics Pipeline section below). Second, a filter that removes all peptide and protein hits matching accessions in a given FASTA file; this functionality is needed for protein-level filtering in our analysis pipeline.

PeptideIndexer

To update the protein references for a set of peptide IDs, the TOPP tool PeptideIndexer can be used. Given an idXML file with peptide IDs and a FASTA file containing amino acid sequences, PeptideIndexer matches the peptide sequences to the database sequences and annotates the peptides with corresponding accessions. Typically, this indexing works in two passes. First, exact string matching using the Aho–Corasick algorithm[30] is used to quickly find matches for the majority of peptides. Second, if any peptides remain unmatched, an error-tolerant search using suffix arrays is performed. This step can recover matches to database sequences that contain ambiguity codes for sets of amino acids (“B”: D or N; “Z”: E or Q; and “X”: any), but it may take a long time depending on the number of sequences involved. In our proteogenomics pipeline, we index identified peptides against parts of the full sequence database, in which case we expect that some peptides will not match. We had to make a small change to the PeptideIndexer code to be able to skip the error-tolerant search in these cases, which allowed us to reduce the runtime of the pipeline by several hours per input file.

Additional Tools Not Used in the “Core” Pipeline

MzMLSplitter

There are limits to the size of raw data files suitable for use in our pipeline. On the lower end, they should contain at least several thousand PSMs to allow reliable Percolator training. On the higher end, input files for MascotAdapterOnline should not be much larger than 1 GB. Small raw files can be merged into larger files using the TOPP tool FileMerger; we added a utility called MzMLSplitter for the opposite operation: splitting a large mzML file into multiple equally sized parts. In practice, when dealing with a set of raw data files of widely varying sizes from a fractionation experiment, a useful approach may be to merge all files from one sample using FileMerger and then split the result into manageable 1 GB parts using MzMLSplitter.

IDScoreSwitcher

One important limitation of the idXML format that OpenMS uses to store peptide and protein identification data is that only one primary score statistic can be associated with each peptide or protein hit. Secondary scores can be stored as metadata, but only the primary score is used for ranking and filtering PSMs. Usually only one final, meaningful score for peptides or proteins is required. An exception to this is Percolator, which calculates FDRs (q values) and PEPs, which can be used in conjunction for filtering, e.g., using the common cutoff combination of 1% FDR and 5% PEP. To better support multiple peptide and protein scores in OpenMS, we added a tool called IDScoreSwitcher, which facilitates switching between secondary scores and the primary score. Filtering by two different score types is possible by running IDFilter twice, applying IDScoreSwitcher between the filters.

FidoAdapter

Fido is a protein inference engine that uses a Bayesian probabilistic model to group and score proteins based on PSMs.[31] It is freely available under an open-source license. Because OpenMS was lacking a protein inference tool, we decided to add support for Fido via an adapter. To this end, the Fido source code was adapted to work on all platforms that OpenMS supports; the patched version is available on GitHub (https://github.com/hendrikweisser/Fido). The OpenMS adapter we developed, FidoAdapter, receives peptide identification results in idXML format as input. The scores of the peptide hits must be probabilities and the referenced protein hits must be annotated with target and decoy information. This enables the adapter to generate suitable input files for Fido. These files, together with any user-specified parameters, are used to run the Fido executable (typically “FidoChooseParameter”, which includes parameter optimization). The adapter parses the Fido output file containing inferred protein groups and their posterior probabilities, adding the results to the original ID data. The result can be used, for example, as an auxiliary input for the ProteinQuantifier TOPP tool, enabling quantification of protein groups.

Proteogenomics Pipeline

Integrating existing software with tools that we adapted or developed specifically for this purpose, we have designed a pipeline for proteogenomic data analysis. An overview of our approach is shown in Figure . The goal of our pipeline is to confidently identify potential “novel” peptides in LC–MS/MS data. Based on conclusions from previous work,[2] we define the following list of criteria for peptides and for the PSMs from which they are derived to be designated as “novel”. (1) The peptide must be 7–30 amino acids long. (2) It must be fully tryptic, with no more than two missed cleavages. (3) It must be identified (a) as the joint top hit of two different search engines, (b) with a PEP of 1% or better in both search engine results, and (c) without any deamidation modifications. (4) The peptide must not match to a contaminant and must differ by more than two amino acids from any known protein (to exclude potential matches to known genes with variants and mutations).

Figure 1

Schematic overview of the OpenMS proteogenomics workflow. Based on a comprehensive sequence database, tandem mass spectra from large proteomic data sets are searched in a competitive target–decoy approach using two search engines, Mascot and MS-GF+. The search results are rescored using Percolator and filtered in multiple stages according to stringent quality criteria. During this process, starting from a large number of spectra and initial PSMs, the set of retained PSMs is refined further and further until in the end, only high-confidence PSMs from novel peptides remain. These are exported and passed on to genome annotators. In a manual review process, novel peptides and other sources of evidence are integrated, in some cases yielding new insights in the form of novel genome annotations. Figure shows the TOPPAS workflow for our pipeline, which is available at http://openms.org/workflow/proteogenomics. The pipeline works in several stages, which we describe in more detail below: (1) database searching of MS2 spectra; (2) rescoring of search results; (3) filtering on the PSM level; (4) filtering on the protein level; and (5) export of results. Important parameters of the pipeline are listed in Table S1.

Figure 2

Proteogenomics pipeline, as displayed in the TOPPAS workflow editor. The different stages of the pipeline are indicated using colored boxes. Additional output nodes, which would be used in practice to capture intermediate results at different stages, have been omitted for simplicity. The input file nodes 1–5 contain the following data: 1, MS2 spectra (mzML files); 2, combined target–decoy sequences (FASTA); 3, contaminant sequences (FASTA); 4, known protein sequences (FASTA); and 5, presumed noncoding sequences (FASTA).

Stage 1: Database Searching of MS2 Spectra

At the start of the pipeline, MS2 spectra from experimental data, stored in mzML files, are searched against a combined sequence database in a competitive target–decoy approach. A pair of search engines is used for this purpose, Mascot and MS-GF+, which we chose because of their good performance, especially in combination with Percolator. Including additional search engines in the pipeline would not be difficult; in particular, both an OpenMS adapter and a Percolator converter already exist for X! Tandem.[32] Input spectra may have to be centroided, as is required by MS-GF+; for high-resolution spectra in profile mode, centroiding can be performed using the TOPP tool PeakPickerHiRes. The results of this stage are peptide–spectrum matches with associated scores, produced by each search engine.

Stage 2: Rescoring of Search Results

The second stage applies Percolator to the PSMs from stage 1 to enrich for correct matches and obtain statistically meaningful scores for filtering in the next stage. Percolator outputs are converted to the idXML format using IDFileConverter. In the MS-GF+ branch, the corresponding spectral data files are required to associate retention time and precursor mass-to-charge values to PSMs. In the Mascot branch, this is not necessary, as these values are included in the ID column in the Percolator output. Note that Percolator produces separate outputs for target and decoy PSMs. In the present pipeline, only target hits are utilized. However, to enable the calculation of overall false discovery rates, decoy hits could be retrieved via the “ETool:out_decoy” parameters of the GenericWrapper (Percolator/MascotPercolator) nodes, converted with IDFileConverter and merged with the target hits.

Stage 3: Filtering on the PSM Level

In this stage, the rescored PSMs from Mascot and MS-GF+ are merged and filtered according to our stringent criteria. Using the IDFilter tool, PSMs are filtered by peptide sequence length (7–30 amino acids), PEP score (0.01 or better), and modifications (no deamidation). Deamidated PSMs are removed because they have previously been found to be overrepresented among potentially novel peptides[2] and are hence considered unreliable. Finally, the ConsensusID tool is applied to group Mascot and MS-GF+ search hits pertaining to the same spectrum and to filter all cases in which the two search engines did not arrive at the same significant top hit.

Stage 4: Filtering on the Protein Level

During stage 4, PSMs are further filtered based on which proteins (or, more correctly, which entries in the sequence database) match their peptide sequences. The PeptideIndexer tool is repeatedly applied to find matches in each part of the database, followed by IDFilter to remove matching peptides. Initially, all peptides matching contaminant (including HLA) sequences are discarded. Next peptides matching known proteins are removed, allowing up to two amino acid differences in a peptide–protein match to account for possible unknown variants in the proteins of the biological sample. Because this step involves a large part of the sequence database and the required approximate matching is computationally expensive, it proceeds in two phases. First, exact matches are found between the peptide and protein sequences, and corresponding peptides are removed. This is relatively fast and excludes the vast majority of identified peptides from further consideration. Second, an approximate search is performed for the remaining peptides, removing any additional matches. The resultant peptide hits match only to the “presumed noncoding” part of the sequence database. To ascertain to which sequences they match, PeptideIndexer is used again, this time indexing against the “presumed noncoding” part of the database. (If decoy hits were included in the analysis, they could be removed using IDFilter.)

Stage 5: Export of Results

At this point in the pipeline, the PSMs of potentially novel peptides, annotated with accessions from the sequence database, are available in idXML format (one file per input mzML file). The final step of the pipeline applies the TextExporter tool to convert the idXML files into tabular text files (.csv) for further analysis in external tools. It is also possible to convert the idXML files from any stage of the pipeline into the HUPO PSI standard formats mzIdentML[33] and mzTab[34] using the TOPP tools IDFileConverter and MzTabExporter, respectively.

Testis Data Analysis

We demonstrate the effectiveness of our proteogenomics pipeline on the analysis of the human testis data set generated for the Chromosome-Centric Human Proteome Project by the Liu lab (“C-HPP testis data set”).[10] The original analysis of this data by Zhang et al. had focused on the detection of “missing proteins”, i.e., known protein-coding genes for which no direct protein evidence had yet been found. In contrast, our analysis focuses on a disjoint set of protein evidence, “novel” peptides that uniquely map to genomic regions not previously known to be protein-coding genes. To find such peptides, we converted the 150 RAW files containing the LC–MS/MS data to mzML format and used the mzML files as inputs for the TOPPAS workflow shown in Figure . The resulting data is available from the PRIDE repository under accession PXD004785.

Runtime Considerations

In practice, we did not run the complete workflow at once but rather performed the Mascot and MS-GF+ searches as separate steps, independent of each other. This was done for efficiency reasons, as the performance characteristics of the two search engines differ. On the C-HPP testis data set, MascotAdapterOnline took between 6 and 32 min (an average of 21 min) per file to run Mascot searches. MSGFPlusAdapter took between 1 and 3 h (an average of 2 h) per file for MS-GF+ searches using four parallel threads. However, the number of Mascot searches that can be run in parallel is limited by the number of Mascot licenses (there is no comparable limit for MS-GF+ searches, provided that adequate computational resources are available). After the database searches, the runtime of the remaining pipeline was between 6 and 18 min per file (an average of 11 min) in our analysis.

Data Processing Summary

A summary of the analysis is presented in Figure , which shows the numbers of peptide–spectrum matches generated and retained at each stage of the pipeline. The data set contains 4.2 million MS2 spectra, almost all of which can be assigned peptide sequences by Mascot and MS-GF+. After filtering on the PSM level (1% PEP, peptide length 7–30, no deamidation), we are left with 1.6 million Mascot PSMs (37% of all MS2 spectra) and one million MS-GF+ PSMs (24% of spectra) with a false discovery rate (FDR) of 0.12%. The ConsensusID step further reduces the number of PSMs to 940 000 for which Mascot and MS-GF+ agree in their assignments of the best hit at sufficiently high confidence. 623 000 PSMs and 96 000 PSMs exclusively identified by Mascot and MS-GF+, respectively, are thus removed; at the same time, the fraction of decoy hits decreases, yielding an estimated FDR of 0.006% for the ConsensusID results. Subsequent filtering, removing matches to contaminants and HLA reduces the overall number of PSMs to 880 000. However, almost all of these PSMs (more than 99.9%) are exact matches to known proteins and are thus removed in the next filtering step. Of the remaining 455 PSMs matching noncoding sequences, only 210 pass the final filter, which matches approximately against known protein sequences, allowing up to two amino acid differences per peptide. These 210 PSMs are the result of applying our proteogenomic pipeline to the C-HPP testis data set and contain 47 nonredundant “novel” peptide sequences.

Figure 3

Data retention throughout the pipeline. The bars show the numbers of “data elements” (spectra, PMSs, and peptides) under consideration as these numbers decrease from the start (left) to the end (right) of the proteogenomics pipeline. In detail, the bars represent the following (node numbers refer to the TOPPAS workflow in Figure ): “MS2 spectra”, input MS2 spectra in the C-HPP testis data set; “Mascot/MS-GF+ PSMs (all)”, spectra that generated PSMs using either search engine; “Mascot/MS-GF+ (1% PEP)”, PSMs after PSM-level filtering (node 15); “Consensus”, PSMs after ConsensusID (node 16); “Filter: contaminants”, PSMs after filtering for contaminants (node 18); “Filter: known proteins (exact)”, PSMs after filtering for exact matches to known proteins (node 20); “Filter: known proteins (approx.)”, PSMs after filtering for approximate matches to known proteins (final set; node 22); and “Novel peptides”, distinct novel peptides identified by the final set of PSMs.

Novel Peptides

We identified 47 potential novel peptides in the C-HPP testis data based on a final set of 210 PSMs generated by our analysis pipeline. We re-evaluated these peptides against an updated, more-comprehensive database of known human proteins composed of RefSeq,[35] neXtProt,[36] GENCODE v22, and UniProt sequences, using the PeptideIndexer and IDFilter combination from the pipeline and again allowing up to two amino acid mismatches. This removed a further 12 peptides. Spectra for the remaining 35 peptides were manually inspected, and then these peptides were passed on to the manual genome annotators from the GENCODE project. Table S1 lists the 35 peptides together with the outcomes of the manual annotation process. Importantly, eight peptides were used as a source of evidence (together with RNA expression, sequence conservation, gene structure, and other orthogonal evidence) to annotate five new protein-coding genes. These annotations are publicly available in the VEGA database[37] and will be incorporated into the next release of the GENCODE gene set. One example is shown in Figure . A further 22 peptides were mapped to seven loci that were only recently annotated as new genes[2] that would otherwise have counted as novel annotations as well. All of these loci were found to be expressed in testis tissue in the previous study.

Figure 4

Reannotation of OTTHUMG00000019887 based on proteogenomic analysis. (A) This locus was present in GENCODE v20 as a lincRNA model, and it is currently categorized in this way by RefSeq (orange model) based on mRNA AK056723.1 (brown model) and given the official HGNC gene symbol LINC00961. Furthermore, an equivalent model was generated and classified as a lncRNA by the RNA-Seq-based PLAR pipeline developed by Hezroni et al.[38] (purple-outlined model). GENCODE have now converted this model to protein coding (UTRs in red; CDS in green) based on proteogenomic evidence in combination with evolutionary conservation. The conserved region is well resolved by PhyloCSF, with this track being taken from genome.ucsc.edu. Peptide [QEASLFTGPVR] is marked (red triangle). (B) The 75 aa human CDS shows conservation in eutherian mammals, although not outside this group based on available genome alignments. “T. Devil” is Tasmanian devil, and “flying fox” is specifically the black flying fox Pteropus alecto.

Known Proteins

As a proof of concept, we carried out protein inference using Fido (via FidoAdapter) on the set of 880 000 noncontaminant consensus PSMs from the whole data set. Filtering for known proteins resulted in 8679 inferred protein groups. This number is roughly in line with the total of 9597 proteins reported by Zhang et al. in their original analysis, given that our filtering criteria for PSMs and proteins were more stringent than theirs, as required for the reliable identification of novel peptides. Using the same data, we performed spectral counting on the peptide level with OpenMS’ ProteinQuantifier and mapped the peptides to the genome. Further analysis of the known proteins and their expressed peptides is beyond the scope of this study, but we are making our results available in a Track Hub,[39] suitable for visualization in genome browsers, at http://ngs.sanger.ac.uk/production/proteogenomics/WTSI_proteomics_CHPP_testis/hub.txt. As an example, our Track Hub can be displayed in the UCSC genome browser, showing tracks for peptides in known proteins, novel peptides, and post-translationally modified peptides, via http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hubUrl=http://ngs.sanger.ac.uk/production/proteogenomics/WTSI_proteomics_CHPP_testis/hub.txt.

Discussion

We present an automated pipeline for proteogenomic data analysis that is implemented within the OpenMS framework for computational mass spectrometry. This pipeline is largely based on existing software (OpenMS, Mascot, MS-GF+, and Percolator) but also benefits from custom extensions and new tools, developed in the context of this project. The aim of our pipeline, given LC–MS/MS data and a suitable sequence database, is to confidently identify peptides that can inform novel genome annotations. We have applied this approach to a relevant data set, the human testis tissue data set generated for the Chromosome-Centric Human Proteome Project. On the basis of 4.2 million MS2 spectra from 150 LC−MS/MS runs, we identified 35 “novel” peptides as candidates for genome annotation. A total of eight of these peptides led to the annotation of five new protein-coding genes for the GENCODE gene set, and a further 12 peptides matched five very recent novel annotations. The fact that our set of results was so highly enriched in peptides that gave rise to new genome annotations demonstrates the efficacy of our approach. Implementing our pipeline within the OpenMS framework not only allowed us to conveniently reuse a large set of existing computational proteomics algorithms for our purposes but also confers a great amount of flexibility for adapting or extending the pipeline to address different research questions. For example, we have focused on the detection of “novel” peptides in this study, thereby excluding 99.9% of our confident peptide IDs from consideration. For a different research objective, such as comparing tissue proteomes, it would be straightforward to adapt our pipeline to focus on these 99.9% of PSMs instead. As we have demonstrated briefly, protein inference can be carried out using Fido via FidoAdapter. One of several available feature detection algorithms, together with the IDMapper and ProteinQuantifier tools, can add label-free quantification;[40] other tools are available for the quantitative analysis of labeled data or for the localization of post-translational modifications. Conversely, because OpenMS is a general-purpose framework for mass-spectrometry-based proteomics and metabolomics, the improvements that we made to its tools will also benefit users in other areas of research. Adapting our pipeline toward more general-purpose peptide identification would open up avenues for extension that are less-suitable for our current focus on novel peptides. For example, error-tolerant searching could help to identify additional post-translational modifications or sequence variants caused by amino acid substitutions.[41] However, such approaches typically restrict the search space to proteins that have already been identified. De novo sequencing could allow us to detect peptides with unexpected sequences;[42] however, we would not consider any novel peptides as credible that are based on sequence variants, unless sequencing data shows that those variants are clearly present in the sample. Spectral library searching,[43] potentially in combination with spectral clustering,[44] has certain advantages over sequence database searching, but it can only assign peptides to spectra if similar spectra have been identified previously. Finally, approaches for identifying cofragmented peptides from “chimeric” spectra may boost identification rates,[45] but it is questionable whether PSMs of novel peptides involving chimeric spectra would pass manual validation. Currently, the mapping of peptides to the genome is part of the manual annotation process, which is based on the inclusion of genomic coordinates in the accessions of the sequence database entries. A recently developed tool that performs the genome mapping could be integrated into our pipeline in the future (Schlaffner et al., manuscript in preparation). Moreover, the availability of paired sequencing data for proteomic samples would make it possible to create custom databases containing the exact sequence variants present in each sample; this would allow the further simplification of the filtering pipeline and increase its sensitivity by obviating the need for an approximate matching step. To date, we have processed data sets composed of over 55 million MS2 spectra with OpenMS-based proteogenomics workflows, demonstrating the scalability and robustness of our pipeline. Even on the most highly curated genome, this has resulted in over 40 new protein-coding gene annotations for the GENCODE human reference. We anticipate that this pipeline will be particularly useful for the analysis of personalized proteomes and integration with other omics technologies.

42 in total

1. The IPD-IMGT/HLA Database - New developments in reporting HLA variation.

Authors: James Robinson; Anup R Soormally; James D Hayhurst; Steven G E Marsh
Journal: Hum Immunol Date: 2016-01-27 Impact factor: 2.850

2. PepNovo: de novo peptide sequencing via probabilistic network modeling.

Authors: Ari Frank; Pavel Pevzner
Journal: Anal Chem Date: 2005-02-15 Impact factor: 6.986

3. Development and validation of a spectral library searching method for peptide identification from MS/MS.

Authors: Henry Lam; Eric W Deutsch; James S Eddes; Jimmy K Eng; Nichole King; Stephen E Stein; Ruedi Aebersold
Journal: Proteomics Date: 2007-03 Impact factor: 3.984

4. TOPPAS: a graphical workflow editor for the analysis of high-throughput proteomics data.

Authors: Johannes Junker; Chris Bielow; Andreas Bertsch; Marc Sturm; Knut Reinert; Oliver Kohlbacher
Journal: J Proteome Res Date: 2012-05-24 Impact factor: 4.466

5. OpenMS: a flexible open-source software platform for mass spectrometry data analysis.

Authors: Hannes L Röst; Timo Sachsenberg; Stephan Aiche; Chris Bielow; Hendrik Weisser; Fabian Aicheler; Sandro Andreotti; Hans-Christian Ehrlich; Petra Gutenbrunner; Erhan Kenar; Xiao Liang; Sven Nahnsen; Lars Nilse; Julianus Pfeuffer; George Rosenberger; Marc Rurik; Uwe Schmitt; Johannes Veit; Mathias Walzer; David Wojnar; Witold E Wolski; Oliver Schilling; Jyoti S Choudhary; Lars Malmström; Ruedi Aebersold; Knut Reinert; Oliver Kohlbacher
Journal: Nat Methods Date: 2016-08-30 Impact factor: 28.547

6. Fast and accurate database searches with MS-GF+Percolator.

Authors: Viktor Granholm; Sangtae Kim; José C F Navarro; Erik Sjölund; Richard D Smith; Lukas Käll
Journal: J Proteome Res Date: 2013-12-23 Impact factor: 4.466

7. A cross-platform toolkit for mass spectrometry and proteomics.

Authors: Matthew C Chambers; Brendan Maclean; Robert Burke; Dario Amodei; Daniel L Ruderman; Steffen Neumann; Laurent Gatto; Bernd Fischer; Brian Pratt; Jarrett Egertson; Katherine Hoff; Darren Kessner; Natalie Tasman; Nicholas Shulman; Barbara Frewen; Tahmina A Baker; Mi-Youn Brusniak; Christopher Paulse; David Creasy; Lisa Flashner; Kian Kani; Chris Moulding; Sean L Seymour; Lydia M Nuwaysir; Brent Lefebvre; Frank Kuhlmann; Joe Roark; Paape Rainer; Suckau Detlev; Tina Hemenway; Andreas Huhmer; James Langridge; Brian Connolly; Trey Chadick; Krisztina Holly; Josh Eckels; Eric W Deutsch; Robert L Moritz; Jonathan E Katz; David B Agus; Michael MacCoss; David L Tabb; Parag Mallick
Journal: Nat Biotechnol Date: 2012-10 Impact factor: 54.908

8. neXtProt: a knowledge platform for human proteins.

Authors: Lydie Lane; Ghislaine Argoud-Puy; Aurore Britan; Isabelle Cusin; Paula D Duek; Olivier Evalet; Alain Gateau; Pascale Gaudet; Anne Gleizes; Alexandre Masselot; Catherine Zwahlen; Amos Bairoch
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

9. The mzIdentML data standard for mass spectrometry-based proteomics results.

Authors: Andrew R Jones; Martin Eisenacher; Gerhard Mayer; Oliver Kohlbacher; Jennifer Siepen; Simon J Hubbard; Julian N Selley; Brian C Searle; James Shofstahl; Sean L Seymour; Randall Julian; Pierre-Alain Binz; Eric W Deutsch; Henning Hermjakob; Florian Reisinger; Johannes Griss; Juan Antonio Vizcaíno; Matthew Chambers; Angel Pizarro; David Creasy
Journal: Mol Cell Proteomics Date: 2012-02-27 Impact factor: 5.911

10. 2016 update of the PRIDE database and its related tools.

Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

7 in total

1. A Fast and Quantitative Method for Post-translational Modification and Variant Enabled Mapping of Peptides to Genomes.

Authors: Christoph N Schlaffner; Georg J Pirklbauer; Andreas Bender; Judith A J Steen; Jyoti S Choudhary
Journal: J Vis Exp Date: 2018-05-22 Impact factor: 1.355

2. Developing a bioinformatics pipeline for comparative protein classification analysis.

Authors: Benedetta Pelosi
Journal: BMC Genom Data Date: 2022-06-06

3. Software Options for the Analysis of MS-Proteomic Data.

Authors: Avinash Yadav; Federica Marini; Alessandro Cuomo; Tiziana Bonaldi
Journal: Methods Mol Biol Date: 2021

4. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.

Authors: Jonathan M Mudge; Irwin Jungreis; Toby Hunt; Jose Manuel Gonzalez; James C Wright; Mike Kay; Claire Davidson; Stephen Fitzgerald; Ruth Seal; Susan Tweedie; Liang He; Robert M Waterhouse; Yue Li; Elspeth Bruford; Jyoti S Choudhary; Adam Frankish; Manolis Kellis
Journal: Genome Res Date: 2019-09-19 Impact factor: 9.043

5. GENCODE reference annotation for the human and mouse genomes.

Authors: Adam Frankish; Mark Diekhans; Anne-Maud Ferreira; Rory Johnson; Irwin Jungreis; Jane Loveland; Jonathan M Mudge; Cristina Sisu; James Wright; Joel Armstrong; If Barnes; Andrew Berry; Alexandra Bignell; Silvia Carbonell Sala; Jacqueline Chrast; Fiona Cunningham; Tomás Di Domenico; Sarah Donaldson; Ian T Fiddes; Carlos García Girón; Jose Manuel Gonzalez; Tiago Grego; Matthew Hardy; Thibaut Hourlier; Toby Hunt; Osagie G Izuogu; Julien Lagarde; Fergal J Martin; Laura Martínez; Shamika Mohanan; Paul Muir; Fabio C P Navarro; Anne Parker; Baikang Pei; Fernando Pozo; Magali Ruffier; Bianca M Schmitt; Eloise Stapleton; Marie-Marthe Suner; Irina Sycheva; Barbara Uszczynska-Ratajczak; Jinuri Xu; Andrew Yates; Daniel Zerbino; Yan Zhang; Bronwen Aken; Jyoti S Choudhary; Mark Gerstein; Roderic Guigó; Tim J P Hubbard; Manolis Kellis; Benedict Paten; Alexandre Reymond; Michael L Tress; Paul Flicek
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

6. Evidence for a novel overlapping coding sequence in POLG initiated at a CUG start codon.

Authors: Yousuf A Khan; Irwin Jungreis; James C Wright; Jonathan M Mudge; Jyoti S Choudhary; Andrew E Firth; Manolis Kellis
Journal: BMC Genet Date: 2020-03-06 Impact factor: 2.797

7. Phosphorylation-Dependent Assembly of a 14-3-3 Mediated Signaling Complex during Red Blood Cell Invasion by Plasmodium falciparum Merozoites.

Authors: Kunal R More; Inderjeet Kaur; Quentin Giai Gianetto; Brandon M Invergo; Thibault Chaze; Ravi Jain; Christéle Huon; Petra Gutenbrunner; Hendrik Weisser; Mariette Matondo; Jyoti S Choudhary; Gordon Langsley; Shailja Singh; Chetan E Chitnis
Journal: mBio Date: 2020-08-18 Impact factor: 7.867

7 in total