Literature DB >> 34746858

RHybridFinder: An R package to process immunopeptidomic data for putative hybrid peptide discovery.

Frederic Saab¹, David J Hamelin¹, Qing Ma², Kevin A Kovalchik¹, Isabelle Sirois¹, Pouya Faridi³, Chen Li³, Anthony W Purcell³, Peter Kubiniok¹, Etienne Caron^1,4.

Abstract

Identification of proteasomal spliced peptides (PSPs) by mass spectrometry (MS) is not possible with traditional search engines. Here, we provide a protocol for running RHybridFinder (RHF), an R package for the computational inference of putative PSPs detected by MS. RHF extracts high confidence scored de novo sequenced peptides identified by PEAKS software. Those peptides are then matched to protein databases to infer cis- or trans-spliced major histocompatibility complex (MHC)-associated peptides. RHF is relatively fast and straightforward. PSPs have to be validated experimentally. For complete details on the use and execution of the original protocol, please refer to Faridi et al. (2018). Crown

Entities: Chemical

Keywords: Bioinformatics; Cancer; Computer sciences; Immunology; Mass Spectrometry

Mesh：

Substances：

Year: 2021 PMID： 34746858 PMCID： PMC8551247 DOI： 10.1016/j.xpro.2021.100875

Source DB: PubMed Journal: STAR Protoc ISSN： 2666-1667

Before you begin

The proteasome is recognized as the core enzymatic machinery of the antigen processing and presentation pathway wherein peptides derived from proteasomal proteolysis are selectively presented on the cell surface by MHC (major histocompatibility complex)-I molecules (Neefjes et al., 2011). In 2004, Hanada et al. discovered that the proteasome could cleave and splice peptide fragments to generate immunogenic epitopes presented by MHC class I molecules (Hanada et al., 2004). Following this groundbreaking discovery, other research groups have been able to uncover additional T cell spliced epitopes generated by the proteosome, referred in this protocol as proteasomal spliced peptides (PSPs) (Berkers et al., 2015; Dalet et al., 2011; Ebstein et al., 2016; Michaux et al., 2014; Vigneron et al., 2004). More recently, MS-based immunopeptidomics has been used to expedite the identification of PSPs in a systematic manner, including cis- and trans-spliced peptides (Berkers et al., 2015; Faridi et al., 2018; Liepe et al., 2010, 2016; Rolfs et al., 2019; Specht et al., 2020). However, MS-based studies using different computational approaches have led to a debate around the proportion of those PSPs in the MHC class I immunopeptidome (Lichti, 2021; Mylonas et al., 2018; Wilhelm et al., 2021). Here, we provide a protocol to run RHybridFinder (RHF), an open access and improved R package built upon the computational workflow developed by Faridi et al. (2018) for the analysis of MS data to systematically identify putative PSPs (Faridi et al., 2018). High speed performance is the main strength of RHF in addition to be relatively straightforward to run. The main limitation is that the PSPs identified by RHF may not be genuily spliced by the proteasome in vivo. Their source and presentation should therefore be validated experimentally to move the debate forward (Figure 1).

Figure 1

Overview of suggested workflow for the discovery of PSPs

We propose a four-step workflow for the identification of PSPs. The first three steps (blue squares: sample preparation, MS data acquisition and RHbridFinder enable computational exploration of putative PSPs followed by experimental validations (green square). A non-exhaustive list of possible experiments is shown for validating/gaining confidence in the identification of MHC-I peptides that are genuinely catalyzed by proteasomal splicing.

Overview of suggested workflow for the discovery of PSPs We propose a four-step workflow for the identification of PSPs. The first three steps (blue squares: sample preparation, MS data acquisition and RHbridFinder enable computational exploration of putative PSPs followed by experimental validations (green square). A non-exhaustive list of possible experiments is shown for validating/gaining confidence in the identification of MHC-I peptides that are genuinely catalyzed by proteasomal splicing. Recommended folder structure The parent folder includes two child folders. The child folders include the various files that are necessary for running RHybridFinder. The dotted line (second_run) indicates that the DB search psm.csv file is added after the second DB search. RHybridFinder is available on CRAN (https://cran.r-project.org/package=RHybridFinder) to enable more researchers to explore those debated peptides.

Data collection

For demonstration of the output of the different RHybridFinder functions, we have used datasets from the HLA Ligand Atlas (Marcu et al., 2021) deposited in PRIDE (Proteomics IDentification Database) PXD019643. Download the following mzML files and analyzed them in PEAKS: 171002_AM_AUT01-DN17_Liver_W6-32_10%_DDA_3_400-650mz_msms4, 171002_AM_AUT01-DN17_Liver_W6-32_10%_DDA_3_400-650mz_msms5, 171002_AM_AUT01-DN17_Liver_W6-32_10%_DDA_3_400-650mz_msms6. Analyze these files in PEAKS.

Installing Rstudio/R

RHybridFinder package has been developed in RStudio and implemented in R programming language. Download & install Rstudio if not already installed: (https://www.rstudio.com/products/rstudio/download/).

Installing and loading RHybridFinder

Below are the lines needed to install the RHybridFinder package from CRAN (the Comprehensive R Archive Network) and then load it. Install and load RHybridFinder by typing “install.packages(“RHybridFinder”) in the R console. > install.packages(“RHybridFinder“) Load RHybridFinder by typing “library(RHybridFinder)” in the R console > library(RHybridFinder) CRITICAL: if you copy the lines of code from here, keep in mind that you might have to re-write the quotation marks yourself.

Key resources table

Step-by-step method details

Step 1: Load inputs into R

Timing: 1 min Before running HybridFinder, the inputs need to be loaded into R. We propose the following way of loading the files into R in order to facilitate the process Figure 2.

Figure 2

Recommended folder structure

The parent folder includes two child folders. The child folders include the various files that are necessary for running RHybridFinder. The dotted line (second_run) indicates that the DB search psm.csv file is added after the second DB search.

Create an object (folder_Exp1) for the path to the parent folder (Mel_Exp1) (but both can be named otherwise). > folder_Exp1 <- file.path(“/Users/YOURUSERNAME/Desktop/Mel_Exp1”) Import the de novo sequencing as well as the database results, both of which are located in the first_run child folder. de novo sequencing results file > denovo_Exp1 <- read.csv(file = file.path(folder_Exp1, “first_run”, "all_denovo_candidates.csv"), header=TRUE, sep=",", stringsAsFactors = FALSE) database search results file > db_search_Exp1<- read.csv (file=file.path(folder_Exp1, “first_run”,“DB seach psm.csv”), header=TRUE, sep=“,”, stringsAsFactors=FALSE) Create an object for the path to the proteome file, located in the parent folder (folder_Exp1) (see refproteome_Exp1, in the example below). The fasta proteome will be imported in R during the HybridFinder function. > refproteome_Exp1 <- file.path(folder_Exp1, “uniprothuman-20379entries-Nov2019_validated.fasta”) CRITICAL: Please note that if you copy the file access path (in windows), you will need to switch the backslash (“\”) to a normal slash ( “/”).

Access the datasets included in the R package

The RHybridFinder package also includes demonstration datasets from the HLA Ligand Atlas that have already been analyzed in PEAKS. These datasets include PEAKS de novo sequencing results and PEAKS database search results. # access denovo dataset > data(package= “RHybridFinder”, “denovo_Human_Liver_AUTD17”) # access database search dataset > data(package=”RHybridFinder”, “db_Human_Liver_AUTD17”) that due to size constraints the proteome database (.fasta) file is not included in the package. It can be downloaded from the Uniprot database. In the environment tab, the denovo_Human_Liver_AUTD17 and db_Human_Liver_AUTD17 should appear. Note that if you see , after clicking on the objects, the data would appear.

Step 2: Run HybridFinder

Timing: 2–5 min (with parallelism, 8 cores) - 10–15 min (without parallelism) In order to have a relatively short runtime, we have implemented an option to use parallel computing. However, please note that because parallel computing requires a certain amount of processing units for proper functioning, it has been made possible to also run HybridFinder without parallel computing. Based on default parameters in the HybridFinder function, the “all de novo candidates.csv” file contains 16,286 peptide sequences and the runtime (parallelism with 8 cores) is of 2 min 17 s ∼5 min are required for double the number of peptides. Without parallelism, the runtime ranged between 10 and 15 min for 16,286 peptide sequences. Run Hybridfinder (Please refer to Table 1 in order to know more about the inputs needed) and export the results in the parent folder.

Table 1

HybridFinder function parameters

Parameter	Description	Default value
de novo_candidates	the dataframe containing the de novo sequencing results	No defaults. Necessary input.
db_search	the data frame containing the database search results	No defaults. Necessary input.
db_search	the data frame containing the database search results	No defaults. Necessary input.
proteome_db	the file path to the proteome used for the database search	No defaults. Necessary input.
(Optional) customALCcutoff	A custom score cutoff that can be set by the user as long as it would be at least 85	NULL. (ALC cutoff calculated automatically as median of matching peptide sequences of assigned spectra). If set manually, minimum is 85.
with_parallel : boolean (True or False)	representing whether parallel computing should be employed for running the function.	TRUE
(Optional) customCores	If with_parallel is set to TRUE and the PC has >5 cores, the user can set a custom amount of cores to be used by the function.	6
(Optional) export_files : boolean (True or False)	by default it is set to False, however, if set to True, then the following input is essential.	FALSE
(Optional) export_dir	file path to the directory where the output files should be stored. This parameter is necessary for the export.	NULL

HybridFinder function parameters > HybridFinder_results_Exp1<- HybridFinder(denovo_candidates = denovo_Exp1, db_search = db_search_Exp1, proteome_db = refproteome_Exp1,customALCcutoff = NULL, with_parallel=TRUE, customCores = 8, export_files= TRUE, export_dir = folder_Exp1) CRITICAL: if you use the datasets included in the package, please note that they are named differently so for instance the “denovo_candidates” and “db_search” parameters should be set to the datasets loaded from the package: denovo_Human_Liver_AUTD17 and db_Human_Liver_AUTD17, respectively. CRITICAL: Make sure to store the HybridFinder results in an object (i.e HybridFinder_results_Exp1), as the HybridFinder output dataframe will come in handy in the second function. At the end of the hybrid proteome will be the concatenated hybrid fake proteins with the name pattern ‘sp|denovo_HF_fake_protein_[#]’. with_parallel is activated if set to true and if the PC has more than 5 cores. CRITICAL: Please ensure to have a minimal number of other windows open and to save any work in other softwares prior to using HybridFinder with parallelism. The function will output a list (Figure 3) containing: (1) the HybridFinder output containing all the denovo peptides along with their potential splice type explanation cis-/trans-, (2) a list of the step1 hybrid candidate peptides, (3) the hybrid proteome (merged proteome: the original user proteome along with the hybrid proteome composed of the concatenated candidate hybrid peptide sequences).

Figure 3

Screenshot of the HybridFinder function results

In the results list you will find 3 items: 1) a dataframe containing the HybridFinder output. 2) a character vector containing the candidate spliced peptides. 3) a list which is in a seqinr class (Charif and Lobry, 2007) containing the merged hybrid proteome.

In the example above, export_files have been set to TRUE and the export_dir has been defined which means that the files are also automatically exported. If these two parameters were not specified or were set to FALSE & NULL, the results are only stored in the Exp1_HybridFinder_results. In this case, you can still use “export_HybridFinder_results” as in the code below, where HybridFinder_results_Exp1 is the object created above for the storage of HybridFinder results. > export_HybridFinder_results(HybridFinder_results_Exp1, export_dir= folder_Exp1) Pause point: If you would like to conduct the rest of the protocol at a later time, either use the export functionality and then load the HybridFinder output in order to use it for the second step. Alternatively, save the objects in R in a .rda file as follows and once you want to use it again for the step 4, load checknetMHCpan inputs into R. > save (HybridFinder_results_Exp1, file=file.path(folder_Exp1, ”HybridFinder_results_Exp1.rda”) >load (file.path(folder_Exp1, ”HybridFinder_results_Exp1.rda”)) Screenshot of the HybridFinder function results In the results list you will find 3 items: 1) a dataframe containing the HybridFinder output. 2) a character vector containing the candidate spliced peptides. 3) a list which is in a seqinr class (Charif and Lobry, 2007) containing the merged hybrid proteome.

Step 3: Database search using hybrid Fasta

Timing: 1 h An essential interim step must follow the HybridFinder function and consists of running a database search in PEAKS with the merged proteome. Importantly, now that a merged hybrid proteome has been obtained from the HybridFinder function, it can be used to obtain potential PSPs whose quality is comparable with all other database search peptides while filtering all peptides at the same FDR (False Discovery Rate) cutoff which can be adjusted by the users in PEAKS. In the original workflow by Faridi et al. (2018), the database search peptides in both runs were filtered in PEAKS at a 1% FDR. Perform a database search in PEAKS using the original raw MS file (while using the same settings as in the beginning) however, this time while using the merged hybrid proteome (.fasta) file generated with the HybridFinder function.

Step 4: Load checknetMHCpan inputs into R

Timing: 1 min Prior to running checknetMHCpan, please ensure that netMHCpan (versions 4.0 or 4.1) is installed. checknetMHCpan is the last step of the hybrid finder workflow, the function uses the database search results from the second PEAKS analysis and provides the binding affinity results of all the peptides along with their categorizations. Create an object for the location of the netMHCpan executable > netmhcpan_dir <- file.path(“/usr/local/bin”) Create an object (vector) for storing the HLA-I alleles that you would like to have binding affinity predictions for. > alleles_Exp1 <- c(“HLA-A∗02:01”, “HLA-A∗03:01”, “HLA-B∗07:02”) Retrieve the HybridFinder output from the HybridFinder function results > HF_output_Exp1 <- HybridFinder_results_Exp1[[1]] Import the database search results (from step 3: Database search using hybrid fasta) > rerun_db_search_Exp1 <- read.csv(file.path(folder_Exp1, “second_run”, “DB search psm.csv”), sep=“,”, head = TRUE, stringsAsFactors = FALSE) in case your computer’s OS is “Windows” (netMHCpan is not compatible with Windows) the web version of netMHCpan (http://www.cbs.dtu.dk/services/NetMHCpan-4.1/instructions.php) would come in handy. In this case, we propose to use a separate function from this package instead (step2_wo_netmhcpan) which outputs a netMHCpan-ready input of sequences in .pep format. The demonstration datasets from the HLA Ligand Atlas included in this package also include datasets for the checknetMHCpan/step2_wo_netMHCpan functions. After having run the HybridFinder function and stored the results in HyrbidFinder_results_Exp1, PEAKS was run using the merged hybrid proteome. Below is a way to retrieve the second PEAKS run dataset included in the package: > data(package= “RHybridFinder”, “db_rerun_Human_Liver_AUTD17”) The merged proteome used for the second database search is based on the customALCcutoff being set to NULL (default parameter value). CRITICAL: The merged proteome database would change between different samples, and if the customALCcutoff parameter is changed. The same merged hybrid proteome cannot be used for separate analyses.

Step 5: Run checknetMHCpan

Timing: ∼ 1 min The checknetMHCpan function embodies the second major step of the workflow. The categorizations of the hybrid peptides from the HybridFinder output are retrieved for matched peptides found in the second PEAKS database results. Then, peptide-MHC class I binding predictions for the entire database search results (for peptides between 9 and 12 amino acids) are computed using netMHCpan and are tidied in order to summarize the results. Run checknetMHCpan using the code below (Please refer to Table 2 in order to know more about the inputs needed) and export the results in the same folder:

Table 2

checknetMHCpan function parameters

Parameter	Description	Default value
netmhcpan_directory	the directory where netMHCpan is installed (i.e., ‘/usr/bin’ or ‘/usr/local/bin’, depending on where you have it installed)	No defaults. Necessary input.
netmhcpan_alleles	a vector composed of the alleles the peptides will be tested against.	No defaults. Necessary input.
peptide_rerun	the database search results from the second peaks run	No defaults. Necessary input.
HF_step1_output	the data frame from the HybridFinder function of the containing the spliced peptide potential explanations as well as RT, m/z, ALC, Scan & Fraction	No defaults. Necessary input.
(Optional) export_files : boolean (True or False)	by default it is set to False, however, if set to True, then the following input is essential.	FALSE
(Optional) export_dir	file path to the directory where the output files should be stored. This parameter is necessary for the export.	NULL

checknetMHCpan function parameters > checknetMHCpan_results_Exp1 <- checknetMHCpan(netmhcpan_directory = netmhcpan_dir, netmhcpan_alleles = alleles_Exp1, peptide_rerun = rerun_db_search_Exp1, HF_step1_output = HF_output_Exp1, export_files= TRUE, export_dir = folder_Exp1) checknetMHCpan is compatible with the exports from both netMHCpan 4.0 & netMHCpan 4.1. CRITICAL: if you use the datasets included in the package, please note that they are named differently so for instance the “peptide_rerun” parameter should be set to dataset loaded from the package db_rerun_Human_Liver_AUTD17. After running the code above, a results list should be returned (Figure 4).

Figure 4

Screenshot of the checknetMHCpan results list

In the results list you will find 3 items: 1) a dataframe containing the netMHCpan results. 2) a dataframe containing the tidied netMHCpan results. 3) the database search results with the “Potential_spliceType” for the hybrid peptides retrieved from step1.

Screenshot of the checknetMHCpan results list In the results list you will find 3 items: 1) a dataframe containing the netMHCpan results. 2) a dataframe containing the tidied netMHCpan results. 3) the database search results with the “Potential_spliceType” for the hybrid peptides retrieved from step1. These results are also exportable with the export_checknetMHCpan_results function. > export_checknetMHCpan_results(step2_RHF_results_Exp1 , export_dir = folder_Exp1) If you intend on using the web version of netMHCpan (especially useful for windows OS users) or another software for peptide binding affinity, the step2_wo_netMHCpan function does the same as checknetMHCpan but without running netMHCpan. The function should return a list (Figure 5) containing the updated database search results as well as a list of the peptides which can be used as input in the web version of netMHCpan.

Figure 5

Screenshot of the step2_wo_netMHCpan results list

In the results list you will find 2 items: 1) a character vector containing the netMHCpan-ready input. 2) the database search results with the “Potential_spliceType” for the hybrid peptides retrieved from step1.

Screenshot of the step2_wo_netMHCpan results list In the results list you will find 2 items: 1) a character vector containing the netMHCpan-ready input. 2) the database search results with the “Potential_spliceType” for the hybrid peptides retrieved from step1.

Expected outcomes

HybridFinder

The HybridFinder function follows the same rationale as indicated in Faridi et al. (2018). After high-confidence de novo peptides are extracted, these are searched sequentially for an exact hit, followed by a search of pair fragments within one protein and then within two proteins (Figure 6). Finally, the sequences of all hybrid peptides are concatenated to create fake proteins, which are added at the bottom of the proteome database in order to constitute a merged hybrid proteome.

Figure 6

HybridFinder function

HybridFinder extracts high confidence de novo peptides by using a ALC cutoff based on the median ALC of common spectrum groups & sequence of peptides between the de novo and the database search. The ALC cutoff is used to filter unassigned de novo spectrum groups in order to obtain high confidence de novo spectra. All sequences are then searched in the proteome for the entire sequence, those that match are filtered and considered “Linear”, the remainder of the peptide spectrum groups are “cut” in order to create peptide fragment combinations. These are then searched in the proteome for whether fragment combinations exist within a same protein, matches are considered as cis-spliced and further filtered. Finally, fragment combinations are created from those that didn’t match in the previous step and are searched whether they exist in two proteins. If there is a match, these are considered as trans-spliced peptides. The remaining uncategorized spectrum groups are considered not to have a biological explanation (NBE) and are therefore discarded.

HybridFinder function HybridFinder extracts high confidence de novo peptides by using a ALC cutoff based on the median ALC of common spectrum groups & sequence of peptides between the de novo and the database search. The ALC cutoff is used to filter unassigned de novo spectrum groups in order to obtain high confidence de novo spectra. All sequences are then searched in the proteome for the entire sequence, those that match are filtered and considered “Linear”, the remainder of the peptide spectrum groups are “cut” in order to create peptide fragment combinations. These are then searched in the proteome for whether fragment combinations exist within a same protein, matches are considered as cis-spliced and further filtered. Finally, fragment combinations are created from those that didn’t match in the previous step and are searched whether they exist in two proteins. If there is a match, these are considered as trans-spliced peptides. The remaining uncategorized spectrum groups are considered not to have a biological explanation (NBE) and are therefore discarded. Typically, when the HybridFinder function is run, 3 messages are printed representing each major stage of the algorithm and finally ‘Done!’ is printed once the processing is finished. The function returns a list containing 3 items: the HybridFinder output (Figure 7) where the predicted splice type is displayed, a character vector containing only the list of hybrid candidates (Figure 8) and finally the merged hybrid proteome (Figure 9) where the hybrid peptide candidates have been concatenated as fake proteins.

Figure 7

Screenshot of the HybridFinder output dataframe

(5 rows), The Fraction column represents the LC-MS run, the Scan column is a number representing a unique index for the tandem mass spectra (F[Fraction#]:Scan#), m/z is the precursor mass-to-charge ratio, RT is the Retention Time (elution time) for the spectrum, Peptide corresponds to the peptide sequences. The Length column represents the number of amino acids for a given peptide, ALC (Average Local Confidence), is a score calculated in PEAKS as the total of the residue local confidence scores in the peptide divided by the peptide length. These columns are not provided by the HybridFinder function, they are columns found in any PEAKS de novo sequencing export. For more information, please visit the PEAKS user manual. The Potential_spliceType corresponds to the resulting categorization from the HybridFinder function. Finally, the proteome_database_used is the filename of the fasta proteome provided by the user (this column is mainly for helping the user keep track of the proteome used) in the HybridFinder function.

Figure 8

Screenshot of the HybridFinder hybrid peptide candidates vector (5 rows)

Figure 9

Screenshot of the bottom of the HybridFinder merged hybrid proteome (5 proteins)

Screenshot of the HybridFinder output dataframe (5 rows), The Fraction column represents the LC-MS run, the Scan column is a number representing a unique index for the tandem mass spectra (F[Fraction#]:Scan#), m/z is the precursor mass-to-charge ratio, RT is the Retention Time (elution time) for the spectrum, Peptide corresponds to the peptide sequences. The Length column represents the number of amino acids for a given peptide, ALC (Average Local Confidence), is a score calculated in PEAKS as the total of the residue local confidence scores in the peptide divided by the peptide length. These columns are not provided by the HybridFinder function, they are columns found in any PEAKS de novo sequencing export. For more information, please visit the PEAKS user manual. The Potential_spliceType corresponds to the resulting categorization from the HybridFinder function. Finally, the proteome_database_used is the filename of the fasta proteome provided by the user (this column is mainly for helping the user keep track of the proteome used) in the HybridFinder function. Screenshot of the HybridFinder hybrid peptide candidates vector (5 rows) Screenshot of the bottom of the HybridFinder merged hybrid proteome (5 proteins) The results might differ if the customALCcutoff score parameter is changed. If the results are exported, these are stored in a folder as .csv files and the merged proteome database is saved as .fasta file. The peptide sequences predicted as spliced are considered as preliminary candidates. Performing the rest of the steps is essential in order to obtain the final list.

checknetMHCpan and step2_wo_netMHCpan

The checknetMHCpan & step2_wo_netMHCpan functions represent the last step in Faridi et al.’s (2018) workflow. After a database search is performed using the merged hybrid proteome in step 1, these two functions can be used. Both of these functions retrieve the potential splice type categorization established in step 1. However, with checknetMHCpan the user can directly obtain MHC-I binding affinity predictions computed for all peptides between 9 and 12 amino acids using netMHCpan (Jurtz et al., 2017; Reynisson et al., 2020). The checknetMHCpan function returns two formats of the netMHCpan results and the updated database search results from the second run with the potential splice type. The first format of the netMHCpan represents the results as they are (Figure 10). The second format is a tidied version of the netMHCpan results (Figure 11), where the rows are summarized into different columns, to allow quick analysis of the netMHCpan results (especially when more than one HLA-I allele is used); in these columns are summed the number of HLA-I alleles that a given peptide is a strong or weak binder to as well as the corresponding alleles. Finally, the database search results dataframe (from the second PEAKS run) updated with the potential splice type determined in the HybridFinder function for each peptide (Figure 12). Additionally, any sequence not identified in the HybridFinder output and solely attributed to the fake proteins created is removed. If exported, these are stored in a folder containing 2 .csv files and a .tsv (tab-separated values) corresponding to these different outputs.

Figure 10

Screenshot of the checknetMHCpan netMHCpan results

(5 rows) HLA/MHC is the allele, Peptide is the amino acid sequence of the potential ligand, Core is the minimal 9 amino acid sequence core to enable HLA binding, Of is the starting position of the Core within the peptide, Gp and Gl are the position and the length of the deletions (respectively), if any. Ip and Il are the position and the length of the insertions (respectively), if any. Icore is the interaction core, Identity is PEPLIST (which indicates that peptides were used as input as opposed to proteins in fasta-format). Score is the raw prediction, Aff(nM) is the predicted IC50 value in nanoMolar units, %Rank is the percentile rank of the predicted affinity compared to a set of random natural ligands. BindLevel is designated by 3 qualifiers: Strong binder, Weak binder, None binder. Potential_spliceType is the categorization retrieved from the HybridFinder output on the potential splice type explanation of the peptide (i.e., linear, cis, trans).

Figure 11

Screenshot of the checknetMHCpan tidied netMHCpan results

(5 rows) Peptide is the amino acid sequence of the potential ligand, the strongBinder, weakBinder, noneBinder (this column not shown in this figure) columns correspond to the alleles to which a given peptide is a strong/weak/none binder to, respectively. If more than one allele, these are separated by commas. For each peptide, there will be %Rank columns per allele (e.g., If 3 alleles were specified in the checknetMHCpan command, then each peptide will have 3%Rank columns). strongBinder_count, weakBinder_count, noneBinder_count represent the number of alleles to which a peptide is a strong/weak/none binder to. Lastly, the Potential_spliceType column is the categorization retrieved from the HybridFinder output on the potential splice type explanation of the peptide (i.e., linear, cis, trans).

Figure 12

Screenshot of the checknetMHCpan database search results updated with the Potential_spliceType column

(5 rows) Peptide is the amino acid sequence of the potential ligand, X.log10P represents the best -10logP identification score for the corresponding peptide. Mass represents the monoisotopic mass of the peptide, Length is the number of amino acid residues that constitute the given peptide, ppm is the precursor mass error, the m.z is the precursor mass-to-charge ratio, Z is the precursor charge, RT is the Retention Time (elution time) for the spectrum, Area represents the area underthe curve of the peptide feature found at the same m/z and retention time as the MS/MS scan, Fraction is the LC-MS run, id represents the precursor ID associated with the PSM, Scan is a number representing a unique index for tandem mass spectra (F[Fraction#]:Scan#), from.Chimera (this column is not shown in this figure) displays whether the identified peptide is from chimeric spectra, Source.File is the mzML/mzXML file used in the PEAKS analysis, PTM is the type of the post-translational modification, Ascore is the localization score assigned to modifications on the peptide, Found.By represents the analysis (in this case PEAKS DB). Peptide_no_mods represents the peptide sequence without modifications, Potential_spliceType is linear, cis or trans and is retrieved from the HybridFinder function.

Screenshot of the checknetMHCpan netMHCpan results (5 rows) HLA/MHC is the allele, Peptide is the amino acid sequence of the potential ligand, Core is the minimal 9 amino acid sequence core to enable HLA binding, Of is the starting position of the Core within the peptide, Gp and Gl are the position and the length of the deletions (respectively), if any. Ip and Il are the position and the length of the insertions (respectively), if any. Icore is the interaction core, Identity is PEPLIST (which indicates that peptides were used as input as opposed to proteins in fasta-format). Score is the raw prediction, Aff(nM) is the predicted IC50 value in nanoMolar units, %Rank is the percentile rank of the predicted affinity compared to a set of random natural ligands. BindLevel is designated by 3 qualifiers: Strong binder, Weak binder, None binder. Potential_spliceType is the categorization retrieved from the HybridFinder output on the potential splice type explanation of the peptide (i.e., linear, cis, trans). Screenshot of the checknetMHCpan tidied netMHCpan results (5 rows) Peptide is the amino acid sequence of the potential ligand, the strongBinder, weakBinder, noneBinder (this column not shown in this figure) columns correspond to the alleles to which a given peptide is a strong/weak/none binder to, respectively. If more than one allele, these are separated by commas. For each peptide, there will be %Rank columns per allele (e.g., If 3 alleles were specified in the checknetMHCpan command, then each peptide will have 3%Rank columns). strongBinder_count, weakBinder_count, noneBinder_count represent the number of alleles to which a peptide is a strong/weak/none binder to. Lastly, the Potential_spliceType column is the categorization retrieved from the HybridFinder output on the potential splice type explanation of the peptide (i.e., linear, cis, trans). Screenshot of the checknetMHCpan database search results updated with the Potential_spliceType column (5 rows) Peptide is the amino acid sequence of the potential ligand, X.log10P represents the best -10logP identification score for the corresponding peptide. Mass represents the monoisotopic mass of the peptide, Length is the number of amino acid residues that constitute the given peptide, ppm is the precursor mass error, the m.z is the precursor mass-to-charge ratio, Z is the precursor charge, RT is the Retention Time (elution time) for the spectrum, Area represents the area underthe curve of the peptide feature found at the same m/z and retention time as the MS/MS scan, Fraction is the LC-MS run, id represents the precursor ID associated with the PSM, Scan is a number representing a unique index for tandem mass spectra (F[Fraction#]:Scan#), from.Chimera (this column is not shown in this figure) displays whether the identified peptide is from chimeric spectra, Source.File is the mzML/mzXML file used in the PEAKS analysis, PTM is the type of the post-translational modification, Ascore is the localization score assigned to modifications on the peptide, Found.By represents the analysis (in this case PEAKS DB). Peptide_no_mods represents the peptide sequence without modifications, Potential_spliceType is linear, cis or trans and is retrieved from the HybridFinder function. The step2_wo_netMHCpan is the equivalent of checknetMHCpan with the exception of computing binding affinity. The function returns a netMHCpan-ready list of peptides (Figure 13), as well as the updated the database search results (Figure 14). If exported, the results are exported into a folder containing a .pep file and a .csv file.

Figure 13

Screenshot of the step2_wo_netMHCpan netMHCpan-ready input (5 rows)

Figure 14

Screenshot of the checknetMHCpan database search results updated with the Potential_spliceType column

(5 rows) The dataframe contains the same columns as in Figure 13

Screenshot of the step2_wo_netMHCpan netMHCpan-ready input (5 rows) Screenshot of the checknetMHCpan database search results updated with the Potential_spliceType column (5 rows) The dataframe contains the same columns as in Figure 13 After running checknetMHCpan or step2_wo_netMHCpan the final list of hybrid candidate peptides should be explored for further experimental validation (Figure 1).

Limitations

The presented package was developed and optimized for exports from PEAKS software. Therefore, results from other search engines or de novo sequencing softwares might not work while using this package. Limitations related to the workflow include the possible introduction of bias towards having results containing a higher proportion of Leucine residues. This is due to the workaround proposed by Faridi et al. (2018) which is also used in this package, entailing a switch of all Isoleucines to Leucines in the database search and the proteome since de novo sequencing does not differentiate between Isoleucine and leucine. As mentioned above, it is also important to emphasize that this protocol does not enable the direct identification of high-confidence PSPs that are genuinely spliced by the proteasome in vivo. However, this protocol enables the computational identification of putative PSPs, which should then be validated experimentally in a rigorous manner as shown in Figure 1.

Troubleshooting

Problem 1

While installing the .tar.gz file for the package, in case you run into the following error: ”Error in install.packages : type == ”both” cannot be used with ’repos = NULL’”

Potential solution

The solution would be to simply invoke the install.packages function while specifying where the package is located, setting the repository (repos) to NULL and setting the type as source (source package). > install.packages("∼/Downloads/RHybridFinder_0.1.0.tar", repos = NULL, Type="source")

Problem 2

While running HybridFinder, in case you run into the following error: ”Error in prepare_input_for_HF(de novo_candidates, db_search): Please make sure you have the right input. N.B: The de novo results data frame should be the first input” Verify the de novo data frame has been correctly imported. Since the de novo results file is in .csv format, the separator should be a comma “,”, stringsAsFactors should be set to FALSE and lastly the header should be set to TRUE. Please refer to step1: Loading inputs into R. Verify that the HybridFinder parameters are properly typed. The de novo sequencing results data frame is indicated first and then the database search results. Alternatively, write the parameters and assigned them their appropriate objects (i.e., de novo_candidates = de novo_results_human_liver_Exp1). Please refer to step2: Run HybridFinder.

Problem 3

While running HybridFinder, in case you run into the following error: ”Error in $<-.dataframe`(`∗tmp`, “db_id”, value = character ( 0 ) ) : replacement has 0 rows, data has[…]” Verify the database search data frame, make sure it has been correctly imported. Since the database search results file is in .csv format the separator should be a comma “,”, stringsAsFactors should be set to FALSE and lastly the header should be set to TRUE. Please refer to step1: Loading inputs into R.

Problem 4

While running checknetMHCpan, if the following error is displayed: ”Error in checknetMHCpan[…]:Please provide the proper input” Verify that the de novo and database search data frames are not switched. Please refer to step 5: Run checknetMHCpan.

Problem 5

While running checknetMHCpan, if the following error is displayed: ”Please check the input alleles: […]” Ensure that the alleles are in the right format, or that the allele is written correctly (i.e., HLA-A03:01, HLA-A∗03:01). Please refer to step 4: Load checknetMHCpan inputs into R.

Problem 6

While running checknetMHCpan, if the path to netMHCpan is not correct, the following error might appear: sh: 1: [/temporary directory/netMHCpan]: not found error in running command The issue could either be that the directory does not contains the netMHCpan file or that the directory was not well written I.e(‘usr/bin/' vs. ‘/usr/bin’ or ‘/usr/bin/, where the first example is wrong and the other two are correct). Please refer to step 4: Load checknetMHCpan inputs into R

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Etienne Caron etienne.caron@umontreal.ca.

Materials availability

This study did not generate new unique reagents.

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

Human liver sample from autologous donor 17 - HLA Ligand Atlas	(Marcu et al., 2021)	PXD019643

Software and algorithms

RStudio (version 1.3.1093)	RStudio website https://www.rstudio.com	SCR_000432
R (>3.5.0)	R statistical software (https://www.r-project.org/)	SCR_001905
PEAKS (PEAKS X studio)	PEAKS website: https://www.bioinfor.com//	N/A
RHybridFinder (v.0.2.0)	https://cran.r-project.org/web/packages/RHybridFinder/index.html	N/A
seqinr (v. 4.2-5)	CRAN - (Charif and Lobry, 2007)	N/A
foreach (v.1.5.1), doParallel (v. 1.0.16)	CRAN	N/A
netMHCpan (v. 4.0 &4.1)	DTU health tech: https://services.healthtech.dtu.dk (Reynisson et al., 2020)	SCR_018182
hybrid finder	Faridi et al. (2018) (workflow on which the package is based)	N/A

18 in total

Review 1. Towards a systems understanding of MHC class I and MHC class II antigen presentation.

Authors: Jacques Neefjes; Marlieke L M Jongsma; Petra Paul; Oddmund Bakke
Journal: Nat Rev Immunol Date: 2011-11-11 Impact factor: 53.106

2. An antigenic peptide produced by reverse splicing and double asparagine deamidation.

Authors: Alexandre Dalet; Paul F Robbins; Vincent Stroobant; Nathalie Vigneron; Yong F Li; Mona El-Gamil; Ken-ichi Hanada; James C Yang; Steven A Rosenberg; Benoît J Van den Eynde
Journal: Proc Natl Acad Sci U S A Date: 2011-06-13 Impact factor: 11.205

3. Identification of spliced peptides in pancreatic islets uncovers errors leading to false assignments.

Authors: Cheryl F Lichti
Journal: Proteomics Date: 2021-03-05 Impact factor: 3.984

4. The 20S proteasome splicing activity discovered by SpliceMet.

Authors: Juliane Liepe; Michele Mishto; Kathrin Textoris-Taube; Katharina Janek; Christin Keller; Petra Henklein; Peter Michael Kloetzel; Alexey Zaikin
Journal: PLoS Comput Biol Date: 2010-06-24 Impact factor: 4.475

5. A large fraction of HLA class I ligands are proteasome-generated spliced peptides.

Authors: Juliane Liepe; Fabio Marino; John Sidney; Anita Jeko; Daniel E Bunting; Alessandro Sette; Peter M Kloetzel; Michael P H Stumpf; Albert J R Heck; Michele Mishto
Journal: Science Date: 2016-10-20 Impact factor: 47.728

6. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data.

Authors: Birkir Reynisson; Bruno Alvarez; Sinu Paul; Bjoern Peters; Morten Nielsen
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

7. Estimating the Contribution of Proteasomal Spliced Peptides to the HLA-I Ligandome.

Authors: Roman Mylonas; Ilan Beer; Christian Iseli; Chloe Chong; Hui-Song Pak; David Gfeller; George Coukos; Ioannis Xenarios; Markus Müller; Michal Bassani-Sternberg
Journal: Mol Cell Proteomics Date: 2018-08-31 Impact factor: 5.911

8. HLA Ligand Atlas: a benign reference of HLA-presented peptides to improve T-cell-based cancer immunotherapy.

Authors: Ana Marcu; Leon Bichmann; Leon Kuchenbecker; Daniel Johannes Kowalewski; Lena Katharina Freudenmann; Linus Backert; Lena Mühlenbruch; András Szolek; Maren Lübke; Philipp Wagner; Tobias Engler; Sabine Matovina; Jian Wang; Mathias Hauri-Hohl; Roland Martin; Konstantina Kapolou; Juliane Sarah Walz; Julia Velz; Holger Moch; Luca Regli; Manuela Silginer; Michael Weller; Markus W Löffler; Florian Erhard; Andreas Schlosser; Oliver Kohlbacher; Stefan Stevanović; Hans-Georg Rammensee; Marian Christoph Neidert
Journal: J Immunother Cancer Date: 2021-04 Impact factor: 13.751

9. Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics.

Authors: Mathias Wilhelm; Daniel P Zolg; Michael Graber; Siegfried Gessulat; Tobias Schmidt; Karsten Schnatbaum; Celina Schwencke-Westphal; Philipp Seifert; Niklas de Andrade Krätzig; Johannes Zerweck; Tobias Knaute; Eva Bräunlein; Patroklos Samaras; Ludwig Lautenbacher; Susan Klaeger; Holger Wenschuh; Roland Rad; Bernard Delanghe; Andreas Huhmer; Steven A Carr; Karl R Clauser; Angela M Krackhardt; Ulf Reimer; Bernhard Kuster
Journal: Nat Commun Date: 2021-06-07 Impact factor: 14.919

10. Proteasomes generate spliced epitopes by two different mechanisms and as efficiently as non-spliced epitopes.

Authors: F Ebstein; K Textoris-Taube; C Keller; R Golnik; N Vigneron; B J Van den Eynde; B Schuler-Thurner; D Schadendorf; F K M Lorenz; W Uckert; S Urban; A Lehmann; N Albrecht-Koepke; K Janek; P Henklein; A Niewienda; P M Kloetzel; M Mishto
Journal: Sci Rep Date: 2016-04-06 Impact factor: 4.379