Literature DB >> 24058055

customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search.

Abstract

UNLABELLED: Database search is the most widely used approach for peptide and protein identification in mass spectrometry-based proteomics studies. Our previous study showed that sample-specific protein databases derived from RNA-Seq data can better approximate the real protein pools in the samples and thus improve protein identification. More importantly, single nucleotide variations, short insertion and deletions and novel junctions identified from RNA-Seq data make protein database more complete and sample-specific. Here, we report an R package customProDB that enables the easy generation of customized databases from RNA-Seq data for proteomics search. This work bridges genomics and proteomics studies and facilitates cross-omics data integration.
AVAILABILITY AND IMPLEMENTATION: customProDB and related documents are freely available at http://bioconductor.org/packages/2.13/bioc/html/customProDB.html.

Entities: Disease Species

Mesh：

Substances：

Year: 2013 PMID： 24058055 PMCID： PMC3842753 DOI： 10.1093/bioinformatics/btt543

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Sequence database search is the primary method for peptide and protein identification in mass spectrometry-based shotgun proteomics (Nesvizhskii, 2010). The completeness and specificity of the sequence databases directly affect the searching results. We recently showed that sample-specific databases derived from RNA-Seq data can better represent the real protein catalogs in biological samples and thus improve protein identification. In addition, sample-specific databases allow the identification of variant peptides (Wang ). With the advancements of both shotgun proteomics and next-generation sequencing (NGS) technologies, many researchers have started to apply both technologies to the same samples in parallel to gain a multi-dimensional understanding of cellular systems (Chen ; Nagaraj ). Even for proteomics studies without corresponding RNA-Seq data, it is highly likely to find sequencing data (e.g. whole-genome sequencing, exome sequencing) for similar samples. Here, we report an R package, customProDB, which is dedicated to generate customized database from NGS data, with a focus on RNA-Seq data, for proteomics search. Based on the assumption that undetected or lowly expressed transcripts are less likely to produce detectable proteins, the package allows users to filter out proteins with undetected or lowly expressed transcripts. It also allows users to incorporate single nucleotide variations, short insertion and deletions and novel junctions identified from RNA-Seq data into the protein database. Figure 1 illustrates the overall structure of the package. Methods and functions implemented in the customProDB package are described in detail in a tutorial available as online Supplementary Material (Supplementary File 1). In section 2, we briefly present the main functionalities of the customProDB package.

Fig. 1.

Schematic overview of the customProDB package

2 DESCRIPTION

2.1 Preparing annotation files

For model organisms, customProDB allows users to download annotation data from the University of California, Santa Cruz (UCSC) table browser using rtracklayer (Lawrence ) or from ENSEMBL using biomaRt (Durinck ) and then process them to generate a standardized data structure. For non-model organisms, users can manually provide the annotation data in the format of UCSC or ENSEMBL.

2.2 Building customized protein databases

2.2.1 Input data

customProDB requires a Binary-sequence Alignment Format (BAM) file and a Variant Call Format (VCF) file as input for each sample of interest. The latter can be generated from a BAM file using single nucleotide polymorphism calling tools such as SAMtools and The Genome Analysis Toolkit (GATK). customProDB also accepts transcript expression estimates when available. For junction analysis, a Browser Extensible Data (BED) file that contains putative splice junctions is needed. This file can be generated by software such as Tophat (Trapnell ) during read alignment.

2.2.2 Expression filter

For a given BAM file, the calculateRPKM function computes the reads per kilobase per million reads sequenced (RPKM) for each transcript based on reads mapped to the exon region. Then the Outputproseq function outputs a FASTA file for proteins with an RPKM value greater than a user-defined cutoff. For generating a consensus database from n (n > 1) related samples, a function Outputsharedpro is provided to output protein sequences for transcripts with an RPKM value greater than a user-defined cutoff in k (1 < k < n) out of the n samples.

2.2.3 Variation annotation

The InputVcf function generates a GRange (Lawrence ) object from a VCF file, which contains variation information from one or multiple samples. The Multiple_VCF function outputs a GRange object with variations presenting in multiple samples. For a given GRange object, the Varlocation function provides an overview of the genomic locations for all variations. Then protein level variations are identified for both single nucleotide variations and short insertion and deletions. VCF files derived from whole-genome or exome sequencing data can also be used to generate customized databases.

2.2.4 Junction analysis

Based on an input BED file that contains splice junctions derived from RNA-Seq data, the function JunctionType classifies all junctions into different categories. Then the function OutputNovelJun can be used to generate three-frame translated peptide sequences for all putative novel junctions.

3 APPLICATION

The development of the customProDB package was mainly driven by two demands: (i) to provide a customized protein database from RNA-Seq data for a specific sample, and (ii) to provide a consensus database from a pool of genetically similar samples. Therefore, we provide two integrated functions to help accomplish these tasks in a single step (Fig. 1). The value of customized databases for individual samples has already been demonstrated (Wang ). Here, we provide an example of a consensus database. A consensus database was generated based on RNA-Seq data from 64 colon cancer samples from The Cancer Genome Atlas project (TCGA, 2012). Previously published proteomics data from three colon cancer patients (Li ) were searched against the consensus database (Fig. 2). By including variation and novel junction information in the consensus database, we were able to identify variant peptides and novel junction peptides from the proteomics datasets (Supplementary File 2 and 3). We did not gain significant improvements in protein identification by applying the transcript expression threshold, possibly because of the high inter-patient heterogeneity. However, compared with the regular REfSeq database search, more peptide-spectrum matches were identified using the consensus database. This example shows the potential of using a consensus database to capture protein features shared by a cohort of samples.

Fig. 2.

Consensus protein database generation and proteomics search results for three independent colon cancer patients (400T, 782T, 823T)

4 CONCLUSION

The huge amount of genomic and transcriptomic data available from NGS experiments has enhanced and will continue to enhance shotgun proteomics studies. However, it is non-trivial for ordinary proteomics researchers to use such data directly. The customProDB package fills this gap by providing an efficient tool to generate customized protein databases using expression and variation information available from NGS data. Funding: National Institutes of Health (grant U24 CA159988). Conflict of Interest: none declared.

10 in total

Review 1. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics.

Authors: Alexey I Nesvizhskii
Journal: J Proteomics Date: 2010-09-08 Impact factor: 4.044

2. Protein identification using customized protein sequence databases derived from RNA-Seq data.

Authors: Xiaojing Wang; Robbert J C Slebos; Dong Wang; Patrick J Halvey; David L Tabb; Daniel C Liebler; Bing Zhang
Journal: J Proteome Res Date: 2011-12-14 Impact factor: 4.466

3. A bioinformatics workflow for variant peptide detection in shotgun proteomics.

Authors: Jing Li; Zengliu Su; Ze-Qiang Ma; Robbert J C Slebos; Patrick Halvey; David L Tabb; Daniel C Liebler; William Pao; Bing Zhang
Journal: Mol Cell Proteomics Date: 2011-03-09 Impact factor: 5.911

4. rtracklayer: an R package for interfacing with genome browsers.

Authors: Michael Lawrence; Robert Gentleman; Vincent Carey
Journal: Bioinformatics Date: 2009-05-25 Impact factor: 6.937

5. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.

Authors: Steffen Durinck; Paul T Spellman; Ewan Birney; Wolfgang Huber
Journal: Nat Protoc Date: 2009-07-23 Impact factor: 13.491

6. Personal omics profiling reveals dynamic molecular and medical phenotypes.

Authors: Rui Chen; George I Mias; Jennifer Li-Pook-Than; Lihua Jiang; Hugo Y K Lam; Rong Chen; Elana Miriami; Konrad J Karczewski; Manoj Hariharan; Frederick E Dewey; Yong Cheng; Michael J Clark; Hogune Im; Lukas Habegger; Suganthi Balasubramanian; Maeve O'Huallachain; Joel T Dudley; Sara Hillenmeyer; Rajini Haraksingh; Donald Sharon; Ghia Euskirchen; Phil Lacroute; Keith Bettinger; Alan P Boyle; Maya Kasowski; Fabian Grubert; Scott Seki; Marco Garcia; Michelle Whirl-Carrillo; Mercedes Gallardo; Maria A Blasco; Peter L Greenberg; Phyllis Snyder; Teri E Klein; Russ B Altman; Atul J Butte; Euan A Ashley; Mark Gerstein; Kari C Nadeau; Hua Tang; Michael Snyder
Journal: Cell Date: 2012-03-16 Impact factor: 41.582

7. Comprehensive molecular characterization of human colon and rectal cancer.

Authors:
Journal: Nature Date: 2012-07-18 Impact factor: 49.962

8. Software for computing and annotating genomic ranges.

Authors: Michael Lawrence; Wolfgang Huber; Hervé Pagès; Patrick Aboyoun; Marc Carlson; Robert Gentleman; Martin T Morgan; Vincent J Carey
Journal: PLoS Comput Biol Date: 2013-08-08 Impact factor: 4.475

9. Deep proteome and transcriptome mapping of a human cancer cell line.

Authors: Nagarjuna Nagaraj; Jacek R Wisniewski; Tamar Geiger; Juergen Cox; Martin Kircher; Janet Kelso; Svante Pääbo; Matthias Mann
Journal: Mol Syst Biol Date: 2011-11-08 Impact factor: 11.429

10. TopHat: discovering splice junctions with RNA-Seq.

Authors: Cole Trapnell; Lior Pachter; Steven L Salzberg
Journal: Bioinformatics Date: 2009-03-16 Impact factor: 6.937

10 in total

56 in total

1. GAPP: A Proteogenomic Software for Genome Annotation and Global Profiling of Post-translational Modifications in Prokaryotes.

Authors: Jia Zhang; Ming-Kun Yang; Honghui Zeng; Feng Ge
Journal: Mol Cell Proteomics Date: 2016-09-14 Impact factor: 5.911

2. Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data.

Authors: Sunghee Woo; Seong Won Cha; Seungjin Na; Clark Guest; Tao Liu; Richard D Smith; Karin D Rodland; Samuel Payne; Vineet Bafna
Journal: Proteomics Date: 2014-11-17 Impact factor: 3.984

3. Top-down-assisted bottom-up method for homologous protein sequencing: hemoglobin from 33 bird species.

Authors: Yang Song; Ünige A Laskay; Inger-Marie E Vilcins; Alan G Barbour; Vicki H Wysocki
Journal: J Am Soc Mass Spectrom Date: 2015-06-26 Impact factor: 3.109

4. A Primer and Guidelines for Shotgun Proteomic Analysis in Non-model Organisms.

Authors: Angel P Diz; Paula Sánchez-Marín
Journal: Methods Mol Biol Date: 2021

Review 5. Methods, Tools and Current Perspectives in Proteogenomics.

Authors: Kelly V Ruggles; Karsten Krug; Xiaojing Wang; Karl R Clauser; Jing Wang; Samuel H Payne; David Fenyö; Bing Zhang; D R Mani
Journal: Mol Cell Proteomics Date: 2017-04-29 Impact factor: 5.911

6. sapFinder: an R/Bioconductor package for detection of variant peptides in shotgun proteomics experiments.

Authors: Bo Wen; Shaohang Xu; Gloria M Sheynkman; Qiang Feng; Liang Lin; Quanhui Wang; Xun Xu; Jun Wang; Siqi Liu
Journal: Bioinformatics Date: 2014-07-22 Impact factor: 6.937

7. Leveraging the complementary nature of RNA-Seq and shotgun proteomics data.

Authors: Xiaojing Wang; Qi Liu; Bing Zhang
Journal: Proteomics Date: 2014-11-17 Impact factor: 3.984

8. JUMPg: An Integrative Proteogenomics Pipeline Identifying Unannotated Proteins in Human Brain and Cancer Cells.

Authors: Yuxin Li; Xusheng Wang; Ji-Hoon Cho; Timothy I Shaw; Zhiping Wu; Bing Bai; Hong Wang; Suiping Zhou; Thomas G Beach; Gang Wu; Jinghui Zhang; Junmin Peng
Journal: J Proteome Res Date: 2016-06-13 Impact factor: 4.466

Review 9. Decoding neuroproteomics: integrating the genome, translatome and functional anatomy.

Authors: Robert R Kitchen; Joel S Rozowsky; Mark B Gerstein; Angus C Nairn
Journal: Nat Neurosci Date: 2014-10-28 Impact factor: 24.884

10. Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities.

Authors: Suhas Vasaikar; Chen Huang; Xiaojing Wang; Vladislav A Petyuk; Sara R Savage; Bo Wen; Yongchao Dou; Yun Zhang; Zhiao Shi; Osama A Arshad; Marina A Gritsenko; Lisa J Zimmerman; Jason E McDermott; Therese R Clauss; Ronald J Moore; Rui Zhao; Matthew E Monroe; Yi-Ting Wang; Matthew C Chambers; Robbert J C Slebos; Ken S Lau; Qianxing Mo; Li Ding; Matthew Ellis; Mathangi Thiagarajan; Christopher R Kinsinger; Henry Rodriguez; Richard D Smith; Karin D Rodland; Daniel C Liebler; Tao Liu; Bing Zhang
Journal: Cell Date: 2019-04-25 Impact factor: 41.582