Literature DB >> 24792918

Integrating genomic, transcriptomic, and interactome data to improve Peptide and protein identification in shotgun proteomics.

Abstract

Mass spectrometry (MS)-based shotgun proteomics is an effective technology for global proteome profiling. The ultimate goal is to assign tandem MS spectra to peptides and subsequently infer proteins and their abundance. In addition to database searching and protein assembly algorithms, computational approaches have been developed to integrate genomic, transcriptomic, and interactome information to improve peptide and protein identification. Earlier efforts focus primarily on making databases more comprehensive using publicly available genomic and transcriptomic data. More recently, with the increasing affordability of the Next Generation Sequencing (NGS) technologies, personalized protein databases derived from sample-specific genomic and transcriptomic data have emerged as an attractive strategy. In addition, incorporating interactome data not only improves protein identification but also puts identified proteins into their functional context and thus facilitates data interpretation. In this paper, we survey the major integrative bioinformatics approaches that have been developed during the past decade and discuss their merits and demerits.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Proteome

Year: 2014 PMID： 24792918 PMCID： PMC4059263 DOI： 10.1021/pr500194t

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

Proteins are key functional molecules in cells and serve as a link between genotype and phenotype. Global proteomic analysis allows direct measurements of proteins, and when integrated with genomic and transcriptomic studies, provides a great opportunity to understand the information flow from DNA to protein to phenotype. Among different high-throughput proteomic technologies, mass spectrometry (MS)-based shotgun proteomics has had the greatest impact in biological and biomedical research. Recent technology advances have made this approach increasingly applicable for global profiling of cell and tissue proteomes, with the capacity to detect more than 10000 proteins from a single biological sample.[1] Figure 1 illustrates the typical workflow of a shotgun proteomics study. In the experimental phase, proteins are enzymatically digested into peptides, which are fractionated and analyzed by liquid chromatography–tandem mass spectrometry (LC–MS/MS). In the data analysis phase, tandem mass spectra are interpreted to peptides by computational algorithms and then assembled into proteins. The most widely used method for peptide identification is database searching by computational tools such as SEQUEST, Mascot,[2] X!Tandem,[3] or MyriMatch.[4] These tools first perform an in silico digestion of all proteins in a reference protein database to enumerate all candidate peptide sequences and then construct a theoretical spectrum for each candidate peptide sequence. Experimentally observed fragment ion spectra are compared to the theoretical spectra and then linked to corresponding peptides if a comparison produces a statistically significant peptide-spectrum match (PSM) score. Finally, identified peptides are transformed into a list of identified proteins through protein assembly tools such as IDPicker,[5] MassSieve,[6] or ProteinProphet,[7] among others.

Figure 1

A typical workflow of shotgun proteomics.

A typical workflow of shotgun proteomics. Although this strategy has been successful, there are several critical challenges that cannot be fully addressed by simply improving database searching and protein assembly algorithms. On one hand, all database searching algorithms rely on a reference protein database, which is typically incomplete. First, novel protein coding genes are still being continuously identified.[8] Second, a single gene locus can produce multiple transcript and protein isoforms through alternative splicing, and it remains difficult to completely catalogue all protein coding transcripts that can be generated from known gene loci.[9] Moreover, sequence variants including single nucleotide polymorphisms (SNPs), somatic mutations, insertions, deletions, and gene fusions are often neglected in commonly used reference protein databases. On the other hand, despite substantial improvements, reliable identification of low-abundant proteins remains challenging. During the past decade, various computational methods have been developed to integrate orthogonal data sources to improve peptide and protein identification in shotgun proteomics studies. These approaches take advantage of the rapidly growing volumes of genomic, transcriptomic, and interactome data. Here we review different integrative bioinformatics strategies that have been used to address the above-mentioned challenges. Relevant studies are summarized in Table 1 and the list continues to grow. This review is limited to human and mouse studies, and studies focusing on microbes or plants are not included.

Table 1

List of Published Orthogonal Data Assisted Proteomics Studies

Genomic Information
Choudhary et al.	six-frame translation using the draft of human genome	(13)
Fermin et al.	six-frame translation of whole human genome	(19)
Sevinsky et al.	six-frame translation of whole human genome	(18)
	peptide isoelectric point (pI)
Bitton et al.	prescreening searches on databases translated from individual chromosomes; matched entries were then combined with the Celera database entries and used for a second time search	(12)
Mo et al.	exon–exon junction database	(29)
Power et al.	noncontiguous junction peptides in a “full length transcript”	(30)
Gatlin et al.	generating dynamically all possible SNPs	(40)
Roth et al.	creating a highly annotated database, including splicing, PTMs, and SNPs	(47)
Bunger et al.	reference protein database	(41)
	tryptic peptide database created from dbSNP
	peptide pI
Schandorff et al.	elongating IPI sequences with theoretical N-terminal peptides, variant peptides from cSNP, variant peptides from conflict annotation in Swiss-Prot, and proteolytic enzyme and keratin sequences	(37)
Xi et al.	human disease-related variants from OMIM, PMD, and Swiss-Prot	(44)
Nijveen et al.	20-mer variant peptides generated by three-frame translation from mRNA sequences including SNPs in dbSNP	(35)
Li et al.	combined database of normal proteins and variant peptides	(46)
	modified FDR estimation
Su et al.	a pipeline of nontargeted proteomics for identifying SAP peptides in human plasma and quantifying them using targeted proteomics	(43)
Khatun et al.	whole genome proteogenomic mapping to identify novel protein coding regions for ENCODE cell line proteomics data	(8)

Improving Peptide Identification

Bottom-up proteomics technologies rely on peptide identification to infer protein presence. Integrating genomic and transcriptomic information allows the identification of peptides derived from novel protein coding genes, splice variants, and sequence variations, leading to a more comprehensive proteomic characterization of biological and clinical samples.

Novel Protein-Coding Genes

The database searching strategy relies on complete genomes and thorough protein coding gene annotations. Although whole genome sequencing data for human and other model organisms have been available for a decade, genome annotation remains incomplete even for the human genome. Several studies have demonstrated the potential of shotgun proteomics in the discovery of novel protein-coding genes in human and mouse using a variety of approaches.[8,10−12] The most intuitive approach to enable the identification of novel protein-coding genes by shotgun proteomics is to use a database containing a six-frame translation of the whole genome. Right after the release of the initial human genome draft sequence, Choudhary and colleagues searched an LC–MS/MS data set containing peptides from at least 22 human proteins against a curated protein database, an expressed sequence tag (EST)-derived database, and a genome-derived database.[13] Although the data set was small and the majority of proteins were found much more rapidly using the curated protein database, the study pioneered the use of genome translated databases for shotgun proteomics. A six-frame translation database does not depend on gene models and therefore contains all possible protein forms except for peptides spanning the exon junction regions. The strategy has been widely used in microbial studies because microbial genomes are small and lack alternative splicing.[14,15] Since 98% of the human genome is not protein coding,[16] this method dramatically increases the searching space and computational time. Tools have been developed to automate this strategy and make it practical for mammalian genomes.[17] However, a major concern is the large amount of background noise introduced by this strategy. Therefore, extra efforts are needed when applying this method to large, complex mammalian genomes. Methods have been developed to constrain the database size and complexity before the search. For experiments that use immobilized pH gradient strips to fractionate peptides, each fraction only contains peptides of a narrow isoelectric point (pI) range. This information has been used for the development of GENQUEST,[18] a method that restricts the peptide search space based on the pI range. Specifically, after the six-frame translation of the genome, each putative protein is in silico digested with trypsin and the pI is calculated for each peptide. Peptides are then grouped together based on their pIs. Spectra generated from a specific peptide fraction are only searched against the subset of peptides with pIs in the same range. It has been shown that this method resulted in accurate and sensitive results comparable to searching a curated protein database. Another method utilizes a series of prescreening searches against databases translated from individual chromosomes to identify and eliminate nonmatching entries, and then a second search is performed against all the matched entries combined with a curated protein database. This method dramatically reduces the database size for individual searches and has been successfully used to identify novel peptides in two human cell lines.[12] Methods have also been developed to control the peptide false discovery rate (FDR) after the search. Ferman et al. searched a data set from the Human Proteome Organization Plasma Proteome Project against a six-frame translation of the entire human genome to identify novel blood proteins.[19] They used a Poisson model, which brings into consideration the number of spectra searched, score threshold applied to accepting a match, the size of the target sequence database, and the length of the matched protein sequence, to estimate the confidence of peptide identifications. A detailed analysis showed that among the 2309 high quality intragenic peptides, 73% were completely contained within annotated exons, 6% partially overlapped with annotated exons, and 21% were aligned to nonexonic regions. Ever since the emergence of the RNA sequencing (RNA-Seq) technology, RNA-Seq data have been widely used to facilitate proteomics studies of nonmodel organisms that do not have a fully sequenced and well-annotated genome, such as many microorganisms and plants.[20−23] In human and model organism studies, RNA-Seq has revealed a large number of transcribed unannotated regions,[24,25] and some of them may represent novel protein-coding regions. Because gene expression changes over time and conditions, and each data set is associated with sequencing errors and mapping errors, combining different RNA-Seq data sets from an organism can lead to a more comprehensive and accurate reference protein database for the organism. A recent study in Caenorhabditis elegans generated an aggregated database from public C. elegans RNA-Seq data sets, allowing the identification of hundreds of novel genes in a MS/MS data set from 11 developmental stage of C. elegans.[26]

Novel Splice Variants

The incompleteness of genome annotation can also arise from unknown isoforms. Alternative splicing isoforms amplify the coding diversity and thus enable the functional repertoire of genes. A typical exon in the human genome is short with more than three-quarters of the exons having a length less than 200 bp,[27] which means a relatively large number of peptides span the exon boundary. Because of incomplete genome annotation, many splice junctions might be missing in the public databases. A more comprehensive splicing annotation will certainly improve peptide identification in proteomics, as exemplified by a study demonstrating a 7% increase in peptide identification when using ENSEMBL database with explicit isoform entries rather than the nonredundant Swiss-Prot database.[28] As mentioned above, one major limitation of the six-frame genome translation method is the failure to detect junction-spanning peptides. This limitation can be partially overcome by the generation of an exon–exon junction database. Mo et al. designed a theoretical exon–exon junction protein database to account for all possible combination of exons for each gene in the ENSEMBL database while keeping the frame of translation.[29] They only took 25 amino acid residues from each exon and used X!Tandem and SEQUEST to identify exon junctions in a human liver secretome MS/MS data set. By combing search results from the two tools, they identified 488 nonredundant peptides corresponding to 395 ENSEMBL genes. Another study by Power et al. used a similar method to construct a database harboring peptide sequences derived from all hypothetical exon–exon junctions in the human genome.[30] The strategy, named SkipE, employs two main steps for database construction. First, it includes a “full-length transcript”, which is the longest predicted exon sequence, for each gene. Overlapping exons are merged into a longer one. Second, entirely noncontiguous junction peptides are created from exon–exon junction-spanning sequences by cleaving the trypsin sites on both faces. Compared to the database generated by Mo et al. (873024 peptides), this method helped reduce the database size by more than half (307030 peptides). One intrinsic limitation of using only genomic data (exon model) to generate exon–exon junction databases is that many predicted alternative splicing events do not occur at the transcriptional level, and therefore a large amount of noise is introduced. To address this limitation, some studies have used EST data to reduce the size of a putative junction database. ESTs are short sequences from complementary DNA (cDNA) sequences and can indicate gene expression. Tanner et al. have developed algorithms that combine genomic data and EST data to construct an exon graph, which is a compact representation of all putative exons, splice variants, and polymorphisms. By searching a large collection of 18.5 million tandem MS spectra from human proteomic samples against the database, they confirmed the translation of 224 hypothetical human proteins and over 40 alternative splicing events.[31] Other studies use three-frame translation of mRNA sequences from ECgene, a comprehensive alternative splicing sequence database with splice variants predicted by EST clustering,[32] to generate databases for integrating with the ENSEMBL database.[33,34] Since alternative splice variants contribute to a number of diseases including cancer, these studies have been performed to identify both novel and known splice variants in cancer samples. Using EST data could largely reduce the number of putative junctions and introduce novel proteins. However, this approach is limited by the (1) large and redundant data size; (2) inability to cover all genes; and (3) presence of unprocessed and truncated transcripts as well as genomic contaminants.[31,35,36] Because of these limitations, some researchers even argue against using EST data for proteomics studies.[37] Further efforts are required to overcome these limitations. Edwards et al. have introduced several sequence database compression strategies to maintain the high quality ESTs, thus reducing database size by approximately 35-fold. These strategies include: (1) limiting EST sequences to those mapping to the vicinity of known genes; (2) requiring a minimum peptide length of 30-mer; and (3) including only peptides supported by at least two ESTs. This approach brings the database size closer to the commonly used protein sequence databases and allows the discovery of novel peptides in a variety of public data sets.[36] The GENQUEST method mentioned earlier can also be used to reduce the complexity of EST databases.[18] Although very helpful, these approaches cannot overcome other above-mentioned limitations. Compared to EST libraries, RNA-Seq provides a more advanced way to comprehensively identify alternative splicing events. Ning et al. performed a preliminary analysis using RNA-Seq data to derive a six-frame translated novel junction sequence database for MS/MS data search, with a focus on the identification of novel alternative splicing forms.[38] Although the study only provided proteomic evidence for a few novel alternative splicing forms, it helped demonstrate the feasibility of using RNA-Seq data to facilitate the identification of junction peptides. In a more recent study, Sheynkman et al. built an unannotated splice-junction peptide database with more than 30000 peptides based on RNA-Seq data, allowing the identification of 57 novel splice junction peptides.[39] Neither of these studies identified as many novel junction peptides as one would expect, which might be explained by the low expression level of the novel transcripts and the limited sequence coverage of proteomics data.

Sequence Variations

Tremendous progress has been made in the identification of disease or drug-response associated DNA sequence variations over the past decade. Validation of these variations at the protein level may lead to novel opportunities for disease diagnosis, prognosis, and treatment. Shotgun proteomics provides a high-throughput solution for the protein-level validation of genomic variations if such information is included in the sequence database used for the search. An early study by Gatlin et al. used SEQUEST-SNP to identify sequence variations in human hemoglobin proteins.[40] Their algorithm dynamically generates all possible SNPs and translates them into peptides for proteomics search. This strategy is only possible for data sets with one or several genes because the number of dynamically introduced variations can grow exponentially with increased number of genes. Several other studies incorporated SNPs derived from EST data to protein databases.[31,36] More efforts have been made to enable the identification of protein sequence variations through incorporating genomic variation information from databases such as dbSNP and COSMIC. These works address two key challenges: how to include possible variations into a database and how to control the FDR in the search results with expanded databases. Bunger et al. presented a refined two-step approach.[41] First, LC–MS/MS data are searched against the reference protein database and a separate SNP database created from dbSNP. Next, search results are compared to get reliable SNP-containing peptides. They pointed out that searching for SNP-peptides carry a high risk of false positives due to small mass changes and post-translational modification or peptide modifications that result in similar mass shifts as amino acid substitutions. To control false positives, they proposed two strategies. First, a decoy database can be created by random substitution of reference peptides with similar size. Second, a more stringent match score cutoff can be applied for identifying SNP peptides. The score cutoff can be empirically identified to balance false-positives and false-negatives. Their study identified 36 alternative SNP alleles which were not included in the reference IPI database. Nijveen et al. designed a Human Short Peptide Variation Database (HSPVdb) dedicated to minor histocompatibility antigens (MiHAs) and demonstrated the value of the database by identifying the majority of published polymorphic SNP or alternative reading frames (ARFs)-derived epitopes in a proteomics study.[35] They generated the database by introducing SNPs into corresponding mRNA sequence fragments from RefSeq and then translated them using three reading frames. The database consists of 20-mer peptides. Further improvements were made to remove nonpolymorphic SNPs in dbSNP, which improved the elucidation of MiHAs. A primary drawback of searching normal database and variant database separately is the loss of competition between normal and variant peptides. A single combined database is preferred because a spectrum that matches well to a peptide in one database may have a better match to a different peptide in another database. This cannot be resolved unless all candidate sequences are considered in a single database.[42] Therefore, Su et al. added a “validation” phase after searching spectra against a variation database from SNPs.[43] To build a combined database, Schandorff et al. developed MSIPI, in which each IPI protein sequence entry is appended with additional peptide sequences such as theoretical N-terminal peptides and variant peptides from coding SNPs.[37] MSIPI allows the identification of N-terminal peptides and of cSNPs in proteomic samples, with an only 10% increase in database size. Along the same line, Xi et al. built a database named SysPIMP that adds human disease-related mutated proteins from OMIM, PMD, and Swiss-Prot to a reference database.[44] More recently, we have developed CanProVar, which comprehensively integrates information on protein sequence variations from various public resources, with a focus on cancer-related variations.[45] We have also developed a bioinformatics workflow to address several critical challenges in using such databases for identifying variant peptides from shotgun proteomics data, including FDR estimation, efficient storage of variation information, compatibility with different search engines, and result interpretation.[46] Applying CanProVar and this workflow to proteomics data sets of human cancer cell lines and tumor samples identified hundreds of variant peptides. More importantly, genomic sequencing confirmed around 90% of the variant peptides randomly selected from the identified ones. With the aid of the Next Generation Sequencing (NGS) technologies, large amounts of new SNPs and mutations are continually being identified, and the above-mentioned methods are both blessed and cursed. Significantly expanded databases inevitably lead to higher requirements on data storage, longer search time, and higher risk in false identifications. One particularly promising approach is to derive personalized databases for individual samples based on matching DNA or RNA sequencing data. In an integrative personal omics profiling (iPOP) study, expanding a protein database with variations identified from DNA and RNA sequencing data allowed the identification of variant peptides resulted from single nucleotide variants (SNVs) and RNA edits.[48] Using RNA-Seq and shotgun proteomics data from two colorectal cancer cell lines, we showed that customized protein sequence databases derived from RNA-Seq data can enable the detection of known and novel peptide variants.[49] In an integrated genomics and proteomics analysis of rat liver, variants derived from genome and transcriptome variation were appended to the ENSEMBL rat database, allowing the detection of variant peptides in the proteomic data.[50] Evan et al. used RNA-Seq reads generated from adenovirus-infected human HeLa cells for the de novo assembly of the entire (host and virus) transcriptome and then built a protein database by six-frame translation of the predicted transcripts for proteomics search.[51] The proteomics informed by transcriptomics (PIT) technique identified more than 99% of the proteins identified using a traditional protein database with annotated human and adenovirus proteins. These studies demonstrate the great potential of integrative proteogenomic studies for an accurate and comprehensive characterizing of individual proteome.

Improving Protein Identification

Inferring proteins from identified peptides is a critical step in shotgun proteomics. Methods have been developed to enhance protein inference by integrating mRNA expression or protein–protein interaction data.

mRNA Expression

Most protein assembly tools assume that all proteins are equally likely to be present in a sample, even though this assumption is oversimplistic. Ramakrishnan et al. incorporated mRNA abundance estimated from microarray gene expression profiling as prior knowledge of protein presence to improve protein identification in shotgun proteomics experiments.[52] Their approach, MSpresso, calculates a protein identification probability by combing direct measure of protein presence from proteomics data and the inferential evidence from microarray data. In their study, the method improves protein identification by ∼40% at a fixed error rate. This work clearly demonstrated the value of incorporating mRNA expression data as prior knowledge in protein identification. An underlying assumption of the MSpresso approach is a good correlation between mRNA and protein abundance. However, recent studies have shown that mRNA and protein abundance are only moderately correlated. On the basis of a more realistic assumption that mRNA expression is a prerequisite for protein expression, we proposed an alternative method by refining proteomics search space based on RNA-Seq data from the same sample. Specifically, a transcript abundance cutoff is set to remove unexpressed transcripts or lowly expressed transcripts that are unlikely to be detected at the protein level. Using RNA-Seq and shotgun proteomics data from two colorectal cancer cell lines, we showed that this approach not only increases the number of identified protein groups but also the number of identifiable spectra,[49] and the latter can help enhance spectral counting-based protein quantification.

Protein–Protein Interaction

Most biological functions arise from interactions among proteins; however, traditional protein assembly pipelines treat proteins as independent entities. To ensure the reliability of protein identification, these pipelines usually eliminate a large number of possible but nonconfident proteins, including many low-abundant proteins that may be vital for the understanding of biological systems. On the basis of the observation that proteins involved in the same biological process or pathway tend to lie close to one another in the protein–protein interaction network,[53] several methods have been developed to improve protein identification by incorporating protein–protein interaction network data. These methods can be broadly classified into three categories: module-based approach, direct neighborhood approach, and diffusion-based approach. A representative implementation of the module-based approach is the clique-enrichment approach (CEA) developed by our group.[54] After protein assembly, all identified proteins are grouped into confident proteins and nonconfident proteins and mapped to a protein–protein interaction network. Network modules defined as fully connected subnetworks (or cliques) are enumerated from the network and evaluated for the enrichment of confident proteins. Nonconfident proteins that coexist in a network module enriched with confident proteins are rescued. In several data sets tested, CEA increased protein identification by 8–23% with an estimated accuracy of 85%.[54] Although clique enumeration is used in CEA for the identification of network modules, other network clustering algorithms can be similarly used in the module-based approach. The direct neighborhood approach considers all direct neighbors of a protein as the neighborhood of the protein. One representative implementation is Software for Network Inference of Proteomics Experiments (SNIPE).[55] In this method, spectral counts for all proteins are mapped to their nodes in a network. An updated score for each protein is re-estimated by adding up the scores of the protein and all its immediate neighbors. Permutation is then applied to assess the statistical significance of the updated scores for all proteins. Applying SNIPE to a tooth development data set correctly highlights several proteins that are not normally detected by shotgun proteomics analysis of complex protein samples from whole tissues.[55] The diffusion-based approach takes into consideration the global network topological structure. This approach is closely related to Google’s PageRank algorithm. One representative implementation is MSNet.[56] The MSNet score for a protein is the convex combination of two terms: the probability that the protein is present in the sample given evidence from a MS experiment, and the weighted average of MSNet scores of the protein’s immediate network neighbors. This is very similar to SNIPE, but in MSNet, the scores are updated iteratively so that evidence from indirect neighbors can be included. Applying MSNet to yeast and human samples increased protein identification by 8–29% and 37%, respectively.[56] Previously, we compared the performance of these three approaches through cross-validation using a yeast cell culture data set.[54] Our results suggest that the module-based approach is more effective and more robust. As a large number of proteomics data sets are available now, it is worth re-evaluating these methods using multiple data sets. All network-based approaches depend on the network coverage and quality. To increase coverage, one may consider functional association networks instead of protein–protein interaction networks, so that different types of functional relationships can be included in the network. These approaches may also be improved by using condition-specific networks, such as the tissue-specific protein–protein interaction networks. Moreover, in the module-based approach, functional modules can be more broadly defined by Gene Ontology, pathways in different databases, and known protein complexes, etc. A recent study by Goh et al. showed that these functional modules can also be used to improve protein identification in proteomics studies.[57] These network and pathway-based approaches not only improve protein identification but also put identified proteins into their functional context.[54] In comparative studies, this approach enables comparisons at the network level instead of individual protein level, allowing a systems level understanding of the difference between the samples.

Conclusion and Perspectives

A major goal in proteomics is to comprehensively identify all proteins in biological and clinical samples. Following the information flow from DNA to RNA to protein and functional networks, genomic, transcriptomic, and interactome data can be applied to improve peptide and protein identification in shotgun proteomics (Figure 2).

Figure 2

Orthogonal data assisted proteomics studies.

Orthogonal data assisted proteomics studies. Despite substantial improvements in MS/MS data analysis, there remains a large number of unassigned spectra in a typical proteomics study, indicating a large unknown proteome territory.[58] This unknown territory can be partly explained by unknown protein coding genes and different types of variations of known protein coding genes. Earlier efforts focus primarily on making databases more comprehensive using publicly available genomic and transcriptomic data. More recently, personalized protein databases derived from sample-specific genomic and transcriptomic data have emerged as an attractive approach. Figure 3 summarizes major computational approaches to increasing database completeness using publicly available genomic and transcriptomics data. Combining six-frame translation and exon–exon junction predictions can theoretically enumerate all coding potentials of the genome. Further integrating sequence variation data from databases such as dbSNP and COSMIC allows the identification of variant peptides and proteins. Transcriptomics data from EST or RNA-Seq can be used to refine exon–exon junction predictions and filter for sequence variations with transcriptional evidence. Although these approaches can largely increase the completeness of protein databases, significantly expanded search space may introduce enormous background noise, reducing specificity, and sensitivity in peptide identification. In a recent study on ENCODE cell lines,[8] shotgun proteomics data from two human cell lines K562 and GM12878 were searched against the GENCODE v7 protein database, the GENCODE v7 transcript-derived protein database, and the six-frame translation of the whole human genome. The GENCODE v7 protein search identified the largest number of peptides, despite of the smallest database size. In contrast, the whole genome search identified the smallest number of peptides. It is worth noting that each search identified a significant number of peptides that were missed by the other two searches, indicating different database constructing strategies are complementary and could be used in a joint way.[8]

Figure 3

Methods for increasing database completeness using publicly available genomic and transcriptomic data.

Methods for increasing database completeness using publicly available genomic and transcriptomic data. With the recent advancements in DNA and RNA-sequencing technologies, deriving personalized protein databases from sample-specific genomic and transcriptomic data becomes a very attractive strategy. RNA-Seq is of particular interest because of its affordable cost and high information content, including information on novel transcribed regions, novel alternative splicing events, sequence variations resulted from genomic alteration and RNA editing events, and transcript presence and abundance. A sample-specific database taking into consideration all above information can better approximate the real protein pool in the sample and thus improves peptide and protein identification, and tools facilitating such integration, such as customProDB,[59] have emerged. Although the review focuses on using orthogonal data to improve shotgun proteomics studies, these integrative approaches are mutually beneficial. For example, proteomics can help refine genome annotations[10,60] and confirm novel alternative splicing events predicted based on RNA-Seq data. Comprehensive identification of all proteins in biological samples can facilitate the reconstruction of sample-specific interactomes. The ability to identify sample-specific protein forms is critical for the emerging field of personalized proteomics, which could complement personalized genomics and lead to novel protein biomarkers and therapeutic targets. More importantly, comprehensive integration of information at DNA, RNA, protein, and network levels, including post-translational modification information that is not discussed in this review, will eventually lead to better understanding of cellular systems, comprehensive catalogue of disease-associated molecular alterations, and novel approaches to correct these alterations. A key to success is the continuous development of computational algorithms and tools that can help translate the large amount of multidimensional data into new knowledge that will eventually improve human health.

63 in total

1. RNA-sequencing reveals previously unannotated protein- and microRNA-coding genes expressed in aleurone cells of rice seeds.

Authors: Kenneth A Watanabe; Patricia Ringler; Lingkun Gu; Qingxi J Shen
Journal: Genomics Date: 2013-11-04 Impact factor: 5.736

2. Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences.

Authors: Gloria M Sheynkman; Michael R Shortreed; Brian L Frey; Mark Scalf; Lloyd M Smith
Journal: J Proteome Res Date: 2013-11-11 Impact factor: 4.466

3. Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events.

Authors: Gerben Menschaert; Wim Van Criekinge; Tineke Notelaers; Alexander Koch; Jeroen Crappé; Kris Gevaert; Petra Van Damme
Journal: Mol Cell Proteomics Date: 2013-02-21 Impact factor: 5.911

4. Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq.

Authors: Gloria M Sheynkman; Michael R Shortreed; Brian L Frey; Lloyd M Smith
Journal: Mol Cell Proteomics Date: 2013-04-29 Impact factor: 5.911

5. Proteogenomic database construction driven from large scale RNA-seq data.

Authors: Sunghee Woo; Seong Won Cha; Gennifer Merrihew; Yupeng He; Natalie Castellana; Clark Guest; Michael MacCoss; Vineet Bafna
Journal: J Proteome Res Date: 2013-07-17 Impact factor: 4.466

6. Quantitative and qualitative proteome characteristics extracted from in-depth integrated genomics and proteomics analysis.

Authors: Teck Yew Low; Sebastiaan van Heesch; Henk van den Toorn; Piero Giansanti; Alba Cristobal; Pim Toonen; Sebastian Schafer; Norbert Hübner; Bas van Breukelen; Shabaz Mohammed; Edwin Cuppen; Albert J R Heck; Victor Guryev
Journal: Cell Rep Date: 2013-11-27 Impact factor: 9.423

7. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search.

Authors: Xiaojing Wang; Bing Zhang
Journal: Bioinformatics Date: 2013-09-20 Impact factor: 6.937

8. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions.

Authors: Jainab Khatun; Yanbao Yu; John A Wrobel; Brian A Risk; Harsha P Gunawardena; Ashley Secrest; Wendy J Spitzer; Ling Xie; Li Wang; Xian Chen; Morgan C Giddings
Journal: BMC Genomics Date: 2013-02-28 Impact factor: 3.969

9. A compatible exon-exon junction database for the identification of exon skipping events using tandem mass spectrum data.

Authors: Fan Mo; Xu Hong; Feng Gao; Lin Du; Jun Wang; Gilbert S Omenn; Biaoyang Lin
Journal: BMC Bioinformatics Date: 2008-12-16 Impact factor: 3.169

10. Comparative network-based recovery analysis and proteomic profiling of neurological changes in valproic acid-treated mice.

Authors: Wilson Wen Bin Goh; Marek J Sergot; Judy C G Sng; Judy Cg Sng; Limsoon Wong
Journal: J Proteome Res Date: 2013-04-17 Impact factor: 4.466

10 in total

Review 1. Methods, Tools and Current Perspectives in Proteogenomics.

Authors: Kelly V Ruggles; Karsten Krug; Xiaojing Wang; Karl R Clauser; Jing Wang; Samuel H Payne; David Fenyö; Bing Zhang; D R Mani
Journal: Mol Cell Proteomics Date: 2017-04-29 Impact factor: 5.911

2. Leveraging the complementary nature of RNA-Seq and shotgun proteomics data.

Authors: Xiaojing Wang; Qi Liu; Bing Zhang
Journal: Proteomics Date: 2014-11-17 Impact factor: 3.984

3. A System-wide Approach to Monitor Responses to Synergistic BRAF and EGFR Inhibition in Colorectal Cancer Cells.

Authors: Anna Ressa; Evert Bosdriesz; Joep de Ligt; Sara Mainardi; Gianluca Maddalo; Anirudh Prahallad; Myrthe Jager; Lisanne de la Fonteijne; Martin Fitzpatrick; Stijn Groten; A F Maarten Altelaar; René Bernards; Edwin Cuppen; Lodewyk Wessels; Albert J R Heck
Journal: Mol Cell Proteomics Date: 2018-07-03 Impact factor: 5.911

4. MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms.

Authors: Franziska Zickmann; Bernhard Y Renard
Journal: Bioinformatics Date: 2015-06-15 Impact factor: 6.937

5. The Utility of Genomic and Transcriptomic Data in the Construction of Proxy Protein Sequence Databases for Unsequenced Tree Nuts.

Authors: Cary Pirone-Davies; Melinda A McFarland; Christine H Parker; Yoko Adachi; Timothy R Croley
Journal: Biology (Basel) Date: 2020-05-19

Review 6. Approaches to the discovery of non-invasive urinary biomarkers of prostate cancer.

Authors: Andrej Jedinak; Kevin R Loughlin; Marsha A Moses
Journal: Oncotarget Date: 2018-08-21

Review 7. Protein Complexes Form a Basis for Complex Hybrid Incompatibility.

Authors: Krishna B S Swamy; Scott C Schuyler; Jun-Yi Leu
Journal: Front Genet Date: 2021-02-09 Impact factor: 4.599

8. proBAMsuite, a Bioinformatics Framework for Genome-Based Representation and Analysis of Proteomics Data.

Authors: Xiaojing Wang; Robbert J C Slebos; Matthew C Chambers; David L Tabb; Daniel C Liebler; Bing Zhang
Journal: Mol Cell Proteomics Date: 2015-12-11 Impact factor: 5.911

9. A cost-sensitive online learning method for peptide identification.

Authors: Xijun Liang; Zhonghang Xia; Ling Jian; Yongxiang Wang; Xinnan Niu; Andrew J Link
Journal: BMC Genomics Date: 2020-04-25 Impact factor: 3.969

Review 10. Esophageal, gastric and colorectal cancers: Looking beyond classical serological biomarkers towards glycoproteomics-assisted precision oncology.

Authors: Elisabete Fernandes; Janine Sores; Sofia Cotton; Andreia Peixoto; Dylan Ferreira; Rui Freitas; Celso A Reis; Lúcio Lara Santos; José Alexandre Ferreira
Journal: Theranostics Date: 2020-03-31 Impact factor: 11.556

10 in total