| Literature DB >> 26269333 |
Jun Fan1, Shyamasree Saha1, Gary Barker2, Kate J Heesom3, Fawaz Ghali4, Andrew R Jones4, David A Matthews5, Conrad Bessant6.
Abstract
With the recent advent of RNA-seq technology the proteomics community has begun to generate sample-specific protein databases for peptide and protein identification, an approach we call proteomics informed by transcriptomics (PIT). This approach has gained a lot of interest, particularly among researchers who work with nonmodel organisms or with particularly dynamic proteomes such as those observed in developmental biology and host-pathogen studies. PIT has been shown to improve coverage of known proteins, and to reveal potential novel gene products. However, many groups are impeded in their use of PIT by the complexity of the required data analysis. Necessarily, this analysis requires complex integration of a number of different software tools from at least two different communities, and because PIT has a range of biological applications a single software pipeline is not suitable for all use cases. To overcome these problems, we have created GIO, a software system that uses the well-established Galaxy platform to make PIT analysis available to the typical bench scientist via a simple web interface. Within GIO we provide workflows for four common use cases: a standard search against a reference proteome; PIT protein identification without a reference genome; PIT protein identification using a genome guide; and PIT genome annotation. These workflows comprise individual tools that can be reconfigured and rearranged within the web interface to create new workflows to support additional use cases.Entities:
Mesh:
Year: 2015 PMID: 26269333 PMCID: PMC4638048 DOI: 10.1074/mcp.O115.048777
Source DB: PubMed Journal: Mol Cell Proteomics ISSN: 1535-9476 Impact factor: 5.911
Fig. 1.Annotated PIT workflow without reference genome. The inputs to the workflow are a file containing raw spectral data and a list of transcripts assembled from RNA-seq data (assembled in a separate workflow using Trinity). In the first step of analysis, the longest open reading frames (ORFs) in all six frames for each transcript are determined using the in-house ORFall tool. MS-GF+ then finds peptide spectrum matches, which are post processed and grouped to proteins using tools from mzIdentML-lib. This results in a list of confidently identified ORFs in mzIdentML format, which is then BLASTed against protein databases to add context to the results.
Summary of RNA-seq data, databases sizes, and results from de novo, genome guided PIT and standard searches
| Standard (human and adenovirus) | PIT | Standard (human only) | Genome guided assembly | |
|---|---|---|---|---|
| Transcripts | n/a | 103,431 | n/a | 37,024 |
| Average transcript length | n/a | 670.03 | n/a | 1,007.21 |
| Total components/genes | n/a | 92,661 | n/a | 33,747 |
| Transcript N50 | n/a | 1,040 | n/a | 1,764 |
| Proteins/ORFs | 88,701 | 78,493 | 88,665 | 54,184 |
| Identified peptides | 26,401 | 25,907 | 25,970 | 21,601 |
| Identified PAGs | 3,015 | 2,926 | 2,986 | 2,713 |
| Peptide level overlap | 88% | 77% | ||
| PAG level overlap | 87% | 75% | ||
Fig. 2.Comparison of peptides and protein groups identified by a standard protein identification workflow in which mass spectra were searched against Uniprot, and a PIT workflow without reference genome, for HeLa cells infected with adenovirus.
Summary of search-specific peptide identifications and the proportion due to absence in the opposing database
| Method | Peptides derived from database | Peptide identifications unique to this search | Unique identifications due to absence of peptide in other database |
|---|---|---|---|
| Standard (human and adenovirus) | 11,636,910 | 1,883 | 1,878 (99.7%) |
| PIT– | 7,843,921 | 1,389 | 813 (58.5%) |
| Standard (human only) | 11,627,809 | 5,215 | 5,187 (99.4%) |
| PIT–genome guided assembly | 7,095,081 | 846 | 552 (65.2%) |
Fig. 3.Comparison of peptides and protein groups identified by a standard protein identification workflow in which mass spectra were searched against Uniprot, and a PIT workflow using a human reference genome, for HeLa cells infected with adenovirus. Because a human reference genome was used, only human protein identifications are included in these results.