Literature DB >> 16381951

PepSeeker: a database of proteome peptide identifications for investigating fragmentation patterns.

Thomas McLaughlin¹, Jennifer A Siepen, Julian Selley, Jennifer A Lynch, King Wai Lau, Hujun Yin, Simon J Gaskell, Simon J Hubbard.

Abstract

Proteome science relies on bioinformatics tools to characterize proteins via their proteolytic peptides which are identified via characteristic mass spectra generated after their ions undergo fragmentation in the gas phase within the mass spectrometer. The resulting secondary ion mass spectra are compared with protein sequence databases in order to identify the amino acid sequence. Although these search tools (e.g. SEQUEST, Mascot, X!Tandem, Phenyx) are frequently successful, much is still not understood about the amino acid sequence patterns which promote/protect particular fragmentation pathways, and hence lead to the presence/absence of particular ions from different ion series. In order to advance this area, we have developed a database, PepSeeker (http://nwsr.smith.man.ac.uk/pepseeker), which captures this peptide identification and ion information from proteome experiments. The database currently contains >185,000 peptides and associated database search information. Users may query this resource to retrieve peptide, protein and spectral information based on protein or peptide information, including the amino acid sequence itself represented by regular expressions coupled with ion series information. We believe this database will be useful to proteome researchers wishing to understand gas phase peptide ion chemistry in order to improve peptide identification strategies. Questions can be addressed to j.selley@manchester.ac.uk.

Entities: Chemical

Mesh：

Substances：

Year: 2006 PMID： 16381951 PMCID： PMC1347429 DOI： 10.1093/nar/gkj066

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Proteomics is growing rapidly as a technique in functional genomics. Driven by advances in mass spectrometry and analytical chemistry, coupled with the expanding number of completely sequenced genomes, proteomics is becoming a widely exploited technology for characterizing the proteins found in living systems. There are a growing number of proteome databases appearing on the internet (1–5) along with maturing data standards in proteomics driven by the Proteome Standards Initiative (PSI) (6–8). Existing databases cover a wide-range of mass spectrometry-based proteomics data, including data stored on basic identifications of proteins and peptides, the samples studied, instruments used, and software search tools employed. Notable examples include the PeptideAtlas database developed by Aebersold and colleagues (1), the Global Proteome Machine (GPM) from Beavis and co-workers (2), the Open Proteomics Database (3), the PEDRo proteome repository developed locally in Manchester (4), and the PRIDE database at the EBI (5). This growing list of resources offers a range of approaches for the capture, storage and dissemination of proteomic experimental data and reflects the fact that proteomics has now come of age in the post-genomic era and is delivering large, complex datasets which are rich in information. These advances in proteomics are supported by bioinformatics search tools which allow the mass spectra generated to be compared with the protein sequence databases in order to identify the protein. Typically, this is done by identifying the protein from peptides produced by hydrolysing the polypeptide chain with a proteolytic enzyme such as trypsin. The tryptic peptides are then separated and analysed in the mass spectrometer. Proteins can then be characterized either from the mass-to-charge values of the peptide ions themselves (known as Peptide Mass Fingerprinting), or increasingly by tandem mass spectrometry (MS) where the peptide ion is itself induced to fragment via energetic collision with a gas in the instrument (Peptide Fragment Fingerprinting). This latter technique is part of the popular MudPIT (Multidimensional Protein Identification Technology) approach originated from the Yates lab (9,10), where thousands of peptides are separated via liquid chromatography directly fed into a mass spectrometer, yielding many thousands of spectra for bioinformatic analysis. Search tools such as Mascot (11), SEQUEST (12), X!Tandem (13) and Phenyx (14) are then employed to determine the most probable matching peptide in a sequence database. This is dependent on the quality of the spectrum, the ions observed and many other factors. Despite several excellent studies [Wysocki and other refs, (15–17)], the fragmentation of peptide ions in the gas phase is still only partially understood, and these algorithms primarily exploit the differences in the mass-to-charge values of the ions in order to identify candidate peptides which match the experimental spectra. A more complete understanding of how different amino acid sequences promote or influence fragmentation pathways will lead to improvements in our ability to predict the relative presence/absence of particular peaks from different ion series in the tandem MS spectra. This in turn can be exploited in these software search tools to make better peptide identifications, both in terms of the number of peptides which can be identified and the overall confidence which can be placed in them. This is important since a large fraction of the spectra currently analysed do not lead to confident peptide identifications (18) and proteomics still does not offer a truly genome-wide coverage. This requirement for better search tools in tandem MS for automatic peptide identification is important both for searches against known protein sequence databases, and the more challenging de novo search problem. To facilitate these efforts, we have designed and implemented a database system, PepSeeker, to capture and store this information, and allow users to query it to help mine rules and explore fragmentation patterns observed in peptide sequences studied in the mass spectrometer. This has been in part motivated by a local project seeking to mine the data using machine learning methods to discover rules to model peptide spectra including the relative peak heights of the fragment ions. To this end, our PepSeeker database contains both peptide identifications and the associated fragment ion details used to identify that amino acid sequence. It is intended to complement the more holistic proteome databases, with the primary focus on the identification itself allied to the peptide sequence data, coupled to the underlying ion series. To this extent, PepSeeker supports novel searches not available via other databases and tools, where spectra and specific ion information can be retrieved with respect to amino acid patterns.

DATA CAPTURE STRATEGY

The current implementation of PepSeeker has been developed using a MySQL platform with a simple schema designed to capture data obtained primarily from a local Mascot-based proteomics pipeline. This strategy enables data to be captured from a range of instruments and vendors. The database schema is shown in Figure 1, which shows the data captured at present, including basic file searching parameters relating to the spectra, instrument and database, as well as protein and peptide hit information from the identifications. Rather than base this around the PEDRo or PSI XML schema currently under development, we elected to design a simplified model targeted directly at the identification stage of the proteomics pipeline, and do not capture all the information associated with the mzData/mzXML standards concerning full spectra and processing details or comprehensive instrument parameters. This is largely equivalent to the mzIdent/analysisXML standard currently under development by the PSI group (6). Results from the Mascot searches are parsed and loaded into the database directly from Mascot ‘.dat’ format flatfiles, although we are able to parse and accept other formats such as Sequest's ‘.dta’ and ‘.out’ files in a semi-automated fashion. A web-based submission form is under development to accept all formats. This provides us with an interim solution whilst the mzIdent/analysisXML matures as a standard, since it is expected that all search engines will be tailored to deliver this as output in the near future. Data is captured based on simple filtering criteria concerning the user/email/database, so that only the desired data is captured from groups willing to share data. The following data tables are used: SearchMasses (containing precursor and product ions, their intensities and charges), Fileparameters (containing information about file, instrument and software settings), Proteinhit (containing information about all of the top protein hits), Proteinscore (containing protein score and information on number of peptide queries matched), Peptidehit (containing information about matched peptides) and IonTable (containing information on ions matching peptide ion fragments). Currently, the database contains peptide identifications from species including Saccharomyces cerevisiae, Schizosaccharomyces pombe, Escherichia coli, Plasmodium falciparum, mouse and human. Statistics relating to the number of peptides in the database are shown in Table 1.

Figure 1

PepSeeker database scheme, showing the relationship between tables.

Table 1

Pepseeker database statistics

Viewable spectra	1 397 159
Proteins	49 537
Peptides (total)	186 873
Unique peptides	47 732
Average peptide length	11.6 amino acids
Range of peptide lengths	3–66 amino acids

PepSeeker INTERFACE

The interface is built using Perl CGI and DBI to interact with and query the MySQL database, providing a variety of entry points depending on a particular researcher's search parameters. The first entry point is via the ‘Filename’ search form. This supports searches via user or even the search title for the search, and is aimed mainly at contributors to the dataset allowing researchers to track their experiments and results. The ‘Protein’ search form relates to the putative parent proteins of identified peptides, allowing queries to retrieve identified peptides from proteins based on keywords, protein mass ranges, taxonomies and/or specific flatfile databases that were searched. We capture version information for publicly available databanks such as Swiss-Prot, Uniprot, MSDB and allow non-standard databases to be downloaded separately via ftp. Given the inherent problems associated with unambiguously assigning peptides to specific proteins, we simply capture and store all reported matches listed in Mascot output. This is deliberate, so that all putative protein–peptide relationships are captured bearing in mind that the focus of this resource is very much on the peptide identifications rather than the protein identifications. The ‘Peptide’ search form is useful to look for specific amino acid sequence patterns, occurring in isolation or coupled with other patterns. Searches support regular expressions allowing quite complex queries, which may be coupled with restriction by protein accession/identifier and quality control on the peptide confidence (via expectation values or Mascot ion scores). The supported regular expression pattern matching is explained in the online help, and the matching pattern is highlighted in the output from a query. The ‘Ions (simple)’ search provides a means to search for specific ion types identified by the search engines associated with particular amino acids and locations within a peptide sequence. As in the peptide search, this can be anchored to a specific protein. A more advanced search can also be performed, where many specific ions can be associated with specific positions in a peptide sequence, each subtended by a selected amino acid. In addition, the C-terminal amino acid can be specified (any, arginine or lysine) as the differing basicity of their sidechains can produce differing fragmentation patterns. The ion type query forms [both simple sequence search ‘Ions (simple)’ and advanced ion search ‘Ions (advanced)’] are particularly important to this project, to provide users with a means to examine the presence or absence of given ion types for given peptide sequence patterns which may relate to specific ion fragmentation pathways, or be useful to examine spectra which exemplify given trends identified by machine learning.

EXAMPLE DATABASE QUERIES

To illustrate the utility of this resource, an example query is shown in Figure 2. This relates to a Peptide search for a specific fragmentation pattern. Most peptide ion fragmentation yields ions in one of two ion series, a b-series and a y-series resulting from fragmentation at the peptide bond. Proline residues are well known to promote fragmentation in the gas phase, and hence for example, we can query the database to show all examples of a proline cleavage where it has generated a y5 ion. We can query the database for examples where there is more than one proline present in the peptide and use the information there to compare with results from machine learning experiments, or indeed to discover unusual or novel patterns which may be investigated further by designing a series of peptides for further study.

Figure 2

Screen-shot of the PepSeeker front-end, showing an example of the navigation from a simple ion search to a list of the matching peptides through to a graphical representation of the spectra and associated ion information. An example PepSeeker query is shown searching for all peptides within the database containing the sequence PPPP. The first window is the query entry, the second window is the output and the third window displays the spectrum and table associated to the peptide SQGPPPPGKPQGPPPQGGSK.

The example in Figure 2 show how you can move around the database from the initial query. Here, a search is conducted in the ‘Ions (simple)’ search query form which reveals a set of peptides matching the selected regular expression (‘PPPP’), which reports back all peptides in the database containing this pattern. A further click on the ‘Ion table’ link brings up a simple peptide secondary ion spectra in which the user can view the ions present in the spectrum along with an ion table showing all possible ions with those actually present highlighted. The user can then zoom in on the spectrum, to see if any putative ions of low level intensity which were left unassigned by the search tool.

DATA SUBMISSION AND FUTURE DIRECTIONS

The database is publicly available via the following URL (). Our principal aim is to make peptide identification data available to the community to provide datasets for groups developing tools for peptide identification. In addition, we provide a means to search the data for characteristic ion patterns linked to amino acid sequence. Currently, contributors wishing to contribute data can send us Mascot ‘.dat’ files or similar formats from other vendors (‘.dta’ and ‘.out’ files from Sequest), but we plan to operate an upload site in the near future. Likewise, we plan to support PSI-compliant XML formats such as mzIdent/analysisXML when they become available, permitting both upload and download in such a format. Groups wishing to contribute data should contact J.Selley@manchester.ac.uk.

17 in total

1. OLAV: towards high-throughput tandem mass spectrometry data identification.

Authors: Jacques Colinge; Alexandre Masselot; Marc Giron; Thierry Dessingy; Jérôme Magnin
Journal: Proteomics Date: 2003-08 Impact factor: 3.984

2. Automatic quality assessment of peptide tandem mass spectra.

Authors: Marshall Bern; David Goldberg; W Hayes McDonald; John R Yates
Journal: Bioinformatics Date: 2004-08-04 Impact factor: 6.937

3. The need for a public proteomics repository.

Authors: John T Prince; Mark W Carlson; Rong Wang; Peng Lu; Edward M Marcotte
Journal: Nat Biotechnol Date: 2004-04 Impact factor: 54.908

4. TANDEM: matching proteins with tandem mass spectra.

Authors: Robertson Craig; Ronald C Beavis
Journal: Bioinformatics Date: 2004-02-19 Impact factor: 6.937

5. Open source system for analyzing, validating, and storing protein identification data.

Authors: Robertson Craig; John P Cortens; Ronald C Beavis
Journal: J Proteome Res Date: 2004 Nov-Dec Impact factor: 4.466

6. Cleavage N-terminal to proline: analysis of a database of peptide tandem mass spectra.

Authors: Linda A Breci; David L Tabb; John R Yates; Vicki H Wysocki
Journal: Anal Chem Date: 2003-05-01 Impact factor: 6.986

7. Influence of basic residue content on fragment ion peak intensities in low-energy collision-induced dissociation spectra of peptides.

Authors: David L Tabb; Yingying Huang; Vicki H Wysocki; John R Yates
Journal: Anal Chem Date: 2004-03-01 Impact factor: 6.986

8. The HUPO PSI's molecular interaction format--a community standard for the representation of protein interaction data.

Authors: Henning Hermjakob; Luisa Montecchi-Palazzi; Gary Bader; Jérôme Wojcik; Lukasz Salwinski; Arnaud Ceol; Susan Moore; Sandra Orchard; Ugis Sarkans; Christian von Mering; Bernd Roechert; Sylvain Poux; Eva Jung; Henning Mersch; Paul Kersey; Michael Lappe; Yixue Li; Rong Zeng; Debashis Rana; Macha Nikolski; Holger Husi; Christine Brun; K Shanker; Seth G N Grant; Chris Sander; Peer Bork; Weimin Zhu; Akhilesh Pandey; Alvis Brazma; Bernard Jacq; Marc Vidal; David Sherman; Pierre Legrain; Gianni Cesareni; Ioannis Xenarios; David Eisenberg; Boris Steipe; Chris Hogue; Rolf Apweiler
Journal: Nat Biotechnol Date: 2004-02 Impact factor: 54.908

9. A systematic approach to modeling, capturing, and disseminating proteomics experimental data.

Authors: Chris F Taylor; Norman W Paton; Kevin L Garwood; Paul D Kirby; David A Stead; Zhikang Yin; Eric W Deutsch; Laura Selway; Janet Walker; Isabel Riba-Garcia; Shabaz Mohammed; Michael J Deery; Julie A Howard; Tom Dunkley; Ruedi Aebersold; Douglas B Kell; Kathryn S Lilley; Peter Roepstorff; John R Yates; Andy Brass; Alistair J P Brown; Phil Cash; Simon J Gaskell; Simon J Hubbard; Stephen G Oliver
Journal: Nat Biotechnol Date: 2003-03 Impact factor: 54.908

10. Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry.

Authors: Frank Desiere; Eric W Deutsch; Alexey I Nesvizhskii; Parag Mallick; Nichole L King; Jimmy K Eng; Alan Aderem; Rose Boyle; Erich Brunner; Samuel Donohoe; Nelson Fausto; Ernst Hafen; Lee Hood; Michael G Katze; Kathleen A Kennedy; Floyd Kregenow; Hookeun Lee; Biaoyang Lin; Dan Martin; Jeffrey A Ranish; David J Rawlings; Lawrence E Samelson; Yuzuru Shiio; Julian D Watts; Bernd Wollscheid; Michael E Wright; Wei Yan; Lihong Yang; Eugene C Yi; Hui Zhang; Ruedi Aebersold
Journal: Genome Biol Date: 2004-12-10 Impact factor: 13.583

9 in total

1. Discovering mercury protein modifications in whole proteomes using natural isotope distributions observed in liquid chromatography-tandem mass spectrometry.

Authors: Benjamin J Polacco; Samuel O Purvine; Erika M Zink; Stephen P Lavoie; Mary S Lipton; Anne O Summers; Susan M Miller
Journal: Mol Cell Proteomics Date: 2011-04-30 Impact factor: 5.911

2. The PeptideAtlas Project.

Authors: Eric W Deutsch
Journal: Methods Mol Biol Date: 2010

Review 3. Proteomics of plant pathogenic fungi.

Authors: Raquel González-Fernández; Elena Prats; Jesús V Jorrín-Novo
Journal: J Biomed Biotechnol Date: 2010-05-27

Review 4. Proteomics data repositories: providing a safe haven for your data and acting as a springboard for further research.

Authors: Juan Antonio Vizcaíno; Joseph M Foster; Lennart Martens
Journal: J Proteomics Date: 2010-07-06 Impact factor: 4.044

Review 5. Understanding Leishmania parasites through proteomics and implications for the clinic.

Authors: Shyam Sundar; Bhawana Singh
Journal: Expert Rev Proteomics Date: 2018-05-02 Impact factor: 3.940

6. Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics.

Authors: Jennifer A Siepen; Emma-Jayne Keevil; David Knight; Simon J Hubbard
Journal: J Proteome Res Date: 2007-01 Impact factor: 4.466

7. BioMart Central Portal: an open database network for the biological community.

Authors: Jonathan M Guberman; J Ai; O Arnaiz; Joachim Baran; Andrew Blake; Richard Baldock; Claude Chelala; David Croft; Anthony Cros; Rosalind J Cutts; A Di Génova; Simon Forbes; T Fujisawa; E Gadaleta; D M Goodstein; Gunes Gundem; Bernard Haggarty; Syed Haider; Matthew Hall; Todd Harris; Robin Haw; S Hu; Simon Hubbard; Jack Hsu; Vivek Iyer; Philip Jones; Toshiaki Katayama; R Kinsella; Lei Kong; Daniel Lawson; Yong Liang; Nuria Lopez-Bigas; J Luo; Michael Lush; Jeremy Mason; Francois Moreews; Nelson Ndegwa; Darren Oakley; Christian Perez-Llamas; Michael Primig; Elena Rivkin; S Rosanoff; Rebecca Shepherd; Reinhard Simon; B Skarnes; Damian Smedley; Linda Sperling; William Spooner; Peter Stevenson; Kevin Stone; J Teague; Jun Wang; Jianxin Wang; Brett Whitty; D T Wong; Marie Wong-Erasmus; L Yao; Ken Youens-Clark; Christina Yung; Junjun Zhang; Arek Kasprzyk
Journal: Database (Oxford) Date: 2011-09-18 Impact factor: 3.451

8. An informatic pipeline for the data capture and submission of quantitative proteomic data using iTRAQ.

Authors: Jennifer A Siepen; Neil Swainston; Andrew R Jones; Sarah R Hart; Henning Hermjakob; Philip Jones; Simon J Hubbard
Journal: Proteome Sci Date: 2007-02-01 Impact factor: 2.480

9. ISPIDER Central: an integrated database web-server for proteomics.

Authors: Jennifer A Siepen; Khalid Belhajjame; Julian N Selley; Suzanne M Embury; Norman W Paton; Carole A Goble; Stephen G Oliver; Robert Stevens; Lucas Zamboulis; Nigel Martin; Alexandra Poulovassillis; Philip Jones; Richard Côté; Henning Hermjakob; Melissa M Pentony; David T Jones; Christine A Orengo; Simon J Hubbard
Journal: Nucleic Acids Res Date: 2008-04-25 Impact factor: 16.971

9 in total