| Literature DB >> 21108792 |
Chen Zhou1, Hao Chi, Le-Heng Wang, You Li, Yan-Jie Wu, Yan Fu, Rui-Xiang Sun, Si-Min He.
Abstract
BACKGROUND: Tandem mass spectrometry-based database searching has become an important technology for peptide and protein identification. One of the key challenges in database searching is the remarkable increase in computational demand, brought about by the expansion of protein databases, semi- or non-specific enzymatic digestion, post-translational modifications and other factors. Some software tools choose peptide indexing to accelerate processing. However, peptide indexing requires a large amount of time and space for construction, especially for the non-specific digestion. Additionally, it is not flexible to use.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21108792 PMCID: PMC3000425 DOI: 10.1186/1471-2105-11-577
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The parameters of database searching experiments
| Instrument | LTQ | |
|---|---|---|
| Spectra | 107666 spectra extracted from ten raw files | |
| Exp1 | Database | 18 purified proteins with the Uniprot/SwissProt protein sequence database(1020188 protein sequences, target + reversed), total around 460 MB |
| Digestion way | Site-specific digestion | |
| Tolerance | Precursor: +/- 3Da; Fragment: +/- 0.5 Da | |
| Modifications | Fixed: Carbamidomethylation (C) Variable: Oxidation (M) | |
| Instrument | LTQ-FT or LTQ-Orbitrap | |
| Spectra | 13816 spectra extracted from two raw files | |
| Database | IPI-Mouse protein sequence database(113914 protein sequences, target + reversed), total around 66 MB | |
| Exp2 | Digestion way | Non-specific digestion |
| Tolerance | Precursor: +/- 12 ppm; Fragment: +/- 0.5 Da | |
| Modifications | Fixed: Carbamidomethylation (C) Variable: Phosphorylation (S, T, Y) Oxidation (M), |
Peptide and protein identification time for the three workflows
| Workflow | Experiment 1 | Experiment 2 |
|---|---|---|
| Workflow-1 | 3679 | 8278 |
| Workflow-2 | 2236 | 4430 |
| Workflow-3 | 2228 | 4111 |
Workflow-1, with no special data structure; workflow-2, with peptide indexing;
workflow-3, with ABLCP. The unit of time is minutes.
The peptide redundancy ratio of two protein sequence databases
| Database | Peptide Number | Full-specific | Semi-specific | Non-specific |
|---|---|---|---|---|
| Non-redundant | 3549956 | 55908454 | 626871441 | |
| Human | Redundant | 8022636 | 128308391 | 1401160777 |
| Redundancy | 55.7% | 56.4% | 55.2% | |
| Non-redundant | 24915278 | 395305609 | 4525189544 | |
| SwissProt | Redundant | 37646081 | 601652577 | 6554527058 |
| Redundancy | 33.8% | 34.2% | 31.0% | |
The peptide redundancy ratio of the IPI-Human V3.65 and Uniprot/SwissProt V56.2 protein sequence databases. The length of the peptides is limited from 6 to 60 amino acids.
The additional storage space needed for ABLCP and peptide indexing
| Database | Workflow | Full-specific | Semi-specific | Non-specific |
|---|---|---|---|---|
| Human | ABLCP | 30 | 60 | 30 |
| Peptide Index | 67 | 939 | 9799 | |
| SwissProt | ABLCP | 137 | 274 | 137 |
| Peptide Index | 424 | 6081 | 65122 | |
The experiments were performed on the IPI-Human V3.65 and Uniprot/SwissProt V56.2 protein sequence databases for full-, semi- and non-specific digestion. The unit of storage space is MB.
The time needed to construct ABLCP and peptide indexing
| Database | Workflow | Full-specific | Semi-specific | Non-specific |
|---|---|---|---|---|
| Human | ABLCP | 41 | 41 | 41 |
| Peptide Index | 50 | 603 | 6475 | |
| SwissProt | ABLCP | 196 | 196 | 196 |
| Peptide Index | 242 | 2919 | 24828 | |
The experiments were performed on the IPI-Human V3.65 and Uniprot/SwissProt V56.2 protein sequence databases for full-, semi- and non-specific digestion. The unit of time is seconds.
The time needed to read peptides from the disk
| Database | Workflow | Full-specific | Semi-specific | Non-specific |
|---|---|---|---|---|
| Human | ABLCP | 3 | 108 | 144 |
| Peptide Index | 20 | 317 | 3588 | |
| SwissProt | ABLCP | 16 | 441 | 1032 |
| Peptide Index | 144 | 2301 | 25283 | |
The time needed to read peptides from the disk for peptide indexing or online digestion for ABLCP. The experiments were performed on the IPI-Human V3.65 and Uniprot/SwissProt V56.2 protein sequence databases for full-, semi- and non-specific digestion. The unit of time is seconds.
Figure 1An example of the corresponding suffix array and LCP for an input text string. The first row is the input text string T = T [0...n) = MSQVQVQV$. n is 8 and the index begins with 0. The second and third rows are the corresponding LCP and suffix array. Take the value at index 2 to explain the meaning. SA[2] is 6, which means that the third suffix in the ascending lexicographical order is the Suffix[6] and this suffix is "QV$". LCP[2] is 4, which means that the longest common prefix between the Suffix[2] "QVQVQV$" and its previous suffix (in the lexicographical order, Suffix[4]"QVQV$") is 4.
The construction time of the four algorithms for the two databases
| IPI-Human | Uniprot/SwissProt | |
|---|---|---|
| Prefix-doubling algorithm LS | 290.7 | -- |
| Recursive algorithm DC3 | 109.6 | -- |
| Induced algorithm MF | 24.3 | 111.5 |
| Induced algorithm MP | 17.9 | 92.0 |
The table does not have the time of algorithms LS and DC3 for the Uniprot/SwissProt V56.2 database, because the two algorithms required too much memory space and could not be constructed in memory. The unit of time is seconds.
Figure 2Shared processes of the three workflows. Spectra are sorted by their precursor masses and the candidate peptides are obtained from a protein sequence database. Then all of the spectra within the specified mass tolerance window are found for the candidate peptide, and each peptide-spectrum match is scored. Evaluation and protein inference occur at the end of the matching and scoring stage.
Figure 3An example of the algorithm GetAllSubStrings. The original string T is {MSQVQVQV$}, and the LCP is {0, 0, 4, 3, 2, 1, 0, 0, 0}. Because the '$' does not belong to the protein sequence database, so the '$' is omitted from in the For loop. Take the suffix "VQVQV" as an example. The corresponding LCP is 3 and this suffix generates substrings at length from 4 (LCP plus one), so this suffix generates two substrings "VQVQ" and "VQVQV".