| Literature DB >> 31363125 |
Zhen-Lin Chen1,2, Jia-Ming Meng1,2, Yong Cao3, Ji-Li Yin1,2, Run-Qian Fang1,2, Sheng-Bo Fan1,2, Chao Liu1,2, Wen-Feng Zeng1,2, Yue-He Ding3, Dan Tan3, Long Wu1,2, Wen-Jing Zhou1,2, Hao Chi1,2, Rui-Xiang Sun3, Meng-Qiu Dong4, Si-Min He5,6.
Abstract
We describe pLink 2, a search engine with higher speed and reliability for proteome-scale identification of cross-linked peptides. With a two-stage open search strategy facilitated by fragment indexing, pLink 2 is ~40 times faster than pLink 1 and 3~10 times faster than Kojak. Furthermore, using simulated datasets, synthetic datasets, 15N metabolically labeled datasets, and entrapment databases, four analysis methods were designed to evaluate the credibility of ten state-of-the-art search engines. This systematic evaluation shows that pLink 2 outperforms these methods in precision and sensitivity, especially at proteome scales. Lastly, re-analysis of four published proteome-scale cross-linking datasets with pLink 2 required only a fraction of the time used by pLink 1, with up to 27% more cross-linked residue pairs identified. pLink 2 is therefore an efficient and reliable tool for cross-linking mass spectrometry analysis, and the systematic evaluation methods described here will be useful for future software development.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31363125 PMCID: PMC6667459 DOI: 10.1038/s41467-019-11337-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1pLink 2 workflow. a The general workflow. Step 1, MS1 scans are preprocessed by pParse to extract precursor candidates. Step 2, for each MS2 spectrum, α-peptide candidates are retrieved from the fragment index using query peaks generated from the spectrum. Step 3, β-peptide candidates are retrieved from the peptide index using the complementary masses of α-peptides. Step 4, α- and β-peptide candidates are paired and fine-scored with the MS2 spectrum. Step 5, all top scored PSMs are re-ranked and filtered after FDR control. b The sub-workflow of α-peptide retrieval. For each MS2 spectrum, the peaks are converted into regular b, y ions to query the fragment index. Only those peptides with at least two matched ions are coarse-scored with the spectrum and the top-5 coarse-scored α-peptide candidates are kept. c The sub-workflow of β-peptide retrieval. For each α-peptide candidate, the open mass is first calculated by subtracting the α-peptide mass and the cross-linker mass from the precursor mass, and this mass is used to retrieve β-peptide candidates from the peptide index. Then, each of the five α-peptide candidates is paired with each of its complementary β-peptide candidates and these pairs are fine-scored with the spectrum. Finally, the highest fine-scored peptide pair is kept. d The re-ranking algorithm. PSMs are grouped into intra-protein, inter-protein, loop-linked, mono-linked, and regular groups, and a semi-supervised learning algorithm is used to re-score and re-rank them in each group
The performance of ten search engines on the Simulated-BS3 dataseta
| Search engine | Search strategy | Sensitivity (%) | Precision (%) | Run time (Min) | Selected |
|---|---|---|---|---|---|
| xQuestb | Exhaustive | – | – | – | No |
| Xilmassc | Exhaustive | – | – | – | No |
| Xolik | Exhaustive | 40.7 | 93.7 | 0.7 | No |
| StavroX | Exhaustive | 50.4 | 78.6 | 363.9 | No |
| Xi | Open | 62.5 | 59.1 | 9.3 | No |
| MetaMorpheusXL | Open | 71.0 | 97.7 | 0.3 | No |
| Protein Prospector | Open | 78.6 | 97.2 | 16.9 | No |
| Kojak | Open | 85.3 | 97.8 | 1.7 | Yes |
| pLink 1 | Open | 99.8 | 99.8 | 12.3 | Yes |
| pLink 2 | Open | 99.9 | 100.0 | 0.5 | Yes |
a For sensitivity, precision, and run time, the average values obtained using three randomly generated Simulated-BS3 datasets were shown
b xQuest threw an exception “Illegal division by zero at /home/xqxp/xquest/V2_1_1/xquest/bin/compare_peaks3.pl line 2246” and did not report any results
c Xilmass threw an exception “java.lang.OutOfMemoryError: GC overhead limit exceeded” and did not report any results
Fig. 2Performance evaluation on the Synthetic-BS3 dataset. a Venn diagram for the results of Kojak, pLink 1, pLink 2, and the benchmark. A total of 904 PSMs were correctly identified consistently by the three engines; these were used to be a new and fair standard dataset. b The numbers of correctly identified PSMs by each search engine. c The percentage of correct α-peptides ranking in the top-k in the open search stage of pLink 2. d Similar to c, but for β-peptides. The “Original” database contains only the sequences of 38 synthetic peptides, “ + E. coli” database contains sequences from the “Original” database and the E. coli whole proteome database, and “ + Worm” and “ + Human” are similar to “ + E. coli”
Fig. 3Performance evaluation on the E.coli-Leiker-15N dataset. a Experimental design of the E.coli-Leiker-15N dataset. The unlabeled and 15N metabolically labeled E. coli lysates were cross-linked separately, mixed at a 1:1 ratio, digested with trypsin, and analyzed by LC-MS/MS. The dataset was searched only for the unlabeled peptides using different search engines, and the identification results were passed to pQuant to quantify the intensity ratio of the 15N-labeled precursor to the unlabeled precursor. Lastly, the precision of identifications was investigated by checking the percentage of NaN-ratio PSMs and peptides. b Analyses of the identified cross-linked PSMs. c Analyses of the identified cross-linked peptide pairs. The histograms denote the total numbers of b PSMs or c peptide pairs identified by each search engine under separate FDR control of intra-protein and inter-protein results, and the curves denote the percentage of NaN-ratio b PSMs or c peptide pairs in the corresponding histograms. d For intra-protein PSMs, more results were reported under separate FDR control and its percentage of NaN ratios was slightly higher than that under global FDR control. e For inter-protein PSMs, many fewer results were reported under separate FDR control and its percentage of NaN ratios decreased, especially for pLink 1
Fig. 4Performance evaluation on the SCF(FBXL3)-BS3 dataset. a A real-world protein complex sample was searched using Kojak, pLink 1, and pLink 2. A total of 850 cross-linked PSMs were identified consistently by the three engines; these were used to be a new and fair standard dataset. b The sensitivities and precisions of the three engines. “ + E. coli” database contains sequences from 146 target proteins and the E. coli whole proteome database and “ + Worm” and “ + Human” are similar to “ + E. coli”. pLink 1 did not finish searching against the worm or the human entrapment databases within 1 week on a single computer when five variable modifications were set
The normalized computing times of the three search engines on eight datasets
| Dataset | pLink 1 | Kojak | pLink 2 | Real timea |
|---|---|---|---|---|
| Simulated-BS3 | 24.6 | 3.4 | 1.0 | 0.5 |
| Simulated-SS | 12.7 | 1.7 | 1.0 | 0.7 |
| Synthetic-BS3 + Original | 35.5 | 2.3 | 1.0 | 0.1 |
| Synthetic-BS3 + | 45.4 | 1.7 | 1.0 | 0.5 |
| Synthetic-BS3 + Worm | 50.8 | 2.4 | 1.0 | 3.5 |
| Synthetic-BS3 + Human | 38.3 | 2.1 | 1.0 | 6.4 |
| Synthetic-SS + Original | 32.7 | 2.6 | 1.0 | 0.2 |
| Synthetic-SS + | 60.2 | 3.3 | 1.0 | 1.2 |
| Synthetic-SS + Worm | 64.3 | 3.8 | 1.0 | 17.7 |
| Synthetic-SS + Human | 69.7 | 4.0 | 1.0 | 31.9 |
| E.coli-Leiker-15N | 31.4 | 1.6 | 1.0 | 142.9 |
| E.coli-SS-15N | 62.6 | 4.2 | 1.0 | 200.7 |
| SCF(FBXL3)-BS3 + Original | 20.0 | 1.7 | 1.0 | 260.6 |
| SCF(FBXL3)-BS3 + | 46.0 | 3.8 | 1.0 | 22.0 |
| SCF(FBXL3)-BS3 + Worm | –b | 4.5 | 1.0 | 275.7 |
| SCF(FBXL3)-BS3 + Human | –b | 4.2 | 1.0 | 385.1 |
| Cav1.1-SS + Original | 20.4 | 2.7 | 1.0 | 2.9 |
| Cav1.1-SS + | 31.8 | 3.4 | 1.0 | 1.8 |
| Cav1.1-SS + Worm | 36.1 | 4.0 | 1.0 | 34.5 |
| Cav1.1-SS + Human | 40.2 | 5.1 | 1.0 | 45.7 |
| Average | 40.2 | 3.1 | 1.0 | – |
aThe real search times for pLink 2 are shown in minutes
bpLink 1 did not finish searching against the worm or the human entrapment databases within 1 week when five variable modifications were set
Fig. 5Increased speed of pLink 2 over Kojak on the Synthetic-BS3 dataset. a pLink 2 achieved a 3.9 times speed-up when searching against the E. coli entrapment database. The horizontal axis is the number of top-k scored single peptides kept in Kojak, starting from its default value of 250. Speed-up was measured when the sensitivity of Kojak remained steady. b, c Similar to a, but against b the worm and c the human entrapment database, respectively