| Literature DB >> 27562653 |
Ngoc Hieu Tran1, M Ziaur Rahman1,2, Lin He2, Lei Xin2, Baozhen Shan2, Ming Li1.
Abstract
De novo protein sequencing is one of the key problems in mass spectrometry-based proteomics, especially for novel proteins such as monoclonal antibodies for which genome information is often limited or not available. However, due to limitations in peptides fragmentation and coverage, as well as ambiguities in spectra interpretation, complete de novo assembly of unknown protein sequences still remains challenging. To address this problem, we propose an integrated system, ALPS, which for the first time can automatically assemble full-length monoclonal antibody sequences. Our system integrates de novo sequencing peptides, their quality scores and error-correction information from databases into a weighted de Bruijn graph to assemble protein sequences. We evaluated ALPS performance on two antibody data sets, each including a heavy chain and a light chain. The results show that ALPS was able to assemble three complete monoclonal antibody sequences of length 216-441 AA, at 100% coverage, and 96.64-100% accuracy.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27562653 PMCID: PMC4999880 DOI: 10.1038/srep31730
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1ALPS system for automated and complete de novo assembly of monoclonal antibody sequences.
Summary of ALPS De Novo Assembly Results on Antibody Datasets.
| WIgG1–Light (219 AA) | WIgG1–Heavy (441 AA) | Human–Light (216 AA) | Human–Heavy (446 AA) | |
|---|---|---|---|---|
| Assembly Results | Full-length contig from de Bruijn assembler | Full-length contig from de Bruijn assembler | Full-length contig from de Bruijn assembler | 3 contigs (lengths 346, 92, 67) from de Bruijn assembler; Complete sequence merged from 3 contigs |
| Target Sequence Coverage (%) | 100.00 | 100.00 | 100.00 | 100.00 |
| Target Sequence Accuracy (%) | 100.00 | 99.09 | 100.00 | 96.64 |
The target sequence coverage was calculated as the percentage of amino acids of the target sequence that were covered by at least one contig. The target sequence accuracy was calculated as the percentage of matched amino acids. I-to-L were not considered as mismatched.
Length (AA), Number of Amino Acids Recovered (AA), Target Sequence Coverage (%), and Contig Assembly Accuracy (%) of the Longest Contigs for Antibody Datasets.
| WIgG1–Light (219 AA) | WIgG1–Heavy (441 AA) | Human–Light (216 AA) | Human–Heavy (446 AA) | |
|---|---|---|---|---|
| PSM-DN with frequencies | 114; 109; 49.77; 95.61 | 143; 129; 29.25; 90.21 | 175; 170; 78.70; 97.14 | 98; 74; 16.59; 75.51 |
| PSM-DN with weights | 109; 109; 49.77; 100.00 | 219; 194; 43.99; 88.58 | 175; 170; 78.70; 97.14 | 154; 121; 27.13; 78.57 |
| PSM-DD | 219; 219; 100.00; 100.00 | 453; 441; 100.00; 97.35 | 216; 216; 100.00; 100.00 | 346; 344; 77.13; 99.42 |
| PSM-DDS | 219; 219; 100.00; 100.00 | 442; 441; 100.00; 99.77 | 216; 216; 100.00; 100.00 | 346; 344; 77.13; 99.42 |
The target sequence coverage was calculated as the percentage of amino acids of the target sequence that were covered by the respective longest contig. The contig assembly accuracy was calculated as the percentage of correct amino acids of the longest contig that were aligned to the respective target sequence.
Figure 2Assembly results for the WIgG1 light chain.
(A) BLAST alignment of the top assembled contigs from list PSM-DN against the target light chain. (B) Zoom-in details of the alignment in (A). (C) BLAST alignment of the full-length contig assembled from list PSM-DD against the target light chain. (D) Details of the alignment in (C).
Figure 3Assembly results for the WIgG1 heavy chain.
(A) BLAST alignment of the top assembled contigs from list PSM-DN against the target heavy chain. (B) BLAST alignment of the full-length contig assembled from list PSM-DDS against the target heavy chain. (C) Details of the alignment in (B).
Figure 4Assembly results for the HUMAN heavy chain.
(A) BLAST alignment of the top assembled contigs from list PSM-DDS against the target heavy chain. (B) BLAST alignment of the template-alignment-based merging of PSM-DDS contigs against the target heavy chain. (C) Details of the alignment in (B).
Summary of De Novo Assembly Results on 6-protein Mixture Dataset.
| leptin (167 AA) | kallikrein (261 AA) | groEL (548 AA) | myoglobin (154 AA) | aprotinin (100 AA) | peroxidase (353 AA) | |
|---|---|---|---|---|---|---|
| Meta-SPS (with κ ≥ 1) | ||||||
| Longest Contig (AA) | 93 | 134 | 194 | 80 | 59 | 58 |
| Sequencing Coverage (%) | 86.2 | 87.7 | 92.5 | 92.2 | 64.0 | 67.4 |
| Sequencing Accuracy (%) | 100.0 | 98.5 | 97.7 | 99.3 | 80.0 | 100.0 |
| ALPS (with list PSM-DDS, 7-mers) | ||||||
| Longest Contig (AA) | 131 | 77 | 444 | 118 | 65 | 92 |
| Sequencing Coverage (%) | 87.4 | 83.5 | 99.1 | 99.4 | 65.0 | 66.6 |
| Sequencing Accuracy (%) | 98.6 | 96.8 | 99.8 | 98.0 | 95.4 | 95.3 |
Sequencing coverage was calculated as the percentage of amino acids of the protein sequence that were covered by at least one contig. Sequencing accuracy was calculated as the percentage of all annotated sequence calls that were labeled correct. The Meta-SPS results were reported in ref. 16.