| Literature DB >> 35793793 |
Joon-Yong Lee1, Hugh D Mitchell1, Meagan C Burnet1, Ruonan Wu1, Sarah C Jenson2, Eric D Merkley2, Ernesto S Nakayasu1, Carrie D Nicora1, Janet K Jansson1, Kristin E Burnum-Johnson3, Samuel H Payne4.
Abstract
Metaproteomics has been increasingly utilized for high-throughput characterization of proteins in complex environments and has been demonstrated to provide insights into microbial composition and functional roles. However, significant challenges remain in metaproteomic data analysis, including creation of a sample-specific protein sequence database. A well-matched database is a requirement for successful metaproteomics analysis, and the accuracy and sensitivity of PSM identification algorithms suffer when the database is incomplete or contains extraneous sequences. When matched DNA sequencing data of the sample is unavailable or incomplete, creating the proteome database that accurately represents the organisms in the sample is a challenge. Here, we leverage a de novo peptide sequencing approach to identify the sample composition directly from metaproteomic data. First, we created a deep learning model, Kaiko, to predict the peptide sequences from mass spectrometry data and trained it on 5 million peptide-spectrum matches from 55 phylogenetically diverse bacteria. After training, Kaiko successfully identified organisms from soil isolates and synthetic communities directly from proteomics data. Finally, we created a pipeline for metaproteome database generation using Kaiko. We tested the pipeline on native soils collected in Kansas, showing that the de novo sequencing model can be employed as an alternative and complementary method to construct the sample-specific protein database instead of relying on (un)matched metagenomes. Our pipeline identified all highly abundant taxa from 16S rRNA sequencing of the soil samples and uncovered several additional species which were strongly represented only in proteomic data.Entities:
Keywords: de novo sequencing; deep learning model; metaproteomics; soil microbiome
Mesh:
Substances:
Year: 2022 PMID: 35793793 PMCID: PMC9361346 DOI: 10.1021/acs.jproteome.2c00334
Source DB: PubMed Journal: J Proteome Res ISSN: 1535-3893 Impact factor: 5.370
Figure 1Training, validation, and testing of a new de novo peptide identification algorithm. (A) Bacteria represented in training and testing data and shown in a phylogenetic tree built from the multiple sequence alignment of rplB is shown for all organisms in the training (white nodes) and testing (red nodes) data sets. The size of the node is scaled to represent the number of spectra used. (B) Accuracy of spectrum annotation for four de novo spectrum annotation tools. (C) For each peptide sequence length, the accuracy of spectrum annotation is shown for each of the four algorithms.
Figure 2Overview of the metaproteomics data analysis leveraging de novo spectrum identification based on the Kaiko model. Peptides are identified using Kaiko and used to infer community composition (steps 1–3). In step 4, the spectra are reanalyzed using a database search algorithm, e.g., MSGF+, and the protein sequence database created in step 3. This yields a final list of peptide identifications which can be used for functional analysis.
Relative Abundance of the Top 20 Bacterial Phyla Detected from 16S and Kaikoa
| Phylum | Read counts by 16S | Peptide counts by Kaiko | Relative read counts % total reads at the phylum level | Relative Peptide counts % By Kaiko at the phylum level |
|---|---|---|---|---|
| Proteobacteria* | 40778 | 4903 | 34.6 | 38.1 |
| Actinobacteria* | 16501 | 3949 | 14.0 | 30.7 |
| Acidobacteria* | 18562 | 1010 | 15.7 | 7.8 |
| Firmicutes* | 6761 | 634 | 5.7 | 4.9 |
| Chloroflexi* | 767 | 479 | 0.7 | 3.7 |
| Bacteroidetes* | 9712 | 467 | 8.2 | 3.6 |
| Planctomycetes* | 11427 | 321 | 9.7 | 2.5 |
| - | 266 | - | 2.1 | |
| Verrucomicrobia* | 11841 | 237 | 10.0 | 1.8 |
| Cyanobacteria | 489 | 162 | 0.4 | 1.3 |
| Gemmatimonadetes* | 869 | 61 | 0.7 | 0.5 |
| Nitrospirae* | 18 | 44 | - | 0.3 |
| - | 43 | - | 0.3 | |
| - | 32 | - | 0.2 | |
| - | 32 | - | 0.2 | |
| - | 15 | - | 0.1 | |
| Tenericutes | 99 | 15 | 0.1 | 0.1 |
| Armatimonadetes | 75 | 13 | 0.1 | 0.1 |
| Ignavibacteriae | 16 | 6 | 0.01 | 0.05 |
| Chlamydiae | 2 | 4 | 0.00 | 0.03 |
A dash in the table represents the corresponding phylum was “not detected”. The asterisk (*) indicates that some taxa in the corresponding phyla are used to construct the protein DB.
Figure 3Distribution of bacterial functions in the metabolic pathway map. Several metabolic steps are shared among multiple phyla (dark gray). Other colors indicate unique EC numbers and their associated metabolic function found only in a specific phylum.