| Literature DB >> 34748388 |
Christian H Ahrens1, Joseph T Wade2,3, Matthew M Champion4, Julian D Langer5,6.
Abstract
Small proteins of up to ∼50 amino acids are an abundant class of biomolecules across all domains of life. Yet due to the challenges inherent in their size, they are often missed in genome annotations, and are difficult to identify and characterize using standard experimental approaches. Consequently, we still know few small proteins even in well-studied prokaryotic model organisms. Mass spectrometry (MS) has great potential for the discovery, validation, and functional characterization of small proteins. However, standard MS approaches are poorly suited to the identification of both known and novel small proteins due to limitations at each step of a typical proteomics workflow, i.e., sample preparation, protease digestion, liquid chromatography, MS data acquisition, and data analysis. Here, we outline the major MS-based workflows and bioinformatic pipelines used for small protein discovery and validation. Special emphasis is placed on highlighting the adjustments required to improve detection and data quality for small proteins. We discuss both the unbiased detection of small proteins and the targeted analysis of small proteins of interest. Finally, we provide guidelines to prioritize novel small proteins, and an outlook on methods with particular potential to further improve comprehensive discovery and characterization of small proteins.Entities:
Keywords: LC-MS/MS; SEP; genome annotation; microprotein; proteomics; sample preparation; shotgun proteomics; small protein; sproteins; top-down proteomics
Mesh:
Substances:
Year: 2021 PMID: 34748388 PMCID: PMC8765459 DOI: 10.1128/JB.00353-21
Source DB: PubMed Journal: J Bacteriol ISSN: 0021-9193 Impact factor: 3.476
FIG 1Overview of the main mass spectrometry-based workflows for small protein discovery, analysis, and characterization. The large majority of studies have relied on a shotgun proteomics discovery approach (bottom-up) to identify small proteins. Top-down approaches are slowly gaining momentum but are not yet widely accessible from core facilities. Bioinformatics is important to assemble complete genomes de novo (at times using genomic DNA extracted from the same sample), to integrate small protein predictions with experimental RNA-seq and Ribo-seq data to create custom databases that allow the identification of novel small proteins by MS-based proteomics. Validation and prioritization facilitate focusing on the elucidation of function(s) of the most promising novel small proteins (yellow shading; see asterisk), an aspect that is described in more detail in the accompanying article “Small Proteins; Big Questions” (124). Shading matches that in Fig. 2. Corresponding text sections are indicated by white circles, as follows: 1B, “Sample Preparation and Data Collection—Preparation and enrichment for small proteins”; 1C, “Sample Preparation and Data Collection—Protease Digestion”; 1D, “Sample Preparation and Data Collection—Liquid chromatography”; 1E, “Sample Preparation and Data Collection—Ionization and data acquisition” ; 2A, “Data Analysis—Overview: the relevance of genome sequences for proteogenomics”; 2B, “Data Analysis—Creation of custom search databases”; 2C, “Data Analysis—Stringent FDR control”; 3, “Validation of Novel Small Protein Candidates”; 4, “Prioritization/Selection of Novel Small Proteins.”
FIG 2Overview of the major steps of the most common MS-based workflows for discovery/identification of small proteins, their targeted analysis (for quantification), and for the functional characterization of novel and known small proteins. The numbering of the steps is aligned with Fig. 1, with corresponding text sections indicated by white circles, as follows: 1B, “Sample Preparation and Data Collection—Preparation and enrichment for small proteins”; 1C, “Sample Preparation and Data Collection—Protease digestion”; 1D, “Sample Preparation and Data Collection—Liquid chromatography”; 1E, “Sample Preparation and Data Collection—Ionization and data acquisition”; 2A, “Data Analysis—Overview: the relevance of genome sequences for proteogenomics”; 2B, “Data Analysis—Creation of custom search databases”; 2C, “Data Analysis—Stringent FDR control”; 3, “Validation of Novel Small Protein Candidates”; 4, “Prioritization/Selection of Novel Small Proteins.” Alternative approaches are listed and selected references provided.
A selection of small protein discovery studies for bacteria and archaea using shotgun (bottom-up) proteomics and top-down approaches without proteolytic digest
| Organism(s) | Taxonomy | Approach | Notes | Sample | Reference(s) |
|---|---|---|---|---|---|
|
| Bacteria | Shotgun | Term proteogenomics introduced; search six-frame translated genome | Whole cell lysate |
|
|
| Bacteria | Shotgun | Used proteomics data in initial genome annotation of an organism | Whole cell lysate |
|
|
| Bacteria | Shotgun | MS-based proteomics to improve genome annotation, used PTM data, studied several conditions | Whole cell lysate; subcellular fractionation |
|
|
| Archaea | Top down | Top-down approach identified five unannotated small proteins (40–76 aa) | Whole cell lysate |
|
|
| Bacteria | Shotgun | Effort to analyze the entire expressed proteome, combining different conditions and proteomics approaches | Subcellular fractionation |
|
|
| Bacteria | Shotgun | Required 2 peptides to identify a novel protein | Subcellular fractionation |
|
| Bacteria | Shotgun | Custom database approach that merges information from different strains | Various |
| |
| 46 species (bacteria and archaea) | Bacteria, Archaea | Shotgun | Large proteogenomic study; use of stringent PSM level FDR advocated | Various |
|
|
| Bacteria | Shotgun | High FDR among peptides implying novel proteins; trypsin + Lys-C | Whole cell lysate |
|
|
| Bacteria | Shotgun | Required 2 peptides to identify a novel protein; also used size exclusion chromatography | Whole cell lysate |
|
| 57 bacterial species | Bacteria | Shotgun | Large proteogenomics study on N-terminal methionine excision and PTM (N-terminal acetylation) | Various |
|
|
| Bacteria | Shotgun | Reannotated genome of organism with high GC content (transcriptomics, shotgun proteomics) | Whole cell lysate |
|
|
| Bacteria | Shotgun | Custom databases to find longer ( | Whole cell lysate | |
|
| Archaea | Shotgun | Proteogenomic study of model cyanobacterium (8 conditions); global profiling for PTMs (1% PSM FDR) | Whole cell lysate |
|
|
| Bacteria | Shotgun | N terminomic combined with six-frame translation database to validate/correct N termini (+ alternative proteases; Glu-C, chymotrypsin) | Whole cell lysate |
|
|
| Bacteria | Shotgun | Integrated anaylsis to re-annotate genome (>300 transcriptome, >70 proteome datasets); evidence for internal starts, new small proteins | Various |
|
|
| Bacteria | Top down (MALDI) | Small transmembrane subunit of cbb3 oxidase | Purified protein complex |
|
| Metagenomic study of grassland soil | Bacteria, archaea | Shotgun | Metagenome-assembled genomes as basis for meta-proteomics (custom database); integrate metabolomics; beyond culturable strains | Soil extract |
|
|
| Bacteria | Shotgun | N-terminal enrichment (COFRADIC approach) ( | Whole cell lysate | |
|
| Bacteria | Shotgun | Reannotation of a plant pathogen with Shotgun data; confirmed expression of 5 novel proteins with Immunoblot (c-Myc tag) | Whole cell lysate |
|
|
| Bacteria | Shotgun | Broadly applicable proteogenomic approach, custom databases validated 107/138 peptides with PRM ( | Subcellular fractionation | |
|
| Bacteria | Shotgun | N terminome study of a strain used for microbial rehabilitation and degradation of industrial pollutants | Whole cell lysate |
|
|
| Bacteria | Top down (MALDI) | Small transmembrane subunit of bd oxidase | Purified protein complex |
|
|
| Bacteria | Shotgun | Explored small protein enrichment strategies, different proteases, database searches; validation by PRM and spectral matching | Small protein enrichment |
|
| Bacteria | Shotgun | Broadly applicable custom peptide DB; integrated Ribo-seq data and peptide fragmentation prediction | Whole cell lysates |
| |
| Bacteria | Shotgun | Integrated small protein prediction with Ribo-seq, shotgun, and other OMICS data; elucidated function of novel small proteins | Various |
| |
| Prokaryotes from human microbiota, | Bacteria, archaea | Shotgun | Prediction of ∼4500 small protein families <50 aa; experimental evidence for selected examples (meta-transcriptomics/proteomics) | Various |
|
| Intestinal microbiota model system | Bacteria | Shotgun | Extended custom iPtgxDB to multispecies model (8 strains); meta-transcriptomics + meta-proteomics; spectral matching | Small protein enrichment |
|
|
| Bacteria | Top down (MALDI) | Small transmembrane subunit of photosystem II | Purified protein complex |
|
|
| Bacteria | Shotgun | Broadly applicable approach; integrated shotgun and Ribo-seq data; automated evaluation of MS/MS spectra quality | Cytoplasmic extract |
|
|
| Archaea | Shotgun/top down | Characterization of 36 proteoforms mapping to 12 small proteins with top down (2D-LC-MS) | Small protein enrichment |
|
|
| Archaea | Shotgun | Multiprotease approach (SDS-PAGE) | Small protein enrichment |
|
We apologize to authors of the many important studies we could not reference due to space restrictions. Preference was given to more recent studies, many of which have used higher accuracy MS instruments and small protein enrichment strategies, carried out validation of novel small proteins (e.g., by PRM or spectral matching), and put a larger emphasis on integration of RNA-seq, Ribo-seq or computational small protein prediction algorithms. Proteogenomic studies prior to 2014 are more thoroughly listed by Kucharova and Wiker (125).