Nadia G D'Lima1,2, Alexandra Khitun1,2, Aaron D Rosenbloom1, Peijia Yuan1,2, Brandon M Gassaway3,4, Karl W Barber3,4, Jesse Rinehart3,4, Sarah A Slavoff1,2,5. 1. Department of Chemistry, Yale University , New Haven, Connecticut 06520, United States. 2. Chemical Biology Institute, Yale University , West Haven, Connecticut 06516, United States. 3. Department of Cellular and Molecular Physiology, Yale University , New Haven, Connecticut 06520, United States. 4. Systems Biology Institute, Yale University , West Haven, Connecticut 06511, United States. 5. Department of Molecular Biophysics and Biochemistry, Yale University , New Haven, Connecticut 06529, United States.
Abstract
Recent advances in mass spectrometry-based proteomics have revealed translation of previously nonannotated microproteins from thousands of small open reading frames (smORFs) in prokaryotic and eukaryotic genomes. Facile methods to determine cellular functions of these newly discovered microproteins are now needed. Here, we couple semiquantitative comparative proteomics with whole-genome database searching to identify two nonannotated, homologous cold shock-regulated microproteins in Escherichia coli K12 substr. MG1655, as well as two additional constitutively expressed microproteins. We apply molecular genetic approaches to confirm expression of these cold shock proteins (YmcF and YnfQ) at reduced temperatures and identify the noncanonical ATT start codons that initiate their translation. These proteins are conserved in related Gram-negative bacteria and are predicted to be structured, which, in combination with their cold shock upregulation, suggests that they are likely to have biological roles in the cell. These results reveal that previously unknown factors are involved in the response of E. coli to lowered temperatures and suggest that further nonannotated, stress-regulated E. coli microproteins may remain to be found. More broadly, comparative proteomics may enable discovery of regulated, and therefore potentially functional, products of smORF translation across many different organisms and conditions.
Recent advances in mass spectrometry-based proteomics have revealed translation of previously nonannotated microproteins from thousands of small open reading frames (smORFs) in prokaryotic and eukaryotic genomes. Facile methods to determine cellular functions of these newly discovered microproteins are now needed. Here, we couple semiquantitative comparative proteomics with whole-genome database searching to identify two nonannotated, homologous cold shock-regulated microproteins in Escherichia coli K12 substr. MG1655, as well as two additional constitutively expressed microproteins. We apply molecular genetic approaches to confirm expression of these cold shock proteins (YmcF and YnfQ) at reduced temperatures and identify the noncanonical ATT start codons that initiate their translation. These proteins are conserved in related Gram-negative bacteria and are predicted to be structured, which, in combination with their cold shock upregulation, suggests that they are likely to have biological roles in the cell. These results reveal that previously unknown factors are involved in the response of E. coli to lowered temperatures and suggest that further nonannotated, stress-regulated E. coli microproteins may remain to be found. More broadly, comparative proteomics may enable discovery of regulated, and therefore potentially functional, products of smORF translation across many different organisms and conditions.
Entities:
Keywords:
E. coli; cold shock; genomics; label-free quantitation; microprotein; non-AUG start codon; proteogenomics; proteomics; small open reading frame; stress response
Small
open reading frames (smORFs) of <100 amino acids are widespread
in all genomes, but they remain largely nonannotated because they
have been under-detected by computational genome annotation algorithms
and proteomics protocols.[1] In recent years,
new technologies including smORF-focused computational genome analysis,[1−4] liquid chromatography/tandem mass spectrometry (LC–MS/MS)-based
proteomics coupled with deep sequencing,[5−8] and ribosome footprinting/deep sequencing
(RIBO-seq)[9,10] have revealed thousands of translated smORFs
in prokaryotic and eukaryotic genomes. While it has become clear that
many smORF-encoded microproteins play important roles in biology,[11] there remains a need to determine what fraction
of newly discovered microproteins are functional, especially because
many exhibit low sequence conservation with known proteins.[6,11]Methods to couple discovery of nonannotated microproteins
to quantitative
analysis of their expression regulation may provide insights into
their potential biological functions. For example, Storz and colleagues
demonstrated that expression of some smORFs in bacteria is stress-inducible,[2] leading to the hypothesis that smORF-encoded
microproteins may function in stress responses. However, while efforts
toward quantitative proteogenomics have been reported,[12−17] LC–MS/MS proteogenomics has generally lagged behind RIBO-seq
in differential analysis of nonannotated microprotein expression.[2,9] To address this need, we have applied a label-free quantitative
proteogenomic workflow to identify novel microproteins that exhibit
stress-regulated expression in Escherichia coli.We chose the cold shock response in E. coli as a model system. Cold shock is a condition under which bacteria
are abruptly exposed to low temperatures (in practice, 10 °C).
This causes arrest in global protein synthesis while inducing expression
of a subset of proteins known as cold shock proteins. The most profoundly
cold-inducible proteins are the homologues of CspA, which generally
act as nucleic acid chaperones to restore transcription and protein
translation at low temperatures.[18] All
of the nine known CspA homologues (CspA–CspI) in E. coli K12 are less than 80 amino acids in length.
Therefore, we hypothesized that nonannotated small proteins could
also be induced during cold shock. In this work, we compared nonannotated
small protein expression in E. coli cells growing at normal and reduced temperatures. We identified
four nonannotated sequences, two of which were found downstream of cspG and cspI and were upregulated by cold
shock. We further characterized the noncanonical ATT start codon that
initiates translation of these genes and demonstrated their conservation
in closely related bacteria.
Methods
Strains and Constructs
E. coli K12 substr. MG1655 and pKD46
plasmids were a gift from Jason Crawford
(Yale University). For generation of SPA tagged proteins, the tag
was introduced at the C-terminal end using the method described by
Uzzau et al. using bacteriophage λ recombination.[19,20] Colonies on LB plates with kanamycin were screened for recombination,
and the presence of the SPA tag at the C-terminus of the respective
genes was verified by PCR and confirmed by sequencing. Primers for
genomic tagging and integration check PCR are provided in Table S2.For recombinant expression, the
genetic region encompassing cspG–ymcF or cspI–ynfQ was PCR amplified from an E. coli K12 substr. MG1655 colony and cloned into
pET 28b using restriction sites NcoI and XhoI (New England Biolabs)
to yield a His6 tag at the C-terminal end of and in frame
with YmcF and YnfQ proteins. All mutations were introduced by site-directed
mutagenesis using inverse PCR.[21]
Stress
Conditions for Mass Spectrometry
Stress conditions
were adapted from Hemm et al.[2] as follows:
Approximately 500 mL of LB was inoculated with a 1:100 dilution of
an overnight culture of MG1655 cells. The cells were grown at approximately
37 °C in a flask with a stir bar until they reached an OD600 between 0.4 and 0.5. The cells were split into two fractions.
The control remained at 37 °C, and the cold shock sample was
incubated at 10 °C for 1 h (starting from the time that the culture
reached 10 °C). All cells were pelleted at 4000g for 10 min at 4 °C. The cells were resuspended in a smaller
volume and transferred to a 50 mL conical tube. The cells were again
pelleted at 4000g for 10 min at 4 °C. The supernatant
was removed, and the pellets were flash frozen and stored at −80
°C.
Cell Lysis and Protein Size Selection
Lysis and size
selection were adapted from Ma et al.[5] as
follows: Frozen cells from the stress conditions were resuspended
in lysis buffer (50 mM HCl and 0.1% β-mercaptoethanol). The
resuspension was sonicated at 35% amplitude with eighteen 10 s bursts
with a 20 s rest on a Fisher Scientific model 120 sonic dismembrator.
Triton X-100 was added to the sample to a final concentration of 0.05%.
The sample was heated for 10 min at greater than 95 °C, allowed
to cool on ice for 10 min, and then pelleted by centrifugation for
30 min at 21 100g at 4 °C. The supernatant
was removed, and the pellet was discarded. The supernatant was filtered
through a 5 μm filter.A Bond Elut C8 column (Agilent)
preconditioned with 1 column volume of methanol followed by 2 column
volumes of triethylammonium formate (TEAF) pH 3.0 was loaded with
approximately 10 mg of protein per 100 mg of bed resin and washed
with 2 column volumes of TEAF pH 3.0. Size-selected proteins were
eluted with two column volumes of 3:1 acetonitrile/TEAF pH 3.0 and
concentrated on a Savant SPD10 SpeedVac concentrator (Thermo Scientific).
Digestion of Samples for Mass Spectrometry
The concentrated
sample was redissolved in water. The resuspension was precipitated
with a methanol/chloroform extraction. The precipitate was resuspended
in 31 μL of a solution of 8 M urea, 0.4 M Tris-HCl, and 20 mM
calcium chloride; 3 μL of 45 mM dithiothreitol (DTT) was added
to the solution, and the sample was incubated at 60 °C for 10
min. The reaction was placed on ice for 30 s and then incubated at
room temperature for 3 min; 3 μL of 100 mM iodoacetamide was
added, and the reaction was incubated at room temperature in the dark
for 30 min. The reaction was quenched with 0.67 μL of DTT; 16
μL of 1 M Tris-HCl pH 8.0 was added. Trypsin (Promega) was added
at a ratio of 1:50 trypsin/protein. Water was added to bring the urea
concentration to 1 M. The digest was incubated at 37 °C overnight.
The following day, the reaction was brought to 1% trifluoroacetic
acid (TFA). The peptides were desalted using Nest Group MicroSpin
columns (C18, 300 Å) and eluted in 80% acetonitrile/0.1% TFA.
The elution was concentrated on a Savant SPD1010 SpeedVac concentrator
(Thermo Scientific).
Offline Fractionation of Peptides
Peptides were fractionated
prior to LC–MS/MS via electrostatic repulsion–hydrophilic
interaction chromatography (ERLIC).[22] Desalted
samples were redissolved in 50 μL of 85% acetonitrile/0.1% formic
acid and loaded on a polyWAX LP column (150 × 1.0 mm; 5 μm
300 Å; PolyLC) attached to an Agilent 1100 HPLC at a 0.05 mL/min
flow rate. The samples were separated over an 80 min gradient as follows
(solvent A: 80% acetonitrile, 0.1% formic acid; solvent B: 30% acetonitrile,
0.1% formic acid). Isocratic flow was maintained at 100% A at a flow
rate of 0.3 mL/min for 5 min, followed by a 17 min linear gradient
to 8% B and a 25 min linear gradient to 45% B. Finally, a 10 min gradient
to 100% B was followed by a 5 min hold at 100% B before a 10 min linear
gradient back to 100% A, followed by an 8 min hold at 100% A. Fractions
were collected every several minutes, resulting in 15–17 samples
for further LC–MS/MS analysis. Each fraction was vacuum-dried
using a Savant SPD1010 SpeedVac concentrator (Thermo Scientific).
LC–MS/MS Analysis
LC–MS/MS methods were
based on a previous report.[23] The fractionated
samples were resuspended in approximately 7 μL of 3:8 70% formic
acid/0.1% TFA. Approximately 5 μL of each sample was injected
onto a 150 μm × 3 cm trap column packed in-house with ReproSil-Pur
120 Å C18 resin (Dr. Maisch). Separation was carried out on a
75 μm × 20 cm PicoFrit analytical column packed in-house
using 1.9 μm ReproSil-Pur 120 Å C18 resin (Dr. Maisch).
Solvents A and B (0.1% formic acid and acetonitrile/0.1% formic acid,
respectively) were delivered using a Nano Acquity UPLC (Waters) in-line
with an LTQ Orbitrap Velos (Thermo Scientific). Samples were trapped
for 6 min at a flow rate of 2.5 μL/min at 98% A. Isocratic flow
was maintained at 0.3 μL/min at 2% B for 10 min, followed by
linear gradients from 2 to 10% B over 2 min, 10 to 25% B over 58 min,
25 to 40% B over 10 min, and 40 to 95% B over 2 min. Isocratic flow
at 95% B was maintained for 5 min, followed by a gradient from 95
to 2% B over 10 min (MS: 30 000 resolution, 298–1750 m/z scan range; dd-MS2: top10 method, 7500
resolution, 1.0 m/z isolation window,
35 NCE).
Data Analysis
ProteoWizard MS Convert[24] was used for peak picking, and files were analyzed using
Mascot Version 2.5.1 (Matrix Science, Inc., London, UK).[25] Carbamidomethyl (C) was set as a fixed modification.
Variable modifications included carbamyl (K and N-term), oxidation
(M), and phospho (STY). The peptide mass error tolerance was 20 ppm.
The parameters were set to a semitryptic digest with a maximum of
three missed cleavages and peptide charge states limited to +2, +3,
and +4. A six-frame translation of the MG1566 genome (accession number NC_000913.3 in NCBI) and the common contaminant database were searched, and
the false discovery rate was adjusted to 1% using the homology threshold.
Peptides fewer than 8 amino acids in length were excluded. Identified
peptides were checked for annotation against the RefSeq database for
MG1655. Putative nonannotated hits were BLASTed, and those that contained
only one amino acid mismatch relative to annotated proteins were discarded.
Protein identifications were made on the basis of unique peptide matches
that had Mascot ions scores greater than 45, with a minimum of one
ion in both b and y series and at least four consecutive ions in a
series or multiple unique peptides that mapped to the same ORF.
Protein Expression
To test nonannotated protein expression,
10 mL of LB was inoculated with 200 μL of the genomically SPA-tagged
cultures grown overnight to saturation at 37 °C. Wild-type E. coli K12 MG1655 was used as a control. The cultures
were grown at 37 °C to log phase on a shaker and split into three
tubes containing 2.5 mL of the culture. Each tube was transferred
to water baths at 10 °C for 1 h (cold shock), 45 °C for
20 min (heat shock), or 37 °C for 1 h. In order to assess protein
expression, an aliquot from each tube corresponding to 0.2 OD600 units was taken, trichloroacetic acid (TCA) was added immediately
to a final concentration of 8%, and the samples were centrifuged at
14 000g for 15 min at 4 °C. Pellets were
washed with acetone, air-dried, and resuspended in SDS gel loading
buffer. Samples were heated at 90 °C for 2 min, and 10 μL
of each sample was loaded on a 15% SDS-PAGE gel.To test expression
of proteins from the pET vector, a single colony of E. coliBL21 (DE3) Gold cells containing the plasmid
construct was inoculated into 5 mL of LB with 40 μg/mL of kanamycin
and grown overnight at 37 °C. 100 μL of this culture was
used to inoculate 5 mL of LB/kanamycin and grown to log phase at 37
°C on a shaker. Isopropyl β-d-1-thiogalactopyranoside
(IPTG) was added to the cultures at a final concentration of 1 mM,
and growth was continued at 37 °C for 1 h, after which 0.2 OD600 units was taken and subjected to TCA precipitation followed
by SDS PAGE as described above. All gels were run in duplicate so
one could be stained with Coomassie and the other could be subjected
to western blotting. At least three biological replicates were carried
out for each experiment reported.
Western Blotting
Gels were transferred to BioTrace
nitrocellulose membranes (VWR) at 30 V for 16 h or at 100 V for 1
h. Blots were blocked in 3% BSA for 1 h at room temperature on a shaker.
To probe for SPA-tagged proteins, 1:1000 dilution of mouse monoclonal
anti-FLAG M2 (Sigma) primary antibody was incubated with the blot
for 1 h, followed by washing with Tris buffered saline containing
0.1% Tween 20 (TBS-T). Goat anti-mouse secondary antibody (Rockland)
at a dilution of 1:10 000 was incubated for 1 h, followed by
washing with TBS-T. Blots were developed using Clarity ECL western
blotting substrate (Bio-Rad) and imaged using a ChemiDoc imaging system
(Bio-Rad) and Image Lab software (BioRad). For His6-tagged
proteins, His tag antibody conjugated to biotin was used. For detection,
streptavidin conjugated to AlexFluor 488 was incubated with the blot
for 30 min, followed by washing and analysis of the blot by a Typhoon
imaging system and Image Quant software (GE Life Sciences).Changes in protein expression of SPA-tagged proteins were assessed
by quantifying the bands using Image Lab software (Bio-Rad). After
background subtraction, the fold change in expression was calculated
by dividing the intensity of bands at 10 °C by those at 37 °C.
At least three biological replicates were carried out for each protein
as well as the wild-type E. coli K12
MG1655 control.
Results
Development of a Proteomics
Workflow for Discovery of Nonannotated,
Cold Shock-Inducible Proteins in E. coli
Figure summarizes our comparative microprotein discovery platform. For
high-sensitivity microprotein detection, we enriched the E. coli small proteome using a modification of previously
reported workflows.[5,6] First, we prepared stress and
control samples by subjecting E. coli K12 substr. MG1655 cells growing at 37 °C in log phase to cold
shock conditions (10 °C) for an hour, whereas control cells were
maintained at 37 °C. Cells were lysed, and the small proteome
was isolated using a C8 column that selectively retains microproteins
and peptides.[26] After trypsin digestion,
peptides were separated by ERLIC, and each fraction was then analyzed
by liquid chromatography and tandem mass spectrometry. We performed
two biological replicates of the cold shock and control samples. We
subsequently analyzed two additional biological replicates of the
cold shock sample to assess reproducibility of protein identifications.
Figure 1
Quantitative
proteomic gene discovery in E. coli. (A) Schematic overview of the quantitative proteomics protocol.
(B) Comparative analysis of nonannotated gene expression begins with
parallel preparation of size-selected small proteome samples from
control and experimental (cold shock) cells. (C) Nonannotated peptides
are sequenced by searching their tandem mass spectrometry (MS/MS)
spectra against a six-frame translation of the E. coli genome and excluding sequences matching known proteins. (D) Analysis
of the peak area for the respective peptides in the extracted ion
chromatograms (EICs) from MS1 spectra is used to quantify
the level of upregulation relative to the control.
Quantitative
proteomic gene discovery in E. coli. (A) Schematic overview of the quantitative proteomics protocol.
(B) Comparative analysis of nonannotated gene expression begins with
parallel preparation of size-selected small proteome samples from
control and experimental (cold shock) cells. (C) Nonannotated peptides
are sequenced by searching their tandem mass spectrometry (MS/MS)
spectra against a six-frame translation of the E. coli genome and excluding sequences matching known proteins. (D) Analysis
of the peak area for the respective peptides in the extracted ion
chromatograms (EICs) from MS1 spectra is used to quantify
the level of upregulation relative to the control.In order to identify all peptides in this sample,
including those
derived from nonannotated genomic regions, we searched these peptide
fragmentation spectra against a six-frame translation of the E. coli K12 substr. MG1655 genome using MASCOT. Annotated
proteins were then excluded using a string-matching algorithm[6] with reference to the current E. coli K12 proteome, and, in order to conservatively
exclude possible point mutants in our laboratory strain, we retained
only those tryptic peptides that are at least two amino acids different
from any annotated protein. Only search results yielding peptides
having at least four consecutive b or y ions were considered for validation.
These parameters not only greatly reduced the number of candidate
peptides but also eliminated false positives. BLAST searches were
performed on the candidate peptides to verify that they were unique
in the E. coli genome. While single-peptide
protein identifications were retained for confirmation, since many
smORF-encoded microproteins yield only one detectable tryptic fragment,[6] we note that two independent tryptic peptides
support identification of two of our nonannotated protein hits (Table S1 and Figure S3). Tandem mass spectra
for peptides that met our stringent criteria are shown in Figures and S3, and peptide scores and related information
are provided in Table S1.
Figure 2
Detection and semiquantitative
analysis of nonannotated E. coli proteins.
(A) Extracted ion chromatograms
(EICs) from MS1 spectra corresponding to (B) MS/MS spectra
of nonannotated tryptic peptides identified in our shotgun profiling
experiments. The EIC intensity at the same retention time for a 1
Da window around the parent ion mass was compared for the control
(red) vs cold shock (blue) samples. Each matched EIC pair is presented
on the same y-axis scale. Because the analysis is
semiquantitative, substantial intensity in both samples was taken
to indicate similar expression. MS/MS spectra (right) presented correspond
to the experimental EICs shown (left). Y- and b-ions are shown in
red and indicated on the matched peptide scores above each spectrum. m/z, mass to charge ratio. Additional peptides
corresponding to each protein as well as scores, precursor mass errors,
and charge states corresponding to the MS/MS spectra in this figure
can be found in Table S1.
Detection and semiquantitative
analysis of nonannotated E. coli proteins.
(A) Extracted ion chromatograms
(EICs) from MS1 spectra corresponding to (B) MS/MS spectra
of nonannotated tryptic peptides identified in our shotgun profiling
experiments. The EIC intensity at the same retention time for a 1
Da window around the parent ion mass was compared for the control
(red) vs cold shock (blue) samples. Each matched EIC pair is presented
on the same y-axis scale. Because the analysis is
semiquantitative, substantial intensity in both samples was taken
to indicate similar expression. MS/MS spectra (right) presented correspond
to the experimental EICs shown (left). Y- and b-ions are shown in
red and indicated on the matched peptide scores above each spectrum. m/z, mass to charge ratio. Additional peptides
corresponding to each protein as well as scores, precursor mass errors,
and charge states corresponding to the MS/MS spectra in this figure
can be found in Table S1.In order to identify differential expression, we
utilized label-free
quantitation.[26−28] Briefly, we identified nonannotated proteins identified
by Mascot search only in the control or stress condition. We then
compared the area under the MS1 peak in the extracted ion
chromatogram (EIC)[29] for each of these
peptides (Figure ),
providing quantitative confirmation of differential expression. As
a control, we confirmed that proteins that do not change under the
experimental cold shock condition exhibited constant MS1 ion intensity (Figure ) and that upregulated MS1 intensity was observed for a peptide derived
from a known cold shock protein (Figure S1). We also confirmed that, for each fraction analyzed, E. coli proteins known to be unresponsive to cold
temperatures, such as ribosomal proteins, do not change in abundance
in their MS1 peptide ion intensities, demonstrating that
the changes we attribute to novel cold shock proteins are specific
(Figure S1).
Identification of Genomic
Loci Putatively Encoding Nonannotated
Microproteins in E. coli
The
genomic sequences corresponding to these candidate peptides were identified
in order to define their full-length sequences. Our proteomics search
results yielded peptides that map to four candidate nonannotated proteins
in coding sequences currently annotated as intergenic. We propose
to name these proteins YmcF, YnfQ, YnaL, and YhiY per convention for
proteins of unknown function (Figure ). We also identified a peptide putatively corresponding
to the predicted protein YpaA (Figure S2). Comparative analysis of the EICs revealed that peptides from three
of these smORFs—ynaL, yhiY, and ypaA—were present in both control and
cold shocked cells (Figures and S2). In contrast, peptides
derived from ymcF and ynfQ were
either not present in the control cells or dramatically enriched in
the cold shocked cells compared to the control (Figure ). We subsequently analyzed two cold shock
sample replicates, demonstrating reproducible detection and sequencing
of YmcF, YnfQ, and YnaL as well as two independent tryptic fragments
supporting identification of YmcF and YnfQ, providing strong evidence
for the reproducibility of their identifications (Table S1 and Figure S3). Taken together, these results suggest
that comparative proteomics has the potential to identify both constitutive
and regulated expression of nonannotated bacterial microproteins.
Figure 3
Gene locus
diagrams for nonannotated E. coli proteins
YmcF (A), YnfQ (B), YnaL (C), and YhiY (D). Line represents
chromosomal DNA, annotated protein-coding sequences are represented
by gray boxes, and newly reported coding sequences are represented
by blue boxes. Arrows indicate 5′–3′ directionality
of the coding sequence. Sizes are proportional to length, and genomic
coordinates of novel protein sequences are provided. Sizes of novel
proteins were calculated either from the first in-frame ATG to stop
codon (C, D) or from experimentally determined non-ATG start codons
(A, B, vide infra).
Gene locus
diagrams for nonannotated E. coli proteins
YmcF (A), YnfQ (B), YnaL (C), and YhiY (D). Line represents
chromosomal DNA, annotated protein-coding sequences are represented
by gray boxes, and newly reported coding sequences are represented
by blue boxes. Arrows indicate 5′–3′ directionality
of the coding sequence. Sizes are proportional to length, and genomic
coordinates of novel protein sequences are provided. Sizes of novel
proteins were calculated either from the first in-frame ATG to stop
codon (C, D) or from experimentally determined non-ATG start codons
(A, B, vide infra).
Confirmation of Microprotein Expression and Cold-Shock Inducibility
via Genomic Tagging
While bottom-up proteomics has proved
powerful in identification of novel peptide sequences, full protein
sequence coverage is rarely obtained. Therefore, this approach is
insufficient to confirm assignment of observed peptides to genomic
loci. Furthermore, since several of our novel protein identifications
were based on single tryptic peptide-spectral matches, rigorous molecular
confirmation of protein expression was required. In order to verify
the smORFs encoding our putative microproteins, we generated epitope-tagged
knock-in strains. The peptides identified by LC–MS/MS helped
define the reading frame and stop codons for the genes that encode
these proteins. For each locus, a C-terminal sequential epitope tag
(SPA tag2) was added to the chromosomal copy of the candidate
genes to report on expression without perturbing translation initiation
(Figure S4). Protein expression under conditions
of normal growth (37 °C) and cold shock (10 °C), with heat
shock (42 °C) as an additional control for specificity of the
cold shock response, was monitored by subjecting the respective cell
lysates to SDS-PAGE followed by western blotting with an antibody
against the FLAG tag that constitutes a portion of the SPA sequence.We were able to detect robust expression of the YmcF, YnfQ, and
YhiY proteins (Figure ). Band densitometry showed that YmcF and YnfQ were significantly
upregulated upon cold shock, whereas YhiY was expressed essentially
equally under all conditions tested. Regarding the migration of these
proteins in SDS-PAGE, the SPA tag adds 70 amino acids, or approximately
8 kDa, to the proteins of interest. Even so, YmcF, YnfQ, and YhiY
migrate at slightly higher apparent molecular weights than would be
expected based on their sizes, as determined by start codon mutagenesis
(approximately 5–7 kDas, vide infra). This
anomalous SDS-PAGE mobility has been observed for several other well-characterized
microproteins[6,30,31] and may be attributable to their high charge density and de-enrichment
in aromatic residues.[32] Despite repeated
attempts, we were unable to detect expression of epitope-tagged YnaL
and YpaA under any conditions. We concluded that these proteins may
be post-translationally proteolyzed, so we did not consider them further.
These results, combined with our proteomics analysis, confirmed that
proteins YmcF, YnfQ, and YhiY are translated and that YmcF and YnfQ
are upregulated during cold shock stress in E. coli.
Figure 4
Confirmation of expression and cold shock upregulation of novel
small proteins. (A) E. coli MG1655
strains with SPA epitope tags added to the C-termini of YmcF, YnfQ,
and YhiY were generated. Cell lysates of strains expressing genomically
tagged YmcF, YnfQ, and YhiY proteins at 37 °C, 10 °C (cold
shock), and 42 °C (heat shock) were separated on a 4–20%
SDS gel and stained with Coomassie blue (right). The same samples
were also subjected to western blotting (left) and probed with an
anti-FLAG antibody. The bands indicated by a red asterisk correspond
to YmcF, YnfQ, and YhiY. (B) Bands from the blot were quantified by
densitometry, and results are plotted to represent the fold change
in expression for the three proteins at 10 °C (cold shock) relative
to 37 °C. Error bars were calculated from three biological replicates
and represent the standard error of the mean.
Confirmation of expression and cold shock upregulation of novel
small proteins. (A) E. coli MG1655
strains with SPA epitope tags added to the C-termini of YmcF, YnfQ,
and YhiY were generated. Cell lysates of strains expressing genomically
tagged YmcF, YnfQ, and YhiY proteins at 37 °C, 10 °C (cold
shock), and 42 °C (heat shock) were separated on a 4–20%
SDS gel and stained with Coomassie blue (right). The same samples
were also subjected to western blotting (left) and probed with an
anti-FLAG antibody. The bands indicated by a red asterisk correspond
to YmcF, YnfQ, and YhiY. (B) Bands from the blot were quantified by
densitometry, and results are plotted to represent the fold change
in expression for the three proteins at 10 °C (cold shock) relative
to 37 °C. Error bars were calculated from three biological replicates
and represent the standard error of the mean.
Identification of the Translation Initiation Sites for ymcF and ynfQ
YmcF and YnfQ map
to intergenic sequences downstream of the known cold shock proteins cspG and cspI, respectively. Although ymcF and ynfQ are not currently annotated
in this E. coli strain, they have been
predicted based on sequence conservation (Refseq accession WP_077248232.1).
A closer look at the ymcF and ynfQ genes revealed that they must initiate at a noncanonical sequence
due to the lack of an ATG start codon upstream of the region that
produced the peptides we detected by mass spectrometry. In order to
identify the translation initiation sites for ymcF and ynfQ, they were amplified along with their
upstream genes and cloned into a pET expression vector to allow for
the expression of a C-terminal hexa-histidine tag (His6 tag) in-frame with ymcF and ynfQ. For cspG–ymcF, the start codon for cspG and potential start codons in its vicinity were substituted
with codons that would not allow for the initiation of cspG (CspG(ds)–YmcF). When expression of this construct was tested,
robust translation of a small product could be observed by SDS-PAGE
and subsequent blotting against the His6 tag (Figure A). This verified
that translation of the small protein downstream of cspG occurs independently and is not a result of stop codon read-through
or frame shifting during cspG translation. (Although
we observed several higher molecular-weight translation products from
the heterologous expression construct, these are not likely to be
physiologically relevant, as they are not detectably produced from
the genomically tagged strain.)
Figure 5
Translation of YmcF initiates at an ATT
start codon. (A) To verify
the translation initiation site for ymcF, a cspG(ds) ymcF plasmid
was cloned with a His6 tag in-frame at the C-terminus of
the ymcF coding sequence. In this construct, the cspG start codon
was mutated (delete start, ds) to abolish initiation of CspG. We then
individually mutated candidate near-cognate ymcF start codons to stop
codons. As a negative control, a stop codon was inserted before the
His6 tag in the cspG(ds) ymcF plasmid. Nucleotide numbering
starts immediately after the stop codon of cspG, and the sequence
is provided in Figure S5. To observe expression,
these constructs were introduced into BL21 cells, and IPTG induced
cell lysates were subjected to SDS-PAGE followed by blotting against
an antibody to the His6 tag. The YmcF protein band (carat)
does not appear when the A88TT codon or the next proceeding near-cognate
start codon (G156TG) is mutated to a stop codon. Nucleotide sequence
and numbering for ymcF are provided in Figure S5, and additional ymcF mutagenesis experiments are presented
in Figure S6. (B) Mutating the A88TT codon
in ymcF to ATG results in increased expression of the same major protein
product (carat).
Translation of YmcF initiates at an ATT
start codon. (A) To verify
the translation initiation site for ymcF, a cspG(ds) ymcF plasmid
was cloned with a His6 tag in-frame at the C-terminus of
the ymcF coding sequence. In this construct, the cspG start codon
was mutated (delete start, ds) to abolish initiation of CspG. We then
individually mutated candidate near-cognate ymcF start codons to stop
codons. As a negative control, a stop codon was inserted before the
His6 tag in the cspG(ds) ymcF plasmid. Nucleotide numbering
starts immediately after the stop codon of cspG, and the sequence
is provided in Figure S5. To observe expression,
these constructs were introduced into BL21 cells, and IPTG induced
cell lysates were subjected to SDS-PAGE followed by blotting against
an antibody to the His6 tag. The YmcF protein band (carat)
does not appear when the A88TT codon or the next proceeding near-cognate
start codon (G156TG) is mutated to a stop codon. Nucleotide sequence
and numbering for ymcF are provided in Figure S5, and additional ymcF mutagenesis experiments are presented
in Figure S6. (B) Mutating the A88TT codon
in ymcF to ATG results in increased expression of the same major protein
product (carat).We then sought to identify
the start site for ymcF. Since there was no in-frame
ATG start codon that could lead to the
translation of a small YmcF protein, every near-cognate start codon
downstream of cspG was mutated to a stop codon and
expression of YmcF was inspected (see Figure S5 for sequence and numbering). We observed that mutations after T64TG caused a significant decrease in translation, whereas
mutating A88TT to a stop codon completely abolished translation
(Figures A and S6). Mutations of residues proceeding this also
abolish translation of the major product (Figure S6), suggesting that A88TT is the translation initiation
site for ymcF. Further mutation of A88TT to ATG significantly increased translation of the same product,
as expected for a more efficient start codon (Figure B). These data are consistent with initiation
of YmcF translation at A88TT.Analysis of the genetic
loci for ymcF and ynfQ revealed
similar organization (Figure ). Further, amino acid sequence alignment
of YmcF and YnfQ reveals that the two proteins share 66% sequence
identity (Figure A),
suggesting that they may have arisen from a gene duplication event.
On the basis of nucleotide sequence alignment of ymcF and ynfQ, we predicted the initiation site of ynfQ would be A22TT. When the preceding codon
A19TT was mutated to a stop codon, YnfQ was still translated.
However, when A22TT was substituted with TAG, translation
of YnfQ was completely abolished, consistent with initiation of ynfQ at A22TT (Figure S7).
Figure 6
Homology and conservation of YmcF and YnfQ. (A) Amino acid sequence
alignment of YmcF and YnfQ proteins using Clustal Omega. The two proteins
exhibit 66% sequence identity. (B) Nucleotide sequence alignments
of ymcF (starting from position −3 relative
to the first coding nucleotide) to homologous sequences from Shigella sonnei strain FORC_011 and Salmonella enterica subsp. enterica serovar Anatum
str. USDA-ARS-USMARC-1677, which were identified using NCBI BLAST.
The ATT start codon is underlined in red.
Homology and conservation of YmcF and YnfQ. (A) Amino acid sequence
alignment of YmcF and YnfQ proteins using Clustal Omega. The two proteins
exhibit 66% sequence identity. (B) Nucleotide sequence alignments
of ymcF (starting from position −3 relative
to the first coding nucleotide) to homologous sequences from Shigella sonnei strain FORC_011 and Salmonella enterica subsp. enterica serovar Anatum
str. USDA-ARS-USMARC-1677, which were identified using NCBI BLAST.
The ATT start codon is underlined in red.A BLAST search revealed the presence of ymcF homologues
in some Salmonella and Shigella species,
as well as conservation of the putative ATT start codon (Figure B). Taken together,
these observations of cold-inducible synthesis and conservation in Enterobacteriaceae suggest that the YmcF and YnfQ proteins
may be functional.
Discussion
While elegant genetic
approaches have improved our ability to identify
small proteins missed by traditional genome annotation algorithms,[4] it is becoming clear that additional classes
of genes have been under-annotated. For example, increasing numbers
of reports have identified proteins translated by unconventional mechanisms
such as initiation at noncanonical start codons,[33,34] internal translation initiation sites,[35,36] programmed frame-shifting,[37,38] and stop codon read-through.[39,40] Our comparative proteomic analysis revealed four novel E. coli proteins, all of which were previously nonannotated
for at least one of the above-mentioned reasons: the encoded proteins
are small, transiently expressed during stress, and/or initiate with
noncanonical start codons.The proximity of ymcF and ynfQ to known genes, in addition to their regulated
expression and conservation,
supports the hypothesis that they may encode functional proteins.
Both are downstream of cold shock genes (cspG and cspI, respectively). The coding region of ymcF also overlaps the ymcE gene, which itself overlaps
the downstream gnsA gene (Figure ). ymcE is a suppressor
of fabA6, whose gene product, FabA, catalyzes a dehydrase
reaction in the synthesis of unsaturated fatty acids.[41,42] Mutations in fabA6 result in a temperature-sensitive
unsaturated fatty acid auxotroph phenotype which can be alleviated
by overexpression of YmcE.[42]ynfQ is also located upstream of a homologue of gnsA named gnsB (Figure ), which is another suppressor of the fabA6 mutant.[43] The biochemical roles of YmcE,
GnsA, and GnsB are yet to be determined, as these proteins remain
largely uncharacterized at the molecular level. However, since our
newly identified proteins are proximal in sequence space both to upstream
cold shock proteins and downstream suppressors of fabA6 mutations, it is reasonable to hypothesize that these proteins may
also play a role in regulating lipid synthesis during cold shock.
Both YmcE and YnfQ are predicted to be structured (Figure S8A), but they have no known sequence or structural
homologues. YmcF exhibits predicted structural homology to zinc-binding
domains in proteins such as aspartate transcarbamoylase, largely based
on five cysteine residues present in both proteins (and in YnfQ) (Figure S8B,C). Future work will focus on characterizing
these proteins and testing these structural and functional hypotheses.We utilized a molecular mutagenesis approach to identify the initiation
codons for ymcF and ynfQ as A88TT and A22TT, respectively. The ATT start codon
has long been known to initiate protein synthesis in bacteria, but
it is thought to be rare, with only two ATT-initiating E. coli genes currently annotated: pcnB and infC.[44−46] The enzyme PAP I (poly A polymerase
I), which catalyzes RNA 3′ polyadenylation, is encoded by pcnB. Elevated levels of PAP I may be toxic to cells, and
initiation at the noncanonical ATT start codon is proposed to be a
regulatory mechanism to control PAP I production at low levels.[45] Similarly, the prokaryotic translation initiation
factor 3 (IF3), which is crucial for selecting the initiation codon
for general protein translation, negatively regulates its own synthesis
by initiating at an ATT start codon.[33,46,47] It is possible that YmcF and YnfQ translation is
also regulated via noncanonical start codon recognition. Our results
further suggest that many more genes may remain to be identified that
initiate with rare near-cognate start codons, even in eukaryotic genomes,
where ATT start codons govern translation initiation of human beta-globin
and frataxin.[48,49]In conclusion, even though
the E. coli genome has been extensively
explored, our results suggest that more
genes may remain to be discovered. These cryptic genes are likely
to be short, may only be expressed under specific conditions, and
may utilize noncanonical translation initiation mechanisms. Our quantitative
proteomic workflow provides a roadmap for the discovery and characterization
of these yet nonannotated genes. More broadly, we anticipate that
comparative analysis of regulated smORF expression via LC/MS-based
proteomics will enable the coupling of microprotein discovery to functional
hypothesis generation.
Authors: Katie L Vermillion; Pratik Jagtap; James E Johnson; Timothy J Griffin; Matthew T Andrews Journal: J Proteome Res Date: 2015-10-23 Impact factor: 4.466
Authors: Matthew C Chambers; Brendan Maclean; Robert Burke; Dario Amodei; Daniel L Ruderman; Steffen Neumann; Laurent Gatto; Bernd Fischer; Brian Pratt; Jarrett Egertson; Katherine Hoff; Darren Kessner; Natalie Tasman; Nicholas Shulman; Barbara Frewen; Tahmina A Baker; Mi-Youn Brusniak; Christopher Paulse; David Creasy; Lisa Flashner; Kian Kani; Chris Moulding; Sean L Seymour; Lydia M Nuwaysir; Brent Lefebvre; Frank Kuhlmann; Joe Roark; Paape Rainer; Suckau Detlev; Tina Hemenway; Andreas Huhmer; James Langridge; Brian Connolly; Trey Chadick; Krisztina Holly; Josh Eckels; Eric W Deutsch; Robert L Moritz; Jonathan E Katz; David B Agus; Michael MacCoss; David L Tabb; Parag Mallick Journal: Nat Biotechnol Date: 2012-10 Impact factor: 54.908
Authors: Xiongwen Cao; Alexandra Khitun; Zhenkun Na; Daniel G Dumitrescu; Marcelina Kubica; Elizabeth Olatunji; Sarah A Slavoff Journal: J Proteome Res Date: 2020-06-03 Impact factor: 4.466
Authors: Samuel Miravet-Verde; Tony Ferrar; Guadalupe Espadas-García; Rocco Mazzolini; Anas Gharrab; Eduard Sabido; Luis Serrano; Maria Lluch-Senar Journal: Mol Syst Biol Date: 2019-02-22 Impact factor: 11.429