A functional understanding of the human body requires structure-function studies of proteins at scale. The chemical structure of proteins is controlled at the transcriptional, translational, and post-translational levels, creating a variety of products with modulated functions within the cell. The term "proteoform" encapsulates this complexity at the level of chemical composition. Comprehensive mapping of the proteoform landscape in human tissues necessitates analytical techniques with increased sensitivity and depth of coverage. Here, we took a top-down proteomics approach, combining data generated using capillary zone electrophoresis (CZE) and nanoflow reversed-phase liquid chromatography (RPLC) hyphenated to mass spectrometry to identify and characterize proteoforms from the human lungs, heart, spleen, small intestine, and kidneys. CZE and RPLC provided complementary post-translational modification and proteoform selectivity, thereby enhancing the overall proteome coverage when used in combination. Of the 11,466 proteoforms identified in this study, 7373 (64%) were not reported previously. Large differences in the protein and proteoform level were readily quantified, with initial inferences about proteoform biology operative in the analyzed organs. Differential proteoform regulation of defensins, glutathione transferases, and sarcomeric proteins across tissues generate hypotheses about how they function and are regulated in human health and disease.
A functional understanding of the human body requires structure-function studies of proteins at scale. The chemical structure of proteins is controlled at the transcriptional, translational, and post-translational levels, creating a variety of products with modulated functions within the cell. The term "proteoform" encapsulates this complexity at the level of chemical composition. Comprehensive mapping of the proteoform landscape in human tissues necessitates analytical techniques with increased sensitivity and depth of coverage. Here, we took a top-down proteomics approach, combining data generated using capillary zone electrophoresis (CZE) and nanoflow reversed-phase liquid chromatography (RPLC) hyphenated to mass spectrometry to identify and characterize proteoforms from the human lungs, heart, spleen, small intestine, and kidneys. CZE and RPLC provided complementary post-translational modification and proteoform selectivity, thereby enhancing the overall proteome coverage when used in combination. Of the 11,466 proteoforms identified in this study, 7373 (64%) were not reported previously. Large differences in the protein and proteoform level were readily quantified, with initial inferences about proteoform biology operative in the analyzed organs. Differential proteoform regulation of defensins, glutathione transferases, and sarcomeric proteins across tissues generate hypotheses about how they function and are regulated in human health and disease.
Entities:
Keywords:
capillary zone electrophoresis; heart; kidney; lung; proteomics; small intestine; spleen; top-down proteomics
Mapping
the human body is critical to improve our understanding
by setting definitive reference points for organs, tissues, and cells
of diverse types. In proteomics, a complete understanding of the proteoform[1] diversity requires measurements that systematically
capture protein-level complexity. In projects such as the Human Biomolecular
Atlas Program (HuBMAP)[2] and Human Cell
Atlas,[3] the resolution of mapping can handle
single cells in tissues, with several highly multiplexed methods enabled
by antibody-based affinity reagents: CODEX,[4] Immuno-SABER,[5] CyTOF,[6] and MIBI,[7,8] among others. These methods measure
the expression of particular epitopes on proteins, although they still
fail to capture the full complexity of the proteoforms present. Proteoform-level
measurements are more specific for a particular biological state compared
to the measurements on the gene or even protein level.[9,10] While our long-term goal is to develop new technologies that deliver
spatial proteoform analysis and build a comprehensive atlas of human
proteoforms,[11] our goal here is to identify
proteoforms present in primary human tissues and provide an initial
assessment of their post-translational modifications (PTMs) across
tissue types.Top-down proteomics (TDP), where intact proteins
are isolated and
fragmented by mass spectrometry (MS), is well suited for the identification
and characterization of tissue-specific proteoforms. For the analysis
of complex proteome samples, upfront separation and/or fractionation
represents a crucial part in TDP workflows to reduce complexity prior
to MS. Reversed-phase liquid chromatography (RPLC) is traditionally
employed as the method of choice in TDP, which is due to its reproducibility,
separation capacity, and MS compatibility, although capillary zone
electrophoresis (CZE) represents an alternative for online MS. In
particular, the separation principle of CZE is based on differences
in electrophoretic mobilities (charge-to-size ratio)
and is considered largely “orthogonal” to RPLC, where
separation is driven by the hydrophobicity of analyte molecules. For
this reason, the combination of information generated by both techniques
is anticipated to increase the number of identified proteins and proteoforms.Here, we report results from two workflows for mapping the proteoform
landscape of solid tissues and present the first iteration with five
commonly studied human tissues (heart, lungs, kidneys, small intestines,
and spleen). Initially, the extracted proteoforms were prefractionated
using gel-eluted liquid fraction entrapment electrophoresis (GELFrEE),[12] followed by subsequent CZE-MS and nano-RPLC-MS
analysis. This study contributes 7373 proteoforms to the Human Proteoform
Atlas (HPfA), a FAIR[13] knowledge base that
now contains approximately 60,000 unique proteoforms linked to their
biological context.[14]
Experimental Procedures
Reagents
All reagents were purchased from Thermo Fisher
Scientific at the highest available purity unless otherwise specified.
Tissue Lysate Preparation
Fresh-frozen tissue samples
of the human heart, lungs, small intestine, and spleen were obtained
from HuBMAP Tissue Mapping Centers (Table S1). The tissue samples were collected under IRB-approved protocols
at each institution. Kidney samples were received as 10 μm microtome
scrolls embedded in methylcellulose (each ∼5 mg). All other
tissue types were cut into small pieces (∼5 mm) by the specimen
preparer at Mapping Centers. The kidney scrolls were cryopulverized
in 2 mL Eppendorf Protein Lo-Bind tubes containing a 5 mm stainless-steel
ball (Qiagen, cat. no. 69989) with a CryoMill (Retsch, cat. no. 20.749.001)
equipped with a tube adaptor. Nonkidney tissue specimens (50–100
mg) were cryopulverized using the CryoMill equipped with a 25 mL grinding
jar containing a 1 inch stainless-steel ball. Three cycles of precooling
with liquid nitrogen at 1 Hz for 3 min and grinding at 30 Hz for 1
min were performed. The pulverized tissue was transferred to a 15
mL conical tube and resuspended in 2 mL of cold radioimmunoprecipitation
assay lysis buffer [50 mM Tris, 150 mM NaCl, 1% NP-40 (v/v), 0.5%
sodium deoxycholate (w/v), 0.1% sodium dodecyl sulfate (w/v), pH 7.4,
1× Halt protease and phosphatase inhibitor cocktail (Thermo Scientific)].
The suspension was further disrupted by sonication on ice (40% power,
cycle 2 s on, 3 s off, for 30 s total) using a probe sonicator (FisherBrand
model 120 with a 1/8 inch probe) and then clarified by centrifugation
(3234g, 30 min, 4 °C).
Sample Prefractionation
and Preparation for MS
The
kidney lysates were studied using a 5 × 4 × 1 × 2 design:
five biospecimens from separate donors were GELFrEE-fractionated into
four fractions, analyzed by RPLC-MS/MS, and injected in duplicate.
The lung lysates were studied in a 3 × 6 × 1 × 3 design:
three samples from a single donor, six fractions, only RPLC, and three
injections. The heart lysates were studied in a 2 × 6 ×
2 × 3 design: two donors, six fractions, both CZE and RPLC, and
three injections. The small intestine and spleen were studied in a
1 × 6 × 2 × 3 design: one sample, six fractions, both
CZE and RPLC, and three injections. The lysates were fractionated
and prepared for MS, as described previously.[15] In brief, the lysates were precipitated by adding four volumes of
cold acetone and incubating them at −80 °C for 1 h. The
precipitate was collected by centrifugation (20,000g, 30 min, 4 °C), and proteins were resolubilized in 1% sodium
dodecyl sulfate (w/v). The total protein content was determined by
the BCA assay (Thermo Scientific). The samples were fractionated using
the GELFrEE 8100 fractionation station (Expedeon). The protein samples
(300 μg in 150 μL) were combined with 30 μL of the
GELFrEE running buffer and 8 μL of 1 M DTT. The samples were
incubated at 95 °C for 5 min, cooled to room temperature, and
separated using a 10% GELFrEE cartridge following the manufacturer’s
protocol. Six (four in the case of kidney samples) 150 μL fractions
were collected and stored at −80 °C until immediately
prior to analysis. On the day of analysis, the fractions were thawed
on ice and precipitated with methanol–chloroform–water
as described.[16] Based on previous experience,
each fraction was expected to contain about 5 μg of protein
material. The pellets were resuspended in 10 μL of 0.3% acetic
acid (HAc) (v/v) and subjected to CZE-MS/MS. When CZE-MS/MS analysis
was completed, the samples were diluted with 20 μL of buffer
A (5% acetonitrile, 94.8% water, and 0.2% formic acid) and subjected
to RPLC-MS/MS analysis. If only RPLC-MS/MS was conducted, the pellets
were resuspended directly in 30 μL of buffer A.
Capillary Zone
Electrophoresis
CZE was performed using
a CESI 8000 Plus (Sciex) equipped with a Neutral OptiMS capillary
cartridge (30 μm ID, L = 90 cm), neutrally
coated. The cartridge was washed and conditioned according to the
manufacturer’s protocols. Separation conditions: cartridge
temperature: 15 °C, sample tray temperature: 4 °C, background
electrolyte: 3% HAc, conductive liquid: 3% HAc, hydrodynamic injection:
2.5 psi for 60 s (corresponds to ∼20 nL or ∼10 ng of
the protein material). The individual separation method steps are
listed in Table S2. Overnight, the capillary
was rinsed alternating between high flow (100 psi, 2 min) and low
flow (10 psi, 120 min) steps with water. For long-term storage, both
separation and conductive lines were rinsed (100 psi) with water for
5 min, respectively, and the cartridge was stored at 4 °C.
Reversed-Phase Liquid Chromatography
RPLC was performed
using an UltiMate 3000 RSLCnano system (Thermo Fisher Scientific)
as described previously.[17] In brief, a
self-packed trap column (150 μm × 2.5 cm, PLRP-S 5 μm
1000 Å pore size) and analytical column (75 μm × 25
cm, PLRP-S 5 μm 1000 Å pore size) were configured in a
vented T setup. The trap and column were kept at 55 °C. Buffer
A: 94.8% water, 5% acetonitrile, 0.2% formic acid; buffer B: 94.8%
acetonitrile, 5% water, 0.2% formic acid. The samples were injected
(6 μL, ∼1 μg total protein) onto the trap column
and washed with 5% buffer B at 3 μL/min for 10 min. Following
a valve switch, the proteins were separated on the analytical column
according to the following gradient: 5% B at 10 min, 15% B at 13 min,
45% B at 70 min, 95% B at 72 min, 95% B at 76 min, 5% B at 80 min,
and 5% B from 80 to 90 min. For fractions 5 and 6, the proteins were
separated according to the following gradient: 5% B at 10 min, 15%
B at 13 min, 50% B at 70 min, 95% B at 72 min, 95% B at 76 min, 5%
B at 80 min, and 5% B from 80 to 90 min. The eluted proteins were
ionized in positive ion-mode nanoelectrospray ionization using a pulled-tip
nanospray emitter (15 μm i.d. × 125 mm, New Objective)
packed with 1 mm of PLRP-S 5 μm 1000 Å pore size with a
custom nanosource.
Top-Down MS
MS was performed either
using a Thermo
Scientific Orbitrap Eclipse Tribrid mass spectrometer or a Thermo
Scientific Fusion Lumos Orbitrap Tribrid mass spectrometer. For analysis
on Eclipse MS, data was acquired using the following global parameters
spray voltage: 1600 V, sweep gas: 0, ion transfer tube temperature:
320 °C, application mode: intact protein, pressure mode: low
pressure (2 mTorr), advanced peak determination: true, default charge
state: 15, S-lens RF: 30%, source collision-induced dissociation:
15 eV. The precursor spectra were acquired at a 120,000 resolving
power, detect type: Orbitrap, scan range: 600–2000 m/z, mass range: normal, AGC target 2E6,
normalized AGC target: 500%, max injection time: 50 ms, microscans:
1. The mass spectrometer was operated using a TopN 3 s data-dependent
acquisition mode. The precursor ions were filtered by intensity, charge
state, and dynamic exclusion. Intensity minimum: 5E3, intensity maximum:
1E20, include charge states: 4–60, include underdetermined
charge states: false, dynamic exclusion after n times: 1, dynamic
exclusion duration: 60 s, mass tolerance: 0.5 m/z, exclude isotopes: true. The ions for fragmentation were
isolated and fragmented via higher energy dissociation (HCD). Detector
type: Orbitrap, isolation mode: quadrupole, resolving power: 60,000,
scan range: 350–2000 m/z,
AGC target: 1E6, normalized AGC target: 2000%, max injection time:
600 ms, microscans: 1, isolation window: 3 m/z, activation type: HCD, collision energy: 32, collision
energy mode: fixed.For analysis on an Orbitrap Fusion Lumos
mass spectrometer, data was acquired with the following global parameters:
spray volage: 1600 V, sweep gas: 0, ion transfer tube temperature:
320 °C, application mode: intact protein, pressure mode: low
pressure (2 mTorr), advanced peak determination: true, default charge
state: 15, S-lens RF: 30%, source collision-induced dissociation:
15 eV. The precursor spectra were acquired at a 120,000 resolving
power (at 200 m/z), mass range:
normal, detector type: Orbitrap, scan range: 600–2000 m/z, AGC target: 1E6, normalized AGC target:
250%, max injection time: 100 ms, microscans: 4. The mass spectrometer
was operated using a Top2 data-dependent acquisition mode. The precursor
ions were filtered by intensity, charge state, and dynamic exclusion.
Intensity minimum: 2E4, intensity maximum:1E20, included charge states:
6–60, include undetermined charge states: false, dynamic exclusion
after n times: 1, dynamic exclusion duration: 60 s, mass tolerance:
1.5 m/z, exclude isotopes: true.
The ions for fragmentation were isolated and fragmented via HCD. Detector
type: Orbitrap, isolation mode: quadrupole, resolving power: 60,000
(at 200 m/z), scan range: 400–2000 m/z, AGC target: 1E6, normalized AGC target:
2000%, max injection time: 400 ms, microscans: 4, isolation window:
3 m/z, activation type: HCD, collision
energy: 27, collision energy mode: fixed.
Protein and Proteoform
Identification
The raw data
files were processed with the publicly available workflow on TDPortal
(https://portal.nrtdp.northwestern.edu, Code Set 4.0.0) that performed mass inference, searched a database
of human proteoforms derived from Swiss-Prot (June 2020) with curated
histones, and estimated conservative, context-dependent 1% false discovery
rate (FDR) at the protein, isoform, and proteoform levels.[18] Each tissue type was searched separately with
its own FDR context. Aggregated search results were used in further
data analysis.
Code and Data Availability
Raw files,
mzIdentML, and
tdReport files were deposited in Massive (Accession MSV000088565).
The search results in the tdReport format are viewable using TDViewer—a
freeware from Northwestern University (http://topdownviewer.northwestern.edu). The search results were further analyzed, and figures were generated
with a custom code written for R 4.1.0. The source code for data analysis
is available at https://github.com/bdrown/rplc-cze-tissues.
Results and Discussion
The samples were obtained from HuBMAP Tissue Mapping Centers from
10 human donors. The tissue was cryopulverized and lysed, and the
proteins were precipitated (Figure ). To increase the depth of proteome coverage, the
proteins were fractionated using GELFrEE prior to MS analysis. Since
we intended to analyze each sample by both CZE and RPLC, we set up
two Orbitrap tribrid MS instruments configured with either CZE or
RPLC, acquired data for a sample on one system, and immediately acquired
data for the same sample on the second one. CZE substantially benefits
from a higher scan rate due to generally narrower peak widths. Consequently,
the CESI 8000 Plus was hyphenated to the Orbitrap Eclipse, while a
Dionex nanoLC was coupled to the Orbitrap Fusion Lumos. Three tissue
types (heart, small intestine, and spleen) were analyzed by this paired
analysis, while two tissues (lungs and kidneys) were analyzed solely
by RPLC-MS on the Orbitrap Eclipse (Table ).
Figure 1
TDP of healthy human tissues. Tissues were obtained
from HuBMAP
Tissue Mapping Centers. Fresh-frozen tissue was cryogenically pulverized,
lysed, and precipitated. Intact proteins were prefractionated using
GELFrEE. Each sample was analyzed by CZE-MS/MS and RPLC-MS/MS, respectively.
Table 1
Proteins and Proteoforms Identified
from Sampling Five Human Tissue Types
tissue type
biological
replicatesa
separation
MS/MS runs
proteins
1% FDRb
unique proteins
1% FDRc
proteoforms
1% FDR (C-score >30)
unique proteoforms (C-score >30)
lungs
3
RPLC
49
437
132
5566 (2940)
3601
(1462)
kidneys
5
RPLC
42
307
62
2278 (988)
641 (306)
heart
2
CZE, RPLC
72
305
70
2897 (1346)
1623 (772)
small intestine
1
CZE, RPLC
36
305
43
3101 (1214)
2049 (643)
spleen
1
CZE, RPLC
35
213
36
1869 (972)
870
(589)
total
12
234
1567
343
15,711 (7460)
8784 (3772)
total redundantd
12
234
740
343
11,466 (4,906)
8784 (3772)
Biological replicate refers to a
sample from a single human being. Sample descriptions and metadata
are shown in Table S1.
The term “protein”
refers to the SwissProt entry mapping to a single human gene.
Unique identifications refer to
proteins or proteoforms that were only identified in the tissue type
indicated.
Proteins and
proteoforms that were
observed in more than one human tissue type are counted once in nonredundant
totals.
TDP of healthy human tissues. Tissues were obtained
from HuBMAP
Tissue Mapping Centers. Fresh-frozen tissue was cryogenically pulverized,
lysed, and precipitated. Intact proteins were prefractionated using
GELFrEE. Each sample was analyzed by CZE-MS/MS and RPLC-MS/MS, respectively.Biological replicate refers to a
sample from a single human being. Sample descriptions and metadata
are shown in Table S1.The term “protein”
refers to the SwissProt entry mapping to a single human gene.Unique identifications refer to
proteins or proteoforms that were only identified in the tissue type
indicated.Proteins and
proteoforms that were
observed in more than one human tissue type are counted once in nonredundant
totals.
Identification of Human
Proteoforms in Solid Tissues
By searching the TDP data against
a database of human proteoforms
using TDPortal and 1% conservative FDR, a total of 11,466 proteoforms
from 740 proteins were identified (Table ). Of these annotations, 8784 proteoforms
and 343 proteins were unique to a single tissue type (Table and Figure A). The lung tissue contained the highest
number of proteoforms and proteins (overall and unique), while the
kidney tissue contained the fewest unique proteoforms (Figure S1). Despite having the lowest number
of proteins identified, the spleen tissue had a high number of proteoforms
per protein (Figure S1). While histones
and hemoglobin generated the highest number of proteoforms per protein
in most tissues, several other proteins populated the top 15 proteins
(Figure S2). Among the shared proteins
and proteoforms, histones, ribosomal proteins, ATP synthase subunits,
and other housekeeping proteins were most frequently observed (Supporting Information Data 1). Overall, CZE-MS/MS
resulted in a higher number of protein and proteoform identifications
than RPLC (Figure B). However, the difference in MS instrument performance likely contributed
to the increased number of identifications characterized by CZE-MS/MS
workflow.
Figure 2
Systematic discovery of unique proteoforms across human tissues.
(A) Venn diagrams of shared and unique proteins and proteoforms identified
in each tissue. 1% FDR filtering was applied at the PrSM, proteoform,
and protein levels for each tissue. (B) Venn diagrams of shared and
unique proteins and proteoforms identified in the heart, small intestine,
and/or spleen tissues by either CZE or RPLC. (C) Pie charts representing
the rediscovery of proteoforms and proteins previously deposited in
the HPfA (red) or only this study (New, blue). HPfA was accessed on
8/18/2021. (D) Heat map showing the presence (yellow) and absence
(purple) of proteoforms in each tissue sample with hierarchical clustering.
(E) Bar graph of top 20 enriched terms from genes associated with
proteoforms uniquely identified in the heart tissue using Metascape.
Systematic discovery of unique proteoforms across human tissues.
(A) Venn diagrams of shared and unique proteins and proteoforms identified
in each tissue. 1% FDR filtering was applied at the PrSM, proteoform,
and protein levels for each tissue. (B) Venn diagrams of shared and
unique proteins and proteoforms identified in the heart, small intestine,
and/or spleen tissues by either CZE or RPLC. (C) Pie charts representing
the rediscovery of proteoforms and proteins previously deposited in
the HPfA (red) or only this study (New, blue). HPfA was accessed on
8/18/2021. (D) Heat map showing the presence (yellow) and absence
(purple) of proteoforms in each tissue sample with hierarchical clustering.
(E) Bar graph of top 20 enriched terms from genes associated with
proteoforms uniquely identified in the heart tissue using Metascape.We also sought to compare the proteoforms identified
in this work
to those reported in prior studies. The Human Proteoform Atlas (HPfA, http://human-proteoform-atlas.org/) is the most comprehensive collection of characterized proteoforms.[14] The HPfA consists of 49 datasets, which include
numerous studies on immortalized cell lines, one study on healthy
human solid tissues,[19] two studies on human
cancer tissues,[20,21] and the Blood Proteoform Atlas
(http://blood-proteoform-atlas.org/).[22] Of the 11,466 proteoforms identified
in this study, a substantial number of 7373 proteoforms (64.3%) were
not previously reported in the HPfA, while 4093 (35.7%) proteoforms
were present in this database (Figure C). The frequency of rediscovery was higher on the
protein level with 198 (26.8%) proteins first reported here and 542
(73.2%) proteins included in the HPfA database (Figure C). Thus, while some proteins were identified
for the first time in this study, the majority of new proteoforms
are differently modified forms of proteins, which were previously
detected by TDP. Presence and absence matrices showed clear clustering
of tissues at the proteoform (Figure D) level, demonstrating that proteoform identifications
are more characteristic of the tissues under study.A “bird’s-eye”
view of the physicochemical
properties of proteoforms identified in the five different tissue
types, including hydrophobicity, monoisotopic mass, and pI value,
can be found in Figures A and S3. While the kidney, lung, and
spleen tissue proteoforms show similar distributions in their violin
plots regarding all three investigated characteristics, distinct differences
for the heart and especially small intestine tissue were detected.
For example, in the case of the small intestine, a high number of
proteoforms in the pI range of 10.5–12.0 was observed, which
can be explained by a relative increase in histone proteoforms compared
to those in the other analyzed tissue types. This is also supported
by the negative GRAVY score, showing a large distribution at around
−0.6. On the other hand, proteoforms observed in the heart
tissue exhibit a relatively broad distribution of pI values.
Figure 3
Complementary
separation of intact proteins by CZE and RP-nanoLC.
(A) Violin plots of proteoform physiochemical properties by the tissue
and separation technique. (B) Scatter plots relating the migration/retention
time to the monoisotopic mass of proteoforms from the heart and small
intestine and the migration/retention time to the monoisotopic mass
of proteoforms from the heart, small intestine, and spleen samples
subdivided by the separation method and GELFrEE fraction. (C) Scatter
plots relating the migration/retention time to the GRAVY score of
proteoforms from the heart, small intestine, and spleen samples subdivided
by the separation method and GELFrEE fraction. Corresponding correlation
coefficients of data presented in panels B and C are listed in Table S3.
Complementary
separation of intact proteins by CZE and RP-nanoLC.
(A) Violin plots of proteoform physiochemical properties by the tissue
and separation technique. (B) Scatter plots relating the migration/retention
time to the monoisotopic mass of proteoforms from the heart and small
intestine and the migration/retention time to the monoisotopic mass
of proteoforms from the heart, small intestine, and spleen samples
subdivided by the separation method and GELFrEE fraction. (C) Scatter
plots relating the migration/retention time to the GRAVY score of
proteoforms from the heart, small intestine, and spleen samples subdivided
by the separation method and GELFrEE fraction. Corresponding correlation
coefficients of data presented in panels B and C are listed in Table S3.
Influence of the Separation Technique
While the performances
of CZE and RPLC have been compared in numerous contexts,[23−27] the paired analysis of the heart, small intestine, and spleen provides
an opportunity to explore how proteoforms behave regarding these two
separation techniques. Despite requiring similarly long acquisition
times, the window of separation for CZE was smaller than that for
RPLC. The difference in the separation principle was evident in the
relationship between proteoform retention/migration times and mass
(Figure B), as well
as time and hydrophobicity (Figure C). While there is a strong correlation between mass
and retention time with RPLC, no significant correlation was observed
between mass and migration time with CZE (Table S3). Both separation methods demonstrate a correlation between
hydrophobicity and time, but RPLC has a stronger correlation. While
CZE was performed with an acidic background electrolyte (pH 2.4),
we observed a positive correlation between the proteoform hydrophobicity
and mass-to-charge ratio (Figure S3I),
which helps to explain the increase in hydrophobicity with migration
time (less number of “ionizable” amino acids available
per size).In addition to the physiochemical properties of proteoforms
identified using CZE and RPLC difference, the distribution of PTMs
was similarly asymmetrical. Twelve PTM categories were identified
(Table ), and their
identifications differed significantly (Pearson’s χ-squared
test, χ2 = 196, p-value <2 ×
10–16) depending on the fractionation method. Two-by-two
χ-squared tests were performed to determine which PTMs had significant
deviations in their identification rates (observed PTM/the sum of
all other PTMs), as described previously.[28] Monomethylation, half cystines, and oxygenation were elevated on
CZE-MS/MS, while on RPLC-MS/MS, the detection of monoacetylated and
trimethylation proteoforms was enhanced. PTM observation frequencies
at the proteoform spectral match (PrSM) level followed the same trends
in observation biases (Table S4). The elevation
of half-cystines and oxygenated residues in the CZE-MS/MS data suggests
that the electrophoretic process can oxidize some sensitive residues.
While the rate of observing oxidized proteoforms is still low overall,
this trend should be considered when performing CZE-MS/MS acquisition.
The differential rates of methylation and acetylation led us to see
if histones were more highly characterized by one separation method.
Indeed, the number of histone proteoforms identified and the number
of histone PrSMs were elevated in the CZE-MS/MS data compared to those
in paired LC–MS/MS data (Figure S4A,C). This trend was maintained even when normalizing for total proteoforms
and spectral matches (Figure S4B,D). Summarized,
these observations substantiate the benefit of the combination of
CZE- and RPLC-derived data by increasing the coverage of the proteoform
discovery workflow.
Table 2
Frequency of Observation
for Different
Types of PTMs on Identified Proteoforms Categorized by the Separation
Technique Used in TDP
CZE
RPLC
PTM type
observeda
freq.b
observeda
freq.b
χ2
p-valuec
monoacetylationd
2723
0.26
1984
0.31
54
2.6 × 10–12
unmodifiedd
2298
0.22
1123
0.18
44
4.3 × 10–10
phosphorylation
1644
0.16
1006
0.16
0.057
>1
monomethylationd
1201
0.11
556
0.088
31
3.6 × 10–7
trimethylationd
920
0.088
667
0.11
14
2.8 × 10–3
dimethylation
919
0.088
642
0.10
8.3
4.9 × 10–2
half-cystined
360
0.034
118
0.019
35
3.8 × 10–8
nitrosylation
239
0.023
165
0.026
1.6
>1
oxygenatedd
72
0.0069
5
7.9 × 10–4
31
3.4 × 10–7
pyruvic acid iminylated
residue
48
0.0046
41
0.0065
2.3
>1
deamidated l-asparagine
42
0.0040
38
0.0060
2.9
>1
S-palmitoylation
14
0.0013
7
0.0011
0.037
>1
total
10,480
6352
Number of modifications observed
on proteoforms at 1% FDR; count does not include N-terminal and C-terminal
modifications; multiple PTMs on the same proteoform are counted multiple
times.
Number of observations/sum
of PTM
observations for each separation technique.
Bonferroni-corrected p-value (n = 12).
Statistically
significant difference
(α < 0.01) in the frequency of observation.
Number of modifications observed
on proteoforms at 1% FDR; count does not include N-terminal and C-terminal
modifications; multiple PTMs on the same proteoform are counted multiple
times.Number of observations/sum
of PTM
observations for each separation technique.Bonferroni-corrected p-value (n = 12).Statistically
significant difference
(α < 0.01) in the frequency of observation.
Tissue-Specific Proteoforms and Handling
of PTM Ambiguity
Uncertainty in the exact position of a PTM
on a proteoform can
arise in cases where SwissProt entries have many recorded modifications
and amino acid variants and fragmentation data are incomplete to assert
an unambiguous level 1 proteoform.[29] This
phenomenon is exemplified by cardiac troponin C (cTnC), which was
identified in its canonical form (full length, N-terminal acetylated,
PFR55232) as a level 1 proteoform (Figure A). Nine additional proteoforms had sufficiently
high proteoform-level Q-scores to pass FDR cutoffs due to excellent
sequence coverage in regions without modifications, and they were
classified as level 3 proteoforms with some PTM site ambiguity (Figure A). The example of
cTnC is not alone; the majority of proteoforms identified in this
study are either chemically modified or bear a sequence variant as
only 33% are unmodified (Figure B). While filtering by C-score can help triage level
3 proteoforms for which PTM localization is ambiguous, the C-score
does not help in cases where there is only one possible site of modification.[30]
Figure 4
Selection of tissue-specific proteoforms. (A) Cigar depiction
of
cTnC proteoforms identified in the human heart tissue. Red, blue,
and purple marks on the bottom of cigars indicate b, y, and both b
and y fragment ions. Tan marks on top of cigars indicate the presence
of a PTM or sequence variant. (B) Distribution of proteoforms identified
with PTMs or sequence variance. Proteolytic cleavage and N-terminal
acetylation are excluded from consideration as PTMs in this panel.
(C) Histogram of proteoforms and the number of matching fragment ions
that support the presence of a sequence variant (e.g., a polymorphism).
(D) Histogram of proteoforms and the number of matching fragment ions
that support the presence of a PTM. (E) Sequential filtering of proteoforms
to identify high-confidence tissue-specific proteoforms. (F) Identification
of tissue-specific defensin proteoforms.
Selection of tissue-specific proteoforms. (A) Cigar depiction
of
cTnC proteoforms identified in the human heart tissue. Red, blue,
and purple marks on the bottom of cigars indicate b, y, and both b
and y fragment ions. Tan marks on top of cigars indicate the presence
of a PTM or sequence variant. (B) Distribution of proteoforms identified
with PTMs or sequence variance. Proteolytic cleavage and N-terminal
acetylation are excluded from consideration as PTMs in this panel.
(C) Histogram of proteoforms and the number of matching fragment ions
that support the presence of a sequence variant (e.g., a polymorphism).
(D) Histogram of proteoforms and the number of matching fragment ions
that support the presence of a PTM. (E) Sequential filtering of proteoforms
to identify high-confidence tissue-specific proteoforms. (F) Identification
of tissue-specific defensin proteoforms.To curate a core set of proteoforms uniquely expressed in the five
individual tissue types, we implemented a conservative process to
select those proteoforms with PTMs with direct fragment ion support
(level 1 proteoforms[29]). To this end, the
number of matching fragment ions that bear a PTM (or amino acid variant)
were counted for each PrSM. While many mutated and modified proteoforms
have supporting fragment ions (level 1), a disproportionate number
of modified proteoforms were level 3 with two or fewer ions (Figure C,D). Consequently,
the requirement of having ≥3 supporting fragment ions for modified
proteoforms was added in addition to a C-score >30. This process
culled
the set of 8784 unique proteoforms in Table down to 2843 level 1 tissue-specific proteoforms
(Figure E and Supporting Information Data 1).More level
1 tissue-specific proteoforms were identified in a subsequence
search (previously called BioMarker search that identifies portions
of full-length proteoforms[31,32]) than in absolute mass
searches. Specifically, 2548 proteoforms were identified in subsequence
searching compared to 295 proteoforms identified in absolute mass
searches. Subsequence searches identify proteolytic fragments that
often arise from endogenous proteolytic events and can serve as significant
biomarkers.[21] While a portion of these
proteoforms may be the product of nonspecific proteolysis, the consensus
sequence of cleavage sites varied across tissues (Figure S5). Truncated proteoforms from the heart, kidneys,
and small intestine showed enrichment of F, Y, W, and L at P1, which
suggests chymotrypsin activity. The spleen proteoforms demonstrated
enrichment of hydrophobic residues but no apparent sequence specificity.
This lack of specificity combined with a high proteoform-to-protein
ratio agrees well with the role of the spleen for scavenging senescent
blood cells.[33] Lung proteoforms had a higher
propensity of cysteine at P1, which is not commonly observed for specific
proteases. This enrichment was driven by 24 of the 715 lung-specific
proteoforms with N-terminal cleavage. 9 of these 24 proteoforms originate
from collapsing response mediator protein 2 (CRMP-2 and Q16555), with
cleavage occurring at C439 (Figure S6).
CRMP-2 has largely been studied in the context of neurological diseases
due to its role in microtubule assembly and axon growth.[34] Indeed, C-terminal truncation of CRMP-2 has
been linked to neurodegeneration,[35] and
the cleavage site was later localized to S517.[36] As the function of CRMP-2 in the lung tissue has only recently
begun to be characterized,[37] this novel
truncation at C439 may assist in elucidating its role.Subsequence
searching also identified a proteolytic cleavage site
in CDGSH iron–sulfur domain-containing protein 1 (mitoNEET
and Q9NZ45) at L47 (Figure S7). MitoNEET
is a mitochondrial outer membrane protein that was initially discovered
as an off-target interactor of the PPAR-γ agonist pioglitazone.[38] With its iron–sulfur cluster oriented
toward the cytosol, mitoNEET acts as a redox sensor and regulator
of mitochondrial iron.[39−41] Downregulation of mitoNEET has been associated with
aging and increased risk of heart failure.[42] The canonical proteoform of mitoNEET was observed in both the small
intestine and heart tissue, while both proteolytic products were observed
solely in the heart tissue (Figure S7).
Cleavage at L47 does not disrupt the iron–sulfur cluster binding
site but does separate this reactive center from the protein’s
transmembrane domain. Thus, proteolytic cleavage may act as a means
for regulating mitoNEET or a mechanism by which full-length mitoNEET
abundance declines in aging cardiomyocytes.
Unique Proteoforms Are
Reflective of Tissue Central Functions
Many of the tissue-specific
proteoforms originate from genes involved
in the core function of these tissues, as indicated by gene ontology
enrichment (Figures E and S8). The subsequence proteoform
search identified a series of proteoforms associated with defensins
with distinct expression patterns (Figures F and S9). Defensins
are a family of small cationic host defense proteins characterized
by three conserved intramolecular disulfide bonds.[43] Six human α-defensins have been identified to date
and are subdivided into human neutrophil peptides 1–4 (HNP1–4)
and human (enteric) defensins (HD5–6). HNPs are stored as mature
peptides in granules of neutrophils and released upon activation by
exocytosis.[44] HNP1 (PFR69106) was identified
in both lung and spleen tissues, as expected for tissues with high
neutrophil content. HNP2 (PFR69109), HNP3 (PFR69079), HNP4 (PFR65983),
and truncation products of HNP2 (PFR165182 and PFR165183) were observed
exclusively in the spleen tissue. No β-defensin proteoforms
were identified. HD5 and HD6 are produced in Paneth cells at the base
of small-intestinal crypts.[45] Accordingly,
HD5 and HD6 were detected exclusively in the small-intestinal tissue.
Unlike other defensins, HD5 is stored as a propeptide, and the fully
mature peptides are thought to be produced by intracellular trypsin.[46] Consequently, the HD5 propeptide (PFR165815)
and several truncated products were observed. Several of these truncated
proteoforms (PFR5737351, PFR97759, and PFR97755) correspond to trypsin
cleavage sites (R25, R55, and R62), while others (PFR5741069, PFR5737454,
and PFR5737363) seem to correspond to other mechanisms of cleavage
considering the residues at the P1 positions (D41, F46, and A61).
Defensins are important components of the host innate immunity, so
observing new proteoforms on mucosal surfaces is important in understanding
their regulation and design of therapeutic mimetics.[47,48] Furthermore, these findings are a good showcase for the capabilities
of the presented setup to evaluate tissue-specific proteoform-related
questions.Glutathione S-transferases are a
family of proteins involved in inflammation and the cellular defense
against toxic and carcinogenic compounds.[49,50] Proteoforms from this protein family were broadly observed but with
distinct tissue distributions (Figure S10). Glutathione S-transferase A1 (P08263) and A2
(P09210) were observed primarily in the small intestine and kidneys,
respectively. The polymorphism E210A (rs6577) was observed in a single
kidney sample (Biorep 3), which was derived from a 53-year-old African
American male (Table S1). This coding SNP
occurs with much higher frequency in African Americans (56.5%) compared
to the global population (9.9%).[51] Microsomal
glutathione S-transferases (MGSTs) 1, 2, and 3 were
observed in the small intestine and lungs (1), small intestine and
kidneys (2), and heart tissue (3), respectively (Figure S10C,D). These glutathione transferases are polytopic
membrane proteins located in the endoplasmic reticulum membrane with
both glutathione conjugation and peroxidase activity.[52,53] A novel MGST3 proteoform (PFR5719232) that lacks the C-terminal
cysteine necessary for S-palmitoylation was the predominant
form observed in the heart tissue.[54]Enrichment of functionally relevant genes from the identified proteoforms
was particularly notable for the heart tissue, with terms associated
with ATP synthesis and muscle contraction leading the list (Figure E). Six proteoforms
of cardiac phospholamban (PLN), a key regulator of cardiac contraction
via inhibition of the sarcoplasmic reticulum calcium pump (SERCA),
were identified by RPLC-MS/MS (Figure A).[55] While unmodified PLN
and palmitoylated PLN have both been reported previously,[56] this study is the first report of phosphorylated
PLN and combined phosphorylation and palmitoylation. Phosphorylation
and palmitoylation of PLN have both been shown to control the impact
localization, complexation, and inhibition of SERCA, so accurate measurement
of their combination will help clarify PLN’s role in health
and disease.[57]
Figure 5
Unique cardio-proteoforms
identified in paired RPLC- and CZE-MS/MS
analysis. (A) Phosphorylated and palmitoylated proteoforms of PLN
(P26678) were observed by RPLC-MS/MS late in the chromatogram. (B)
Phosphorylation of the ventricular myosin regulatory light chain (RLCV
and P10916). HCD fragmentation precisely localized the phosphorylation
to S15. (C) cTnI (P19429) was observed by CZE- and RPLC-MS/MS as three
phosphoproteoforms, which correlate with enlargement of the heart
in a model of hypertrophic cardiomyopathy (ref (60)). Both CZE- and RPLC-TDPs
successfully resolved and quantified all three proteoforms.
Unique cardio-proteoforms
identified in paired RPLC- and CZE-MS/MS
analysis. (A) Phosphorylated and palmitoylated proteoforms of PLN
(P26678) were observed by RPLC-MS/MS late in the chromatogram. (B)
Phosphorylation of the ventricular myosin regulatory light chain (RLCV
and P10916). HCD fragmentation precisely localized the phosphorylation
to S15. (C) cTnI (P19429) was observed by CZE- and RPLC-MS/MS as three
phosphoproteoforms, which correlate with enlargement of the heart
in a model of hypertrophic cardiomyopathy (ref (60)). Both CZE- and RPLC-TDPs
successfully resolved and quantified all three proteoforms.We also present evidence for phosphorylation of
the ventricle myosin
regulatory light chain (RLCV). Prior reports by the Ge
group have established N-terminal trimethylation of RLCV and phosphorylation of swine RLCV, but phosphorylation
of human RLCV was unlocalized.[58,59] By calculating the area-under-curve from extracted ion chromatograms
of each proteoform, phosphorylated RLCV is estimated to
be at 9% relative abundance. The removal of N-terminal methionine
and trimethylation was confirmed by tandem HCD fragmentation, and
the site of phosphorylation was localized to S15, which is analogous
to the site identified on swine RLCV (Figure B). On a last analytical note,
phosphoproteoforms of cardiac troponin I (cTnI)[60] were not resolved by RPLC but were baseline-separated by
CZE (Figure C); proteoform
quantitation by both techniques showed <10% coefficient of variation
between them. Better separation of charge variants such as phospho-troponin
by CZE should translate into better on-the-fly sequence coverage and
proteoform characterization with tandem MS scan speeds.
Conclusions
We have described the combination of TDP data collected with online
separation by RPLC and CZE to expand the depth of human proteome coverage.
All proteomics methods face the challenge of measuring low-abundance
analytes, so identifying robust approaches that introduce new proteoform
selectivity is highly sought. RPLC and CZE were shown to possess differential
proteoform selectivity that manifests as different physiochemical
properties and PTM profiles. In a TDP study of five human tissues,
we dramatically expanded the number of proteoforms associated with
these tissues by combining the two methods.Confident assignment
of proteoforms bearing PTMs or sequence variations
becomes more challenging as query proteoforms get larger and the search
databases contain more candidate PTM sites. Unambiguous level 1 proteoform
assignments are particularly troublesome when seeking proteoforms
specific to a particular biological context (e.g., tissue types),
but this can be significantly mitigated with the inclusion of fragment-ion
data quality standards. Even at the current levels of proteoform characterization
quality, organ-specific proteoforms achieve robust tissue type identification.The genes from the tissue-specific proteoforms identified in this
study were tied to the core function of the tissues, as broadly indicated
by GEO analysis. This is further supported by specific examples such
as proteins that regulate muscle contractility (PLN, RLCV, and cardiac
troponins), host–pathogen interaction (defensins), cytoskeletal
reorganization (CRMP-2), and metabolic detoxification (family of glutathione
transferases). In many cases, these unique proteoforms were detected
with only one of the upfront separation methods. Thus, proper exploration
of our hypothesis that proteoform-level measurements more fully capture
biological context than protein-level measurement requires an increased
depth of proteome coverage.
Authors: Ji Eun Lee; John F Kellie; John C Tran; Jeremiah D Tipton; Adam D Catherman; Haylee M Thomas; Dorothy R Ahlf; Kenneth R Durbin; Adaikkalam Vellaichamy; Ioanna Ntai; Alan G Marshall; Neil L Kelleher Journal: J Am Soc Mass Spectrom Date: 2009-08-12 Impact factor: 3.109
Authors: Elizabeth K Neumann; Nathan Heath Patterson; Jamie L Allen; Lukasz G Migas; Haichun Yang; Maya Brewer; David M Anderson; Jennifer Harvey; Danielle B Gutierrez; Raymond C Harris; Mark P deCaestecker; Agnes B Fogo; Raf Van de Plas; Richard M Caprioli; Jeffrey M Spraggins Journal: STAR Protoc Date: 2021-08-13
Authors: Kristen M Varney; Alexandre M J J Bonvin; Marzena Pazgier; Jakob Malin; Wenbo Yu; Eugene Ateh; Taiji Oashi; Wuyuan Lu; Jing Huang; Marlies Diepeveen-de Buin; Joseph Bryant; Eefjan Breukink; Alexander D Mackerell; Erik P H de Leeuw Journal: PLoS Pathog Date: 2013-11-07 Impact factor: 6.823
Authors: Wenxuan Cai; Zachary L Hite; Beini Lyu; Zhijie Wu; Ziqing Lin; Zachery R Gregorich; Andrew E Messer; Sean J McIlwain; Steve B Marston; Takushi Kohmoto; Ying Ge Journal: J Mol Cell Cardiol Date: 2018-07-23 Impact factor: 5.000
Authors: Pei Su; John P McGee; Kenneth R Durbin; Michael A R Hollas; Manxi Yang; Elizabeth K Neumann; Jamie L Allen; Bryon S Drown; Fatma Ayaloglu Butun; Joseph B Greer; Bryan P Early; Ryan T Fellers; Jeffrey M Spraggins; Julia Laskin; Jeannie M Camarillo; Jared O Kafader; Neil L Kelleher Journal: Sci Adv Date: 2022-08-10 Impact factor: 14.957