Literature DB >> 26941048

Elucidating Proteoform Families from Proteoform Intact-Mass and Lysine-Count Measurements.

Michael R Shortreed1, Brian L Frey1, Mark Scalf1, Rachel A Knoener1, Anthony J Cesnik1, Lloyd M Smith1,2.   

Abstract

Proteomics is presently dominated by the "bottom-up" strategy, in which proteins are enzymatically digested into peptides for mass spectrometric identification. Although this approach is highly effective at identifying large numbers of proteins present in complex samples, the digestion into peptides renders it impossible to identify the proteoforms from which they were derived. We present here a powerful new strategy for the identification of proteoforms and the elucidation of proteoform families (groups of related proteoforms) from the experimental determination of the accurate proteoform mass and number of lysine residues contained. Accurate proteoform masses are determined by standard LC-MS analysis of undigested protein mixtures in an Orbitrap mass spectrometer, and the lysine count is determined using the NeuCode isotopic tagging method. We demonstrate the approach in analysis of the yeast proteome, revealing 8637 unique proteoforms and 1178 proteoform families. The elucidation of proteoforms and proteoform families afforded here provides an unprecedented new perspective upon proteome complexity and dynamics.

Entities:  

Keywords:  NeuCode; PTM; database search; proteoform; proteoform family; proteomics; top-down

Mesh:

Substances:

Year:  2016        PMID: 26941048      PMCID: PMC4917391          DOI: 10.1021/acs.jproteome.5b01090

Source DB:  PubMed          Journal:  J Proteome Res        ISSN: 1535-3893            Impact factor:   4.466


Introduction

The dominant means for identification of proteins in complex mixtures is bottom-up proteomics.[1] In this approach, a mixture of proteins from the sample of interest is cleaved into peptides, typically using trypsin, and analyzed by liquid chromatography–mass spectrometry (LC–MS). Fragmentation of the peptides within the mass spectrometer yields product-ion mass spectra, which are compared to theoretical mass spectra produced in silico based upon a generic reference protein database of the organism under study. Statistical analysis of the results provides a list of peptides identified in the sample, subject to a specified false discovery rate (FDR).[2] Proteins present in the sample are then inferred from the identified peptides in a process referred to as protein inference.[3,4] Implementations of this approach are routinely able to identify thousands of proteins in yeast,[5] human,[6] or other organisms.[7] The strategy can reveal differences in protein expression in different cell types or in response to cellular growth conditions or treatment with drugs.[8] While the bottom-up strategy is powerful and widely practiced, it does suffer from major shortcomings. Proteins produced from the same gene can vary substantially in their molecular structure: genetic variations, splice variants, RNA edits, and post-translational modifications (PTMs) all give rise to different forms of the proteins, referred to as “proteoforms”.[9] Knowledge of the proteoforms that are present in a system under study is absolutely essential to understanding that system, as the different proteoforms often have dramatically different functional behavior,[10] and regulation of their production is a central aspect of pathway control. One recent example is the finding that intact and clipped human histones differ in post-translational modification patterns[11] and that these combinations of sequence-length and PTM differences have functional consequences. Bottom-up strategies are unable to identify proteoforms for two reasons: first, the digestion of the proteins into peptides means that information is lost as to the protein context within which that peptide is found, making impossible the identification of the parent proteoform from which each peptide is derived; and second, the databases used for peptide identification do not generally contain information regarding amino-acid variant or modified peptides, causing such peptides to be effectively invisible in the absence of specialized search strategies,[12−14] which can introduce problems with search time and false identifications. One way in which these issues have been addressed is through the alternative strategy of “top-down” proteomics.[15] In top-down proteomics, the entire intact protein is subjected to fragmentation, and the proteoform is identified from the parent mass and fragmentation products; in the ideal case, this can yield a precise identification of the proteoform, including the nature and positions of PTMs. While historically such efforts have largely been limited to the exhaustive study of individual purified protein species,[16] recent work has extended the approach to highly complex samples such as yeast[17] and human[18] cell lysates. However, the highly complex nature and voluminous quantities of the data produced, as well as the need for long MS analysis times to produce data of sufficient sensitivity and resolution for proteoform identification, make this approach, at present, a highly specialized endeavor. We present here an alternative proteomic approach that utilizes proteoform intact mass and lysine count determinations, not tandem MS, to reveal proteoforms and proteoform families. A proteoform family, a concept we introduce here, is a set of proteoforms derived from a single gene. Individual proteoforms in a proteoform family frequently differ from one another by single post-translational modifications or amino acid differences but can also differ by larger changes due to splice variation or protein truncation. For example, all of the many different post-translational variants of histone H4 are members of a single proteoform family. A complex proteomic sample such as a cell lysate may contain thousands of proteoform families. We devised a computational process for the determination of the proteoform families present in a complex sample. The families are constructed from knowledge of just two pieces of information for each proteoform, the accurate proteoform mass and the number of lysine residues it contains. Proteoforms are considered to be related, and thus members of the same family, if their lysine counts are identical and their intact masses differ by the mass of known modifications or amino acid changes (in this initial study, we have not yet attempted to include larger changes such as splice variation or protein truncation). Identification of any given member of the family then identifies all members of the family. This initial identification is obtained by matching the accurate mass and lysine count of the experimentally observed proteoform to values calculated from a protein reference database and looking for exact matches within a small mass tolerance. This strategy of using the identification of one proteoform to leverage the identification of many related proteoforms distinguishes this approach from both top-down and bottom-up proteomics, which are based solely upon the identification of individual proteins. In addition, because all members of a family are identified and visualized together, the relative abundances of the related forms are easily compared (see below). The representation and visualization of proteoform families described here nicely parallels related work in the field[19] that connects individual proteoforms and proteoform interaction networks to PTM and disease metadata. This provides an incredible bridge between the experimental process of proteoform identification and the relationship between proteoform observations and the presence of particular disease states.

Experimental Procedures

The experimental workflow for identifying proteoforms is straightforward (see the Materials and Methods section in the Supplemental Text). Briefly, accurate proteoform masses are determined by standard LC–MS analysis of undigested protein mixtures in an orbitrap mass spectrometer, and the lysine count is determined using the NeuCode stable isotope labeling by amino acids in cell culture (SILAC) isotopic tagging method.[20] We note that the use of intact mass determination and amino acid count for protein identification has been previously reported.[21−23] We cultured yeast with media containing either of two isotopically heavy forms of lysine: 13C615N2-lysine (+8.0142 Da) or 2H8-lysine (+8.0502 Da). These two isotopologues of lysine differ in mass by 36 mDa. Pairs of identical proteoforms produced upon mixing and lysing cells from both cultures have a monoisotopic mass difference equal to 36 mDa times the number of lysines in the proteoform. For these experiments, cells grown in media enriched with “NeuCode Light” and “NeuCode Heavy” lysine are combined in a 2:1 ratio. Experiments here were limited to analysis of proteins below 30 000 Da because of the mass range limitation of the mass spectrometer employed (Thermo LTQ Orbitrap Velos). Cells are lysed, and a soluble protein cleared lysate is prepared, followed by gel electrophoretic separation into 12 molecular weight fractions, which are analyzed by LC-MS.[24] The resultant mass spectra (28 847 in the present study) are processed in a multistep data-analysis pipeline to provide proteoform identifications (an example is shown in Figure ). The first step in the pipeline is charge-state deconvolution and deisotoping to yield proteoform monoisotopic intact mass values. Protein mass spectra produced by electrospray ionization are highly complex. Each individual protein is observed in multiple different charge states. In addition, the natural abundance C, H, N, O, and S atoms in each proteoform yield multiple different isotopologues. Therefore, mass spectra must be deconvoluted to eliminate the charge-state differences and deisotoped to eliminate the isotopologue effect to obtain a single monoisotopic mass for each proteoform. Next, we paired together mass values that were NeuCode-Light and NeuCode-Heavy isotopologues of one another. The stringent pairing criteria include: a small mass difference of <6 Da, an intensity ratio between 1.4:1 and 6:1 (based on the expected mixing ratio of 2:1), and also observation in the same spectrum and the same charge states (see the Materials and Methods section in the Supplemental Text). This pairing serves two purposes. First, it greatly increases the confidence that the mass values correspond to actual proteoforms from the sample. Second, the number of lysines present in each protein is determined from the mass difference between the doublet peaks for the two proteoform isotopologues using the 36 mDa per lysine conversion factor. Overall, this yielded a set of 70 564 intact masses with associated lysine counts (Supplemental Table S-1), of which 8637 were nonredundant and thus likely to correspond to unique proteoforms (Supplemental Table S-2).
Figure 1

Example intact protein chromatogram and spectrum. (A) An LC–MS chromatogram for one gel electrophoresis fraction of NeuCode SILAC yeast. (B) A full-scan mass spectrum obtained at a resolution of 100 000. (C) An expanded view of the mass spectrum showing one charge-state envelope containing multiple isotope peaks for each of the two isotopologues. (D) A further expanded view displaying the “Light” (left peak) and “Heavy” (right peak) isotopologues; the spacing between these two peaks is used to determine the number of lysines in this proteoform.

Example intact protein chromatogram and spectrum. (A) An LC–MS chromatogram for one gel electrophoresis fraction of NeuCode SILAC yeast. (B) A full-scan mass spectrum obtained at a resolution of 100 000. (C) An expanded view of the mass spectrum showing one charge-state envelope containing multiple isotope peaks for each of the two isotopologues. (D) A further expanded view displaying the “Light” (left peak) and “Heavy” (right peak) isotopologues; the spacing between these two peaks is used to determine the number of lysines in this proteoform.

Results and Discussion

We sought to identify the known yeast proteins to which these 8637 proteoforms correspond. This is not possible to achieve by direct comparison of the UniProt database entries with the experimental data because of the wide variety of possible post-translational modifications, which change the intact proteoform masses. We devised a three-stage strategy to address this problem (Figure ). In stage 1, experimentally observed proteoforms are identified by pairing them with their theoretical counterparts (experimental–theoretical (ET) pairs); in stage 2, pairs of proteoforms that differ from one another by the mass of well-known protein modifications are identified by pairing them with one another (experimental–experimental (EE) pairs); and in stage 3, all ET and EE pairs sharing a common proteoform are joined together to form proteoform families.
Figure 2

Three-stage strategy for elucidating proteoform families. In the first stage, experimental intact masses, En, are compared to theoretical masses, Tn, (having the same lysine count) to create ET pairs for certain mass differences (e.g., 42 Da). In the second stage, EE pairs are similarly generated. In the third stage, the pairs are clustered together to produce proteoform families, two examples of which are shown here.

Three-stage strategy for elucidating proteoform families. In the first stage, experimental intact masses, En, are compared to theoretical masses, Tn, (having the same lysine count) to create ET pairs for certain mass differences (e.g., 42 Da). In the second stage, EE pairs are similarly generated. In the third stage, the pairs are clustered together to produce proteoform families, two examples of which are shown here. In stage 1 of the strategy, ET pairs are identified by comparing experimental masses with theoretical masses from the UniProt entries having the same lysine count. For each of the 8637 observed proteoforms, we determined the UniProt entries (including single annotated PTMs when present) falling within 500 Da and calculated the differences between the experimental and theoretical proteoform masses. Figure A shows a histogram of the results out to 200 Da. The most intense peaks in the histogram correspond to the mass differences associated with frequent protein modifications. Note that several of the major peaks have satellites within one or two Da, which are likely due to well-known challenges in the deisotoping of mass spectra of intact proteins.[25] We selected 13 of these mass differences that met an average false discovery rate (FDR) of 21%, ranging from 8 to 35% (see below for a discussion of FDR). This threshold was selected because it captured the major peaks in the histogram corresponding to known prevalent modifications. There were 550 ET pairs identified by this process.
Figure 3

Histograms of observed mass differences. (A) Mass differences between experimental masses and theoretical ones calculated from UniProt entries, which have the same number of lysines. (B) Mass differences between pairs of experimental observations, again stipulating the same lysine count. The most frequently observed mass differences correspond to common PTM or amino acid masses. A total of 31 of the 88 mass differences (highlighted in pink) were directly attributable to known modifications (e.g., oxidation, methylation, and acetylation) or amino acid losses at one of the proteoform termini. Another 34 peaks were adjacent to these 31 primary mass shifts, up to 2 Da away, and were attributed to misassignment of the proteoform monoisotopic mass. Several other mass shifts (e.g., 46 and 72 Da) were included in the construction of proteoform families because they exceeded the threshold but they remain unidentified. These two mass shifts in particular are absent in the compendium of modifications at unimod.org. They could arise from a combination of modifications. An FDR threshold is shown (green line). See also the Supplemental Tables S-9 and S-11 for the complete list of ET and EE mass differences to 500 Da.

Histograms of observed mass differences. (A) Mass differences between experimental masses and theoretical ones calculated from UniProt entries, which have the same number of lysines. (B) Mass differences between pairs of experimental observations, again stipulating the same lysine count. The most frequently observed mass differences correspond to common PTM or amino acid masses. A total of 31 of the 88 mass differences (highlighted in pink) were directly attributable to known modifications (e.g., oxidation, methylation, and acetylation) or amino acid losses at one of the proteoform termini. Another 34 peaks were adjacent to these 31 primary mass shifts, up to 2 Da away, and were attributed to misassignment of the proteoform monoisotopic mass. Several other mass shifts (e.g., 46 and 72 Da) were included in the construction of proteoform families because they exceeded the threshold but they remain unidentified. These two mass shifts in particular are absent in the compendium of modifications at unimod.org. They could arise from a combination of modifications. An FDR threshold is shown (green line). See also the Supplemental Tables S-9 and S-11 for the complete list of ET and EE mass differences to 500 Da. In stage 2 of the strategy, EE pairs are identified by comparing all experimental masses of the same lysine count with one another. For each of the 8637 observed proteoforms, we identified the sets of observed proteoforms having the same lysine count and then calculated all pairwise mass differences within each set. A histogram of the aggregated results for all mass differences below 200 Da is shown in Figure B. Peaks highlighted in the histogram include PTMs, amino acid losses, and other protein modifications commonly observed in protein mass spectrometry. We selected the 88 mass differences that met an average false discovery rate (FDR) of 22% (ranging from 5 to 36%; see below for a discussion of FDR). The larger number of significant mass differences observed for EE pairs (88) than for ET pairs (13) is due in part to the multiplicative effect of the monoisotopic errors. For example, we may see a proteoform with a monoisotopic mass of 10 000 Da and a missed monoisotopic mass for that same proteoform at 10 001 Da. The oxidized version of these two forms would have monoisotopic masses of 10 016 and 10 017 Da, respectively. The EE mass differences for all four species would be 1, 15, 16, and 17 Da, with relative intensities of 1:1:2:1. Thus, two actual proteoforms produce four separate peaks in the EE histogram. Stage 2 yielded 11 213 EE pairs. In the third stage of analysis, proteoform families are formed by joining together all ET and EE pairs sharing a common proteoform. Each pair consists of two nodes (masses of the two proteoforms) and one edge (the mass difference between the two proteoforms). All pairs having a common node are joined together to form discrete proteoform families. This process yielded 1178 proteoform families ranging in size from 2 to 150 members, as displayed in Figure A. The proteoform families are represented as collections of nodes and edges, where each node corresponds to a particular proteoform with an associated intact mass and lysine count, and the edges correspond to the mass differences between related proteoforms. The red nodes represent the mass and lysine count of an unmodified (base) protein from a protein reference database (UniProt), the green nodes represent the mass and lysine count of a UniProt-curated post-translational modification of the base reference protein entry (base + PTM), and the blue nodes represent experimental mass and lysine count observations from the yeast lysate sample. The area of each blue node is proportional in size to the number of times that proteoform was observed experimentally, providing a crude measure of abundance. In the simple proteoform family shown in Figure B for Negative cofactor 2 complex subunit β, for example, there are four nodes and three edges. The red and green nodes represent the UniProt entries for the base and phosphorylated protein, respectively, and the two blue nodes correspond to the experimentally observed mass and lysine count pairs for both proteoforms. There are two zero Da mass difference edges shown, connecting the UniProt entries with the experimental observations for those proteoforms, and one 80 Da mass difference edge connecting the two experimentally observed proteoforms, corresponding to the mass added upon phosphorylation. Figure shows three other proteoform families of increasing complexity, showing multiple methylations of 60S ribosomal protein L12-A, multiple acetylations of Histone H2B.1, and a pattern of amino acid losses from the N-terminal degradation of 60S ribosomal protein L40. The ability shown here to identify and visualize the members of proteoform families provides a powerful and unprecedented new view of proteome complexity at the intact proteoform level, information that is critical to understanding biological systems and pathways.
Figure 4

Proteoform families. (A) Display of 1178 proteoform families discovered in this work. (B–E) Expanded views of four example proteoform families. Theoretical unmodified proteins (red nodes) are labeled with their UniProt accession number. Theoretical modified proteins (green nodes) are labeled with their accession number and a PTM known to occur on that protein. Experimentally observed proteoforms (blue nodes) are labeled with their intact mass and the number of times it was detected. The area of each blue node is proportional in size to the number of times that proteoform was observed experimentally; however, to facilitate visualization, all nodes corresponding to 1–10 observations were given the same (minimum) size. Proteoforms are connected by select mass differences (edges) indicated by black lines with orange mass-difference values.

Proteoform families. (A) Display of 1178 proteoform families discovered in this work. (B–E) Expanded views of four example proteoform families. Theoretical unmodified proteins (red nodes) are labeled with their UniProt accession number. Theoretical modified proteins (green nodes) are labeled with their accession number and a PTM known to occur on that protein. Experimentally observed proteoforms (blue nodes) are labeled with their intact mass and the number of times it was detected. The area of each blue node is proportional in size to the number of times that proteoform was observed experimentally; however, to facilitate visualization, all nodes corresponding to 1–10 observations were given the same (minimum) size. Proteoforms are connected by select mass differences (edges) indicated by black lines with orange mass-difference values. Figure summarizes the proteoforms and proteoform families identified. Of the total 8637 proteoforms observed, 2378 were not associated with any other proteoform or a UniProt accession number and hence are not members of families (orphans). The rest of the proteoforms formed 1178 proteoform families composed of 1460 proteoforms belonging to 199 families that correspond to a known protein (i.e., are associated with a single UniProt accession number); 802 proteoforms in 27 families that leave some ambiguity in identification in that they were associated with two or more accession numbers; and the remaining 3997 proteoforms in 952 families that remain unidentified. Of the 70 564 total experimental proteoform observations, 92% belong to one of the 1178 proteoform families. 1216 (14%) of the 8637 proteoforms observed, and 253 (11%) of the 2262 that were also identified, had masses below 5000 Da and thus might be considered as peptides rather than proteins (see Supplemental Table S-2 for a list of all observed and identified proteoform masses, along with histograms showing their distribution as a function of mass). The size distribution of the families is plotted in Supplemental Figure S-1 and shows a roughly exponential decrease in frequency with increasing size. This plot reveals for the first time the number of different proteoforms for a given base protein, providing a new way of assessing the complexity of the entire yeast proteome.
Figure 5

Distribution of observed proteoforms in various types of proteoform families. Most of the observed proteoforms clustered with theoretical or other experimental proteoforms to make families, although some did not (i.e., “orphans”). The proteoform families are categorized as identified, ambiguous, or not yet identified based on containing one, two or more, or zero theoretical accession numbers, respectively. The term “Observation” here refers to each detection of a proteoform intact mass and lysine count in any of the 29 847 mass spectra collected in this study.

Distribution of observed proteoforms in various types of proteoform families. Most of the observed proteoforms clustered with theoretical or other experimental proteoforms to make families, although some did not (i.e., “orphans”). The proteoform families are categorized as identified, ambiguous, or not yet identified based on containing one, two or more, or zero theoretical accession numbers, respectively. The term “Observation” here refers to each detection of a proteoform intact mass and lysine count in any of the 29 847 mass spectra collected in this study. To assess the statistical confidence associated with the identifications, we estimated the false discovery rate (FDR) for the ET and EE pairs. FDR is an estimate of the fraction of false positive identifications in a group of identifications. The strategies for assessing FDR for each pair type are described briefly below and provided in greater detail in the Supporting Information. FDRs for the ET pairs were determined using a target-decoy strategy, analogous to the widely employed estimation of FDR in bottom-up proteomics.[26] In bottom-up proteomics, the most common method of creating a decoy database for all proteins in an organism of interest is to reverse all of the amino acid sequences. However, this method of creating a decoy database is not useful here because all of the decoy entries would have the same masses and lysine counts as the true target database. We accordingly developed an alternative strategy for the construction of the decoy database. We first concatenated all yeast protein sequences in random order into a single continuous string and then divided the string into substrings with lengths equal to each of the known yeast proteins. This yields a decoy database, in which the number and length of decoy protein sequences matches exactly to the known set of yeast proteins, but the masses and lysine counts differ. The database was further expanded to include proteoforms with single post-translational modifications, one for each modification annotated in the UniProt yeast protein database. We created 10 such decoy databases and employed each of them for the stage 1 ET identification to determine the number of experimental–decoy (ED) pairs, which represent false ET connections. The FDR at each mass difference is the ratio of the median number of ED pairs to the number of ET pairs and ranged from 8 to 35%, with an average of 21%. The primary factor driving this high FDR for ET pairs (and for EE pairs below) is mass accuracy, which is limited by the instability and drift in the measurement of intact mass that occurs over the course of the experiment, which requires several days of instrument operation. The false discovery rate is expected to drop with improvements to instrument stability, such as the utilization of a lock-mass standard in the chromatographic buffer for continuous mass calibration.[27] The target-decoy strategy just described for the estimation of FDR for ET pairs is not applicable to estimation of FDR for EE pairs because no theoretical database is utilized for identifying EE pairs. A different method was needed to estimate the number of false EE pairs at each of the 88 mass differences. We hypothesized that because all true EE pairs are between proteoforms having the same lysine count, we could use mass differences between experimental values having unequal lysine count as a proxy for false-positive connections. To implement this approach, we calculated the mass differences between all experimentally observed proteoforms differing in lysine count by two or more lysines. Because this set of mass differences is vastly larger than the set created when considering only experimental values with the same lysine count, we selected a random subset of size equal to the number of mass differences produced in the EE comparison of Figure B. We counted the number of mass differences in this subset in a small window (± 0.04 Da) around each of the selected EE peaks. This count provides an estimate of the number of false EE connections (experimental–false lysine count (EF) pairs) in each peak. The FDR at each mass difference is the ratio of the number of EF pairs to the number of EE pairs and ranged from 5 to 36%, with an average of 22%. We note that the modest FDR values reported here (21% for ET and 22% for EE) do not compare favorably with either bottom-up proteomics, which commonly reports 1% FDR values for protein identification, or top-down proteomics, which commonly reports 1–5% FDR values for protein identifications. These FDR values for the intact mass and lysine count approach are highly dependent on instrumental factors that can be improved, and therefore, they should not detract from the importance of this new approach to proteoform and proteoform family identification. We compared the identifications obtained from the intact mass and lysine count strategy with those obtained by top-down proteomics. Briefly, we aggregated yeast top-down search results (Supplemental Table S-3) obtained in our own laboratory (Supplemental Tables S-4 and S-5) with those reported by the Kelleher laboratory in the most comprehensive study published to date.[17] A detailed explanation of this comparison is provided in the Supplemental Text and further supported by additional data found in Supplemental Tables S-6 through S-17. We found 75% agreement between the proteoforms identified by top-down proteomics and the ones identified by the intact mass and lysine count strategy. It is of interest to note several current limitations of the intact-mass approach to identification of proteoforms and proteoform families, which offer interesting paths forward for the further development of the strategy. The method does not localize PTMs. Localization could possibly be achieved using either bottom-up or top-down mass spectrometry, but neither method guarantees sequence coverage over the region containing the PTM. The strategy will be necessarily more difficult to implement on samples from more complex organisms such as plant and mammalian species because they have larger proteomes and include genetic variation among individuals. Thus, it will be necessary to characterize the sequence variation of the individual under study using large-scale genomic or transcriptomic sequence data to inform and improve the proteomic analysis. Efforts to accomplish this are currently an active area of research, referred to as “proteogenomics”.[28] We have used this approach successfully to improve bottom-up proteomic analyses in a variety of mammalian cell lines[29,30] and anticipate that it will be similarly useful for proteoform family analysis. The NeuCode SILAC isotopic tagging strategy employed in this study to provide lysine counts for each proteoform was extremely useful but also limits the approach, as it is not applicable to tissue samples. However, it may be that as comprehensive proteoform databases are established in higher organisms, the lysine-count parameter will become less critical to the identifications and can be replaced by other readily measured or calculated parameters such as chromatographic retention time.[31] Only the rudimentary quantification of proteoforms was accomplished in the current work based on the number of times each mass and lysine count was observed. The accuracy and precision will be greatly improved by using intensity-based measurements or isotopic tagging strategies such as NeuCode SILAC for relative quantification.[20,32] Finally, we believe that there is room for much improvement over the first-generation bioinformatic and biostatistical approaches presented here. For instance, we are devising descriptive statistical approaches that will provide confidence intervals for the likelihood that each individual node (proteoform) is included in the correct family. See the Supporting Information for more in-depth discussion of these current limitations. We encountered a few interesting phenomena having potential, yet currently unknown, biological significance. First, the process used for determining ET and EE pairs, which involves making a histogram showing the frequency of mass difference values, revealed several frequent but previously unknown differences. These peaks, like those revealed in similar work by ourselves and others,[12] suggest the possibility of unknown protein modifications. Two examples are the peaks at 46 and 72 Da in the EE plot (Figure B). We have observed these mass differences in mass-tolerant bottom-up proteomics analyses of yeast. These two particular cases have also been reported elsewhere.[12] We are currently working to interpret them. Second, we observe a considerable number of proteoforms that are missing one or more amino acids from either the N- or C-terminus or both. Proteoforms displaying this behavior were also identified by us and by Kelleher’s group[17] using top-down proteomics. This new strategy of identifying proteoforms from intact mass, lysine count, and clustering into proteoform families serves to complement rather than replace top-down and bottom-up proteomic approaches. We found 1460 proteoforms associated with 199 single accession numbers and an additional 802 proteoforms associated with two or more accession numbers. These numbers compare reasonably well with the most extensive top-down study in yeast to date, which reported 1103 proteoforms associated with 530 accession numbers at 5% FDR, from the same type of sample and gel fractionation.[17] We also compared our results with bottom-up analyses of the same samples, which yielded 2651 protein identifications. We found that the frequency of intact proteoform identifications correlated strongly with the bottom-up protein abundance as determined by spectral counting (Supplemental Figure S-2), indicating that the more abundant proteins are more readily detected in both strategies. Although it is clear that bottom-up analyses are able to identify far more proteins than either intact mass or top-down analyses, they are not able to reveal proteoforms. The intact mass and lysine count strategy could potentially identify more proteoforms than top-down proteomics within a given amount of instrument time due to the intrinsically simpler nature of the data. The intact mass approach is capable of identifying several proteoforms from each high-resolution full spectrum scan, and there are no fragmentation spectra to acquire. However, on the one hand, in top-down mass spectrometry, each identification comes from a high-resolution fragmentation spectrum obtained for a single selected and isolated precursor (intact proteoform). On the other hand, top-down analysis can yield invaluable data that cannot be obtained from intact mass measurements, namely the positional localization of modifications or sequence variations. Furthermore, the proteoform family concept introduced here is not exclusive to intact mass analyses but could easily be applied to top-down proteomics data to identify additional proteoforms. It is thus apparent that the three proteomic approaches are complementary to one another rather than competitive because each is characterized by differing strengths and weaknesses. Another interesting way of comparing top-down and intact mass approaches is to consider “discovery” versus “scoring” strategies for proteomics. During the human genome project, the initial phase of single nucleotide polymorphism (SNP) analysis was a discovery phase: as the DNA sequence was generated from different individuals, sequence differences were discovered and catalogued, leading over time to vast databases containing millions of genetic variations. Once these variations were known, the need for additional discovery was diminished, and instead, platforms were developed to query samples for already known SNPs[33] or “scoring”. We envision a similar transition developing for proteoform analysis, with a “discovery” phase during which proteoforms are identified and catalogued, populating databases that then enable simpler, less expensive, and higher-throughput proteoform “scoring” approaches to be utilized for most biological studies. An early effort at establishing such proteoform databases has recently been initiated by the Consortium for Top-Down Proteomics.[34,35] We posit that the intact-mass approach will function particularly well for scoring proteoforms, and the proteoform family concept will greatly benefit both proteoform discovery and scoring.
  33 in total

Review 1.  Inference and validation of protein identifications.

Authors:  Manfred Claassen
Journal:  Mol Cell Proteomics       Date:  2012-08-03       Impact factor: 5.911

Review 2.  Protein analysis by shotgun/bottom-up proteomics.

Authors:  Yaoyang Zhang; Bryan R Fonslow; Bing Shan; Moon-Chang Baek; John R Yates
Journal:  Chem Rev       Date:  2013-02-26       Impact factor: 60.622

Review 3.  A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics.

Authors:  Alexey I Nesvizhskii
Journal:  J Proteomics       Date:  2010-09-08       Impact factor: 4.044

4.  Robust analysis of the yeast proteome under 50 kDa by molecular-mass-based fractionation and top-down mass spectrometry.

Authors:  John F Kellie; Adam D Catherman; Kenneth R Durbin; John C Tran; Jeremiah D Tipton; Jeremy L Norris; Charles E Witkowski; Paul M Thomas; Neil L Kelleher
Journal:  Anal Chem       Date:  2011-12-14       Impact factor: 6.986

Review 5.  Top Down proteomics: facts and perspectives.

Authors:  Adam D Catherman; Owen S Skinner; Neil L Kelleher
Journal:  Biochem Biophys Res Commun       Date:  2014-02-17       Impact factor: 3.575

6.  Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate.

Authors:  Ding Ye; Yan Fu; Rui-Xiang Sun; Hai-Peng Wang; Zuo-Fei Yuan; Hao Chi; Si-Min He
Journal:  Bioinformatics       Date:  2010-06-15       Impact factor: 6.937

7.  A robust two-dimensional separation for top-down tandem mass spectrometry of the low-mass proteome.

Authors:  Ji Eun Lee; John F Kellie; John C Tran; Jeremiah D Tipton; Adam D Catherman; Haylee M Thomas; Dorothy R Ahlf; Kenneth R Durbin; Adaikkalam Vellaichamy; Ioanna Ntai; Alan G Marshall; Neil L Kelleher
Journal:  J Am Soc Mass Spectrom       Date:  2009-08-12       Impact factor: 3.109

8.  Asn3, a reliable, robust, and universal lock mass for improved accuracy in LC-MS and LC-MS/MS.

Authors:  An Staes; Jonathan Vandenbussche; Hans Demol; Marc Goethals; Şule Yilmaz; Niels Hulstaert; Sven Degroeve; Pieter Kelchtermans; Lennart Martens; Kris Gevaert
Journal:  Anal Chem       Date:  2013-11-04       Impact factor: 6.986

Review 9.  Computational approaches to protein inference in shotgun proteomics.

Authors:  Yong Fuga Li; Predrag Radivojac
Journal:  BMC Bioinformatics       Date:  2012-11-05       Impact factor: 3.169

10.  Neutron-encoded mass signatures for quantitative top-down proteomics.

Authors:  Timothy W Rhoads; Christopher M Rose; Derek J Bailey; Nicholas M Riley; Rosalynn C Molden; Amelia J Nestler; Anna E Merrill; Lloyd M Smith; Alexander S Hebert; Michael S Westphall; David J Pagliarini; Benjamin A Garcia; Joshua J Coon
Journal:  Anal Chem       Date:  2014-02-19       Impact factor: 6.986

View more
  22 in total

1.  Accurate Estimation of Context-Dependent False Discovery Rates in Top-Down Proteomics.

Authors:  Richard D LeDuc; Ryan T Fellers; Bryan P Early; Joseph B Greer; Daniel P Shams; Paul M Thomas; Neil L Kelleher
Journal:  Mol Cell Proteomics       Date:  2019-01-15       Impact factor: 5.911

2.  Proteoforms as the next proteomics currency.

Authors:  Lloyd M Smith; Neil L Kelleher
Journal:  Science       Date:  2018-03-08       Impact factor: 47.728

Review 3.  High-throughput quantitative top-down proteomics.

Authors:  Kellye A Cupp-Sutton; Si Wu
Journal:  Mol Omics       Date:  2020-01-14

4.  Expanding Proteoform Identifications in Top-Down Proteomic Analyses by Constructing Proteoform Families.

Authors:  Leah V Schaffer; Michael R Shortreed; Anthony J Cesnik; Brian L Frey; Stefan K Solntsev; Mark Scalf; Lloyd M Smith
Journal:  Anal Chem       Date:  2017-12-22       Impact factor: 6.986

5.  Multiplexed proteome analysis with neutron-encoded stable isotope labeling in cells and mice.

Authors:  Katherine A Overmyer; Stefka Tyanova; Alex S Hebert; Michael S Westphall; Jürgen Cox; Joshua J Coon
Journal:  Nat Protoc       Date:  2018-01-11       Impact factor: 13.491

6.  ProForma: A Standard Proteoform Notation.

Authors:  Richard D LeDuc; Veit Schwämmle; Michael R Shortreed; Anthony J Cesnik; Stefan K Solntsev; Jared B Shaw; Maria J Martin; Juan A Vizcaino; Emanuele Alpi; Paul Danis; Neil L Kelleher; Lloyd M Smith; Ying Ge; Jeffrey N Agar; Julia Chamot-Rooke; Joseph A Loo; Ljiljana Pasa-Tolic; Yury O Tsybin
Journal:  J Proteome Res       Date:  2018-02-14       Impact factor: 4.466

Review 7.  Identification and Quantification of Proteoforms by Mass Spectrometry.

Authors:  Leah V Schaffer; Robert J Millikin; Rachel M Miller; Lissa C Anderson; Ryan T Fellers; Ying Ge; Neil L Kelleher; Richard D LeDuc; Xiaowen Liu; Samuel H Payne; Liangliang Sun; Paul M Thomas; Trisha Tucholski; Zhe Wang; Si Wu; Zhijie Wu; Dahang Yu; Michael R Shortreed; Lloyd M Smith
Journal:  Proteomics       Date:  2019-05       Impact factor: 3.984

Review 8.  Evolution of Structural Biology through the Lens of Mass Spectrometry.

Authors:  Upneet Kaur; Danté T Johnson; Emily E Chea; Daniel J Deredge; Jessica A Espino; Lisa M Jones
Journal:  Anal Chem       Date:  2018-12-06       Impact factor: 6.986

9.  Constructing Human Proteoform Families Using Intact-Mass and Top-Down Proteomics with a Multi-Protease Global Post-Translational Modification Discovery Database.

Authors:  Yunxiang Dai; Katherine E Buxton; Leah V Schaffer; Rachel M Miller; Robert J Millikin; Mark Scalf; Brian L Frey; Michael R Shortreed; Lloyd M Smith
Journal:  J Proteome Res       Date:  2019-09-18       Impact factor: 4.466

Review 10.  Top-Down Proteomics: Ready for Prime Time?

Authors:  Bifan Chen; Kyle A Brown; Ziqing Lin; Ying Ge
Journal:  Anal Chem       Date:  2017-12-15       Impact factor: 6.986

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.