Literature DB >> 25288897

Application of a Bioinformatics-Based Approach to Identify Novel Putative in vivo BACE1 Substrates.

Joseph L Johnson1, Emily Chambers1, Keerthi Jayasundera1.   

Abstract

BACE1, a membrane-bound aspartyl protease that is implicated in Alzheimer's disease, is the first protease to cut the amyloid precursor protein resulting in the generation of amyloid-β and its aggregation to form senile plaques, a hallmark feature of the disease. Few other native BACE1 substrates have been identified despite its relatively loose substrate specificity. We report a bioinformatics approach identifying several putative BACE1 substrates. Using our algorithm, we successfully predicted the cleavage sites for 70% of known BACE1 substrates and further validated our algorithm output against substrates identified in a recent BACE1 proteomics study that also showed a 70% success rate. Having validated our approach with known substrates, we report putative cleavage recognition sequences within 962 proteins, which can be explored using in vivo methods. Approximately 900 of these proteins have not been identified or implicated as BACE1 substrates. Gene ontology cluster analysis of the putative substrates identified enrichment in proteins involved in immune system processes and in cell surface protein-protein interactions.

Entities:  

Keywords:  Alzheimer’s disease; BACE1; bioinformatics; protease; protease substrates

Year:  2013        PMID: 25288897      PMCID: PMC4147752          DOI: 10.4137/BECB.S8383

Source DB:  PubMed          Journal:  Biomed Eng Comput Biol        ISSN: 1179-5972


Introduction

BACE1 (memapsin 2, β-secretase, Asp 2 protease) is a Type I membrane-bound aspartyl protease. It is highly expressed in the brain and pancreas, and the bulk of the enzyme, including catalytic domain, is extracytoplasmic (extracellular or luminal), with a short C-terminal tail containing a cell trafficking domain that directs it to the trans-Golgi network and endosomes.1 Just over ten years ago it was identified by several groups as the protease responsible for the initial cleavage of the amyloid precursor protein (APP, also a Type I membrane protein) in the brain.2–6 Subsequent cleavage of APP within its transmembrane domain by γ-secretase, a novel aspartyl protease protein complex with multiple membrane spanning α-helices, yields short peptide fragments primarily consisting of 40 or 42 amino acids termed amyloid-β (Aβ). Aggregation of the Aβ peptides forms plaques in the brain which are one of the hallmark pathological features of Alzheimer’s disease (AD). The precise mechanisms by which these Aβ peptides exert their pathogenic effects in the brain are unknown, but soluble oligomers of Aβ have been shown to be involved in the synaptic dysfunction associated with AD.7 Due to its association with the production of Aβ and with AD, BACE1 has gained significant attention as an attractive AD therapeutic target for at least two reasons. Firstly, since it is the first protease to cleave APP on the pathway leading to Aβ formation, inhibiting it precludes γ-secretase cleavage from leaving APP to be processed via the non-pathogenic α-secretase pathway. Secondly, BACE1 knockout mice showed a mild, albeit complex phenotype and no detectable Aβ in the brain, whereas knocking out γ-secretase was embryonic lethal.8–13 As is the case with many other aspartyl proteases, BACE1 has a relatively open active site and fairly loose specificity Turner et al initially reported the subsite specificity for BACE1 by measuring the second order rate constant for the peptide hydrolysis within pools of octapeptide libraries, in which seven residues were held constant while substituting one of the 19 standard amino acids (cysteine omitted) for the remaining residue.14 This was initially done for each of the P4 to P1 and P1’ to P4’ residues. Subsequent studies expanded the peptide substrates tested to include changes in residues P8 to P5.15,16 These studies of BACE1 subsite specificity provide a cleavage sequence profile that can be adapted for bioinformatic studies. Though the precise physiological function of BACE1 remains elusive, some have suggested that it acts as a sheddase.17 Despite its relatively loose specificity, only a handful of in vivo BACE1 substrates have been identified, primarily through top down approaches. As mentioned above, APP is a known physiological BACE1 substrate. Another extensively characterized BACE1 substrate is the growth factor Neuregulin-1 (NRG1), a Type I membrane protein expressed on the surface of axons that interacts with the ErbB family of receptor tyrosine kinases. NRG1 is involved in the stimulation of Schwann cell proliferation and ultimately myelination.18,19 This connection between BACE1 and NRG1 is borne out in the observation of hypo myelination in BACE1−/− knockout mice.20 Another set of proteins identified as BACE1 substrates are the beta-subunits of voltage gated sodium channels (VGSCβ).21,22 Wong et al demonstrated that BACE1 knockout cell lines showed a 50% reduction in the proteolytic processing responsible for the generation of the C-terminal fragment (CTF) of β1, β2, β3, and β4 VGSC subunits, but the residual 40%–50% activity suggests that other proteases are also involved in CTF formation.22 Although the VGSCβ4 subunit has been predicted to be a better BACE1 substrate than β2, VGSCβ2 appears to be the only subunit that acts as a substrate in the brain cortex.21 Other documented BACE1 substrates include beta-galactoside alpha-2,6-sialytrasferase 1 (ST6Gal I),23,24 P-selectin glycoprotein ligand 1 (PSGL-1),25 the APP-like proteins 1 and 2 (APLP1, APLP2),26,27 low-density lipoprotein related receptor (LRP1),28 interleukin-1 receptor type 2 (IL-1R-2),29 the anti-aging protein Klotho,30 and most recently membrane-bound prostaglandin E2 synthase-2 (mPGES-2).31 These substrates are all Type I membrane proteins with the exception of ST6Gal I, which is Type II. Since BACE1 remains an attractive target for AD therapeutics, knowing its in vivo substrates would be valuable for predicting and/or suggesting possible side effects to be aware of during clinical trials and beyond. Successful elucidation of the native substrates of any protease often requires a multifaceted approach. Proteomics studies can yield a less biased accounting of proteins cleaved upon overexpression of a given protease, but a potential drawback of this approach is that overexpression can alter the native properties of the protease, such as its subcellular location, so that observed “hits” are not necessarily reflective of the native physiological activity. Another potential problem is that it can be difficult to definitively prove whether an observed proteolytic event was directly or indirectly associated with the over-expressed protease. An alternate approach already mentioned is to investigate subsite specificity using synthetic peptide libraries to give a systematic view of a protease’s activity and specificity, but by necessity only a small subset of the possible peptide substrates can be synthesized and tested. For example, an aspartyl protease that binds eight amino acids in its active site would require the impossible feat of synthesizing 208 peptides to completely define its subsite preferences. Another approach that has been successfully employed in testing whether individual proteins are substrates for a given protease involves co-expressing the protease and its potential substrate in cell culture. This approach is only feasible if there is some a priori result or hypothesis suggesting that a protein is a substrate of a given protease. Finally, animal models can confirm that a protease-substrate pair do indeed give rise to a particular phenotype, but as was seen with BACE1/NRG1, sometimes the phenotype is not noticed until after the substrate has been identified by other means, which provides suggestions on where to search.20 Though the identification of native protease substrates can seem unwieldy, the combined results of the experimental approaches discussed can lead to success and ultimately positively impact the design of therapeutic agents. An underutilized method in the case of BACE1 is the use of bioinformatics to leverage the wealth of information contained in proteome databases. As with other methods, the goal with bioinformatics-based methods is to distil the vast amount of data to a point that minimizes false positives and false negatives while not missing the true substrates. We report here an approach that uses published in vitro subsite specificity data to drive a bioinformatics-based search of the human proteome for BACE1 in vivo substrates. We validated our approach by comparing our results to data for known in vivo BACE1 substrates and subsequently tested the method against a recently reported whole cell proteomics study aimed at elucidating putative in vivo BACE1 substrates by monitoring for proteins cleaved upon BACE1 overexpression in HeLa and HEK cell lines.32

Methods

Database of protein sequences from complete human proteome

We obtained 20,300 human protein sequences from the Universal Protein Resources. (UniProt, http://www.uniprot.org) complete proteome set (July 2010 release).33,34 The dataset contained manually annotated and reviewed protein sequences comprised of only the full length isoforms.

Transmembrane domain prediction

Human protein sequences in FASTA format were submitted to the web-based transmembrane domain prediction server TMHMM v. 2.0 that is available from the Center for Biological Sequence Analysis (http://www.cbs.dtu.dk/services/TMHMM).35 The short output format returned the number of TM domains, the predicted residue numbers of the TM domains identified, and the topology of the TM domains. Proteins were grouped according to the number of transmembrane domains, and the subset of proteins that had a single TM domain were evaluated for their potential as BACE1 substrates as outlined below. GPI anchored proteins are potential membrane-bound substrates for BACE1 as well. During GPI-anchored protein maturation, the C-terminal domain is removed and replaced by a GPI anchor. These proteins were included as single TM domain proteins with the site of the GPI anchor being numbered as though it were the first amino acid in the TM domain.

Signal peptide sequence prediction

Most proteins containing transmembrane domains also have signal peptides that target them to the ER and the secretory pathway. These hydrophobic sequences, which are removed as part of the transport process, tend to be misidentified by certain algorithms as TM domains. To prevent these sequences from being identified as potential BACE1 cleavage sequences, we sought to identify and annotate them according to their function as distinct from other protein regions. The human protein sequences in FASTA format were submitted to the signal peptide sequence prediction server SignalP v. 3.0 (http://www.cbs.dtu.dk/services/SignalP).36 The server used both neural network and Hidden Markov models trained on eukaryotic signal peptide sequences. The data output from the TMHMM and SignalP prediction servers were imported into the Microsoft Excel matrix described below.

Scoring matrix and Microsoft Excel macro

The final data required for the bioinformatics analysis were experimental measurements for the cleavage of various peptide sequences by BACE1. As mentioned previously, Turner et al performed such a study shortly after BACE1 was identified, in which they synthesized octapeptide libraries based on the human APP sequence (EVNLDAEF) that randomized a single position with all of the standard amino acids except for cysteine (because of its potential for disulfide bonding), while holding the amino acids in the other 7 positions constant.14 These eight libraries were incubated with BACE1 and the resulting peptide fragments were quantified by MALDI-TOF mass spectrometry. Based on these results, the second order rate constant for each peptide was calculated and reported as a “preference index” for each subsite. These reported preference indices for each amino acid at each subsite were converted to numerical values that were then weighted by the coefficient of variation (CV). The standard deviation of the preference indices for a given subsite was divided by the mean for those same values. The CV is a measure of the dispersion of a given set of data; therefore, subsites that show more selectivity by preferring fewer amino acids at that subsite will have a higher weighting factor. The weighting factors for the P4-P4’ sites were 0.84, 1.06, 1.14, 1.77, 1.15, 0.99, 0.61, and 0.58, respectively. These factors agree with the recent observation by Li et al that the P3-P2’ sites of BACE1 are most critical in determining substrate reactivity.15 Using these values, a score for each octapeptide was calculated by multiplying the weighted preference indices for all of the subsites together and was reported as the “score” for a given octapeptide. The preference indices with a value of zero were assigned a minimal value of 0.001. This reflected the lack of activity for a given amino at a particular subsite, while preventing potential “hits” that would be missed after multiplication by zero, essentially allowing for the possibility of some error in the original mass spectrometric measurements of the second order rate constant. We wrote a macro in Microsoft Excel Visual Basic to import and analyze the protein sequences, to calculate the score for each sequence, and to sort them according to their location in each protein sequence. For the proteins with a single TM domain, text files containing the UniProt ID, protein sequence in FASTA format, TM domain residue numbering, SignalP signal peptide prediction data, and orientation of membrane protein were imported into an Excel spreadsheet. Proteins that had an undefined orientation in the membrane were determined by manually comparing the UniProt annotations to the TMHMM prediction and were included in the database in both orientations. The macro returned the score for each octapeptide sequence, and sequences that had a score above the threshold value of 1.0 × 10−5 were retained in a matrix. This threshold value was selected relative to the score of 1.0 × 10−3 for the native APP sequence (EVKMDAEF) known to be cleaved by BACE1. We reasoned that an additional two orders of magnitude below this value was a reasonable range to reduce false negative results while minimizing the total number of sequences returned. Scores based on sum of the weighted preference indices were measured but not used because, as expected, they did not correlate well due to their inability to distinguish between sequences with acceptable preference indices at each subsite from those that had mixtures of very poor and very good preference indices. For protein sequences reaching the threshold, the results were sorted based on their position relative to the TM domain, according to which side of the membrane they were on, and whether they were Type I or Type II proteins. Octapeptide sequences less than eight residues away from the TM were rejected because one or more residues were part of the TM domain. BACE1 cleavage of proteins as far as 50 amino acids away from their TM domain have been reported and therefore the upper limit was set at 52, which allowed for some flexibility due to imprecise prediction of the exact beginning and end of TM domains. Predicted substrate sequences that fell within the TM domain itself or within a signal peptide sequence were removed and not considered further.

Gene ontology analysis

Hits returned by the algorithm were analyzed and grouped according to gene ontology (GO) terms. The UniProt IDs were submitted to the Gene Functional Classification algorithm that is part of the DAVID Bioinformatics Resources (http://david.abcc.ncifcrf.gov/home.jsp). A total of 37 Sequences for the 962 proteins were submitted and 33 of these were not found in the database because they have unknown functions and therefore no GO terms associated with them. This list was analyzed by the Functional Annotation Tool by generating the list of terms that showed up more than would be predicted by chance in the human proteome and then grouping them into clusters of overlapping or synonymous terms. The scores are reported as P-values, but they are actually a relative measure.

Results

Generation of the single TM domain subset of the complete human proteome

Submission of the complete human proteome set of protein sequences to the TMHMM prediction server yielded 2364 proteins (∼11.5%) with 1 TM domain. Approximately 77% had 0 TM domains, while there were about 2% each of proteins containing 2, 6, or 7 TM domains. The remaining 6% was scattered among proteins with 3–5 or 8–23 TM domains. These data were evaluated to determine how well the TMHMM prediction server performed relative to the annotations contained in the UniProt database by taking the UniProt IDs from the 0 TM domain subset and searching for the term “transmembrane”, which returned 171 proteins or 0.8% of the TM containing protein sequences that were missed. These proteins were added to the 1 TM domain dataset using the annotations from UniProt. The 2 TM and 3 TM subsets were then analyzed for instances where TMHMM overpredicted the number of TM domains. There were 220 proteins (1.1%) in the 2 TM subset, which according to UniProt annotations, only had 1 TM domain. For the large majority of these proteins, one of the TM domains was predicted by SignalP and annotated by UniProt as being a signal peptide sequence. This was not surprising when considering that signal peptide sequences tend to be rather hydrophobic. Only 7 proteins predicted by TMHMM to have 3 TM domains had 1 TM according to UniProt. Overall, the TMHMM algorithm categorized approximately 13% of the 1 TM proteins differently than UniProt. Roughly half of these were identified as 2 TM proteins which were actually 1 TM proteins with signal sequences. The remaining 6% discrepancy likely represents minor differences in how the 1 TM proteins are identified with each method having its own minor sources of error. Ninety-seven 1 TM proteins had an ambiguous orientation in the membrane according to UniProt. These protein sequences were analyzed as both Type I and Type II proteins. Amazingly, none of these proteins returned peptide sequences that exceeded the threshold limits when analyzed by the macro as Type II proteins. GPI anchored proteins were the last to be included in the single TM subset. Although these proteins do not have a transmembrane α-helix, they are associated with the membrane through a GPI anchor attached to the C-terminus of the protein. This GPI anchor is added with concomitant removal of a C-terminal protein domain. As mentioned in the Methods, the distance from the TM domain was counted from the residue attached to the GPI anchor.

Summary of results

There were over 11,000,000 amino acids in the 20,300 proteins from the complete human proteome and more than 10,860,000 octapeptide sequences to analyze for their predicted ability to serve as BACE1 substrates. The initial stage of screening, done to identify proteins with a single TM domain, reduced the number of proteins to analyze down to 3085 protein sequences, 97 of which were duplicated because of their ambiguous orientation in the membrane. A total of 39,864 octapeptide sequences (of the approximately 1,600,000 possible) had scores exceeding the threshold of 1.0 × 10−5. Of these 10.8% were within the TM domain, 12.2% fell within the signal peptide sequence, 20.0% were cytoplasmic, and 56.9% were extracytoplasmic (extracellular or luminal). Of the 56.9% of sequences that were extracytoplasmic, 7.7% (4.4% of the total) met both threshold requirements, having a score > 1.0 × 10−5 and being within 8–52 residues of the TM domain. This equated to 1748 octapeptide sequences of the roughly 1,600,000 possible (~0.11%) contained within 962 different proteins—a significant reduction in number of sequences to consider.

Hits among known BACE1 substrates

Once the data collection and sorting were completed, the results were surveyed to evaluate how well the algorithm had successfully predicted the known BACE1 substrates as hits. As shown in Table 1, the macro correctly identified 9 Type I substrates out the 13 known in vivo substrates. Each of these had a score over the threshold and at least one predicted cut site in the extracytoplasmic juxtamembrane domain. A cleavage recognition site of 13 for a Type I membrane protein, for example, means that the 13th amino acid from the transmembrane domain is the P4 residue and that the octapeptide sequence would span the range 13–6 with the protein cleavage occurring between residues 10 and 9. APP and APLP2 were each identified with three potential cut sites, while the closely related APLP1 was not identified as having any predicted cut sites. The BACE1 cleavage sequences for APP at sites 13 and 33 were LVFFAEDV and EVKMDAEF, respectively. These are recognition sequences that have been described previously,8 the second corresponding to the canonical site for the generation of Aβ. The sequence for the mutant Swedish APP protein was not included in the standard proteome database. Three of the four beta subunits of the voltage gated sodium channels (β1, β3, and β4) were successfully identified; VGSCβ2, however, was not. NRG1, IL-1R-2, and PSGL-1 did have predicted recognition sequences while mPGES-2 and the Type II protein ST6Gal I did not. The octapeptide recognition sequences for all of the hits can be found in Table S1.
Table 1

Predicted BACE1 cut sites for known substrates.

UniProt IDProteinTopologyPredicted cleavage recognition
SiteSequenceScore
P05067APPType I13LVFFAEDV8.44E-03
33EVKMDAEF1.02E-03
41NIKTEEIS6.04E-05
Q06481APLP2Type I9REDFSLSS1.20E-03
30MIFNAERV7.23E-05
44DENMVIDE3.55E-03
P27930IL-1R-2Type I16TLSFQTLR1.02E-03
Q02297NRG1Type I11QEKAEELY6.14E-05
P56975NRG3Type I11FMESEEVY2.07E-05
13IEFMESEE2.52E-04
14GIEFMESE4.50E-02
Q8IWT1VGSCβ4Type I15TIFLQVVD3.58E-01
Q9NY72VGSCβ3Type I44EFEFEAHR1.09E-05
Q07699VGSCβ1Type I21EHNTSVVK1.03E-04
28LLFFENYE1.09E-05
29RLLFFENY1.73E-05
Q14242PSGL-1Type I21ASNLSVNY8.30E-05
O60939VGSCβ2Type INone
Q9H7Z7mPGES-2Type INone
P51693APLP1Type INone
Q07954LRP1Type INone
P15907ST6Gal IType IINone

Validation of the algorithm for BACE1 substrates identified by proteomics

Hemming et al recently reported a quantitative proteomics study utilizing two human epithelial cell lines overexpressing BACE1.32 This study reported 68 putative substrates, many of which had not been identified previously. This provided an excellent opportunity to evaluate the validity of the substrate prediction algorithm beyond the more well-characterized BACE1 substrates with a larger dataset. The macro successfully predicted 70% of the BACE1 protein substrates reported. One of these, Glypican-3, was a GPI anchored protein and the remainder were Type I membrane proteins. No Type II membrane proteins were positively identified, but this is not surprising given that only a very small percentage of BACE1 substrates have been identified to date using quantitative proteomics or other methods. For the remaining 30% of proteomics-based substrates that were not identified, two were GPI anchored proteins, one was a Type II membrane protein, and the rest were Type I membrane proteins. As was the case with the known BACE1 substrates, the predicted cleavage recognition sequences did not a show a clear consensus in their scores or in their distance from the TM domain. Others have reported this observation as well and suggested that at least some of this variability could be attributed to the fact that both enzyme and substrate are membrane-bound and so the energetics and properties of recognition, binding, and cleavage would be different from those of non-membrane associated enzymes and substrates.15 It is not likely that all of the BACE1 substrates identified by quantitative proteomics will prove to be native substrates, a point that was made by the authors themselves.32 For example, although BACE1 is listed as a substrate for itself, further work showed that there was not a direct correlation and that proteolysis of BACE1 was catalyzed by a different protease.

Novel BACE1 substrates predicted by bioinformatics

As mentioned earlier, our study returned 1748 potential octapeptide recognition sequences in 962 different protein sequences (Table S1). Table 3 gives the results for those sequences which had a score greater than 0.01 and were not listed previously. The only sequence with a score greater than 1 came from the T cell immunoreceptor with Ig and ITIM domains protein. The next seven peptide sequences with scores between 1 and 0.1 come from proteins involved in immune response, calcium-dependent exocytosis, disulfide formation, cytokine signaling, and trafficking. As an example from peptides scoring between 0.1 and 0.01, a conserved sequence (PLDLAVFW) in the family of nine UDP-glucuronosyltransferase 1 proteins is predicted to be a strong BACE1 substrate. Many of the top scoring sequences are composed of negatively charged and hydrophobic amino acids, consistent with the preference table values. As is the case for the known BACE1 substrates, there were a variety of predicted cleavage recognition sites ranging from 8–50 in Table 3 and 8–52 in the Table S1.
Table 3

Predicted cut sites and scores for novel putative BACE1 substrates from the human proteome.

UniProt IDProteinPredicted cleavage recognition
SiteSequenceScore
Q495A1T cell immunoreceptor with Ig and ITIM domains21RIFLEVLE1.03E+00
O95470Sphingosine-1-phosphate lyase 128EPYLEILE8.08E-01
P12314High affinity immunoglobulin gamma Fc receptor I14ELELQVLG5.50E-01
Q9BZM6NKG2D ligand 122EEFLMYWE4.48E-01
Q9NP60X-linked interleukin-1 receptor accessory protein-like 246EVELALIF2.07E-01
Q13445Transmembrane emp24 domain-containing protein 149EEMLDVKM1.58E-01
Q5T7P8Synaptotagmin-646QEALAVLA1.16E-01
Q6ZRP7Sulfhydryl oxidase 28GVDFSSLD1.09E-01
A0PJX4Protein shisa-3 homolog50PEDFDTLD9.03E-02
Q96A26Protein FAM162A17TVSLEMLD7.63E-02
UDP-glucuronosyltransferase 1 family (combined)37PLDLAVFW7.42E-02
P60509HERV-R(b)_3p24.3 provirus ancestral env polyprotein40NISLALED7.41E-02
Q4ADV7Protein RIC1 homolog35DENFSTLS6.68E-02
Q3SXP7Uncharacterized protein KIAA164426ETEFQAVM6.15E-02
O95140Mitofusin-232QEEFMVSM6.07E-02
Q96FB5UPF0431 protein C1orf6616PLNLAALQ6.01E-02
O75578Integrin alpha-1015ESLLEVVQ5.55E-02
Q15363Transmembrane emp24 domain-containing protein 221QEYMEVRE4.86E-02
Q5DX21Immunoglobulin superfamily member 1119LLDLQVIS4.74E-02
O43699Sialic acid-binding Ig-like lectin 617QISLSLFV4.58E-02
O95971CD160 antigen35GHFFSILF4.32E-02
O60499Syntaxin-1037GIMLDAFA4.31E-02
Q6ZNB6NF-X1-type zinc finger protein NFXL135QAELEAFE3.98E-02
O95866Protein G6b48ELLLSAGD3.68E-02
Q86UW2Organic solute transporter subunit beta16QELLEEML3.62E-02
P26006Integrin alpha-315DIDSELVE3.44E-02
Q9Y639Neuroplastin36IVNLQITE3.32E-02
Q6UWI2Prostate androgen-regulated mucin-like protein 125LIDMETTT3.01E-02
A2A2Y4FERM domain-containing protein 345FEDLEADE3.00E-02
Q6P7N7Transmembrane protein 8121EVNLDSYS2.88E-02
A6NFR6Putative uncharacterized protein C5orf6024AVDMDILF2.81E-02
Q8N386Leucine-rich repeat-containing protein 2520QHNLSAFL2.76E-02
Q9HBW1Leucine-rich repeat-containing protein 412QTSLDEVM2.68E-02
Q9Y5Y7Lymphatic vessel endothelial hyaluronic acid receptor 132EVFMETST2.65E-02
P0C6S8Leucine-rich repeat neuronal protein 240DTYFATLT2.56E-02
Q6NUS6Tectonic-343EVSLTTLV2.56E-02
Q8IYS5Osteoclast-associated immunoglobulin-like receptor48EFFLEEVT2.47E-02
Q9H5V8CUB domain-containing protein 116DLLFSVTL2.34E-02
Q15399Toll-like receptor 141QVSSEVLE2.29E-02
Q9Y2C9Toll-like receptor 641QVSSEVLE2.29E-02
Q13651Interleukin-10 receptor subunit alpha47HENFSLLT2.28E-02
Q9Y5I0Protocadherin alpha-1334TVLLSLVE2.09E-02
Q68DV7RING finger protein 4328EKLMEFVY2.08E-02
Q6UX41Butyrophilin-like protein 847EISLTVQE1.86E-02
Q15262Receptor-type tyrosine-protein phosphatase kappa45NIYFQAMS1.85E-02
Q5TH69Brefeldin A-inhibited guanine nucleotide-exchange protein 314DLLFELLR1.76E-02
Q9Y5F3Protocadherin beta-121EPYLQFQD1.63E-02
P29376Leukocyte tyrosine kinase receptor34QAELQLAE1.60E-02
Q86XX4Extracellular matrix protein FRAS117NLEMQELA1.56E-02
P60507HERV-F(c)1_Xq21.33 provirus ancestral Env polyprotein34ETSLLTLD1.40E-02
Q5SWX8Protein odr-4 homolog47IEDLEIAE1.37E-02
Q9H4D0Calsyntenin-249EFNLEVSI1.35E-02
Q9P246Stromal interaction molecule 244EPSFMISQ1.27E-02
A6BM72Multiple epidermal growth factor-like domains protein 1125QAALMMEE1.22E-02
Q9UQV4Lysosome-associated membrane glycoprotein 323DVQLQAFD1.17E-02
Q6IEE7Transmembrane protein 132E8LTDLEIGM1.13E-02
Q96KV6Butyrophilin subfamily 2 member A350DSLFMVTT1.11E-02
Q96MU8Kremen protein 148QANLSVSA1.08E-02
P13598Intercellular adhesion molecule 215PKMLEIYE1.06E-02
Q13421Mesothelin31QDDLDTLG1.05E-02
Q01638Interleukin-1 receptor-like 134EEDLLLQY1.04E-02

Gene ontology (GO) analysis

GO analysis of the complete set of 962 proteins identified by the prediction algorithm as BACE1 substrates was performed using DAVID bioinformatics resources from the NIAID at NIH to search and then cluster GO terms to identify the enrichment of biological themes within a list of genes or proteins.37 As expected based on the predicted BACE1 substrates dataset, the terms “membrane protein” and “transmembrane” were associated with almost all of the proteins. The other common clusters that were returned are shown in Table 4 with their enrichment score and representative terms that were included in a given cluster. The enrichment score for a group is based on the combination of the EASE scores (a modified Fisher Exact P-Value scores) from the members of the group, with a higher score indicating a greater enrichment. Processes involved in cell-surface protein-protein or small molecule interactions, such as immunoglobulins, integrins, leucine-rich repeat proteins, and receptors, were the most highly enriched terms in the list of predicted BACE1 substrates.
Table 4

Gene ontology cluster analysis of putative BACE1 substrates from the bioinformatics analysis.

Enrichment scoreAnnotation cluster terms
85.1Immunoglobulin domain (230)
72.8Receptor (302), signal transducer (314)
61.5Cell adhesion (209), cadherin (73), cation binding (186)
35.6Fibronectin type III (76)
24.2Immune response (108), immune system process (145), response to stimulus (232)
13.8Integrin mediated signaling (27), regulation of actin cytoskeleton (31)
13.3Cytokine binding (35), cytokine-cytokine receptor interactions (48), growth factor binding (32)
11.7Leucine-rich repeat (51)

Discussion

Identification of in vivo substrates for proteases is a difficult task, especially those that have relatively loose subsite specificity and/or a large active site that accommodates a longer peptide chain. Both of these conditions apply to BACE1.14,38 In addition to these challenges, sub-cellular localization also determines whether proteins with the potential to be substrates are actually proteolyzed in vivo. Because of its promising potential as a therapeutic target for Alzheimer’s disease, BACE1 has been studied extensively to elucidate its subsite specificity as well as its ability to cleave proteins in cell-based proteomics assays. Very recently, Turner et al extended their analysis of the subsite specificity of BACE1 from eight subsites (P4-P4’) to twelve (P8-P4’).14,15 Both studies utilized synthetic peptide libraries in which one position of the peptide was randomized with each of the standard amino acids (except cysteine) while holding the other positions constant. These libraries were then incubated with BACE1 and analyzed by mass spectrometry to determine a relative second order rate constant normalized to the Swedish APP sequence (EVNLDAEF). Inherent in this approach was the assumption that neighboring peptide residues did not significantly interact with one another. The fact that they and we have used these preference indices to successfully identify a significant number of known BACE1 substrates and, even more importantly, to make specific predictions about the location of cut sites using computational methods, supports the validity and utility of their data. Because attempting to identify the in vivo substrates for a protease with loose substrate specificity is difficult, a combination of approaches such as proteomics, bioinformatics, and in vitro biochemical measurements can and indeed have driven the ultimate identification of native substrates. The cleavage of APP at the β-site was known for several years before the discovery that the novel membrane bound aspartyl protease BACE1 was responsible for the observed β-secretase activity. Although several BACE1 substrates have been identified through careful observation, phenotypes arising from BACE1 activity can be subtle or nonexistent because some actual BACE1 substrates can be proteolyzed by other proteases such as BACE2 or α-secretase. Alternate strategies are needed to focus and inform in vivo studies. The complete kinetic assessment of BACE1 subsite specificity employing synthetic peptide libraries provides the powerful opportunity to extend their application to protein sequences as well. These data have demonstrated the promise of this approach, but it is apparent that further refinement is required. For example, despite the success of the algorithm in predicting the most likely cleavage sites for APLP2, it did not identify any for APLP1, which is known to be cleaved by BACE1.32 Li et al made predictions for the BACE1 cleavage sites in two other known substrates, mPGES-2 and ST6Gal I, but how these cleavages happen at the proposed sites is not clear.15 mPGES-2, a Type I membrane protein with a short extracytoplasmic domain and large cytoplasmic domain, is known to be cut between amino acids 87 and 88 to release it from the membrane, but this cleavage site is on the cytoplasmic side of the lipid bilayer.39 Though a BACE1 cleavage site was predicted, it does not match the known site and how BACE1 can cleave at this intracellular peptide sequence is unclear. For the Type II membrane protein ST6Gal I, the original peptide sequence identified as a BACE1 substrate is actually from rat.24 Surprisingly, this cleavage recognition sequence is not even conserved between rat and human, and according to our algorithm the changes to the human sequence would make it a worse substrate. The predicted cleavage site is 11 residues away from the transmembrane domain. Because the orientation of the peptide sequence is reversed for a Type II protein, it is unclear how BACE1 could cut so close to the membrane and have the peptide sit in its active site in the proposed orientation. In addition to the in vitro studies characterizing BACE1 activity with short peptide substrates, proteomic methods have also been used to guide the search for in vivo substrates.32 One strength of this approach is that it does not bias the choice of peptide sequences to test for BACE1 activity, which was a necessary simplification when utilizing synthetic peptide libraries. Another advantage is that BACE1 is in its membrane-bound form and presumably exposed primarily to substrates that are membrane-bound as well. However, one drawback to this approach includes needing to limit the analysis to a few cell lines, some of which may not typically express BACE1. In addition, the in vivo data generated can only be for those proteins expressed in the particular cell line(s) chosen. This may be one explanation for the lack of identification of some of the known BACE1 substrates such as VGSCβ subunits, IL-1R-2, PSGL-1, LRP1, and NRG1, leading to false negative results. Another potential source of incorrect identification of BACE1 substrates that could yield false positive results arises from the overexpression of BACE1. Lee et al showed that BACE1 overexpression shifted the subcellular localization of APP cleavage to earlier points in the secretory pathway.40 Since this happens for APP upon BACE1 overexpression, caution should be used when interpreting the results for other substrates identified by proteomics. Because the purpose of proteomic and in vitro studies is to narrow the list of potential proteins to investigate further for their in vivo activity, the studies’ drawbacks do not present insurmountable problems as both the proteomics and in vitro approaches successfully identified known BACE1 substrates. Combining bioinformatics with existing proteomics and in vitro data should give a more robust prediction of BACE1 in vivo substrates. This report adds to the BACE1 in vivo substrate discussion by utilizing a bioinformatics approach to both successfully predict the BACE1 cleavage sites for a large number of known substrates and to identify potential novel BACE1 substrates by extending the analysis to the entire human proteome. We first compared our results to the known BACE1 in vivo substrates. Nine of the thirteen substrates in Table 1 were positively identified using our algorithm. The predicted recognition cleavage sites and the cut sites for these nine proteins match the published data. Four of the proteins had three sites that met our criteria. In the case of APP, multiple BACE1 cleavage sites are known to be present.8 Our method did not return positive identifications for mPGES-2, ST6Gal I, VGSCβ2, and APLP1. Our proposed explanation for not identifying mPGES-2 and ST6Gal I as potential substrates has been described above. For VGSCβ2, the score for the cleavage site reported by Li et al was 3.3 × 10−6, just below our threshold. This result may necessitate changing the threshold, but we are currently investigating other methods that will reduce rather than increase the number of hits returned while capturing all of the known substrates. From both the proteomics and the in vitro studies, one would predict that APLP2 is a better substrate than APLP1. This is also the case with our bioinformatics data, which is not surprising since the preference indices from Turner et al were used in our scoring matrix as well. APLP1 was identified via proteomics, but it is not apparent why our method did not identify it as a substrate. One explanation could be due to the large number of cysteine residues in the juxtamembrane region for APLP1. Since cysteine was left out of the octapeptide substrate libraries, scoring cysteine-rich sequences is not possible with our algorithm. The proteomics data of Hemming et al were used to validate the efficacy of our method.32 Approximately 70% of their reported substrates were correctly identified, and importantly, we report the predicted recognition sites for those cleavages. Because of the way the algorithm is currently written, no Type II protein would be identified as a substrate. Though the data for rat ST6Gal I is convincing, exactly how BACE1 recognizes and cleaves this sequence that is in the opposite orientation is not clear. Additionally, the human ST6Gal I sequence is not conserved in the rat sequence where the proteolysis by BACE1 was described. The fact that BACE1 substrates such as BACE1 itself were identified by proteomics, which upon further analysis were shown to be associated with a protease other than BACE1, highlights the need for complimentary information about substrates predicted via proteomics, whether from further biochemical or bioinformatics studies. With the solid foundation provided by this study, further refinement of our substrate prediction algorithm is underway to address the lack of identification of the remaining 30% of proteomics and known BACE1 substrates. Some of the substrates identified by proteomics may or may not turn out to be actual in vivo BACE1 substrates and definitive conclusions about the relative value of the bioinformatics or proteomics methods must be determined in further studies. Each method has value and unique strengths and weaknesses in guiding the search for native BACE1 substrates. As is the case with the known substrates identified by in vivo and proteomics methods, the distance from the membrane for the cut recognition sites span the entire range between 8 and 52. Between Table 3 and the summary of the GO analysis in Table 4, the annotation clusters yielded a significant number of proteins in relatively few categories: A large number (230 of 962) contained immunoglobulin domains or were involved in immune response or immune system processes; just over 300 had functions related to receptors and signal transduction; proteins involved in protein-protein interactions including cell adhesion proteins accounted for 209 proteins, including some further subcategorized as cadherins, cation binding proteins, integrin proteins, and leucine-rich repeat proteins; and finally cytokines and their receptors are involved in processes such as growth factor binding. Efforts to refine the algorithm to improve its accuracy are underway, and though experiments to evaluate these putative BACE1 substrates in vivo are planned, they are beyond the scope of the present study. Table S1.xls
Table 2

Predicted BACE1 cut sites for substrates identified by Hemming et al32 proteomics study.

UniProt IDProteinTopologyPredicted cleavage recognition
SiteSequenceScore
P05067APPType I13LVFFAEDV8.44E-03
33EVKMDAEF1.02E-03
41NIKTEEIS6.04E-05
Q06481APLP2Type I9REDFSLSS1.20E-03
30MIFNAERV7.23E-05
44DENMVIDE3.55E-03
P40189Interleukin-6 receptor beta chainType I17GPEFTFTT9.00E-05
35DTLYMVRM2.17E-03
P08581Hepatocyte growth factor receptorType I29NSELNIEW1.22E-05
O75976Carboxypeptidase DType I22DAASSVVI4.17E-05
P29317Ephrin type A receptor 2Type I15VHEFQTLS2.32E-03
28QALTQEGQ1.43E-04
P54764Ephrin type A receptor 4Type I44NPLTSYVF6.06E-05
Q15375Ephrin type A receptor 7Type I16GKMFEATA5.55E-03
25DVATLEEA2.89E-05
40RAFTAAGY2.89E-05
P54760Receptor protein tyrosine kinase variant EPHB4V1Type I16QTQLDESE6.70E-04
41GASYLVQV1.20E-05
Q92823Neuronal cell adhesion molecule 1Type I14GPAMASRQ2.46E-05
P32004Neuronal cell adhesion molecule L1Type I24RHQMAVKT5.75E-05
38DTDYEIHL2.83E-04
40QPDTDYEI2.96E-04
Q9NPR2Semaphorin-4BType I39GVADQTDE7.20E-05
Q9C0C4Semaphorin-4CType I25EGYLVAVV1.17E-05
Q9H2E6Semaphorin-6AType I31DPLGAVSS2.07E-05
Q96JA1Leucine-rich repeats and immunoglobulin-like domains protein 1Type I51TPDNQLLV5.72E-05
O94898Leucine-rich repeats and immunoglobulin-like domains protein 2Type I28HIYLNVIS1.28E-04
Q6UXM1Leucine-rich repeats and immunoglobulin-like domains protein 3Type I51IVDSDVSD7.11E-05
Q9Y6N7Roundabout homolog 1Type I9QISDVVKQ2.36E-05
15QVSLAQQI4.11E-04
47EVAASTGA1.99E-05
Q9HCK4Roundabout homolog 2Type I47EVAASTSA1.75E-05
Q7Z5N4Sidekick-1Type I17NPSTAVSA3.82E-05
Q58EX2Sidekick-2Type I38GVSYDFRV3.74E-04
52EVSSYTFS3.77E-05
P15151Poliovirus receptorType I23QAELTVQV5.00E-04
Q92673Sortilin-related receptorType I14GADASATQ2.07E-05
22LLYDELGS1.02E-05
23ILLYDELG1.89E-04
46GHNYTFTV8.20E-05
Q96JP9Protocadherin 21 (cadherin-related family member 1)Type I15MAAFLIQT6.23E-05
17SPMAAFLI1.45E-05
26ITDAETLS2.20E-05
39SPSFSTTA5.71E-05
Q9Y5H2Protocadherin gamma A11Type I11LANSETSD3.08E-05
20LADLGSLE3.89E-05
22EVLADLGS9.31E-05
40PPLSATVT1.54E-05
Q9Y5G8Protocadherin gamma A5Type I8PEDLDLTL1.03E-02
22DILADLGS7.29E-05
Q9Y5G5Protocadherin gamma A8Type I9DPNDSSLT6.06E-05
22EVLTELGS1.67E-03
40PPLSATVT1.54E-05
Q9UN70Protocadherin gamma C3Type I40EPSLSTTA3.88E-03
Q86VZ4Low-density lipoprotein receptor-related protein 11Type I23EESYIFES3.20E-05
O75096Low-density lipoprotein receptor-related protein 4Type I37RTSLEEVE9.63E-03
47TTLYSSTT1.08E-05
P31431Syndecan-4Type I43PKKLEENE1.67E-05
MULTIPLEHLA class I histocompatibility antigen (Combined)Type I9EPSSQSTV3.00E-05
Q13332Receptor-type tyrosine protein phosphatase SType I8IVDGEEGL2.82E-05
Q13740CD166 antigenType I19DEADEISD1.29E-04
Q12907Vesicular integral-membrane protein VIP36Type I52MKLFQLMV1.20E-03
Q5VU97Cache domain containing 1Type I19DDMGAIGD2.22E-05
Q9BYH1Seizure 6-like protein 2Type I12EAAAETSL1.25E-05
19EHALEVAE5.97E-02
51ELMGEVTI3.82E-03
Q92859NeogeninType I45MPNDQASG1.60E-05
Q6UVK1Chondroitin sulfate proteoglycan 4Type I9LSFLEANM3.03E-04
12GGFLSFLE9.84E-05
Q24JP5Transmembrane protein 132AType I8VTELELGM4.24E-04
Q13145BMP and activin membrane-bound inhibitor homologType I14QELTSSKE1.42E-04
Q14126Desmoglein 2Type I10QHDSYVGL9.29E-05
46EIQFLISD2.81E-03
Q9NZV1Cysteine-rich motor neuron 1 proteinType I45EVDLEVPL1.12E-03
Q92896Golgi apparatus protein 1Type I13DLAMQVMT4.21E-03
15FSDLAMQV1.88E-04
Q9NR96Toll-like receptor 9Type I47DFLLEVQA1.55E-03
48MDFLLEVQ8.73E-05
49FMDFLLEV1.41E-04
51AAFMDFLL3.58E-04
O75509Tumor necrosis factor receptor superfamily member 21Type I37LPSMEATG3.14E-04
P51654Glypican-3GPI31AYDLDVDD2.48E-05
33ELAYDLDV1.30E-03
35LAELAYDL3.35E-04
P51693APLP1Type INone
Q99523SortilinType INone
Q5ZPR3CD276 antigenType INone
P19021Peptidyl-glycine alpha-amidating monooxygenaseType INone
Q6UX71Plexin domain-containing protein 2Type INone
P35613BasiginType INone
O95185Netrin receptor UNC5CType INone
Q8TB96T-cell immunomodulatory proteinType INone
O14672Disintegrin and metalloproteinase domain-containing protein 10Type INone
O43291Kunitz-type protease inhibitor 2Type INone
O43493Trans-golgi network integral membrane protein 2Type INone
Q12860Contactin-1GPINone
Q8NFY4Semaphorin-6DType INone
O00592Podocalyxin-like protein 1Type INone
P56817Beta-secretase 1Type INone
Q2VWP7ProtogeninType INone
P78504Jagged-1Type INone
P11717Cation-independent mannose-6-phosphate receptorType INone
Q86YC3Leucine-rich repeat-containing protein 33Type INone
P52803Ephrin-A5GPINone
O00461Golgi phosphoprotein 4Type IINone
  40 in total

1.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors:  A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal:  J Mol Biol       Date:  2001-01-19       Impact factor: 5.469

Review 2.  Neuregulin, a factor with many functions in the life of a schwann cell.

Authors:  A N Garratt; S Britsch; C Birchmeier
Journal:  Bioessays       Date:  2000-11       Impact factor: 4.345

3.  Purification and cloning of amyloid precursor protein beta-secretase from human brain.

Authors:  S Sinha; J P Anderson; R Barbour; G S Basi; R Caccavello; D Davis; M Doan; H F Dovey; N Frigon; J Hong; K Jacobson-Croak; N Jewett; P Keim; J Knops; I Lieberburg; M Power; H Tan; G Tatsuno; J Tung; D Schenk; P Seubert; S M Suomensaari; S Wang; D Walker; J Zhao; L McConlogue; V John
Journal:  Nature       Date:  1999-12-02       Impact factor: 49.962

4.  Locating proteins in the cell using TargetP, SignalP and related tools.

Authors:  Olof Emanuelsson; Søren Brunak; Gunnar von Heijne; Henrik Nielsen
Journal:  Nat Protoc       Date:  2007       Impact factor: 13.491

5.  Sheddases and intramembrane-cleaving proteases: RIPpers of the membrane. Symposium on regulated intramembrane proteolysis.

Authors:  Stefan F Lichtenthaler; Harald Steiner
Journal:  EMBO Rep       Date:  2007-05-11       Impact factor: 8.807

6.  Characterization of Alzheimer's beta -secretase protein BACE. A pepsin family member with unusual properties.

Authors:  M Haniu; P Denis; Y Young; E A Mendiaz; J Fuller; J O Hui; B D Bennett; S Kahn; S Ross; T Burgess; V Katta; G Rogers; R Vassar; M Citron
Journal:  J Biol Chem       Date:  2000-07-14       Impact factor: 5.157

7.  Phenotypic and biochemical analyses of BACE1- and BACE2-deficient mice.

Authors:  Diana Dominguez; Jos Tournoy; Dieter Hartmann; Tobias Huth; Kim Cryns; Siska Deforce; Lutgarde Serneels; Ira Espuny Camacho; Els Marjaux; Katleen Craessaerts; Anton J M Roebroek; Michael Schwake; Rudi D'Hooge; Patricia Bach; Ulrich Kalinke; Dieder Moechars; Christian Alzheimer; Karina Reiss; Paul Saftig; Bart De Strooper
Journal:  J Biol Chem       Date:  2005-06-29       Impact factor: 5.157

8.  The low density lipoprotein receptor-related protein (LRP) is a novel beta-secretase (BACE1) substrate.

Authors:  Christine A F von Arnim; Ayae Kinoshita; Ithan D Peltan; Michelle M Tangredi; Lauren Herl; Bonny M Lee; Robert Spoelgen; Tammy T Hshieh; Sripriya Ranganathan; Frances D Battey; Chun-Xiang Liu; Brian J Bacskai; Sanja Sever; Michael C Irizarry; Dudley K Strickland; Bradley T Hyman
Journal:  J Biol Chem       Date:  2005-03-04       Impact factor: 5.157

9.  Cellular prostaglandin E2 production by membrane-bound prostaglandin E synthase-2 via both cyclooxygenases-1 and -2.

Authors:  Makoto Murakami; Karin Nakashima; Daisuke Kamei; Seiko Masuda; Yukio Ishikawa; Toshiharu Ishii; Yoshihiro Ohmiya; Kikuko Watanabe; Ichiro Kudo
Journal:  J Biol Chem       Date:  2003-06-30       Impact factor: 5.157

10.  Infrastructure for the life sciences: design and implementation of the UniProt website.

Authors:  Eric Jain; Amos Bairoch; Severine Duvaud; Isabelle Phan; Nicole Redaschi; Baris E Suzek; Maria J Martin; Peter McGarvey; Elisabeth Gasteiger
Journal:  BMC Bioinformatics       Date:  2009-05-08       Impact factor: 3.169

View more
  6 in total

1.  PTPRD and DCC Are Novel BACE1 Substrates Differentially Expressed in Alzheimer's Disease: A Data Mining and Bioinformatics Study.

Authors:  Hannah A Taylor; Katie J Simmons; Eva M Clavane; Christopher J Trevelyan; Jane M Brown; Lena Przemyłska; Nicole T Watt; Laura C Matthews; Paul J Meakin
Journal:  Int J Mol Sci       Date:  2022-04-20       Impact factor: 6.208

2.  Cell-type-specific expression quantitative trait loci associated with Alzheimer disease in blood and brain tissue.

Authors:  Devanshi Patel; Xiaoling Zhang; John J Farrell; Jaeyoon Chung; Thor D Stein; Kathryn L Lunetta; Lindsay A Farrer
Journal:  Transl Psychiatry       Date:  2021-04-27       Impact factor: 7.989

Review 3.  Neuroplastin in Neuropsychiatric Diseases.

Authors:  Xiao Lin; Yi Liang; Rodrigo Herrera-Molina; Dirk Montag
Journal:  Genes (Basel)       Date:  2021-09-26       Impact factor: 4.096

4.  BACE-1 inhibition facilitates the transition from homeostatic microglia to DAM-1.

Authors:  Neeraj Singh; Marc R Benoit; John Zhou; Brati Das; Jose Davila-Velderrain; Manolis Kellis; Li-Huei Tsai; Xiangyou Hu; Riqiang Yan
Journal:  Sci Adv       Date:  2022-06-17       Impact factor: 14.957

5.  The beta secretase BACE1 regulates the expression of insulin receptor in the liver.

Authors:  Paul J Meakin; Anna Mezzapesa; Eva Benabou; Mary E Haas; Bernadette Bonardo; Michel Grino; Jean-Michel Brunel; Christèle Desbois-Mouthon; Sudha B Biddinger; Roland Govers; Michael L J Ashford; Franck Peiretti
Journal:  Nat Commun       Date:  2018-04-03       Impact factor: 14.919

6.  BACE1 gene silencing alleviates isoflurane anesthesia‑induced postoperative cognitive dysfunction in immature rats by activating the PI3K/Akt signaling pathway.

Authors:  Ying-Bin Wang; Jian-Qin Xie; Wei Liu; Rong-Zhi Zhang; Sheng-Hui Huang; Yan-Hong Xing
Journal:  Mol Med Rep       Date:  2018-09-04       Impact factor: 2.952

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.