Michal Krassowski1,2, Marta Paczkowska1, Kim Cullion1, Tina Huang1, Irakli Dzneladze1,3, B F Francis Ouellette1,4, Joseph T Yamada1, Amelie Fradet-Turcotte5, Jüri Reimand1,3. 1. Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada. 2. Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland. 3. Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada. 4. Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, Canada. 5. Department of Molecular Biology, Medical Biochemistry and Pathology, Université Laval, Québec, Québec, Canada.
Abstract
Interpretation of genetic variation is needed for deciphering genotype-phenotype associations, mechanisms of inherited disease, and cancer driver mutations. Millions of single nucleotide variants (SNVs) in human genomes are known and thousands are associated with disease. An estimated 21% of disease-associated amino acid substitutions corresponding to missense SNVs are located in protein sites of post-translational modifications (PTMs), chemical modifications of amino acids that extend protein function. ActiveDriverDB is a comprehensive human proteo-genomics database that annotates disease mutations and population variants through the lens of PTMs. We integrated >385,000 published PTM sites with ∼3.6 million substitutions from The Cancer Genome Atlas (TCGA), the ClinVar database of disease genes, and human genome sequencing projects. The database includes site-specific interaction networks of proteins, upstream enzymes such as kinases, and drugs targeting these enzymes. We also predicted network-rewiring impact of mutations by analyzing gains and losses of kinase-bound sequence motifs. ActiveDriverDB provides detailed visualization, filtering, browsing and searching options for studying PTM-associated mutations. Users can upload mutation datasets interactively and use our application programming interface in pipelines. Integrative analysis of mutations and PTMs may help decipher molecular mechanisms of phenotypes and disease, as exemplified by case studies of TP53, BRCA2 and VHL. The open-source database is available at https://www.ActiveDriverDB.org.
Interpretation of genetic variation is needed for deciphering genotype-phenotype associations, mechanisms of inherited disease, and cancer driver mutations. Millions of single nucleotide variants (SNVs) in human genomes are known and thousands are associated with disease. An estimated 21% of disease-associated amino acid substitutions corresponding to missense SNVs are located in protein sites of post-translational modifications (PTMs), chemical modifications of amino acids that extend protein function. ActiveDriverDB is a comprehensive human proteo-genomics database that annotates disease mutations and population variants through the lens of PTMs. We integrated >385,000 published PTM sites with ∼3.6 million substitutions from The Cancer Genome Atlas (TCGA), the ClinVar database of disease genes, and human genome sequencing projects. The database includes site-specific interaction networks of proteins, upstream enzymes such as kinases, and drugs targeting these enzymes. We also predicted network-rewiring impact of mutations by analyzing gains and losses of kinase-bound sequence motifs. ActiveDriverDB provides detailed visualization, filtering, browsing and searching options for studying PTM-associated mutations. Users can upload mutation datasets interactively and use our application programming interface in pipelines. Integrative analysis of mutations and PTMs may help decipher molecular mechanisms of phenotypes and disease, as exemplified by case studies of TP53, BRCA2 and VHL. The open-source database is available at https://www.ActiveDriverDB.org.
DNA sequencing studies have enabled large-scale characterization of human genomes and revealed millions of single nucleotide variants (SNVs), copy number alterations, and other types of genetic variants. Identifying genotype-phenotype associations, molecular mechanisms, causal disease variants and cancer driver mutations remain major challenges of current biomedical research (1,2). Large catalogues of genetic variation comprising tens of thousands of individual and tumour genomes are now available from projects such as The Cancer Genome Atlas (TCGA) (3), the International Cancer Genome Consortium (ICGC) (4), the 1000 Genomes Project (5), The Exome Aggregation Consortium (ExAC) (6), and others. Open-access databases such as ClinVar (7) collect disease genes and mutations. Prediction of functional impact and prioritization of candidate variants primarily relies on evolutionary sequence conservation and other genomic features (8–10), however information about protein interactions and signaling is not routinely applied in such analyses.Post-translational modifications (PTM) include more than 400 kinds of chemical modifications of amino acids that act as molecular switches and expand the functional repertoire of proteins (11,12). PTMs are carried out by modular reader–writer–eraser networks where specific enzymes induce PTMs in target proteins, remove PTMs, and interact with modified sites (13). Phosphorylation, ubiquitination, acetylation, and methylation are the most commonly characterized PTMs with nearly 400 000 experimentally determined sites in human proteins (14–16). PTMs are involved in various aspects of cellular organization including protein activation and degradation, protein-protein interactions, chromatin organization, development, and signaling pathways associated with disease (17–20). Further, PTMs are increasingly drug targetable and used in precision cancer therapies (21–23). Thus PTM information helps interpret genetic variation, genotype-phenotype associations, and molecular disease mechanisms.PTM sites are enriched in disease mutations and rare variants in the population (24–29). Such mutations often alter sequence motifs bound by PTM enzymes and may cause rewiring of signaling networks (27,30). Importantly, the functional impact of PTM mutations is often underestimated in standard annotation pipelines. We found that 15–30% of disease mutations in PTM sites are considered benign by tools such as PolyPhen2 (8), SIFT (9) and CADD (10), likely because PTM sites are located in disordered protein regions with lower evolutionary conservation (25). Thus PTMs remain understudied in the context of genetic variation and disease. The PhosphositePlus database maintains downloadable datasets with PTM site variation (14), however, a dedicated comprehensive database of genetic variation in PTM sites does not exist to our knowledge.To address this limitation, we developed ActiveDriverDB, a proteo-genomics resource for interpreting human genome variation using PTM sites (24–27). The database integrates experimentally determined PTM sites with large genomics resources: cancer exomes from TCGA (3,31), known disease genes and mutations from the ClinVar database (7) and population variation from the 1,000 Genomes Project and ESP6500 (5,32). We also display the network context of PTM mutations by analyzing PTM-specific protein-protein interactions and the drugs targeting PTM enzymes that regulate the protein (33). Hundreds of thousands of amino acid substitutions in PTM sites are available in the database for browsing, visualization and interpretation. Datasets can be downloaded or analyzed using our application program interface (API). Users can also interactively upload, store and analyze their own custom datasets of mutations. Our open-source database can be downloaded for local use.
MATERIALS AND METHODS
Genomic and proteomic data in ActiveDriverDB
ActiveDriverDB includes two major types of human -omics data: genomics data on missense SNVs and proteomics data on PTMs (Figure 1, Table 1). Human genome variation datasets include disease-associated SNVs and those apparent in the human population. First, ActiveDriverDB includes somatic cancer mutations of nearly 9000 tumor samples of 34 types from exome sequencing by the TCGA compiled in the recent PanCanAtlas MC3 release (3,31). The TCGA dataset was further filtered to exclude non-passed mutations and hyper-mutated samples. Second, inherited disease mutations from the ClinVar database (7) are also available in ActiveDriverDB. Third, inter-individual genome variation of the human population includes the 1,000 Genomes Project (5) with >2500 genomes and the ESP6500 project (32) with >6500 exomes. Experimentally determined human PTM sites are retrieved from public databases PhosphositePlus (14), Phospho.ELM (15) and HPRD (16) and include primarily proteomics data on the four most frequently characterized PTM types (phosphorylation, acetylation, ubiquitination, methylation). Site-specific protein-protein interactions of substrate proteins and upstream enzymes (primarily kinases) are also included from these databases. We also integrate drugs that target upstream enzymes of PTMs using data from the DrugBank database (33).
Figure 1.
Overview and workflow of ActiveDriverDB. Our database integrates genomic and proteomic data for interpreting disease mutations and human inter-individual variation with PTMs. Genomics datasets include cancer exome sequencing studies (TCGA), disease genes and mutations (ClinVar), and human genome variation studies (ESP6500, the 1000 Genomes Project) (top-left panel). Proteomics datasets include PTM sites of four commonly studied PTM types, site-specific interactions of PTM enzymes and target proteins, and drug interactions with PTM enzymes (top-right panel). Our systematic analysis pipeline aligns PTM sites with missense SNVs, predicts the impact of amino acid substitutions on kinase-bound sequence motifs using the MIMP method, and organizes site-specific interaction networks of PTMs, upstream enzymes and drugs (middle panel). The protein sequence view shows the distribution of PTMs and substitutions along the protein sequence (bottom left panel), while the interaction network view shows site-specific interactions of mutated proteins with upstream PTM enzymes and their associated drugs (bottom right panel). The database also provides exporting, visualization, searching and automated analysis tools (bottom middle panel).
Table 1.
Overview of genome variation datasets and post-translational modifications included in ActiveDriverDB
TCGA PanCanAtlas
ClinVar
1000 Genomes
ESP6500
Total
Dataset
Size
8 856 exomes
494 059 records
2504 genomes
6503 exomes
-
Description
Cancer (somatic)
Inherited disease
General population
-
Mutations
Total
1 595 400
137 860
1 066 906
1 318 972
3 588 280
in PTM sites
221 472
27 305
143 489
185 982
506 974
with network-rewiring effect
30 882
2 869
20 525
26 498
70 518
Annotated nucleotides (hg19/GRCh37)
1 865 173
155 824
1 206 968
1 486 067
4 124 041
PTM sites affected by mutations*
Total
214 362
16 597
157 420
186 800
303 401
Phosphorylation sites
169 632
12 999
128 097
150 505
239 509
Acetylation sites
11 581
1 258
7 216
8 988
16 384
Ubiquitination sites
35 058
2 659
22 678
28 226
50 081
Methylation sites
3 377
272
2 293
2 577
4 416
Proteins with mutations affecting PTM sites
Total
27 316
3 317
25 132
26 202
29 462
Kinases & PTM enzymes
613
115
580
594
624
Kinase families
127
58
127
126
127
PTM sites
Total
-
-
-
-
385 185
Phosphorylation sites
-
-
-
-
299 241
Acetylation sites
-
-
-
-
21 670
Ubiquitination sites
-
-
-
-
67 933
Methylation sites
-
-
-
-
5 666
Counts of PTMs and amino acid substitutions reflect all high-confidence protein isoforms collected in the database.
Counts of PTMs and amino acid substitutions reflect all high-confidence protein isoforms collected in the database.
Mapping mutations and PTM sites
Substitutions (SNVs) in PTM sites were mapped using our previously designed pipelines (24–27). Genomic coordinates of SNVs were mapped to protein amino acid substitutions using the Annovar software (34) and RefSeq genes (hg19/GRCh37). Peptide sequences corresponding to PTM sites were mapped to RefSeq proteins using exact sequence matching permitting multiple matches per sequence. PTM sites extended seven amino acids before and after of the modified protein residue, and multiple clustered PTM sites are merged into consecutive regions. Protein domains from the InterPro database (35) were mapped into non-redundant regions and combined with disorder predictions of the DISOPRED2 software (36). ActiveDriverDB provides information for 39 159 high-confidence isoforms of 19 062 human genes. We identify those by HGNC gene symbols (37) or RefSeq transcript IDs and show primary isoforms according to the Uniprot database (38) by default.
Impact of mutations on PTM sites
Amino acid substitutions (SNVs) in PTM sites are further annotated regarding their position relative to PTMs and potential impact on signaling networks. PTM mutations are considered direct if they substitute the central PTM amino acid residue while indirect mutations are classified either as proximal or distal (1–2 or 3–7 amino acid residues to the nearest PTM site, respectively). We distinguish variants that affect different types of PTM sites and mutations affecting multiple adjacent PTMs. To estimate the network impact of mutations, we performed sequence motif analysis with our machine learning method MIMP (27) using 476 models of sequence motifs of 322 kinases and families (14–16,39,40). MIMP analyses substitutions in PTM sites and predicts whether these cause loss of existing kinase-bound motifs or create new motifs, suggesting the impact of mutations on the rewiring of cellular signaling networks.
Software design and availability
The ActiveDriverDB website uses the Flask micro-framework and two relational databases: the first for constant biological data and the second for dynamic content and user-provided data. An additional key-value BerkleyDB database allows mapping of all potential missense SNVs in the human genome. Visualizations are implemented in the d3.js framework. Our needleplots are inspired by the muts-needle-plot library (https://zenodo.org/record/14561). All code is available on terms of LGPL 2.1 license. Documentation is available at https://github.com/reimandlab/ActiveDriverDB.
RESULTS
Thousands of disease mutations are enriched in PTM sites
The database is available at https://ActiveDriverDB.org. In total, ActiveDriverDB characterises 506 974 unique amino acid substitutions in PTM sites across high-confidence protein isoforms, including 221 472 in cancer genomes, 27 305 in inherited diseases and 143 489 and 185 982 in human genomes from the population sequencing projects 1000 Genomes and ESP6500, respectively. These substitutions affect the four types of most frequently characterized PTMs: phosphorylation sites (299 241), ubiquitination sites (67 933), acetylation sites (21 670) and methylation sites (5666), with 385 185 distinct sites in total across all protein isoforms. Among 558 high-confidence cancer genes of the Cancer Gene Census database (41), 9542 unique substitutions in the TCGA dataset (25%) are associated with PTM sites when considering primary isoforms of proteins (5773 expected from sampling of substitutions from the 1000 Genomes dataset, empirical P < 10−5). Among disease genes annotated in the ClinVar database, 11 041 unique substitutions (21%) are associated with PTM sites (7963 PTM SNVs expected, P < 10−5). Enrichment of disease-associated mutations in PTM sites is in agreement with our earlier studies (24–27). These statistics suggest that a large fraction of germline and somatic disease mutations can be interpreted using PTM information.
Visualization and analysis of mutations in PTM sites
The two primary pages of the database, the protein sequence view and the interaction network view, are focused on individual proteins (genes). Both views provide detailed visualizations of PTM-associated amino acid substitutions, tables with additional information, protein descriptions, and external links. The views permit filtering of mutations by dataset (inherited disease mutations, somatic cancer mutations, or inter-individual genome variation), disease types, pathogenicity and PTM types. All non-PTM mutations can be filtered as well. Both views display the primary isoform as default, while alternative isoforms can be selected.
Protein sequence view: mutation impact on PTMs, sequence features and network rewiring
The main components of this view include a needleplot with mutations and impact on PTM sites, sequence tracks with protein domains (35) and disorder predictions (36), and a detailed table of mutations. The needleplot represents the protein sequence and its PTM sites horizontally, while mutations extend vertically from the sequence according to their frequency (Figure 2A). Colored circles on top of needles represent mutation impact on PTM sites, and mouse-over motion shows information about the mutation, disease annotations, known PTM enzymes such as bound kinases, predictions of network rewiring with mutation-induced gains and losses of sequence motifs (Figure 2B), and drugs targeting the upstream PTM enzymes. The needleplot can be zoomed and searched by amino acid position. Mutations are also described in the table below (Figure 2C). The needleplot can be exported as a high-resolution PDF (Portable Document Format file) and the mutation table can be exported as a spreadsheet.
Figure 2.
PTM-associated cancer mutations in the tumor suppressor protein TP53. (A) Needleplot in the protein sequence view shows the distribution of missense cancer mutations from TCGA (vertical bars) and their associations with PTM sites (blue boxes) with protein sequence on the x-axis and number of mutations on the Y-axis. (B) The substitution R282W disrupts the sequence motif bound by the Aurora Kinase B (AURKB). (C) Detailed table of mutations with disease associations and impact of substitutions on PTMs. (D) Experimentally determined interaction network shows the TP53 protein (middle node) and its PTM site-specific interactions with upstream enzymes, as well as approved drugs targeting these enzymes. Node shapes indicate types of interacting molecules and sites: protein of interest (oval), PTM sites (squares), enzymes interacting with PTM site (circles), and drugs targeting the enzyme (triangles). Arrows indicate the interaction of TP53 and Aurora kinase B at phosphosite T284 and the associated substitution R282W.
Overview and workflow of ActiveDriverDB. Our database integrates genomic and proteomic data for interpreting disease mutations and human inter-individual variation with PTMs. Genomics datasets include cancer exome sequencing studies (TCGA), disease genes and mutations (ClinVar), and human genome variation studies (ESP6500, the 1000 Genomes Project) (top-left panel). Proteomics datasets include PTM sites of four commonly studied PTM types, site-specific interactions of PTM enzymes and target proteins, and drug interactions with PTM enzymes (top-right panel). Our systematic analysis pipeline aligns PTM sites with missense SNVs, predicts the impact of amino acid substitutions on kinase-bound sequence motifs using the MIMP method, and organizes site-specific interaction networks of PTMs, upstream enzymes and drugs (middle panel). The protein sequence view shows the distribution of PTMs and substitutions along the protein sequence (bottom left panel), while the interaction network view shows site-specific interactions of mutated proteins with upstream PTM enzymes and their associated drugs (bottom right panel). The database also provides exporting, visualization, searching and automated analysis tools (bottom middle panel).PTM-associated cancer mutations in the tumor suppressor protein TP53. (A) Needleplot in the protein sequence view shows the distribution of missense cancer mutations from TCGA (vertical bars) and their associations with PTM sites (blue boxes) with protein sequence on the x-axis and number of mutations on the Y-axis. (B) The substitution R282W disrupts the sequence motif bound by the Aurora Kinase B (AURKB). (C) Detailed table of mutations with disease associations and impact of substitutions on PTMs. (D) Experimentally determined interaction network shows the TP53 protein (middle node) and its PTM site-specific interactions with upstream enzymes, as well as approved drugs targeting these enzymes. Node shapes indicate types of interacting molecules and sites: protein of interest (oval), PTM sites (squares), enzymes interacting with PTM site (circles), and drugs targeting the enzyme (triangles). Arrows indicate the interaction of TP53 and Aurora kinase B at phosphositeT284 and the associated substitution R282W.
Interaction network view: PTM site-specific interactions, upstream enzymes and drug targets
This view displays the selected protein in a site-specific interaction network with upstream enzymes and associated drugs (Figure 2D). Two interaction networks are available: the high-confidence experimental network includes experimentally determined kinase–substrate interactions, and the computationally predicted network includes gained and lost kinase-substrate interactions derived from sequence motif analysis with MIMP. Most interactions comprise phosphorylation sites and associated kinases with largest body of experimental data. The network view uses an automatic layout algorithm that emphasizes the hierarchical network structure. It can be zoomed and arranged for clarity and exported as a high-resolution PDF.
Searching and browsing PTM mutations in proteins
The database provides a flexible graphical user interface for finding, visualizing and interpreting mutations in PTM sites and their potential impact on signaling networks.
Searching for genes, pathways, and diseases
The main search bar supports several options. First, the user can identify a gene (protein) of interest by either its HGNC symbol that retrieves the primary isoform (e.g.TP53) or a RefSeq transcript ID that retrieves a specific isoform (e.g.NM_000546). Second, genes associated with biological processes of Gene Ontology (42), molecular pathways of Reactome (43) (e.g.Wnt signaling pathway), and diseases in the ClinVar database (e.g.Noonan syndrome) can be looked up. These search options benefit researchers who are interested in specific genes, pathways, or disease mutations.
Searching for mutations
The user can search for a gene or protein using amino acid substitutions (e.g.TP53 R282W) or coordinates of missense mutations (e.g.chr17 7577094 C T) through our rapid indexing system that covers all potential missense SNVs in the human genome. Search for mutations is especially beneficial for genetics researchers who have identified interesting missense SNVs in a genome-wide association or DNA sequencing study.
Browsing top disease-associated genes and pathways
Users may browse sets of disease-associated genes and pathways with unexpectedly frequent mutations in PTM sites. Candidate gene lists are available for cancer mutations from TCGA and inherited disease mutations from ClinVar. Genes are ranked according to statistical significance of PTM mutations computed using our ActiveDriver method (26). These lists are useful for novice users of the database who are looking for an overview of the database through examples of genes and pathways.
Analysing custom datasets of mutations
Users can upload VCF or tab-separated files to analyze their datasets of variants with PTM information, using chromosomal or protein coordinates. A password-protected area is available for uploaded data.
Application programming interface (API)
ActiveDriverDB includes an API allowing access to the database with programming languages like R and Python using the Representational State Transfer (REST) pattern. The API accepts chromosomal or protein coordinates of mutations, converts these appropriately and returns PTM annotations. Filters for mutation types (cancer, inherited disease, population), querying of mutations by gene symbol or RefSeq ID, and other options are also supported. Datasets of PTM sites and associated mutations are also available for download for advanced computational biology studies. We provide up-to-date input datasets for the ActiveDriver method (26) that reveals proteins with statistical enrichment of mutations in PTM sites.
Case studies of PTM-associated disease mutations
Frequent cancer mutations in PTM sites in the tumor suppressor protein TP53
Mutated in 50% of cancers, the transcription factor TP53 relies on its DNA-binding activity to perform its function as a tumour suppressor (44). Consistently, most mutations are found in the DNA-binding domain (DBD) of the protein with a third clustered in seven hotspot residues with high-frequency mutations (R175, G245, R248, R249, R273, R282) (Figure 2A) (45). Although most of the mutations in TP53 are associated with loss of function, mutations such as R282W lead to gain of function and TP53 with distinct oncogenic properties (reviewed in (45,46)). The mechanisms by which R282W leads to this transformation are still unclear (47). MIMP analysis of sequence motifs in TP53 predicts that the mutations R282W/G/Q rewire the phosphositeT284 by abolishing the sequence motif of the AURKB in the DBD of TP53 (Figure 2B and C) (27). This phosphosite is bound by AURKB kinase in vitro and in cells with AURKB ectopic expression (Figure 2D) (27,48,49). The substitution T284E inhibits the ability of TP53 to promote CDKN1A expression (48) highlighting a role of AURKB in TP53 signaling. More than 200 mutations in the TCGA dataset potentially interfere with the phosphositeT284 (Table 2), suggesting that these regulate a common function of TP53. As R282W is associated with early cancer development (50), the mutations should be further studied regarding their impact on the gene regulatory and tumour suppressive roles of TP53 (46). By highlighting clusters of PTM-associated mutations, our database helps design experiments to understand the post-translational regulation of TP53.
Table 2.
PTM-associated cancer mutations affecting the phosphosite T284 in the tumor suppressor protein TP53
Phosphosite
Reference a.a.
Mutated a.a
Number of mutations
Impact on PTM
TCGA
ClinVar
T284
R280
T
19
3
Distal
I
4
3
Distal
K
14
Distal
S
3
Distal
G
5
Distal
D281
V
4
2
Distal
G
2
Distal
A
2
Distal
H
3
Distal
N
2
Distal
Y
7
Distal
E
6
Distal
R282
W
74
5
Network-rewiring
G
2
5
Network-rewiring
Q
1
2
Network-rewiring
L
2
Network-rewiring
R283
H
1
2
Proximal
P
8
Proximal
C
3
Proximal
S
3
Proximal
T284
I
1
Direct
P
1
Direct
A
Direct
E
Direct
E285
V
4
2
Proximal
K
23
Proximal
Q
1
Network-rewiring
E286
K
11
1
Network-rewiring
A
1
Proximal
G
4
Proximal
V
1
Proximal
Q
1
Proximal
Total
-
-
201
37
-
Somatic and germline disease mutations that potentially affect the phosphosite of Aurora kinase B.
Somatic and germline disease mutations that potentially affect the phosphosite of Aurora kinase B.
PTM-associated disease mutations of BRCA2 and the DNA damage response pathway
Mutations in the tumor suppressor BRCA2 are associated with elevated risk of breast and ovarian cancers as well as Fanconi Anemia, a rare chromosome instability syndrome characterized by aplastic anemia and susceptibility to childhood cancer (51,52). Consistently, disease-associated SNVs in BRCA2 reported in the ClinVar database are associated with familial breast cancer and hereditary cancer-predisposing syndrome. BRCA2 is essential for DNA double-strand break (DSB) repair by homologous recombination and protects the stalled replication fork (53). To prevent genomic instability, BRCA2 relies on interactions with RAD51 mediated by cell cycle-dependent kinases (CDKs) (54–57). Using the ActiveDriverDB database, we found that a significant number of somatic and inherited cancer mutations of BRCA2 coincide with phosphosites (29 SNVs in ClinVar, FDR = 10−47; 15 SNVs in TCGA, FDR = 10−6) (Figure 3A). Interestingly, three phosphosites S3291, S3319 and T3323 occur in the C-terminus of BRCA2 whose deletion is associated with increased radiation sensitivity and early-onset breast and ovarian cancer (58–60). The C-terminal TR2 domain at 3265-3330 a.a. mediates the interaction of BRCA2 with nucleofilaments of RAD51 (55,57) and its phosphorylation by CDKs inhibits this interaction and is essential for mitotic entry (54,55,57,61). Substitutions that either abolish these phosphosites or the CDK consensus sites (P3292L/S, P3320H and P3324L) (Figure 4B–D) are associated with familial breast cancer and hereditary cancer-predisposing syndrome, suggesting that the mutations interfere with maintenance of genomic stability. Consistently, the substitution S3291A inhibits the interaction of BRCA2 with RAD51 filaments, a phenotype that abrogates the replication fork protection without affecting DNA repair (62). Whether the phosphorylation of S3319 and T3323 regulates BRCA2 is unknown, however mutant BRCA2 with glutamate substitutions in these amino acids still interacts with RAD51 filaments (54). This example illustrates the integration of PTM information and germline disease mutations to predict novel experimentally testable hypotheses of mechanisms.
Figure 3.
PTM-associated mutations in the BRCA protein involved in DNA repair and breast cancer. (A) Zoomed needleplot shows germline disease mutations located in three phosphorylation sites in the protein sequence at 3,200-3,400 residues. Only PTM-associated mutations are shown. (B) Mutations P3292L and P3292S are predicted to disrupt the sequence motif of the CDK2 kinase. (C) Table shows additional information on the two mutations. (D) The computationally derived PTM interaction network of BRCA2, kinases predicted to interact with mutant BRCA2, and drugs targeting the kinases. Arrows point to the mutations P3292L/S.
Figure 4.
PTM-associated mutations in the tumor suppressor protein VHL. (A) Needleplot of mutations from the TCGA dataset with the VHL mutation L169P near two phosphsites. (B) The mutation L196P is predicted to disrupt the sequence motif of the CDK1 kinase. (C) Table shows additional information on the mutation. (D) The computationally predicted interaction network of VHL, its PTM sites, kinases predicted to interact with mutant VHL according to the MIMP method, and drugs targeting the kinases. Arrows indicate the mutation L169P.
PTM-associated mutations in the BRCA protein involved in DNA repair and breast cancer. (A) Zoomed needleplot shows germline disease mutations located in three phosphorylation sites in the protein sequence at 3,200-3,400 residues. Only PTM-associated mutations are shown. (B) Mutations P3292L and P3292S are predicted to disrupt the sequence motif of the CDK2 kinase. (C) Table shows additional information on the two mutations. (D) The computationally derived PTM interaction network of BRCA2, kinases predicted to interact with mutant BRCA2, and drugs targeting the kinases. Arrows point to the mutations P3292L/S.PTM-associated mutations in the tumor suppressor protein VHL. (A) Needleplot of mutations from the TCGA dataset with the VHL mutation L169P near two phosphsites. (B) The mutation L196P is predicted to disrupt the sequence motif of the CDK1 kinase. (C) Table shows additional information on the mutation. (D) The computationally predicted interaction network of VHL, its PTM sites, kinases predicted to interact with mutant VHL according to the MIMP method, and drugs targeting the kinases. Arrows indicate the mutation L169P.
Network-rewiring mutations in the tumour suppressor VHL alter putative CDK binding sites
The tumour suppressor VHL encodes a member of a E3 ubiquitin ligase complex that inhibits oncogenic substrates such as protein kinase C, retinol binding protein 1, and hypoxia-inducible transcription factors (HIF) (63,64). VHL is frequently inactivated in cancer and clear cell renal cell carcinomas (ccRCCs) harbour gene-silencing mutations including L169P and others in the p.157–172 subdomain (64) (Figure 4A). Phosphorylation of VHL at S168 by NIMA Related Kinase 1 (NEK1) has been associated with VHL degradation and cilliary homeostasis (65). The mutation L169P may impact VHL signaling as it flanks the phosphosites S168 and Y175 bound by the NEK1 kinase. ActiveDriverDB analysis suggests that three L169P substitutions observed in TCGA kidney cancers may induce gains of phosphosites of the cyclin dependent kinase 1 (CDK1) or related CDK and MAPK kinases (Figure 4B and C). While little is known about the interactions of VHL and CDKs, VHL inactivation has been linked to increased levels of CDK1 and CDK2 (66), and CDK1 stabilizes HIF transcription factors that are targets of VHL (67). Studying the L169P mutation in the context of VHL phosphorylation and upstream kinases may reveal details about disease mechanisms (Figure 4D). Since CDK1 is pharmaceutically targetable, drug assays using alsterpaullone and alvocidib (33) may also advance development of targeted therapies.
DISCUSSION
ActiveDriverDB is a comprehensive human proteo-genomics resource that uses PTMs to interpret disease mutations and inter-individual variation. Although PTMs are important regulators of protein function and signaling pathways, genetic variant impact analysis pipelines usually neglect this information. Our database aims to advance analysis of missense mutations using PTMs. Novice users of our database can start from example queries of well-annotated genes and browse gene lists with PTM-enriched disease mutations. Basic and translational researchers can look up their favourite genes, upload candidate variants from DNA sequencing experiments, and export tables and publication-quality figures. Computational biologists can use the API to automatically analyze variants and download entire datasets for advanced studies.Our collection of PTM sites is derived from many published studies that represent diverse cells and experimental conditions. While this large dataset allows us to maximise coverage of disease mutations and genetic variation, all PTM sites may be not directly comparable with one another. PTM sites observed in certain cell types may not be expressed as proteins or may be excluded due to alternative splicing in cells relevant to a disease of interest. Although we processed PTM sites uniformly across the entire collected dataset, false positives may have emerged from analyses in original studies, the databases reporting the data, or our multi-database integration pipeline. We recommend that users validate top PTM sites by looking these up in databases such as PhosphoSitePlus and the publications and supplementary materials that originally reported the PTM site.We plan several important future developments of the database. Maintaining timely biomedical data resources is essential as new datasets accumulate rapidly and outdated resources hamper scientific advances (68). Thus we aim to provide at least annual updates of our database to include recent large-scale genomics and proteomics studies and molecular interaction networks. Recent proteomics technologies enable large-scale characterization of other PTM types such as glycosylation (69) and SUMOylation (70) and such datasets will be included in future releases. Additional species and genomes will be also considered, such as the most recent human genome assembly (GRCh38) and model organisms such as mouse and Arabidopsis with abundant genome variation and proteomics data (14,71).Interpreting inter-individual genetic variation will become an increasingly important challenge as we enter the era of personal genomics. Integration of proteomic and genomic information for deciphering the impact of variation on cellular systems and organism-level phenotypes is a powerful approach that will improve with future datasets of increased magnitude and complexity. We provide an integrated database resource to the research community to enable future discoveries.
Authors: Greg Donoho; Mark A Brenneman; Tracy X Cui; Dorit Donoviel; Hannes Vogel; Edwin H Goodwin; David J Chen; Paul Hasty Journal: Genes Chromosomes Cancer Date: 2003-04 Impact factor: 5.006
Authors: J M Lancaster; R Wooster; J Mangion; C M Phelan; C Cochran; C Gumbs; S Seal; R Barfoot; N Collins; G Bignell; S Patel; R Hamoudi; C Larsson; R W Wiseman; A Berchuck; J D Iglehart; J R Marks; A Ashworth; M R Stratton; P A Futreal Journal: Nat Genet Date: 1996-06 Impact factor: 38.330
Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis Journal: Nature Date: 2015-10-01 Impact factor: 49.962
Authors: Melissa J Landrum; Jennifer M Lee; George R Riley; Wonhee Jang; Wendy S Rubinstein; Deanna M Church; Donna R Maglott Journal: Nucleic Acids Res Date: 2013-11-14 Impact factor: 16.971
Authors: David Croft; Antonio Fabregat Mundo; Robin Haw; Marija Milacic; Joel Weiser; Guanming Wu; Michael Caudy; Phani Garapati; Marc Gillespie; Maulik R Kamdar; Bijay Jassal; Steven Jupe; Lisa Matthews; Bruce May; Stanislav Palatnik; Karen Rothfels; Veronica Shamovsky; Heeyeon Song; Mark Williams; Ewan Birney; Henning Hermjakob; Lincoln Stein; Peter D'Eustachio Journal: Nucleic Acids Res Date: 2013-11-15 Impact factor: 16.971
Authors: Gregory Tombline; Jonathan Gigas; Nicholas Macoretta; Max Zacher; Stephan Emmrich; Yang Zhao; Andrei Seluanov; Vera Gorbunova Journal: Proteomics Date: 2020-01-09 Impact factor: 3.984
Authors: Horacio Gomez-Acevedo; John D Patterson; Sehrish Sardar; Murat Gokden; Bhaskar C Das; David W Ussery; Analiz Rodriguez Journal: BMC Cancer Date: 2019-08-22 Impact factor: 4.430