| Literature DB >> 29531238 |
Yu Fan1, Yu Hu1, Cheng Yan1, Radoslav Goldman2, Yang Pan1, Raja Mazumder3,4, Hayley M Dingerdissen5.
Abstract
Despite availability of sequence site-specific information resulting from years of sequencing and sequence feature curation, there have been few efforts to integrate and annotate this information. In this study, we update the number of human N-linked glycosylation sequons (NLGs), and we investigate cancer-relatedness of glycosylation-impacting somatic nonsynonymous single-nucleotide variation (nsSNV) by mapping human NLGs to cancer variation data and reporting the expected loss or gain of glycosylation sequon. We find 75.8% of all human proteins have at least one NLG for a total of 59,341 unique NLGs (includes predicted and experimentally validated). Only 27.4% of all NLGs are experimentally validated sites on 4,412 glycoproteins. With respect to cancer, 8,895 somatic-only nsSNVs abolish NLGs in 5,204 proteins and 12,939 somatic-only nsSNVs create NLGs in 7,356 proteins in cancer samples. nsSNVs causing loss of 24 NLGs on 23 glycoproteins and nsSNVs creating 41 NLGs on 40 glycoproteins are identified in three or more cancers. Of all identified cancer somatic variants causing potential loss or gain of glycosylation, only 36 have previously known disease associations. Although this work is computational, it builds on existing genomics and glycobiology research to promote identification and rank potential cancer nsSNV biomarkers for experimental validation.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29531238 PMCID: PMC5847511 DOI: 10.1038/s41598-018-22345-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Numbers of sequons and proteins identified from three methods 3,452 out of 20,199 proteins in the human proteome have signal peptides; 14,921 proteins in the human proteome have at least one “Cellular component” keyword (“Secreted,” “Membrane,” “Cytoplasm,” or “Nucleus”); the same protein can belong to multiple “Cellular component” categories if it is observed in more than one cellular location. 8,772 proteins have either signal peptides or are annotated with keyword(s) “Secreted” or “Membrane.” String search results include almost all the NLGs from the other two methods except for 121 atypical cases which do not follow consensus NX(S/T) (X! = P) according to high-confidence criterion, 61 reported from UniProt FT lines. There are 59,341 non-redundant NLGs from 15,318 proteins in total from these three methods. 7,017 of them either have signal peptides or are annotated with keyword(s) “Secreted” or “Membrane”.
| High-confidencea NLGs from Databases | Predicted NLGs by NetNGlyc | String search of NX(S/T) (X! = P) sequons | |
|---|---|---|---|
| Sequons | 16,253 | 43,139 | 59,220 |
| Proteins | 4,412 | 14,114 | 15,314 |
| Proteins either have signal peptides or annotated with keyword(s) “Secreted” or “Membrane” | 4,373 | 6,597 | 7,014 |
aAnnotated NLGs from UniProtKB/Swiss-Prot, HPRD 9.0, dbPTM 3.0, neXtProt and NCBI-CDD were treated as high-confidence results.
Figure 1Contribution of data sources to NLG identification. This diagram shows the distinct and overlapping contribution of different resources and methods of identification of NLGs throughout the human proteome. The contributions of specific databases (those data entries composing the high-confidence subset) are detailed in the Venn diagram on the left.
Figure 2Density of sequons per protein. Density here is calculated as the total number of positions identified to start an NXS/T motif in a protein divided by the corresponding length of the protein. The majority of proteins are annotated with smaller numbers of NLGs, and therefore the average density is less than 1 NLG per protein, when normalized by unit length. The distribution of NLGs per protein is plotted as the count of proteins with a given density of NLGs for (A) all NLGs in the human proteome, (B) LOG-causing NLGs in the somatic subset, and C) GOG-causing NLGs in the somatic subset.
Spacing of sequons and real N-glycosylation sites. All values are reported as numbers of amino acids.
| Sequons | Real N-glycosylation sites | |
|---|---|---|
| Minimum | 1 | 1 |
| Maximum | 4,277 | 2,970 |
| Mean | 117.98 | 93.84 |
| Median | 71 | 52 |
| S.D. | 145.02 | 134.71 |
Numbers of variations, affected sequons, and affected proteins from germline and cancer somatic LOG and GOG variation.
| Somatic | Somatic-onlya | Germline | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| LOG due to variations within the NX(S/T) (X! = P) sequon | 10,807 | 9,665 | 5,893 | 8,895 | 8,061 | 5,204 | 37,498 | 26,909 | 10,586 |
| GOG due to variations within the NX(S/T) (X! = P) sequonc | 15,675 | 15,503 | 8,257 | 12,939 | 12,815 | 7,356 | 39,455 | 38,711 | 12,667 |
| GOG predicted by NetNGlyc due to variationsd | 11,018 | 10,918 | 6,808 | 9.835 | 9.294 | 5,365 | 27,704 | 27,226 | 11,307 |
aVariations from cancer genomics databases (somatic origin), but not in dbSNP (germline origin)
bSequons for GOG sets are reported by unique positions- Note that the number of unique motifs per position identified is equal to the total number of nsSNVs for that set.
cGOG predicted by string search alone
dOverlap between string search and NetNGlyc is 100% of NetNGlyc results
Figure 3Position of affected residue in sequon in LOG and GOG cancer subset. LOG and GOG variants are linked to the position in the sequence affected by the underlying nsSNV, where one corresponds to N, two corresponds to X, and three corresponds to S/T in the NXS/T consensus sequon. Variants are also linked to the specific cancers in which they have been identified.
Figure 4Schematic of biologically relevant LOG and GOG variations in at least three cancers. Red circles are proteins in the LOG dataset, green circles are proteins in the GOG dataset. Protein completely within the lipid bilayer are tagged with cellular localization term “Membrane” while proteins spanning both the membrane and the adjacent cytoplasm or extracellular environment are tagged both with cellular localization term “Membrane” and “Cytoplasm” or “Secreted,” respectively. Note that position within membrane is not indicative of status as integral or peripheral proteins. Dotted lines represent proteins that also have cellular localization term “Nucleus.” Black stars on proteins signify the presence of a signal peptide on that protein. Also, note that Mast stem cell growth factor receptor Kit, (KIT, P10721) is the only protein to appear with mutations that could cause loss or gain of glycosylation in more than three cancers each.
High-confidence LOG variations related with three or more cancer types.
| UniProt AC | Gene Name | Protein Name | Position | Reference | Variation | Cancer Types |
|---|---|---|---|---|---|---|
| O15244 | SLC22A2 | Solute carrier family 22 member 2 | 74 | T | M | DOID:1793/pancreatic cancer;DOID:1319/brain cancer;DOID:219/colon cancer |
| O15438 | ABCC3 | Canalicular multispecific organic anion transporter 2 | 1009 | S | F | DOID:4159/skin cancer;DOID:10534/stomach cancer;DOID:3571/liver cancer |
| O75473 | LGR5 | Leucine-rich repeat-containing G-protein coupled receptor 5 | 77 | N | S | DOID:219/colon cancer;DOID:1324/lung cancer;DOID:5672/large intestine cancer |
| P10163 | PRB4 | Basic salivary proline-rich protein 4 | 110 | S | P | DOID:1319/brain cancer;DOID:3070/malignant glioma;DOID:11934/head and neck cancer |
| P10721 | KIT | Mast/stem cell growth factor receptor Kit | 486 | N | D | DOID:184/bone cancer;DOID:1993/rectum cancer;DOID:219/colon cancer;DOID:5672/large intestine cancer |
| P28472 | GABRB3 | Gamma-aminobutyric acid receptor subunit beta-3 | 107 | T | M | DOID:0060119/pharynx cancer;DOID:10534/stomach cancer;DOID:11934/head and neck cancer |
| Q96S37 | SLC22A12 | Solute carrier family 22 member 12 | 104 | T | M | DOID:1319/brain cancer;DOID:10534/stomach cancer;DOID:219/colon cancer |
| Q9NTG1 | PKDREJ | Polycystic kidney disease and receptor for egg jelly-related protein | 297 | S | L | DOID:4159/skin cancer;DOID:0060119/pharynx cancer;DOID:11934/head and neck cancer |
| Q9NTG1 | PKDREJ | Polycystic kidney disease and receptor for egg jelly-related protein | 925 | S | L | DOID:10534/stomach cancer;DOID:219/colon cancer;DOID:5672/large intestine cancer |
| Q9NUN5 | LMBRD1 | Probable lysosomal cobalamin transporter | 349 | S | C | DOID:1319/brain cancer;DOID:1324/lung cancer;DOID:3070/malignant glioma |
| Q9P121 | NTM | Neurotrimin | 46 | T | M | DOID:0060119/pharynx cancer;DOID:5041/esophageal cancer;DOID:11934/head and neck cancer |
| Q9UKZ4 | TENM1 | Teneurin-1 | 1759 | S | L | DOID:4362/cervical cancer;DOID:0060119/pharynx cancer;DOID:11934/head and neck cancer |
| Q9UPZ6 | THSD7A | Thrombospondin type-1 domain-containing protein 7A | 236 | T | M | DOID:0060119/pharynx cancer;DOID:1993/rectum cancer;DOID:5041/esophageal cancer;DOID:11934/head and neck cancer |
Somatic-only GOG variations related with three or more cancer types.
| UniProt AC | Gene Name | Protein Name | Position | Reference | Variation | Cancer Types |
|---|---|---|---|---|---|---|
| O14522 | PTPRT | Receptor-type tyrosine-protein phosphatase T | 281 | A | T | DOID:1319/brain cancer;DOID:10534/stomach cancer;DOID:3070/malignant glioma |
| O43526 | KCNQ2 | Potassium voltage-gated channel subfamily KQT member 2 | 785 | D | N | DOID:1793/pancreatic cancer;DOID:1612/breast cancer;DOID:10283/prostate cancer |
| O43699 | SIGLEC6 | Sialic acid-binding Ig-like lectin 6 | 251 | A | T | DOID:1319/brain cancer;DOID:219/colon cancer;DOID:3070/malignant glioma |
| O60469 | DSCAM | Down syndrome cell adhesion molecule | 213 | A | T | DOID:363/uterine cancer;DOID:1612/breast cancer;DOID:1793/pancreatic cancer;DOID:5041/esophageal cancer |
| O94973 | AP2A2 | AP-2 complex subunit alpha-2 | 115 | A | T | DOID:1612/breast cancer;DOID:219/colon cancer;DOID:5672/large intestine cancer |
| P04264 | KRT1 | Keratin, type II cytoskeletal 1 | 248 | D | N | DOID:4159/skin cancer;DOID:2394/ovarian cancer;DOID:5672/large intestine cancer |
| P07357 | C8A | Complement component C8 alpha chain | 169 | D | N | DOID:4159/skin cancer;DOID:363/uterine cancer;DOID:1909/melanoma |
| P10721 | KIT | Mast/stem cell growth factor receptor Kit | 566 | N | S | DOID:1993/rectum cancer;DOID:263/kidney cancer;DOID:5672/large intestine cancer |
| P21802 | FGFR2 | Fibroblast growth factor receptor 2 | 659 | K | N | DOID:363/uterine cancer;DOID:1612/breast cancer;DOID:1324/lung cancer |
| P21817 | RYR1 | Ryanodine receptor 1 | 2861 | D | N | DOID:8618/oral cavity cancer;DOID:1793/pancreatic cancer;DOID:363/uterine cancer;DOID:11934/head and neck cancer |
| P46531 | NOTCH1 | Neurogenic locus notch homolog protein 1 | 465 | A | T | DOID:219/colon cancer;DOID:3070/malignant glioma;DOID:5041/esophageal cancer;DOID:11934/head and neck cancer;DOID:0060119/pharynx cancer;DOID:1319/brain cancer |
| P54852 | EMP3 | Epithelial membrane protein 3 | 42 | D | N | DOID:8618/oral cavity cancer;DOID:0060119/pharynx cancer;DOID:11934/head and neck cancer |
| Q02817 | MUC2 | Mucin-2 | 1750 | T | N | DOID:1612/breast cancer;DOID:3571/liver cancer;DOID:1781/thyroid cancer;DOID:2394/ovarian cancer;DOID:10283/prostate cancer;DOID:1319/brain cancer;DOID:1993/rectum cancer |
| Q13002 | GRIK2 | Glutamate receptor ionotropic, kainate 2 | 528 | D | N | DOID:8618/oral cavity cancer;DOID:1612/breast cancer;DOID:11054/urinary bladder cancer |
| Q13349 | ITGAD | Integrin alpha-D | 1070 | D | N | DOID:363/uterine cancer;DOID:1993/rectum cancer;DOID:219/colon cancer;DOID:5672/large intestine cancer |
| Q4ZHG4 | FNDC1 | Fibronectin type III domain-containing protein 1 | 253 | K | N | DOID:363/uterine cancer;DOID:1793/pancreatic cancer;DOID:1993/rectum cancer |
| Q685J3 | MUC17 | Mucin-17 | 2784 | T | N | DOID:2394/ovarian cancer;DOID:11934/head and neck cancer;DOID:0060119/pharynx cancer |
| Q6P1J6 | PLB1 | Phospholipase B1, membrane-associated | 645 | D | N | DOID:363/uterine cancer;DOID:0060119/pharynx cancer;DOID:11934/head and neck cancer |
| Q6UWW8 | CES3 | Carboxylesterase 3 | 161 | D | N | DOID:4159/skin cancer;DOID:219/colon cancer;DOID:5672/large intestine cancer |
| Q6UX06 | OLFM4 | Olfactomedin-4 | 372 | R | S | DOID:1793/pancreatic cancer;DOID:1612/breast cancer;DOID:5041/esophageal cancer |
| Q7Z304 | MAMDC2 | MAM domain-containing protein 2 | 319 | D | N | DOID:4159/skin cancer;DOID:1612/breast cancer;DOID:11054/urinary bladder cancer |
| Q7Z5H5 | VN1R4 | Vomeronasal type-1 receptor 4 | 265 | L | S | DOID:4362/cervical cancer;DOID:1793/pancreatic cancer;DOID:11934/head and neck cancer |
| Q86TH1 | ADAMTSL2 | ADAMTS-like protein 2 | 44 | D | N | DOID:10534/stomach cancer;DOID:219/colon cancer;DOID:5672/large intestine cancer |
| Q8IZD2 | KMT2E | Histone-lysine N-methyltransferase 2E | 902 | D | N | DOID:363/uterine cancer;DOID:219/colon cancer;DOID:5672/large intestine cancer |
| Q8N158 | GPC2 | Glypican-2 | 200 | D | N | DOID:1612 / breast cancer;DOID:5041/esophageal cancer;DOID:5672/large intestine cancer |
| Q8N8F6 | YIPF7 | Protein YIPF7 | 141 | D | N | DOID:4362/cervical cancer;DOID:0060119/pharynx cancer;DOID:11934/head and neck cancer |
| Q8NFZ4 | NLGN2 | Neuroligin-2 | 542 | A | T | DOID:8618/oral cavity cancer;DOID:0060119/pharynx cancer;DOID:11934/head and neck cancer |
| Q8NGZ4 | OR2G3 | Olfactory receptor 2G3 | 159 | H | N | DOID:363/uterine cancer;DOID:219/colon cancer;DOID:1324/lung cancer |
| Q8TC71 | SPATA18 | Mitochondria-eating protein | 404 | K | N | DOID:363/uterine cancer;DOID:1319/brain cancer;DOID:1793/pancreatic cancer;DOID:1324/lung cancer |
| Q8TDM6 | DLG5 | Disks large homolog 5 | 1799 | D | N | DOID:363/uterine cancer;DOID:0060119/pharynx cancer;DOID:11934/head and neck cancer |
| Q92556 | ELMO1 | Engulfment and cell motility protein 1 | 55 | D | N | DOID:4159/skin cancer;DOID:1319/brain cancer;DOID:3070/malignant glioma |
| Q92736 | RYR2 | Ryanodine receptor 2 | 898 | I | T | DOID:0060119/pharynx cancer;DOID:1324/lung cancer;DOID:11934/head and neck cancer |
| Q96PZ7 | CSMD1 | CUB and sushi domain-containing protein 1 | 3053 | D | N | DOID:4159/skin cancer;DOID:10534/stomach cancer;DOID:219/colon cancer |
| Q9BXX0 | EMILIN2 | EMILIN-2 | 759 | K | N | DOID:363/uterine cancer;DOID:1324/lung cancer;DOID:11054/urinary bladder cancer |
| Q9H2B2 | SYT4 | Synaptotagmin-4 | 89 | K | N | DOID:363/uterine cancer;DOID:1612/breast cancer;DOID:219/colon cancer |
| Q9H9P2 | CHODL | Chondrolectin | 186 | P | S | DOID:4159/skin cancer;DOID:1612/breast cancer;DOID:1324/lung cancer |
| Q9UKJ8 | ADAM21 | Disintegrin and metalloproteinase domain-containing protein 21 | 278 | D | N | DOID:4159/skin cancer;DOID:219/colon cancer;DOID:5672/large intestine cancer |
| Q9UQP3 | TNN | Tenascin-N | 1091 | D | N | DOID:0060119/pharynx cancer;DOID:1324/lung cancer;DOID:11934/head and neck cancer |
| Q9Y5F1 | PCDHB12 | Protocadherin beta-12 | 556 | D | N | DOID:363/uterine cancer;DOID:4159/skin cancer;DOID:10534/stomach cancer |
Figure 5Flowchart of the identification of LOG and GOG. The complete human proteome was retrieved from UniProtKB/Swiss-Prot, and sequences of included proteins were analyzed by string search and by NetNGlyc to identify all potential NLGs. High-confidence annotations of NLGs were also retrieved from the specified databases and incorporated into the comprehensive NLG dataset. NLGs were then mapped to somatic nsSNVs reported by cancer genomics databases and germline variations reported by dbSNP. The impact of variation on NLGs was analyzed, and for the subset resulting in loss or gain of NLG (LOG and GOG, respectively), presence in cancer samples was reported.