| Literature DB >> 31253080 |
Rajaram Gana1, Sona Vasudevan2.
Abstract
BACKGROUND: To-date, no claim regarding finding a consensus sequon for O-glycosylation has been made. Thus, predicting the likelihood of O-glycosylation with sequence and structural information using classical regression analysis is quite difficult. In particular, if a binary response is used to distinguish between O-glycosylated and non-O-glycosylated sequences, an appropriate set of non-O-glycosylatable sequences is hard to find.Entities:
Keywords: N-glycosylation; O-glycosylation; consensus sequon; linear; phosphorylation; probability model; ridge regression
Mesh:
Substances:
Year: 2019 PMID: 31253080 PMCID: PMC6599295 DOI: 10.1186/s12860-019-0200-9
Source DB: PubMed Journal: BMC Mol Cell Biol ISSN: 2661-8850
Percentage distribution of amino acids grouped according to their physiochemical properties around the O-GlcNAc glycosylated S/T-site in the data
| Amino Acid Position | Positively charged | Negatively charged | Polar uncharged | Cystein | Hydrophobic |
|---|---|---|---|---|---|
| –8 | 14.71 | 8.53 | 40.88 | 0.59 | 35.29 |
| –7 | 13.82 | 8.82 | 36.47 | 1.47 | 39.41 |
| –6 | 14.12 | 7.94 | 42.35 | 0.59 | 35.00 |
| –5 | 18.53 | 8.82 | 32.35 | 0.59 | 39.71 |
| –4 | 18.82 | 8.24 | 35.59 | 0 | 37.35 |
| –3 | 10.88 | 5.00 | 30.00 | 0.88 | 53.24 |
| –2 | 14.12 | 4.41 | 28.82 | 0.29 | 52.35 |
| –1 | 8.24 | 3.82 | 38.24 | 0.59 | 49.12 |
| 0 ( | |||||
| +1 | 10.59 | 8.24 | 52.35 | 0.29 | 28.53 |
| +2 | 6.47 | 6.47 | 45.29 | 0.29 | 41.47 |
| +3 | 8.82 | 4.71 | 44.71 | 0.59 | 41.18 |
| +4 | 10.88 | 6.18 | 48.24 | 0 | 34.71 |
| +5 | 16.47 | 5.00 | 32.35 | 0.59 | 45.59 |
| +6 | 16.47 | 6.18 | 37.06 | 0.29 | 40.00 |
| +7 | 15.88 | 6.76 | 44.41 | 0.29 | 32.65 |
| +8 | 14.41 | 6.47 | 42.94 | 0.59 | 35.59 |
Marginal distributions (%) of amino acids by position relative to the S/T-site
| Amino Acid | Eight positions to the left/right of the S/T-site for O-GlcNAc glycosylated sequences | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | +8 | |
| A | 9.1 | 12.7 | 7.1 | 12.1 | 11.5 | 8.8 | 11.8 | 10.3 | 9.1 | 17.4 | 12.1 | 8.5 | 10.3 | 10.6 | 7.4 | 7.1 |
| C | 0.6 | 1.5 | 0.6 | 0.6 | 0 | 0.9 | 0.3 | 0.6 | 0.3 | 0.3 | 0.6 | 0 | 0.6 | 0.3 | 0.3 | 0.6 |
| D | 4.4 | 2.7 | 2.7 | 3.8 | 4.1 | 1.5 | 2.4 | 2.7 | 2.9 | 3.5 | 2.1 | 2.4 | 3.2 | 2.4 | 3.8 | 2.4 |
| E | 4.1 | 6.2 | 5.3 | 5.0 | 4.1 | 3.5 | 2.1 | 1.2 | 5.3 | 2.9 | 2.7 | 3.8 | 1.8 | 3.8 | 2.9 | 4.1 |
| F | 1.5 | 1.8 | 2.9 | 1.5 | 2.4 | 2.4 | 1.8 | 1.8 | 2.4 | 1.2 | 1.8 | 1.5 | 2.7 | 2.4 | 1.2 | 1.8 |
| G | 8.5 | 6.2 | 5.9 | 5.9 | 5.3 | 6.8 | 6.8 | 5.6 | 9.4 | 8.5 | 7.1 | 9.1 | 5.6 | 6.5 | 8.8 | 5.9 |
| H | 3.2 | 0.9 | 2.1 | 0.6 | 3.5 | 2.7 | 1.8 | 1.5 | 1.5 | 2.7 | 1.5 | 1.2 | 2.1 | 3.2 | 1.5 | 3.2 |
| I | 4.4 | 4.7 | 3.2 | 4.1 | 3.5 | 4.4 | 2.9 | 4.4 | 2.9 | 2.1 | 3.5 | 3.2 | 5.9 | 3.8 | 1.8 | 2.9 |
| K | 7.1 | 6.5 | 5.3 | 8.2 | 7.1 | 3.8 | 7.1 | 3.8 | 4.4 | 1.2 | 2.4 | 5.0 | 7.4 | 5.9 | 7.4 | 5.9 |
| L | 7.4 | 5.6 | 6.5 | 10.0 | 5.9 | 5.6 | 7.1 | 5.0 | 5.6 | 4.1 | 7.7 | 5.3 | 7.7 | 6.2 | 5.9 | 6.8 |
| M | 1.2 | 1.5 | 1.8 | 1.5 | 2.1 | 0.3 | 1.2 | 0.9 | 0.6 | 1.2 | 0.3 | 1.8 | 2.4 | 0.9 | 1.5 | 1.2 |
| N | 2.9 | 1.8 | 3.5 | 0.9 | 1.5 | 2.1 | 1.8 | 0.3 | 1.5 | 2.1 | 3.8 | 1.8 | 1.2 | 2.1 | 1.8 | 2.4 |
| P | 6.2 | 8.2 | 8.5 | 6.5 | 7.7 | 17.9 | 19.4 | 6.8 | 2.7 | 8.8 | 9.1 | 6.5 | 7.9 | 11.8 | 7.7 | 8.8 |
| Q | 4.7 | 2.4 | 5.6 | 3.2 | 4.7 | 2.7 | 5.6 | 2.7 | 9.1 | 5.6 | 7.9 | 2.7 | 3.8 | 5.6 | 6.2 | 6.2 |
| R | 4.4 | 6.5 | 6.8 | 9.7 | 8.2 | 4.4 | 5.3 | 2.9 | 4.7 | 2.7 | 5.0 | 4.7 | 7.1 | 7.4 | 7.1 | 5.3 |
| S | 14.4 | 13.2 | 14.7 | 10.0 | 12.9 | 6.8 | 8.8 | 12.1 | 16.5 | 19.4 | 14.7 | 19.1 | 8.5 | 11.8 | 13.2 | 15.0 |
| T | 6.8 | 8.8 | 9.7 | 8.8 | 8.2 | 8.5 | 4.7 | 13.8 | 13.8 | 8.8 | 8.8 | 13.5 | 10.0 | 7.1 | 11.8 | 10.0 |
| V | 5.6 | 5.0 | 5.0 | 4.1 | 4.4 | 13.8 | 8.2 | 20.0 | 5.3 | 6.8 | 6.8 | 7.9 | 8.8 | 4.4 | 7.4 | 7.1 |
| W | 0.6 | 0.9 | 1.2 | 0.3 | 0.3 | 0.3 | 0 | 0 | 0 | 0 | 0.3 | 0 | 0 | 0 | 0.9 | 0.6 |
| Y | 2.9 | 3.2 | 1.8 | 3.2 | 2.7 | 2.9 | 1.2 | 3.8 | 2.1 | 0.9 | 2.1 | 2.1 | 3.2 | 4.1 | 1.8 | 2.9 |
Marginal distributions (%) of amino acids by position relative to the S/T-site
| Amino Acid | Eight positions to the left/right of the S/T-site for O-GalNAc glycosylated sequences | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | +8 | |
| A | 7.7 | 7.2 | 7.0 | 7.2 | 6.6 | 7.0 | 9.1 | 8.5 | 7.7 | 10.2 | 7.9 | 6.8 | 8.0 | 8.2 | 7.9 | 7.8 |
| C | 0.9 | 0.8 | 0.9 | 0.4 | 0.8 | 0.4 | 0.4 | 0.1 | 0.9 | 0.1 | 0.3 | 0.4 | 0.7 | 0.6 | 0.7 | 1.4 |
| D | 4.3 | 4.5 | 4.1 | 3.8 | 4.1 | 3.4 | 4.2 | 1.3 | 3.9 | 4.0 | 4.3 | 3.7 | 5.2 | 4.6 | 4.6 | 5.0 |
| E | 6.4 | 6.2 | 6.6 | 6.8 | 6.4 | 8.3 | 6.6 | 4.4 | 7.0 | 6.2 | 5.2 | 7.0 | 7.8 | 6.6 | 7.1 | 8.0 |
| F | 2.4 | 2.4 | 2.3 | 2.2 | 2.3 | 2.4 | 2.7 | 2.8 | 1.6 | 2.0 | 2.7 | 2.0 | 2.3 | 2.3 | 2.0 | 2.2 |
| G | 6.9 | 6.7 | 6.9 | 6.8 | 6.2 | 5.2 | 7.6 | 5.3 | 5.6 | 7.2 | 6.0 | 6.8 | 6.1 | 7.6 | 7.8 | 6.8 |
| H | 3.6 | 3.8 | 2.9 | 4.3 | 4.5 | 3.6 | 3.6 | 4.5 | 3.8 | 4.0 | 2.8 | 3.5 | 3.5 | 3.2 | 3.4 | 3.0 |
| I | 2.9 | 3.2 | 2.7 | 3.0 | 3.6 | 3.4 | 3.0 | 3.2 | 2.4 | 2.7 | 2.3 | 2.7 | 3.2 | 2.1 | 2.0 | 2.3 |
| K | 5.7 | 5.4 | 5.7 | 5.8 | 5.1 | 5.6 | 5.5 | 3.3 | 4.6 | 5.6 | 3.6 | 6.0 | 5.3 | 5.6 | 6.6 | 7.0 |
| L | 6.8 | 7.1 | 6.7 | 7.2 | 6.8 | 8.0 | 7.1 | 5.6 | 7.5 | 7.1 | 6.1 | 6.7 | 7.2 | 7.7 | 7.1 | 7.0 |
| M | 1.7 | 1.3 | 1.0 | 0.8 | 0.8 | 1.0 | 1.1 | 0.9 | 0.7 | 1.0 | 0.9 | 1.0 | 1.1 | 1.0 | 1.1 | 1.1 |
| N | 2.1 | 2.0 | 2.5 | 2.4 | 2.1 | 2.0 | 1.2 | 1.7 | 1.7 | 2.6 | 2.1 | 2.0 | 2.2 | 2.2 | 2.3 | 2.6 |
| P | 8.3 | 9.6 | 9.4 | 10.2 | 9.4 | 11.0 | 11.5 | 15.8 | 11.6 | 12.2 | 17.8 | 12.3 | 9.7 | 10.6 | 8.9 | 8.2 |
| Q | 4.7 | 4.3 | 4.4 | 4.4 | 4.7 | 4.2 | 4.5 | 3.9 | 4.9 | 3.9 | 4.2 | 3.9 | 3.6 | 4.0 | 4.9 | 4.2 |
| R | 7.5 | 6.5 | 7.5 | 7.2 | 7.5 | 6.8 | 6.2 | 5.1 | 5.9 | 5.8 | 4.7 | 6.5 | 6.9 | 7.5 | 6.7 | 6.8 |
| S | 9.4 | 10.6 | 11.5 | 10.1 | 11.8 | 10.3 | 9.0 | 10.3 | 10.4 | 11.3 | 10.8 | 10.3 | 9.6 | 10.1 | 10.5 | 9.1 |
| T | 11.5 | 10.8 | 10.6 | 10.2 | 10.1 | 8.3 | 8.0 | 11.2 | 12.0 | 6.8 | 10.9 | 10.2 | 9.5 | 9.7 | 10.0 | 10.4 |
| V | 5.2 | 5.8 | 5.4 | 5.2 | 5.5 | 6.7 | 6.0 | 9.7 | 6.2 | 5.1 | 5.4 | 6.1 | 6.0 | 4.5 | 4.9 | 4.8 |
| W | 1.0 | 0.6 | 0.5 | 0.8 | 0.5 | 0.9 | 0.8 | 0.5 | 0.6 | 0.7 | 0.5 | 1.0 | 0.6 | 0.5 | 0.5 | 1.3 |
| Y | 1.3 | 1.4 | 1.2 | 1.4 | 1.5 | 1.6 | 1.8 | 1.7 | 1.2 | 1.5 | 1.7 | 1.4 | 1.4 | 1.5 | 1.3 | 1.3 |
Marginal distributions (%) of amino acids by position relative to the S/T/Y/H-site
| Amino Acid | Seven positionsa to the left/right of the S/T-site for phosphorylated sequences | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -7 | -6 | -5 | -4 | -3 | -2 | -1 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | |
| A | 7 | 6.9 | 6.8 | 6.6 | 6.4 | 6.8 | 6.7 | 5.7 | 6.4 | 6.4 | 6.4 | 6.6 | 6.9 | 6.7 |
| C | 1.4 | 1.4 | 1.3 | 1.2 | 1.2 | 1.3 | 1.3 | 1.2 | 1.4 | 1.2 | 1.3 | 1.3 | 1.3 | 2.5 |
| D | 4.9 | 4.9 | 4.9 | 5 | 5 | 5.1 | 6.1 | 5.4 | 5.9 | 5.6 | 5.2 | 5.3 | 5.1 | 5.1 |
| E | 7.3 | 7.3 | 7 | 7.2 | 6.8 | 6.6 | 6 | 6.6 | 9.5 | 8.5 | 7.8 | 7.6 | 7.8 | 7.5 |
| F | 2.6 | 2.6 | 2.8 | 2.6 | 2.4 | 2.5 | 2.8 | 3.1 | 2.3 | 2.5 | 2.6 | 2.7 | 2.7 | 2.7 |
| G | 6.9 | 6.7 | 6.6 | 7.1 | 6.8 | 6.7 | 8.1 | 7.6 | 6.9 | 7.1 | 6.5 | 6.6 | 6.6 | 6.5 |
| H | 2.2 | 2.2 | 3.3 | 2.2 | 2.5 | 2.1 | 3.7 | 2 | 2 | 2.1 | 2.1 | 2.3 | 2.2 | 2.3 |
| I | 3.6 | 3.7 | 3.6 | 3.5 | 3.4 | 4.1 | 3.9 | 3.6 | 3.5 | 3.5 | 3.8 | 3.5 | 3.6 | 3.5 |
| K | 7.4 | 7.3 | 7 | 7.3 | 7.2 | 6.1 | 6 | 4.5 | 6 | 7.6 | 6.8 | 7.3 | 7.5 | 7.2 |
| L | 7.9 | 8.3 | 8.8 | 7.7 | 8 | 8 | 9.1 | 9.1 | 7.9 | 7.8 | 8.7 | 7.9 | 8 | 8.1 |
| M | 2.1 | 1.8 | 1.8 | 1.8 | 1.8 | 1.9 | 1.7 | 1.8 | 1.7 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 |
| N | 3.4 | 3.4 | 3.4 | 3.5 | 3.3 | 3.8 | 3.8 | 2.6 | 3.3 | 3.4 | 3.3 | 3.3 | 3.3 | 3.4 |
| P | 7.1 | 7 | 7 | 6.9 | 6.7 | 7.9 | 7.3 | 15.5 | 8 | 7.3 | 8.1 | 7.6 | 7.4 | 7 |
| Q | 4.5 | 4.4 | 4.3 | 4.8 | 4.1 | 4.4 | 3.8 | 4.6 | 3.9 | 4.3 | 4.2 | 4.3 | 4.4 | 4.3 |
| R | 7.4 | 7.5 | 7.7 | 7.3 | 11 | 7.3 | 6.8 | 5.1 | 6.1 | 6.8 | 6.4 | 6.9 | 7 | 7 |
| S | 10.6 | 10.7 | 10.3 | 11.7 | 10.9 | 11.7 | 9.8 | 9.2 | 11.6 | 11.1 | 11.3 | 10.6 | 10.8 | 10.5 |
| T | 5.7 | 5.5 | 5.3 | 5.4 | 5 | 5.7 | 4.4 | 4.1 | 5.9 | 5.1 | 5.7 | 5.5 | 5.6 | 5.5 |
| V | 5.2 | 5.2 | 5.2 | 5.1 | 4.8 | 5.4 | 5.5 | 5.3 | 5.2 | 5 | 5.3 | 5 | 5.3 | 5.1 |
| W | 0.8 | 0.7 | 0.7 | 0.7 | 0.6 | 0.6 | 0.6 | 0.8 | 0.5 | 0.7 | 0.7 | 0.7 | 0.7 | 0.8 |
| Y | 2.2 | 2.2 | 2.2 | 2.1 | 2 | 2.1 | 2.3 | 2.1 | 2 | 2.1 | 2.1 | 3.2 | 2.1 | 2.5 |
a Only seven positions to the left/right of the phosphorylation site were collected
Testing whether the proportions of an amino acid over the positions it occupies are pairwise different across PTMs of proteinsa
| Amino Acid | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | |
| O-GlcNAc vs O-GalNAc glycosylation | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
| O-GlcNAc glycosylation vs phosphorylation | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
| O-GalNAc glycosylation vs phosphorylation | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
a 1/0 indicates the difference is/isn’t statistically significant at the 5% level of significance. The first comparison uses the 16 pairs, while the other two uses 14 pairs
UniProt Accession No. and position pair counts that are common between PTMs of proteins
| Sequences in the empirical data that are: | N-glycosylated | O-GlcNAc glycosylated | O-GalNAc glycosylated | Phosphorylated |
|---|---|---|---|---|
| N-glycosylated | 361 | 0 | 0 | 13 |
| O-GlcNAc glycosylated | 638 | 0 | 144 | |
| O-GalNAc glycosylated | 2,079 | 224 | ||
| Phosphorylated | 227,810 | |||
N-glycosylated sequences that are also phosphorylated
| UniProt Accession No. | Protein Name | Phosphorylated at UniProt position (residue) | |
|---|---|---|---|
| O00206 | Toll-like receptor 4 | 175 ( | 173 |
| O15455 | Toll-like receptor 3 | 72 ( | 70 |
| P00533 | Epidermal growth factor receptor | 354 ( | 352 |
| P02675 | Fibrinogen beta chain | 396 ( | 394 |
| P02788 | Lactotransferrin | 499 ( | 497 |
| P04629 | High affinity nerve growth factor receptor | 123 ( | 121 |
| P05187 | Alkaline phosphatase | 273 ( | 271 |
| P06213 | Insulin receptor | 366 ( | 364 |
| P06213 | Insulin receptor | 447 ( | 445 |
| P07711 | Cathepsin L1 | 223 ( | 221 |
| P12821 | Angiotensin-converting enzyme | 716 ( | 714 |
| Q96FE5 | Leucine-rich repeat and immunoglobulin-like domain-containing Nogo receptor-interacting protein 1 | 295 ( | 293 |
| Q9NPH3 | Interleukin-1 receptor accessory protein | 113 ( | 111 |
Description of the collected dataa
| Dataset name given | Data description | Identified in | Sample Size | Source |
|---|---|---|---|---|
|
| Oglycos_status = yes | 1,105. Where 998 are human with unique PDB-IDs; of these, only 16 are inferred from known | dbOGAP | |
|
| Human | Ogly_only_seq = yes | 376. These are unique UniProt Accession No. and position pairs. | dbOGAP |
|
| Extract of human sequences from | Not identified as it is derivable using software like | 39. Of these, 28 are experimentally validated and the remaining 11 are inferred. These 39 sequences become 998 in | N/A |
|
| Merge | Not identified as it is derivable | 340 | N/A |
|
| Additional extract of | Ogly_21 = yes | 411. Of these, 59.12% are glycosylated at | PhosphoSitePlus |
|
| Merge | GLCNAC_s1 = yes | 259. This is used as out-of-sample data. Note, 152 of the 340 sequences in | N/A |
|
| GALNAC_s1 = yes | 2,079. This is used as out-of-sample data. Of these, 60.27% are glycosylated at | PhosphoSitePlus | |
|
| glyco_status = yes | 6,328. Of these, 2,422 are “Homo sapiens (Human)”. Of the 2,422, the count of sequences with more than one sugar bound is 1,083. These 1,083 sequences are in-sample data for the proteins with sequence | Gana et al.[ | |
|
| [ | Not identified. This is archived in a separate file: | 363,256. Of these, 227,810 are human with amino acids in ±7 positions of the | PhosphoSitePlus |
|
| Human sequences with the | wstw = yes | 236. This extract is unique in terms of Uniprot Accession No. & position pairs | UniProt |
a The columns describe the dataset name, counts of the sequences collected, description of the data and its source. For example, 1,105 O-GlcNAc glycosylated proteins with sequence and structural data are collected and stored as dataset dbogap-str. This data is identified in glycos_public.xlsx by “yes” in column Oglycos_status. In terms of unique PDB-IDs, there are 998 sequences in this data. The last column cites the source of the collected data, dbOGAP
Empirical occurrence rate of the identified sequon in PTMs of proteins
| Sequona (viewed from S/T/Y/H as “center”) | % of Sequences in the collected data that are: | ||
|---|---|---|---|
| O-GlcNAc glycosylated | O-GalNAc glycosylated | Phosphorylated | |
|
| 0.24% | 0.43% | 0.145% |
|
| 1.22% (0.5%, 2.8%) b | 0.77% (0.5%, 1.2%) b | 2.858% (2.8%, 2.9%) b |
|
| 98.54% (96.9%, 99.3%) b | 98.80% (98.2%, 99.2%) b | 80.453% |
|
| Not applicable | Not applicable | 15.748% |
|
| Not applicable | Not applicable | 0.006% |
|
| Not applicable | Not applicable | 96.2% (96.1%, 96.3%) b |
|
| 0% | 0% | 0.0066% (0.004%, 0.011%) b |
|
| 100% | 98.94% (98.4%, 99.3%) b | 82.3133% |
|
| 0% | 0.58% (0.33%, 1.0%) b | 0.6422% |
|
| 0% | 0.48% (0.26%, 0.88%) b | 0.4930% |
|
| Not applicable | Not applicable | 0.0004% |
|
| Not applicable | Not applicable | 16.2587% |
|
| Not applicable | Not applicable | 0.1365% |
|
| Not applicable | Not applicable | 0.1462% |
|
|
|
|
|
aX denotes any amino acid. b 95% confidence interval [44]
Summary of the modeling/validation strategy
| Row | Data used | What the data is used for | Outcome |
|---|---|---|---|
| 1 | The 998 sequences in | Predicting the likelihood of | Table |
| 2 | The 340 sequences in | Predicting the likelihood of | Table |
| 3 | The 259 sequences in | Calculating the out-of-sample mispredictions rate with the LPM estimated for the exercise outlined in Row 2 of this Table | 54 of the 259 sequences (≈ 21%) are mispredicted as not being |
| 4 | The 2,079 sequences in | Calculate the out-of-sample mispredictions rate with the LPM estimated for the exercise outlined in Row 2 of this Table | 656 of the 2,079 (≈ 31.6%) are mispredicted as not being |
| 5 | The 236 sequences in | To see if any of these are | None are |
LPM predicting O-glycosylation probabilities given structural and sequence data
| Variablea | β | |t| | Variablea | β | |t| | Variablea | β | |t| |
|---|---|---|---|---|---|---|---|---|
| intercept | -0.1309 | 9.71 | m8N | -0.0921 | 4.18 | p7T | -0.1431 | 9.24 |
| m1D | 0.1878 | 10.74 | m8P | -0.1516 | 11.05 | p7V | -0.1309 | 8.14 |
| m1L | 0.0861 | 7.26 | m8V | -0.2180 | 12.08 | p7Y | 0.1290 | 6.75 |
| m1P | 0.7885 | 11.63 | p1A | 0.1778 | 9.83 | p8H | 0.1164 | 5.76 |
| m1R | -0.0757 | 4.66 | p1D | 0.0915 | 4.67 | p8K | 0.0570 | 3.34 |
| m3A | 0.2162 | 14.35 | p1F | 0.0858 | 5.44 | p8N | 0.1003 | 5.51 |
| m3C | -0.1315 | 6.45 | p1S | 0.0835 | 5.33 | p8Q | 0.1493 | 8.53 |
| m3L | -0.1139 | 9.49 | p1T | 0.1673 | 10.15 | pos | 0.2821 | 15.95 |
| m3N | -0.2245 | 5.86 | p1V | 0.0669 | 4.97 | ASA_zero | 0.8057 | 12.02 |
| m3T | -0.0881 | 5.14 | p2A | 0.0590 | 2.92 | II | 0.1649 | 6.95 |
| m4E | 0.1296 | 5.3 | p2H | 0.3143 | 11.32 | II´ | -0.2828 | 7.58 |
| m4F | -0.0848 | 5.56 | p2P | 0.2652 | 18.99 | Helix | 0.1610 | 14.88 |
| m4N | 0.1419 | 7.3 | p2Q | -0.1532 | 7.89 | Beta Bridges | 0.6077 | 6.53 |
| m4R | 0.1410 | 7.04 | p2Y | -0.2434 | 8.12 | Beta Hairpin | 0.7884 | 22.3 |
| m4V | -0.0545 | 4.18 | p3N | -0.1615 | 7.69 | Beta Hairpin Strand | -0.1044 | 9.83 |
| m5D | 0.2521 | 13.74 | p3W | -0.1086 | 3.41 | Phi angle | -0.0004 | 5.31 |
| m5F | -0.1257 | 6 | p4A | 0.1629 | 9.62 | |||
| m5G | 0.0700 | 3.21 | p4E | 0.1757 | 12.53 | |||
| m5I | -0.0981 | 4.85 | p4P | 0.1921 | 13.35 | |||
| m5Y | -0.1468 | 6.9 | p5C | -0.1957 | 9.63 | |||
| m6E | 0.0954 | 7.08 | p5E | 0.1718 | 7.98 | |||
| m6H | -0.1245 | 4.73 | p5H | 0.1938 | 7.88 | |||
| m6V | 0.1528 | 9.82 | p5I | -0.0760 | 4.33 | |||
| m6W | 0.1166 | 4.9 | p5Q | 0.1060 | 4.93 | |||
| m6Y | -0.1265 | 6.31 | p5T | 0.0722 | 4.61 | |||
| m7A | 0.1576 | 10.18 | p5Y | -0.1876 | 7.45 | |||
| m7E | -0.1132 | 6.87 | p6F | -0.1155 | 5.43 | |||
| m7G | -0.0685 | 4.89 | p6G | -0.1696 | 10.07 | |||
| m7H | 0.2427 | 11.33 | p6M | 0.3118 | 9.57 | |||
| m7I | 0.1222 | 7.67 | p6N | -0.1209 | 6.67 | |||
| m7K | 0.1538 | 8.77 | p6Q | -0.2641 | 12.26 | |||
| m7S | 0.0826 | 4.44 | p7A | 0.1752 | 7.04 | |||
| m8G | 0.1032 | 6.68 | p7C | -0.1281 | 5.86 | |||
| m8L | -0.1541 | 10.43 | p7E | -0.1161 | 7.36 | |||
| m8M | -0.1234 | 5.34 | p7G | -0.0951 | 5.81 | |||
a miα and piα are abbreviations for minus and plus, respectively. Note that the coefficient standard error is |β|÷|t|
LPM mispredictions in-sample, using 50% as the cutoff probability
| UniProt Accession No. | UniProt Position | Protein Length | Sequence | Secondary Structure | Turn Type | Phi angle | Psi angle | ASA | LPM prediction |
|---|---|---|---|---|---|---|---|---|---|
| Proteins with LPM predictions < 50%, but that are deemed to be O-glycosylated in the data | |||||||||
| P02730 | 224 | 911 | ILEKIPPDSEATLVLVG | Beta Turn | IV | -113 | 168 | 18.8 | 19.3% |
| Q16566 | 57 | 473 | ESELGRGATSIVYRCKQ | Beta Turn | I | -83 | -35 | 69.9 | 34.6% |
| P16157 | 794 | 1881 | LKVVTDETSFVLVSDKH | Gamma Turn, Beta Turn | Inverse, IV | -54 | 11 | 94.7 | 36.3% |
| P68431 | 11 | 136 | RTKQTARKSTGGKAPRK | Loop | N/A | -81 | 145 | 44.4 | 36.3% |
| P31749 | 308 | 480 | IKDGATMKTFCGTPEYL | Loop | N/A | -115 | -3 | 47.6 | 44.5% |
| P04406 | 229 | 335 | IPELNGKLTGMAFRVPT | Strand | N/A | -157 | 171 | 55.5 | 44.5% |
| Proteins with LPM predictions > 50%, but that are deemed to be N-glycosylated in the data | |||||||||
| P06756 | 488 SA | 1048 | SILNQDNKTCSLPGTAL | Gamma Turn, Beta Turn | Inverse, IV | -69 | 93 | 63.5 | 51.6% |
| P04629 | 329 | 796 | THVNNGNYTLLAANPFG | Beta Hairpin Strand | N/A | -93 | 119 | 30.7 | 62.7% |
| Q92854 | 329 | 862 | SAVCAYNLSTAEEVFSH | Beta Hairpin Strand | N/A | -76 | 124 | 35.5 | 70.1% |
Sequences in the LPM estimation dataset for which ASA values are zero
| UniProt Accession No. | UniProt Position | Protein Length | Sequence | Secondary Structure | Phi angle | Psi angle | PDB ID | LPM prediction |
|---|---|---|---|---|---|---|---|---|
| P02730 | 162 | 911 | ELLRALLLKH | Strand | -68.8 | 160.7 | 1HYN | 89.3% |
| P32119 | 112 | 198 | PLLADVTRRL | Helix | -60.1 | 40.0 | 1QMV | 110.8% |
Fig. 1Empirical CDFs of ASA values
WLS estimated LPM for predicting the probability of O-glycosylation given only sequence data
| Variablea | β | |t| | Variablea | β | |t| |
|---|---|---|---|---|---|
| intercept | 0.1379 | 3.27 | p1S | 0.2446 | 5.33 |
| m1C | -0.2020 | 2.11 | p1T | 0.2268 | 4.48 |
| m1F | -0.1373 | 1.96 | p2A | 0.2120 | 4.44 |
| m1P | 0.4708 | 5.44 | p2D | 0.2029 | 2.73 |
| m1T | 0.1535 | 3.25 | p2F | -0.2292 | 2.43 |
| m1V | 0.1348 | 3.41 | p2G | 0.1706 | 3.01 |
| m3F | -0.1950 | 2.92 | p2L | -0.1342 | 2.64 |
| m3L | -0.1367 | 2.65 | p2S | 0.2068 | 4.69 |
| m3P | 0.1852 | 4 | p3Q | 0.1914 | 3.12 |
| m3V | 0.1103 | 2.3 | p3S | 0.1041 | 2.29 |
| m4C | -0.6595 | 4.9 | p4A | 0.2149 | 3.66 |
| m4F | -0.2551 | 4.1 | p4G | 0.2195 | 3.93 |
| m4L | -0.1180 | 2.28 | p4S | 0.1277 | 3.07 |
| m4Q | -0.2802 | 4.52 | p4T | 0.1867 | 3.81 |
| m4V | -0.2253 | 4.37 | p5A | 0.0934 | 1.77 |
| m4Y | -0.2037 | 3.03 | p5C | -0.3927 | 4.63 |
| m5A | 0.1224 | 2.28 | p5L | -0.1029 | 2.04 |
| m5C | -0.2515 | 2.6 | p5N | -0.3141 | 3.6 |
| m5H | -0.3195 | 3.45 | p5W | -0.7183 | 3.72 |
| m5N | -0.2492 | 2.9 | p6C | -0.3787 | 4.2 |
| m5R | 0.1419 | 2.57 | p7A | 0.1763 | 2.71 |
| m5W | -0.2752 | 1.96 | p7C | -0.2791 | 3.06 |
| m5Y | -0.1507 | 2.15 | p7G | 0.1366 | 2.5 |
| m6R | 0.1362 | 2.31 | p7K | 0.1578 | 2.73 |
| m6Y | -0.2011 | 2.49 | p7T | 0.1968 | 3.98 |
| m7L | -0.0910 | 1.86 | p8F | -0.1867 | 2.86 |
| m7M | -0.2490 | 2.83 | p8S | 0.1625 | 3.5 |
| m7R | 0.2907 | 4.38 | pos | 0.2406 | 4.35 |
| m8A | 0.1563 | 2.7 | |||
| m8K | 0.1726 | 2.9 | |||
| p1A | 0.2092 | 3.78 | |||
| p1G | 0.1706 | 3.23 | |||
| p1K | 0.2524 | 3.37 | |||
| p1P | 0.4464 | 3.26 | |||
| p1Q | 0.1728 | 3.05 | |||
a miα and piα are abbreviations for minus and plus, respectively. Note that the coefficient standard error is |β|÷|t|
Fig. 2Distribution of (unobservable) errors for the LPM in Table 11
Fig. 3Cook’s distances for the LPM in Table 11
Fig. 4LPM residuals vs. pos and phi angle
LS estimated LPM predicting O-glycosylation probabilities given structural and sequence data
| Variablea | β | White’s |t| | Variablea | β | White’s |t| | Variablea | β | White’s |t| |
|---|---|---|---|---|---|---|---|---|
| intercept | -0.1177 | 5.17 | m8N | -0.0842 | 3.18 | p7T | -0.1371 | 6.97 |
| m1D | 0.1827 | 5.30 | m8P | -0.1387 | 6.82 | p7V | -0.1173 | 4.68 |
| m1L | 0.0930 | 3.71 | m8V | -0.2103 | 7.60 | p7Y | 0.1188 | 4.05 |
| m1P | 0.7844 | 13.86 | p1A | 0.1754 | 5.59 | p8H | 0.1194 | 2.51 |
| m1R | -0.0619 | 2.34 | p1D | 0.0984 | 2.65 | p8K | 0.0484 | 1.52 |
| m3A | 0.2046 | 8.22 | p1F | 0.0831 | 2.43 | p8N | 0.1052 | 3.22 |
| m3C | -0.1245 | 4.37 | p1S | 0.0884 | 3.27 | p8Q | 0.1466 | 4.52 |
| m3L | -0.1099 | 6.18 | p1T | 0.1567 | 4.59 | pos | 0.2509 | 8.51 |
| m3N | -0.2151 | 3.23 | p1V | 0.0623 | 3.34 | ASA_zero | 0.7952 | 9.89 |
| m3T | -0.0743 | 2.90 | p2A | 0.0697 | 1.33 | II | 0.1387 | 3.12 |
| m4E | 0.1109 | 2.61 | p2H | 0.2885 | 4.65 | II´ | -0.2684 | 4.04 |
| m4F | -0.0881 | 3.43 | p2P | 0.2730 | 9.84 | Helix | 0.1508 | 7.66 |
| m4N | 0.1405 | 3.27 | p2Q | -0.1379 | 4.70 | Beta Bridges | 0.6330 | 9.35 |
| m4R | 0.1408 | 3.71 | p2Y | -0.1906 | 3.98 | Beta Hairpin | 0.8016 | 13.87 |
| m4V | -0.0560 | 2.91 | p3N | -0.1483 | 5.13 | Beta Hairpin Strand | -0.0916 | 5.34 |
| m5D | 0.2333 | 6.89 | p3W | -0.1152 | 3.37 | Phi angle | -0.0003 | 3.04 |
| m5F | -0.0941 | 2.85 | p4A | 0.1650 | 4.77 | |||
| m5G | 0.0560 | 1.75 | p4E | 0.1711 | 6.75 | |||
| m5I | -0.0872 | 2.61 | p4P | 0.1946 | 8.26 | |||
| m5Y | -0.1326 | 4.60 | p5C | -0.1742 | 5.23 | |||
| m6E | 0.0916 | 3.22 | p5E | 0.1512 | 4.93 | |||
| m6H | -0.1270 | 5.34 | p5H | 0.2173 | 4.08 | |||
| m6V | 0.1567 | 3.99 | p5I | -0.0783 | 3.52 | |||
| m6W | 0.1063 | 2.81 | p5Q | 0.1209 | 3.16 | |||
| m6Y | -0.1267 | 4.11 | p5T | 0.0651 | 2.79 | |||
| m7A | 0.1622 | 5.51 | p5Y | -0.1818 | 4.60 | |||
| m7E | -0.1072 | 4.70 | p6F | -0.0883 | 1.97 | |||
| m7G | -0.0703 | 3.42 | p6G | -0.1561 | 5.14 | |||
| m7H | 0.2640 | 5.12 | p6M | 0.3066 | 4.87 | |||
| m7I | 0.1141 | 5.62 | p6N | -0.1144 | 5.29 | |||
| m7K | 0.1366 | 3.55 | p6Q | -0.2447 | 7.92 | |||
| m7S | 0.0859 | 2.72 | p7A | 0.1803 | 3.10 | |||
| m8G | 0.0964 | 4.51 | p7C | -0.1207 | 3.99 | |||
| m8L | -0.1518 | 5.28 | p7E | -0.1148 | 4.67 | |||
| m8M | -0.1187 | 4.49 | p7G | -0.0945 | 3.50 | |||
a miα and piα are abbreviations for minus and plus, respectively. Note that the coefficient standard error is |β|÷|t|
Fig. 5Actual vs. predicted outcomes of the LPM in Table 11
RR estimated LPM predicting O-glycosylation probabilities given structural and sequence data
| Variable | β | Standard error | Variable | β | Standard error | Variable | β | Standard error |
|---|---|---|---|---|---|---|---|---|
| intercept | 0.3124 | 0.0322 | m8N | -0.0595 | 0.0496 | p7T | -0.0660 | 0.0364 |
| m1D | 0.0343 | 0.0415 | m8P | -0.0638 | 0.0336 | p7V | -0.0556 | 0.0386 |
| m1L | 0.0560 | 0.0307 | m8V | -0.0687 | 0.0433 | p7Y | 0.0463 | 0.0439 |
| m1P | 0.1345 | 0.0687 | p1A | -0.0119 | 0.0435 | p8H | 0.0487 | 0.0459 |
| m1R | -0.0671 | 0.0391 | p1D | 0.0614 | 0.0474 | p8K | 0.0037 | 0.0409 |
| m3A | 0.0127 | 0.0368 | p1F | 0.0657 | 0.0396 | p8N | 0.0638 | 0.0428 |
| m3C | -0.0633 | 0.0473 | p1S | -0.0007 | 0.0383 | p8Q | 0.0082 | 0.0414 |
| m3L | 0.0228 | 0.0316 | p1T | 0.0664 | 0.0391 | pos | 0.0338 | 0.0417 |
| m3N | -0.0528 | 0.0703 | p1V | -0.0329 | 0.0323 | ASA_zero | 0.1204 | 0.0690 |
| m3T | -0.0515 | 0.0406 | p2A | 0.0554 | 0.0465 | II | -0.0266 | 0.0547 |
| m4E | -0.0219 | 0.0543 | p2H | 0.0169 | 0.0592 | II´ | -0.0616 | 0.0678 |
| m4F | 0.0251 | 0.0377 | p2P | 0.0788 | 0.0349 | Helix | 0.0603 | 0.0271 |
| m4N | 0.0548 | 0.0464 | p2Q | -0.0658 | 0.0444 | Beta Bridges | 0.1283 | 0.0588 |
| m4R | 0.0669 | 0.0476 | p2Y | -0.0701 | 0.0602 | Beta Hairpin | 0.1353 | 0.0689 |
| m4V | -0.0619 | 0.0309 | p3N | -0.0574 | 0.0469 | Beta Hairpin Strand | -0.0641 | 0.0264 |
| m5D | 0.0553 | 0.0444 | p3W | -0.0584 | 0.0619 | Phi angle | 0.0001 | 0.0002 |
| m5F | -0.0751 | 0.0465 | p4A | 0.0642 | 0.0406 | |||
| m5G | -0.0351 | 0.0482 | p4E | 0.0139 | 0.0350 | |||
| m5I | -0.0408 | 0.0478 | p4P | 0.0384 | 0.0359 | |||
| m5Y | -0.0577 | 0.0482 | p5C | -0.0702 | 0.0473 | |||
| m6E | 0.0249 | 0.0338 | p5E | -0.0222 | 0.0471 | |||
| m6H | -0.0693 | 0.0579 | p5H | 0.0701 | 0.0540 | |||
| m6V | 0.0545 | 0.0395 | p5I | -0.0552 | 0.0419 | |||
| m6W | 0.0067 | 0.0525 | p5Q | 0.0310 | 0.0511 | |||
| m6Y | -0.0589 | 0.0450 | p5T | -0.0380 | 0.0371 | |||
| m7A | 0.0418 | 0.0381 | p5Y | -0.0697 | 0.0546 | |||
| m7E | -0.0655 | 0.0398 | p6F | -0.0644 | 0.0482 | |||
| m7G | -0.0569 | 0.0348 | p6G | -0.0540 | 0.0418 | |||
| m7H | 0.0707 | 0.0491 | p6M | 0.0608 | 0.0620 | |||
| m7I | -0.0053 | 0.0385 | p6N | -0.0695 | 0.0438 | |||
| m7K | 0.0396 | 0.0409 | p6Q | -0.0782 | 0.0484 | |||
| m7S | -0.0403 | 0.0433 | p7A | 0.0765 | 0.0562 | |||
| m8G | 0.0512 | 0.0390 | p7C | -0.0576 | 0.0493 | |||
| m8L | -0.0587 | 0.0351 | p7E | -0.0513 | 0.0393 | |||
| m8M | -0.0693 | 0.0525 | p7G | 0.0306 | 0.0383 | |||
RR estimated LPM ANOVA with k = 3.77895
| Source | Degrees of Freedom | Sum of Squares | Mean Square | Approximate | Angle |
|---|---|---|---|---|---|
| Regression | 4.2906 | 121.1983 | 28.2472 | 341.4085 | 37.820 |
| Residual | 2043.1994 | 169.0489 | 0.0827 | ||
| Nonorthogonal component (NON) | 21.5099 | 226.1397 | |||
| Total | 2069 | 516.387 |
Summary of estimated LPLMs as ρ varies
| ρ | 0.05 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.81 | 0.815 | 0.82 | 0.85 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| In sample | SBC | 1710 | 1318 | 1257 | 1133 | 1036 | 931 | 878 | 798 | 761 | 749 | 737 | 751 | 723 |
| Variables | 12 | 39 | 92 | 91 | 73 | 34 | 50 | 41 | 38 | 54 | 51 | 51 | 39 | |
| KS (%) | 78.0 | 89.4 | 94.1 | 95.7 | 95.2 | 91.3 | 95.2 | 95.5 | 95.7 | 96.8 | 96.7 | 96.6 | 95.8 | |
| Brier Score | 0.115 | 0.062 | 0.031 | 0.023 | 0.025 | 0.038 | 0.026 | 0.026 | 0.025 | 0.017 | 0.018 | 0.019 | 0.022 | |
| –2×L(·) | 1610 | 1013 | 547 | 431 | 471 | 664 | 489 | 478 | 463 | 329 | 340 | 354 | 418 | |
| Out of sample | Brier Score SNR | 1.76 | 2.55 | 1.47 | 1.86 | 2.18 | 1.85 | 3.42 | 1.59 | 4.32 | 3.75 | 2.00 | 7.78 | 3.06 |
The selected LPLM with ρ = 0.82
| Variable | β | Variable | β | Variable | β | Variable | β |
|---|---|---|---|---|---|---|---|
| intercept | –3.7350 | m6S ☼ | –0.4411 | p2Ia | –0.3424 | p7T | –1.0516 |
| m1Da | 0.0982 | m6V | 0.5139 | p2L | –0.0053 | p8Ga ☼ | –0.0074 |
| m1La | 1.0127 | m7H | 2.3444 | p2P | 2.6960 | p8K ☼ | 0.5589 |
| m1Ra | –0.1413 | m7K | 0.8149 | p2Va ☼ | 0.0130 | p8Na | 0.5041 |
| m1S | –0.1913 | m7La ☼ | –0.4651 | p3Ta | 0.4597 | p8Qa | 0.3987 |
| m3Aa | 0.5288 | m8A | 0.5662 | p4Aa | 0.9764 | pos | 0.3384 |
| m3G | 1.1410 | m8V | –0.0763 | p4E | 1.3507 | ASAa ☼ | –0.0144 |
| m4N ☼ | 0.1992 | p1Da | 0.4975 | p4H | –0.1231 | Ia | –0.2885 |
| m4R | 0.0413 | p1Ea ☼ | –0.3194 | p4Ia ☼ | –0.0779 | Helix | 1.7757 |
| m4Va | –0.2451 | p1Fa | 0.9501 | p5H | 0.1778 | BH | 0.3832 |
| m4Ya ☼ | –0.0003 | p1L | –0.7827 | p6L | 1.2946 | BH_strand | –0.8759 |
| m5D | 1.1952 | p1S | 0.1261 | p6Ta | 1.0123 | Phi anglea | –0.0047 |
| m6E | 0.0586 | p1T | 2.3710 | p7A | 1.4453 | Psi anglea | –0.0011 |
a / ☼ not significant at 10% in the “equivalent” classical logit model / LPM
Mispredictions rate and fit statistics for selected models
| Model | Mispredictions rate under a 50% cutoff probability | Fit statistics | ||
|---|---|---|---|---|
| In the set of non-O-glycosylated sequences (Y=0), the percentage of those that have estimated probabilities of O-glycosylation greater than 50% ( | In the set of O-glycosylated sequences (Y=1), the percentage of those that have estimated probabilities of O-glycosylation less than or equal to 50% ( | KS | Brier Score | |
| Ordinary LS estimated LPM in Table | 0.37 | 0.61 | 99.1% | 0.009 |
| RR estimated LPM (used for estimating the weights for the WLS estimated LPM in Table | 0.28 | 7.90 | 96.7% | 0.084 |
| LPM in Table | 0.28 | 0.61 | 99.2% | 0.009 |
| LPLM with | 0.83 | 3.55 | 96.6% | 0.019 |
Regressing psi against sequence information
| Variable | β | White’s |t| | Variable | β | White’s |t| | Variable | β | White’s |t| |
|---|---|---|---|---|---|---|---|---|
| intercept | 164.7095 | 15.76 | m6T | -27.7795 | 2.28 | p4P | -63.4680 | 6.26 |
| m1E | -41.8628 | 2.97 | m7E | 43.9542 | 5.47 | p5C | -73.2859 | 4.80 |
| m1F | 13.3555 | 1.67 | m7K | -32.2923 | 3.01 | p5E | -59.8629 | 6.27 |
| m1G | -39.8688 | 5.61 | m7L | 29.8517 | 3.96 | p5F | -46.9835 | 3.43 |
| m1I | 39.5770 | 4.07 | m7V | -47.2830 | 5.75 | p5H | -103.8328 | 6.47 |
| m1P | -88.6389 | 2.43 | m7Y | -21.9592 | 1.70 | p5K | -46.4349 | 5.08 |
| m1S | 22.2152 | 2.37 | m8G | 26.2222 | 2.74 | p5M | -55.0010 | 2.58 |
| m1V | 22.1718 | 2.53 | m8I | -26.1921 | 1.96 | p5R | -49.2451 | 3.17 |
| m1W | -56.8472 | 1.73 | m8K | 39.6707 | 3.18 | p5T | -26.0844 | 3.06 |
| m3C | -107.3697 | 7.69 | m8L | -59.0523 | 7.31 | p5W | -92.6251 | 6.31 |
| m3H | -34.3322 | 2.59 | m8M | -79.0357 | 3.70 | p6C | 45.0324 | 4.28 |
| m3I | 55.7918 | 3.38 | m8Q | -48.2257 | 3.54 | p6M | -103.3734 | 6.33 |
| m3P | -60.7537 | 6.03 | m8V | 20.4134 | 1.93 | p6R | 42.4221 | 4.59 |
| m3Q | -91.6187 | 6.83 | p1E | 21.3198 | 2.71 | p6S | 33.0045 | 3.96 |
| m3R | -78.6022 | 7.07 | p1I | 57.7033 | 3.69 | p6W | 78.9544 | 2.14 |
| m3S | -58.6653 | 4.59 | p1P | 168.9937 | 4.72 | p7M | -30.9223 | 2.37 |
| m3Y | -43.6250 | 4.75 | p1T | 25.9053 | 2.81 | p7Q | -74.3226 | 5.24 |
| m4A | -22.2813 | 3.06 | p2A | -32.4674 | 2.98 | p7S | -27.8055 | 2.90 |
| m4D | 34.2970 | 2.86 | p2I | -51.3731 | 6.62 | p7V | -25.1988 | 2.55 |
| m4F | -30.2667 | 2.51 | p2Q | 22.8051 | 2.34 | p7W | -126.1256 | 5.32 |
| m4H | 42.5802 | 2.67 | p2R | -73.2310 | 6.60 | p7Y | -33.3375 | 3.33 |
| m4I | 52.7538 | 6.23 | p2T | -24.7715 | 2.04 | p8C | -32.8242 | 2.11 |
| m4S | -17.8569 | 2.04 | p2V | 28.9007 | 3.16 | p8G | 24.3492 | 3.24 |
| m5C | -61.3798 | 5.19 | p3A | -33.2607 | 3.75 | p8K | -26.5478 | 3.16 |
| m5M | -50.2581 | 2.77 | p3L | -32.1617 | 3.65 | p8Q | 30.6688 | 2.49 |
| m5N | -45.0953 | 2.62 | p3Q | 44.6698 | 2.94 | p8S | 47.6755 | 4.08 |
| m5Q | 36.6723 | 4.03 | p3T | -34.4732 | 3.71 | p8V | 24.9615 | 3.24 |
| m5R | 36.6709 | 4.25 | p4A | -22.9135 | 2.38 | p8Y | 29.0752 | 2.56 |
| m6A | -51.1224 | 3.88 | p4C | -54.2233 | 3.33 | pos | -45.3919 | 4.67 |
| m6D | -31.7667 | 2.24 | p4D | 35.4777 | 3.29 | ASA | -0.6246 | 5.84 |
| m6K | -16.0558 | 2.11 | p4E | 53.4754 | 6.01 | |||
| m6L | -30.2475 | 3.66 | p4F | 41.2684 | 3.29 | |||
| m6P | -62.2979 | 6.04 | p4G | 19.1452 | 2.11 | |||
| m6R | -22.4489 | 2.12 | p4L | -21.9163 | 2.72 | |||
| m6S | 30.1454 | 3.69 | p4N | 31.5863 | 1.94 | |||
Fig. 6Q-Q plot of residuals for the model in Table 21
Fig. 7Ramachandran Plot
Fig. 8Increases in CV PRESS when the model in Table 21 is subject to 5-fold CV 50 times
Frequencies of variables selected by the LPLM as ρ variesa
a Variables in the LPM of Table 11 are shaded in green.
Fig. 9KS statistic and Brier Score out-of-sample variations for the LPM
Regressing log(ASA) against sequence information
| Variable | β | White’s |t| | Variable | β | White’s |t| | Variable | β | White’s |t| |
|---|---|---|---|---|---|---|---|---|
| intercept | 3.8265 | 79.93 | m6G | 0.2437 | 3.04 | p4A | -0.1698 | 2.96 |
| m1C | 0.4005 | 4.78 | m6K | 0.2509 | 4.17 | p4C | -0.4837 | 2.44 |
| m1E | -0.1849 | 2.76 | m6M | -0.2652 | 2.11 | p4D | -0.4424 | 6.50 |
| m1F | -0.3802 | 9.02 | m6W | 0.2150 | 2.77 | p4F | -0.2666 | 6.08 |
| m1G | -0.1734 | 4.07 | m7I | -0.2036 | 3.95 | p4G | -0.5605 | 8.13 |
| m1I | -0.4253 | 7.95 | m7K | 0.4807 | 7.38 | p4H | -0.2247 | 3.50 |
| m1K | 0.2688 | 3.63 | m7P | 0.1391 | 2.64 | p4I | 0.3184 | 7.64 |
| m1M | -0.1687 | 2.12 | m7Q | 0.3848 | 5.22 | p4S | -0.1250 | 2.75 |
| m1Q | 0.4901 | 6.87 | m7R | -0.2030 | 2.36 | p4W | -0.1784 | 2.80 |
| m1S | 0.1580 | 2.89 | m7S | 0.2185 | 3.14 | p5A | 0.1559 | 3.37 |
| m1V | 0.0999 | 2.51 | m7V | 0.2153 | 4.68 | p5C | -0.1432 | 2.21 |
| m1Y | -0.2066 | 3.66 | m7W | 0.2831 | 2.64 | p5D | -0.1959 | 3.46 |
| m3F | -0.4134 | 7.04 | m8G | -0.5119 | 11.04 | p5K | 0.3154 | 6.66 |
| m3G | 0.1102 | 2.47 | m8S | -0.1997 | 4.90 | p5V | 0.2375 | 4.07 |
| m3I | -0.4745 | 6.72 | m8Y | 0.6461 | 11.36 | p5W | 0.3914 | 3.95 |
| m3L | -0.2031 | 4.83 | p1C | 0.3582 | 5.47 | p6A | 0.1402 | 3.83 |
| m3M | -0.2393 | 3.21 | p1K | 0.3864 | 5.90 | p6E | -0.2923 | 6.89 |
| m3S | 0.1590 | 2.64 | p1L | 0.1875 | 4.96 | p6H | 0.1935 | 3.36 |
| m3V | -0.2821 | 4.05 | p1Q | -0.1655 | 2.41 | p6I | 0.3046 | 4.23 |
| m3W | -0.2581 | 4.16 | p1V | 0.1280 | 3.00 | p6K | -0.1276 | 1.92 |
| m3Y | -0.1855 | 3.34 | p2E | 0.1752 | 3.76 | p7C | 0.1051 | 1.73 |
| m4C | 0.4717 | 3.35 | p2F | 0.1018 | 1.72 | p7H | 0.2926 | 4.07 |
| m4D | 0.3255 | 3.79 | p2G | -0.2754 | 4.66 | p7I | -0.2492 | 3.18 |
| m4F | 0.3545 | 7.04 | p2K | -0.1480 | 2.10 | p7P | 0.1066 | 2.49 |
| m4K | 0.1518 | 2.83 | p2N | -0.1830 | 2.88 | p7Q | -0.1690 | 2.20 |
| m4L | 0.1450 | 3.83 | p2V | -0.1858 | 3.33 | p7T | 0.0952 | 1.96 |
| m4P | 0.4735 | 8.35 | p3A | -0.2814 | 5.72 | p7W | -0.2539 | 2.93 |
| m4Q | 0.3059 | 5.66 | p3C | 0.2717 | 3.90 | p7Y | 0.2480 | 3.36 |
| m4S | 0.1391 | 2.63 | p3D | -0.2040 | 4.27 | p8G | 0.1935 | 3.69 |
| m4T | -0.1328 | 2.2 | p3H | -0.2118 | 2.19 | p8H | 0.2042 | 3.94 |
| m5I | -0.4287 | 5.15 | p3I | -0.1469 | 3.65 | p8K | 0.3261 | 7.20 |
| m5P | 0.1189 | 1.72 | p3K | -0.4302 | 5.66 | p8L | 0.1804 | 2.63 |
| m5R | -0.2278 | 3.95 | p3L | -0.2711 | 6.92 | p8R | 0.2939 | 5.12 |
| m5Y | -0.3245 | 5.27 | p3R | -0.1781 | 2.87 | pos | 0.1555 | 2.65 |
| m6E | 0.1342 | 2.47 | p3Y | -0.2788 | 2.90 | |||
Fig. 10Distribution of studentized residuals for the model in Table 23
Fig. 11Variations in ASE and KS over 100 CV trials generated by the model in Table 23
Fig. 12Superposition of ±10 positions around the S-site in protein structures. In this example, the PDB-IDs are: 1AZM, 1BZM, 1CRM, 1CZM, 1HCB, 1HUG, 1HUH, 1J9W, 1JV0, 2CAB, and 2FOY. The figure was generated using PyMOL software. The Serine residue is colored in yellow