| Literature DB >> 34757056 |
Alessia David1, Suhail Islam2, Evgeny Tankhilevich2, Michael J E Sternberg2.
Abstract
AlphaFold, the deep learning algorithm developed by DeepMind, recently released the three-dimensional models of the whole human proteome to the scientific community. Here we discuss the advantages, limitations and the still unsolved challenges of the AlphaFold models from the perspective of a biologist, who may not be an expert in structural biology.Entities:
Keywords: AlphaFold; human proteome; inter-domain accuracy; three-dimensional model
Mesh:
Substances:
Year: 2021 PMID: 34757056 PMCID: PMC8783046 DOI: 10.1016/j.jmb.2021.167336
Source DB: PubMed Journal: J Mol Biol ISSN: 0022-2836 Impact factor: 5.469
Figure 1The challenges of protein structure prediction. A) AlphaFold model of the growth hormone receptor (GHR, UniProt P10912). The long, unstructured intracellular tail of the growth hormone receptor (residues 289–638) is presented in magenta as a long filament and is wrongly placed next to the extracellular domain. The extracellular domain (residues 19–264) is presented in blue and the transmembrane domain (residues 265–288) in cyan. B) On the left, AlphaFold model of the PIK3R1 protein (in magenta, UniProt P27986). The main domains of PIK3R1 are highlighted with dotted lines. On the right, the AlphaFold model of PIK3R1 (in magenta) is superposed to the experimental structure of PIK3R1 (in cyan) in complex with PIK3CD (in green; PDB 5M6U). The PIK3R1 interdomain placement would results in a steric clash with PIK3CD. PI3K-P85-iSH2, Phosphatidylinositol 3-kinase regulatory subunit P85 inter-SH2 domain.
AlphaFold database coverage compared to the experimental coverage and the coverage obtained using standard homology-based methods exemplified by our
The three-dimensional coordinate files were extracted from the ProteinDataBank (PDB). Phyre2 was used as a representative of homology-based methods. Only Phyre2 models with a confidence score >98% and sequence identity >30% were selected. For AlphaFold models, the residue coverage is presented according to the per-residue pLDDT score.
| Experimental coverage | AlphaFold (pLDDT ≥ 70) | Phyre2 (Confidence > 98%; Seq ID > 30%) | ||||||
|---|---|---|---|---|---|---|---|---|
| Gene | UniProt Id | Protein length | residues, n. | residues, % | residues, n. | residues, % | residues, n. | residues, % |
| Q5SW96 | 308 | 16 | 5.2 | 159 | 51.6 | 144 | 46.8 | |
| Q9BYW2 | 2564 | 424 | 16.5 | 513 | 20.0 | 345 | 13.5 | |
| Q92793 | 2442 | 556 | 22.8 | 823 | 33.7 | 829 | 33.9 | |
| O14497 | 2285 | 586 | 25.6 | 554 | 24.2 | 647 | 28.3 | |
| P46531 | 2555 | 797 | 31.2 | 602 | 23.6 | 551 | 21.6 | |
| P51532 | 1647 | 682 | 41.4 | 831 | 50.5 | 945 | 57.4 | |
| Q86U86 | 1689 | 879 | 52.0 | 1126 | 66.7 | 373 | 22.1 | |
| P15056 | 766 | 447 | 58.4 | 421 | 55.0 | 295 | 38.5 | |
| Q969H0 | 707 | 444 | 62.8 | 471 | 66.6 | 443 | 62.7 | |
| P40337 | 213 | 160 | 75.1 | 155 | 72.8 | 150 | 70.4 | |
| P06400 | 928 | 698 | 75.2 | 592 | 63.8 | 763 | 82.2 | |
| P01130 | 860 | 705 | 82.0 | 643 | 74.8 | 650 | 75.6 | |
| P60484 | 403 | 334 | 82.9 | 315 | 78.2 | 353 | 87.6 | |
| P00533 | 1210 | 1010 | 83.5 | 860 | 71.1 | 914 | 75.5 | |
| P04637 | 393 | 340 | 86.5 | 227 | 57.8 | 357 | 90.8 | |
| P01116 | 189 | 171 | 90.5 | 175 | 92.6 | 189 | 100.0 | |
| Q8NBP7 | 692 | 642 | 92.8 | 563 | 81.4 | 622 | 89.9 | |
| P42345 | 2549 | 2370 | 93.0 | 2074 | 81.4 | 2533 | 99.4 | |
| P02649 | 317 | 298 | 94.0 | 218 | 68.8 | 299 | 94.3 | |
| P27986 | 724 | 683 | 94.3 | 621 | 85.8 | 596 | 82.3 | |
| P42336 | 1068 | 1061 | 99.3 | 1002 | 93.8 | 1060 | 99.3 | |
| P42771 | 156 | 156 | 100.0 | 114 | 73.1 | 156 | 100.0 | |
| 13,059 | 13,214 | |||||||
| P21359 | 2839 | NA | NA | 595 | 21.0 | |||
| P25054 | 2843 | NA | NA | 571 | 20.1 | |||
| Q13315 | 3056 | NA | NA | 3053 | 99.9 | |||
| Q96T58 | 3664 | NA | NA | 456 | 12.4 | |||
| P04114 | 4563 | NA | NA | 0 | 0.0 | |||
| Q14517 | 4588 | NA | NA | 518 | 11.3 | |||
| Q8NEZ4 | 4911 | NA | NA | 156 | 3.2 | |||
| O14686 | 5537 | NA | NA | 309 | 5.6 | |||
NA, AlphaFold model not available from the EBI website. However, the predicted overlapping segments for these long proteins can be downloaded from https://alphafold.ebi.ac.uk/download.
LDLR, APOB, APOE, PCSK9 and LDLRAP1 cause Familial Hypercholesterolemia. The remaining 25 genes are the top 25 genes from PanCan (4742 patients) in the TumorPortal.
Seq ID, sequence identity between query and template.