| Literature DB >> 25609314 |
Stephen S-T Yau1, Wei-Guang Mao1, Max Benson2, Rong Lucy He3.
Abstract
What kinds of amino acid sequences could possibly be protein sequences? From all existing databases that we can find, known proteins are only a small fraction of all possible combinations of amino acids. Beginning with Sanger's first detailed determination of a protein sequence in 1952, previous studies have focused on describing the structure of existing protein sequences in order to construct the protein universe. No one, however, has developed a criteria for determining whether an arbitrary amino acid sequence can be a protein. Here we show that when the collection of arbitrary amino acid sequences is viewed in an appropriate geometric context, the protein sequences cluster together. This leads to a new computational test, described here, that has proved to be remarkably accurate at determining whether an arbitrary amino acid sequence can be a protein. Even more, if the results of this test indicate that the sequence can be a protein, and it is indeed a protein sequence, then its identity as a protein sequence is uniquely defined. We anticipate our computational test will be useful for those who are attempting to complete the job of discovering all proteins, or constructing the protein universe.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25609314 PMCID: PMC4302309 DOI: 10.1038/srep07972
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Alanine convex hull computed from the Uniprot 2013_03 dataset.
Blue points in each of these four subfigures stand for vectors corresponding to proteins. (A) shows the picture in (n, μ) coordinate plane and (C) shows the picture coordinate plane. (B) is the enlarged view of the protein area in (A). The black lines stand for the boundaries of the convex hull for protein area. (D) is the enlarged view of the protein area in (C). The black lines stand for the boundaries of the convex hull for the protein area.
Detailed counts of sequences in the three snapshots of UniprotKB
| Number of Distinct Sequences in | 392455 |
| Number of Distinct Sequences in | 395514 |
| Number of Distinct Sequences in | 397348 |
| Number of Distinct Sequences | 391704 |
| Number of Distinct Sequences Contained in all three | 391528 |
| Number of Sequences in | 751 |
| Number of Sequences in | 3810 |
| Number of Sequences in | 176 |
| Number of Sequences in | 5820 |
Corresponding between sequence counts and natural vectors
| Number of Sequences Before Normalization | Number of Sequences | Number of Distinct Sequences | Number of Natural Vectors | Number of Distinct Natural Vectors | |
|---|---|---|---|---|---|
| 472284 | 471313 | 392455 | 471313 | 392455 | |
| 475547 | 474537 | 395514 | 474537 | 395514 |
The 14 protein sequence outliers in Uniprot 2014_03 and their distances from the convex hulls
| No. | Sequence Length | Access ID | Convex hull(s) the sequences fall outside | Distance from point to convex hull |
|---|---|---|---|---|
| 1 | 11 | P85817 | Asparagine (N) | 0.0177 |
| 2 | 16 | P81071 | Aspartic acid (D) | 0.0110 |
| 3 | 19 | P68116 | Aspartic acid (D) | 0.0018 |
| 4 | 20 | P14469 | Isoleucine (I) | 0.003 |
| 5 | 199 | Q9ZVZ9 | Histidine (H) | 0.0000208 |
| 6 | 211 | P33191 | Tyrosine (Y) | 0.0027 |
| 7 | 237 | Q6M923 | Glutamine (Q) | 0.0362 |
| 8 | 287 | P50751 | Proline (P) | 0.0044 |
| 9 | 392 | Q5A8I8 | Proline (P) | 33.9023 |
| 10 | 1086 | Q59XL0 | Methionine (M) | 0.4508 |
| 11 | 1129 | Q9QR71 | Glutamic acid (E) | 5.5955 (E) |
| Glutamine (Q) | 1.4455 (Q) | |||
| 12 | 1404 | Q59SG9 | Serine (S) | 0.2427 |
| 13 | 2346 | A1Z8P9 | Glycine (G) | 0.2179 |
| 14 | 3461 | P62288 | Arginine (R) | 1.8593 |
Figure 2Protein sequence (Access ID P85817) lying outside the Asparagine convex hull.
In the subfigure (A), the cyan surfaces stand for the surfaces of convex hulls in 3-dimensional space. The red point stands for the coordinate for this sequence. The subfigure (B) is an enlarged view of subfigure (A) showing that the point really falls outside the convex hull.
The 18 protein sequence outliers from Uniprot 2014_06 and their distances from the convex hulls
| No. | Sequence Length | Access ID | Convex hull(s) the sequences fall outside | Distance from point to convex hull |
|---|---|---|---|---|
| 1 | 20 | P82867 | Aspartic acid (D) | 0.0045 |
| 2 | 19 | P68214 | Aspartic acid (D) | 0.0309 |
| 3 | 15 | P80612 | Alanine (A) | 0.3850 |
| 4 | 267 | P14918 | Arginine (R) | 0.0000042949 |
| 5 | 150 | P27787 | Phenylalanine (F) | 0.00025511 |
| 6 | 372 | Q5AKU5 | Histidine (H) | 0.00013742 |
| 7 | 105 | Q2RB28 | Leucine (L) | 0.000053612 |
| 8 | 105 | B9GBM3 | Leucine (L) | 0.000053612 |
| 9 | 94 | Q5G8Z3 | Aspartic acid (D) | 0.0022 |
| 10 | 391 | P46525 | Serine (S) | 0.1044 |
| 11 | 838 | P08489 | Methionine (M) | 0.0840 |
| Valine (V) | 3.0627 | |||
| 12 | 848 | P10388 | Glutamic acid (E) | 0.1269 |
| Methionine (M) | 0.1408 | |||
| Valine (V) | 3.1465 | |||
| 13 | 240 | P04702 | Glycine (G) | 0.000036147 |
| 14 | 240 | P06677 | Glycine (G) | 0.000036147 |
| 15 | 240 | P04703 | Glycine (G) | 0.000036147 |
| 16 | 240 | P06676 | Glycine (G) | 0.000036147 |
| 17 | 267 | P04698 | Glycine (G) | 0.000070559 |
| 18 | 187 | B6U769 | Proline (P) | 0.0037 |