| Literature DB >> 25741177 |
Donald Adjeroh1, Yue Jiang2, Bing-Hua Jiang3, Jie Lin2.
Abstract
Various studies have implicated different multidomain proteins in cancer. However, there has been little or no detailed study on the role of circular multidomain proteins in the general problem of cancer or on specific cancer types. This work represents an initial attempt at investigating the potential for predicting linkages between known cancer-associated proteins with uncharacterized or hypothetical multidomain proteins, based primarily on circular permutation (CP) relationships. First, we propose an efficient algorithm for rapid identification of both exact and approximate CPs in multidomain proteins. Using the circular relations identified, we construct networks between multidomain proteins, based on which we perform functional annotation of multidomain proteins. We then extend the method to construct subnetworks for selected cancer subtypes, and performed prediction of potential link-ages between uncharacterized multidomain proteins and the selected cancer types. We include practical results showing the performance of the proposed methods.Entities:
Keywords: cancer; circular patterns; functional annotation; multidomain proteins
Year: 2015 PMID: 25741177 PMCID: PMC4338801 DOI: 10.4137/CIN.S14059
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Example of multidomain proteins that are related by CP. Multidomain protein Q5RFP4 (Zinc finger (ZNF146) from Pongo abelii) with domain block sequence ABBBBBBBBB occurs as an exact CP inside Q8NC79 (ZNF680, Homo sapiens) with domain sequence CBBBBBBBBAB. Notice also that both proteins form a k-approximate CP (with k = 1). Codes inside the blocks denote protein domain IDs as used in the protein domain database (ProDom).28 Key: A:PD057131, B:PD000003, C:PD915601. Schematic for linear domain block structures generated from the ProDom website.
Generic approximate pattern matching using LIS
| APM- |
| 1 Build the mapping table |
| 2 seq ← NULL, |
| 3 |
| 4 seq ← seq ○ |
| 5 |
| 6 Generate LIS from seq |
| 7 Calculate LCS between |
| 8 |
| 9 |
| 10 |
ACPM with Greedy Algorithm
| ACPM- |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 APM-via-LIS( |
| 6 |
| 7 |
Figure 2Variation of the number of hypotheses generated using q-grams.
ACPM with q-grams and Suffix Array
| ACPM- |
| 1 seq ← NULL, pos ← NULL, s ← 1 |
| 2 |
| 3 seq ← seq ○ SeqDB[i] |
| 4 |
| 5 pos[ |
| 6 |
| 7 |
| 8 <SA,lcp> ← BuildSA(seq) |
| 9 |
| 10 Candidates ← {} |
| 11 |
| 12 Candidates ← Candidates ∪ { |
| 13 |
| 14 |
| 15 |
| 16 |
| 17 |
| 18 |
| 19 |
| 20 APM-via-LIS( |
| 21 |
| 22 |
| 23 |
| 24 |
Top 20 highest degree proteins with GO function.
| RANK | COUNT | AC NUMBER | GO DESCRIPTION |
|---|---|---|---|
| 1 | 23353 | Q7VMZ1 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 2 | 23344 | Q9CPC5 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 3 | 23338 | Q3EG14 | Protein not found in GO |
| 4 | 20508 | Q33HH1 | Protein not found in GO |
| 5 | 20446 | Q47AY9 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 6 | 20446 | Q4UQ62 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 7 | 20446 | Q8P4K7 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 8 | 20446 | Q8PG73 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 9 | 20415 | Q426Q5 | Protein not found in GO |
| 10 | 20398 | Q3BNR9 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 11 | 20393 | Q73PA3 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 12 | 20273 | Q50XK7 | Protein not found in GO |
| 13 | 20246 | Q66C16 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 14 | 20244 | Q5NU40 | No function in GO |
| 15 | 20244 | O32748 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 16 | 20199 | Q30UI1 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 17 | 20177 | Q5NU41 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 18 | 20150 | Q3MAZ4 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 19 | 20133 | Q5VLQ9 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
| 20 | 20118 | Q6NEY3 | nucleotide binding; ATP binding; ATPase activity; nucleoside-triphosphatase activity |
Figure 3Execution time for the proposed ACPM algorithms.
Circular patterns found in the ProDom database.
| MATCH TYPE | CP-MATCHES | NON-CP MATCHES | TOTAL |
|---|---|---|---|
| exact PM | 1706800 | 2626323 | 4333123 |
| 1-approx PM | 24679013 | 613602 | 25292615 |
| Total | 26385813 | 3239925 | 29625738 |
Figure 4Degree distributions in the network of multidomain proteins constructed based on the circular patterns they contain. (A) Log degree distribution. (B) Log degree distribution for Top-100 degree nodes.
Figure 5Number of directly connected pairs in Top K highest degree proteins. (A) Number of directly connected pairs. (B) The ratio ρ for increasing values of K.
Predicted functions for nine sample multidomain proteins using the union of functions for known proteins in the in-edge and out-edge sets.
| PROTEIN AC NUMBER | FUNCTION (GROUND TRUTH) | PREDICTED FUNCTION ( | PREDICTED FUNCTION ( | PREDICTED FUNCTION ( |
|---|---|---|---|---|
| Q7VMZ1 | GO:0000166 | GO:0000166 | GO:0000166 | GO:0000166 |
| GO:0005524 | GO:0005524 | GO:0005524 | GO:0005215 | |
| GO:0016887 | GO:0016887 | GO:0016887 | GO:0005524 | |
| GO:0017111 | GO:0017111 | GO:0017111 | GO:0016787 | |
| GO:0042626 | GO:0016887 | |||
| GO:0017111 | ||||
| GO:0042626 | ||||
| O32184 | GO:0003824 | GO:0003824 | GO:0003824 | GO:0003824 |
| GO:0005488 | GO:0005488 | GO:0005488 | GO:0004316 | |
| GO:0016491 | GO:0016491 | GO:0016491 | GO:0005488 | |
| GO:0016491 | ||||
| Q2Y7W6 | GO:0000156 | GO:0000155 | GO:0000155 | GO:0000155 |
| GO:0004871 | GO:0004871 | GO:0004871 | ||
| Q33CH5 | GO:0003723 | GO:0003723 | GO:0003723 | GO:0003723 |
| GO:0003968 | GO:0003968 | GO:0003968 | GO:0003968 | |
| Q30U32 | GO:0000156 | GO:0000156 | GO:0000155 | GO:0000155 |
| GO:0004871 | GO:0000156 | GO:0000156 | ||
| GO:0004871 | GO:0004871 | |||
| O93828 | GO:0004585 | GO:0004585 | GO:0004585 | GO:0004585 |
| GO:0016597 | GO:0016597 | GO:0016597 | GO:0016597 | |
| GO:0016740 | GO:0016740 | GO:0016740 | GO:0016740 | |
| GO:0016743 | GO:0016743 | GO:0016743 | GO:0016743 | |
| Q30SN9 | GO:0003824 | GO:0003824 | ||
| GO:0004252 | GO:0004252 | |||
| GO:0005515 | ||||
| Q2YTY7 | GO:0003723 | GO:0003723 | ||
| GO:0009982 | GO:0009982 | |||
| O78911 | GO:0008137 | GO:0008137 | GO:0008137 | GO:0008137 |
| GO:0016491 | GO:0016491 | GO:0016491 | GO:0016491 |
Predicted functions for nine sample multidomain proteins using the intersection of functions for known proteins in the in-edge and out-edge sets.
| PROTEIN AC NUMBER | FUNCTION (GROUND TRUTH) | PREDICTED FUNCTION ( | PREDICTED FUNCTION ( | PREDICTED FUNCTION ( |
|---|---|---|---|---|
| Q7VMZ1 | GO:0000166 | GO:0000166 | GO:0000166 | GO:0000166 |
| GO:0005524 | GO:0005524 | GO:0005524 | GO:0005524 | |
| GO:0016887 | GO:0016887 | GO:0016887 | GO:0016887 | |
| GO:0017111 | GO:0017111 | GO:0017111 | GO: 0017111 | |
| O32184 | GO:0003824 | |||
| GO:0005488 | ||||
| GO:0016491 | ||||
| Q2Y7W6 | GO:0000156 | GO:0004871 | GO:0004871 | GO:0000155 |
| GO:0004871 | ||||
| Q33CH5 | GO:0003723 | GO:0003723 | GO:0003723 | |
| GO:0003968 | GO:0003968 | GO:0003968 | ||
| Q30U32 | GO:0000156 | GO:0004871 | GO:0000155 | |
| GO:0004871 | ||||
| O93828 | GO:0004585 | GO:0016597 | ||
| GO:0016597 | GO:0016740 | |||
| GO:0016740 | GO:0016743 | |||
| GO:0016743 | ||||
| Q30SN9 | GO:0003824 | |||
| GO:0004252 | ||||
| Q2YTY7 | GO:0003723 | |||
| GO:0009982 | ||||
| O78911 | GO:0008137 | GO:0008137 | GO:0008137 | GO:0008137 |
| GO:0016491 | GO:0016491 | GO:0016491 | GO:0016491 |
Figure 6Performance in function prediction based on circular permutations for the top k highest degree proteins, with relationships defined based on circular permutations between pairs of multidomain proteins. (A) Using thresholds z1 ≥ 3, z2 ≥ 0.5. (B) using thresholds z1 ≥ 3, z2 ≥ 1.
Figure 7The subnetwork of colon cancer proteins. Red nodes denote known cancer proteins; yellow nodes are the proteins predicted to be associated with colon cancer.
Figure 8The subnetwork of skin cancer proteins, showing only the Top 200 nodes (ranked by betweenness centrality).
Summary statistics of the circular relationship subnetworks for five cancer types (bone, colon, lung, skin, and breast).
| bone | 130 | 28 | 3578 | 3708 | 66505 |
| colon | 168 | 43 | 3252 | 3420 | 55630 |
| lung | 131 | 38 | 3400 | 3531 | 53462 |
| skin | 254 | 117 | 13698 | 13952 | 9 05412 |
| breast | 441 | 143 | 11301 | 11742 | 713063 |
Abbreviations: N, Number of cancer proteins in subnetwork; N, Number of cancer proteins that have circular relationship(s) with other cancer proteins; N, Number of non-cancer proteins in subnetwork; N, Total number of nodes in subnetwork; N, Total number of edges.
Top 10 proteins with the highest betweenness centrality values in each of the five cancer subnetworks (bone, colon, lung, skin, and breast cancer).
| RANK | ACCESSION | BC | NCCP | IN | IN | IN | IN | U | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| NO | SHORT NAME | PROTEIN NAME (UniProt) | (×106) | Z-SCORE | ATLAS | Pub1 | Pub2 | CT | |||
| 1 | P53356 | HTK16 | Tyrosine-protein kinase | 0.245 | 15.859 | 4 | No | 0 | 0 | * | |
| 2 | Q3TQJ7 | PAK7 | Serine/threonine-protein kinase PAK 7 | 0.213 | 13.754 | 4 | Yes | 1 | 20 | ||
| 3 | Q7RYZ3 | stk-42 | Serine/threonine protein kinase-42 | 0.185 | 11.972 | 4 | No | 0 | 1 | ||
| 4 | Q5R8U2 | DKFZp469F0413 | Putative uncharacterized protein | 0.167 | 10.781 | 5 | No | 0 | 0 | ||
| 5 | O77440 | HTK98 | Tyrosine-protein kinase | 0.159 | 10.211 | 20 | No | 0 | 0 | ||
| 6 | Q6GQ43 | pik3r1 | MGC80357 protein | 0.159 | 10.211 | 18 | Yes | 0 | 76 | ||
| 7 | Q9N597 | deleted | 0.159 | 10.211 | 18 | No | 0 | 0 | |||
| 8 | Q9NHC3 | ced-2 | Cell death abnormality protein 2 | 0.159 | 10.211 | 18 | No | 0 | 1 | ||
| 9 | O62272 | CELE_F58G1.3 | Hypothetical protein | 0.151 | 9.747 | 8 | No | 0 | 0 | ||
| 10 | Q34QW6 | Deleted (obsolete) | 0.147 | 9.461 | 4 | No | 0 | 0 | |||
| 1 | Q3TQJ7 | PAK 7 | Serine/threonine-protein kinase PAK 7 | 0.237 | 25.054 | 4 | Yes | 0 | 20 | * | |
| 2 | Q7RYZ3 | stk-42 | Serine/threonine protein kinase-42 | 0.149 | 15.727 | 4 | No | 0 | 1 | ||
| 3 | O62272 | Serine/threonine-protein phosphatase | 0.103 | 10.813 | 8 | No | 0 | 0 | |||
| 4 | O14428 | ppt-1 | Serine/threonine-protein phosphatase | 0.094 | 9.882 | 8 | No | 0 | 2 | ||
| 5 | Q6QUV9 | 0.081 | 8.432 | 9 | No | 0 | 0 | ||||
| 6 | Q4ZTM7 | Short-chain dehydrogenase/reductase SDR | 0.069 | 7.173 | 9 | No | 0 | 0 | |||
| 7 | Q3TXD4 | Clpb | Putative uncharacterized protein | 0.042 | 4.288 | 22 | No | 0 | 10 | ||
| 8 | O77008 | Casein kinase II alpha subunit | 0.040 | 4.059 | 21 | No | 0 | 0 | |||
| 9 | Q4DHP2 | Mitogen-activated protein kinase, putative | 0.040 | 4.059 | 30 | No | 0 | 0 | |||
| 10 | Q5DHJ0 | SJCHGC09514 protein | 0.040 | 4.059 | 30 | No | 0 | 0 | |||
| 1 | P42686 | SRK1 | Tyrosine-protein kinase isoform | 0.343 | 21.250 | 4 | No | 0 | 1 | ||
| 2 | Q9IAX8 | CYP2P1 | Cytochrome P450 2P1 | 0.282 | 17.469 | 1 | No | 0 | 0 | ||
| 3 | P53356 | HTK16 | Tyrosine-protein kinase | 0.192 | 11.835 | 4 | No | 0 | 0 | ||
| 4 | O77440 | HTK98 | Tyrosine-protein kinase | 0.178 | 10.970 | 18 | No | 0 | 0 | ||
| 5 | Q6GQ43 | pik3r1 | MGC80357 protein | 0.178 | 10.970 | 18 | Yes | 5 | 76 | ||
| 6 | Q9N597 | Deleted (obsolete) | 0.178 | 10.970 | 18 | No | 0 | 0 | |||
| 7 | Q9NHC3 | ced-2 | Cell death abnormality protein 2 | 0.178 | 10.970 | 18 | No | 0 | 1 | ||
| 8 | Q61125 | Bdkrb1 | B1 bradykinin receptor | 0.138 | 8.487 | 1 | Yes | 1 | 2 | ||
| 9 | P35409 | Probable glycoprotein hormone G-protein coupled receptor | 0.135 | 8.258 | 1 | No | 0 | 0 | |||
| 10 | O17136 | srx-21 | Protein SRX-21 | 0.135 | 8.258 | 1 | No | 0 | 0 | ||
| 1 | P54591 | yhcG | Uncharacterized ABC transporter ATP-binding protein | 3.292 | 33.507 | 4 | No | 0 | 0 | ||
| 2 | Q7SYD8 | xpnpep2 | Zgc:63528 | 1.958 | 19.849 | 4 | No | 0 | 1 | ||
| 3 | Q8DSW8 | mets | Methionine–tRNA ligase | 1.915 | 19.407 | 1 | Yes | 48 | 340 | * | |
| 4 | Q4TMZ8 | Deleted (obsolete) | 1.906 | 19.321 | 1 | No | 0 | 0 | |||
| 5 | Q3QD47 | Deleted (obsolete) | 1.839 | 18.628 | 1 | No | 0 | 0 | |||
| 6 | O74634 | MSM1 | Methionine–tRNA ligase, mitochondrial | 1.802 | 18.252 | 1 | No | 0 | 0 | * | |
| 7 | Q9HMN5 | srp54 | Signal recognition particle 54 kDa protein | 1.792 | 18.154 | 1 | Yes | 0 | 3 | * | |
| 8 | Q62ZT7 | cysK | Cysteine synthase | 1.629 | 16.484 | 1 | No | 0 | 0 | ||
| 9 | Q63KP6 | iles2 | Isoleucine–tRNA ligase 2 | 1.594 | 16.125 | 2 | No | 0 | 0 | ||
| 10 | Q72D59 | DVU_1070 | Branched chain amino acid ABC transporter | 1.566 | 15.838 | 1 | No | 0 | 0 | ||
| 1 | Q5SMW4 | P0568D10.9 | Putative uncharacterized protein P0568D10.9 | 1.872 | 16.366 | 49 | No | 0 | 0 | ||
| 2 | Q5Y2C4 | CDC2 | Cdc2 protein kinase | 1.803 | 15.756 | 49 | Yes | 68 | 3177 | ||
| 3 | Q6XKY3 | hog1 | Mitogen-activated protein kinase hog1 | 1.803 | 15.756 | 49 | No | 0 | 32 | ||
| 4 | Q80YP0 | Cdk3 | Cyclin-dependent kinase 3 Protein kinase domain containing protein, | 1.803 | 15.756 | 49 | Yes | 0 | 29 | ||
| 5 | Q53PY9 | Os11g0150700 | expressed | 1.796 | 15.697 | 49 | No | 0 | 0 | ||
| 6 | Q54QD5 | nek1 | Probable serine/threonine-protein kinase nek1 | 1.796 | 15.697 | 49 | Yes | 0 | 8 | ||
| 7 | Q5AI03 | SPS1 | Likely protein kinase | 1.796 | 15.697 | 49 | Yes | 0 | 7 | ||
| 8 | O04099 | Bcpk1 | Putative serine/threonine protein kinase | 1.787 | 15.619 | 49 | No | 0 | 0 | ||
| 9 | P51956 | NEK3 | Serine/threonine-protein kinase Nek3 | 1.787 | 15.619 | 49 | Yes | 0 | 6 | ||
| 10 | P51957 | NEK4 | Serine/threonine-protein kinase Nek4 | 1.787 | 15.619 | 49 | Yes | 0 | 4 | ||
Abbreviations: BC, Betweeness centrality; N, No. of connected cancer proteins (of the given cancer type); In Pub1, number of times published in PubMed (with the indicated cancer type); In Pub2, number of times published in PubMed (with any cancer type); In CT, also found for cancer type (B, bone; C, colon; L, lung; R, breast; S, skin); U, Described as “unknown”, “uncharacterized”, “putative”, “hypothetical”, or “probable” in Uniprot.
Note: “*” indicates that the protein was in the list of known cancer proteins from the Cancer Resource dataset.
Figure 9A 49-node dense subgraph at the center of the skin cancer subnetwork. See also Figure 8.