| Literature DB >> 16202129 |
Qicheng Ma1, Gung-Wei Chirn, Richard Cai, Joseph D Szustakowski, N R Nirmala.
Abstract
BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30,000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12,000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes.Entities:
Mesh:
Year: 2005 PMID: 16202129 PMCID: PMC1261163 DOI: 10.1186/1471-2105-6-242
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The schematic view of family-based clustering. Figure 1 illustrates a typical example of the clustering of three protein families denoted by the three oval outlines. Family I consists of protein sequences 1 and 2. Family II consists of protein sequences 3, 4, and 5. Family III consists of protein sequence 6 and 7. Domain A is common to families 1 and 2 while Domain B is common to families 2 and 3.
Figure 2shows the distribution of InterPro family size. Figure 2 shows the distribution of the InterPro families used in the benchmarking dataset based upon the number of members in each family. There are 102 singleton InterPro families, and the largest InterPro family in the benchmarking dataset is Rhodopsin-like GPCR superfamily which has 1058 protein sequences in the benchmarking dataset.
Figure 3Definition of various clustering parameters. Figure 3 illustrates the mapping of three generated clusters denoted by oval outlines differentiated by different colors into a InterPro family denoted by a rectangle. The cluster can be mapped to an InterPro family only if more than 50% cluster members belong to that InterPro family; and is declared as a orphan cluster otherwise. Protein sequences outside the rectangle are false positives. Protein sequences within both the oval outline and the rectangle are true positives. Protein sequences wholly within the grey rectangle are false negatives.
Specificity, sensitivity, goodness, cluster number, and orphan cluster values at different cutoff values on the benchmarking dataset.
| cutoff | specificity | sensitivity | goodness | cluster number | Number of orphan clusters |
| 0.20 | 97.11% | 98.60% | 78.20% | 1706 | 201 |
| 0.22 | 97.37% | 98.70% | 78.00% | 1742 | 180 |
| 0.25 | 97.61% | 98.70% | 77.60% | 1786 | 157 |
| 0.29 | 97.85% | 98.70% | 76.90% | 1837 | 133 |
| 0.33 | 98.06% | 98.90% | 76.30% | 1896 | 107 |
| 0.40 | 98.43% | 99.00% | 75.00% | 1972 | 79 |
| 0.50 | 98.70% | 99.10% | 72.60% | 2073 | 59 |
Figure 4Specificity, sensitivity, and goodness on the benchmarking dataset. Sensitivity and specificity for CLUGEN and MCL at various specificities. At higher specificities, the sensitivity of both methods increases, whereas the goodness of both methods decreases. This is expected because higher specificities are achieved via stricter parameter thresholds that more clusters overall and fewer large clusters. Performance for both methods is comparable in this range with CLUGEN performing better at lower specificities and MCL performing better at higher specificities.
Figure 5The number of generated clusters and orphan clusters on the benchmarking dataset. Total clusters and orphan clusters for clugen and MCL at various specificities. With stricter parameter thresholds, overall specificity and the total number of clusters increases for both methods. The larger number of small clusters at higher specificities leads to a reduction in the number of orphan clusters in both methods.
Top 50 InterPro superfamily/domains that have been mapped to clusters with one-to-one correspondence
| InterPro family/Domain ID | Type | Number of proteins in the benchmark dataset | Description |
| IPR001128 | Family | 507 | Cytochrome P450 |
| IPR000685 | Family | 398 | Ribulose bisphosphate carboxylase, large chain |
| IPR002198 | Family | 290 | Short-chain dehydrogenase/reductase SDR |
| IPR004000 | Family | 255 | Actin/actin-like |
| IPR002423 | Family | 226 | Chaperonin Cpn60/TCP-1 |
| IPR001023 | Family | 221 | Heat shock protein Hsp70 |
| IPR002085 | Family | 181 | Zinc-containing alcohol dehydrogenase superfamily |
| IPR000173 | Family | 177 | Glyceraldehyde 3-phosphate dehydrogenase |
| IPR001175 | Family | 169 | Neurotransmitter-gated ion-channel |
| IPR000910 | Family | 169 | HMG1/2 (high mobility group) box |
| IPR001353 | Family | 147 | 20S proteasome, A and B subunits |
| IPR000894 | Family | 141 | Ribulose bisphosphate carboxylase, small chain |
| IPR000298 | Family | 135 | Cytochrome c oxidase, subunit III |
| IPR001019 | Family | 135 | Guanine nucleotide binding protein (G-protein), alpha subunit |
| IPR000568 | Family | 134 | H+-transporting two-sector ATPase, A subunit |
| IPR001400 | Family | 133 | Somatotropin hormone |
| IPR000883 | Family | 131 | Cytochrome c oxidase, subunit I |
| IPR001364 | Family | 131 | Hemagglutinin, HA1/HA2 chain |
| IPR00970 | Family | 130 | Secreted growth factor Wnt protein |
| IPR001664 | Family | 127 | Intermediate filament protein |
| IPR000847 | Domain | 127 | Bacterial regulatory protein, LysR |
| IPR001659 | Family | 124 | Phycobilisome protein |
| IPR001694 | Family | 123 | Respiratory-chain NADH dehydrogenase, subunit 1 |
| IPR001811 | Family | 119 | Small chemokine, interleukin-8 like |
| IPR000215 | Family | 118 | Proteinase inhibititor I4, serpin |
| IPR001926 | Family | 114 | Pyridoxal-5'-phosphate-dependent enzyme, beta subunit |
| IPR000515 | Family | 113 | Binding-protein-dependent transport systems inner membrane component |
| IPR001424 | Family | 112 | Copper/Zinc superoxide dismutase |
| IPR001804 | Family | 111 | Isocitrate/isopropylmalate dehydrogenase |
| IPR001691 | Domain | 109 | Glutamine synthetase, catalytic domain |
| IPR000934 | Domain | 105 | Metallophosphoesterase |
| IPR001189 | Family | 105 | Manganese and iron superoxide dismutase |
| IPR001041 | Domain | 105 | Ferredoxin |
| IPR001099 | Family | 104 | Naringenin-chalcone synthase |
| IPR001450 | Domain | 102 | 4Fe-4S ferredoxin, iron-sulfur binding domain |
| IPR001427 | Family | 102 | Pancreatic ribonuclease |
| IPR000484 | Family | 100 | Photosynthetic reaction centre protein |
| IPR000954 | Family | 98 | Aminotransferase class-III |
| IPR001576 | Family | 93 | Phosphoglycerate kinase |
| IPR000230 | Family | 93 | Ribosomal protein S12, bacterial and chloroplast form |
| IPR002068 | Domain | 91 | Heat shock protein Hsp20 |
| IPR001750 | Domain | 90 | NADH/Ubiquinone/plastoquinone (complex I) |
| IPR000836 | Domain | 90 | Phosphoribosyltransferase |
| IPR001993 | Family | 90 | Mitochondrial substrate carrier |
| IPR001236 | Family | 85 | Lactate/malate dehydrogenase |
| IPR002210 | Family | 83 | Papillomavirus major capsid L1 (late) protein |
| IPR001395 | Family | 81 | Aldo/keto reductase |
| IPR000943 | Family | 80 | Sigma-70 factor |
| IPR002226 | Family | 80 | Catalase |
| IPR001766 | Domain | 80 | Fork head transcription factor |
Figure 6A schematic view of a pairwise alignment. Figure 6 shows a pairwise alignment between two aligned sequences. The aligned regions of the two sequences are highlighted. Their boundaries are pinpointed by the arrows.
Figure 7The architecture of the neural network. Figure 7 demonstrates the architecture of the neural network. The neural network is actually fully connected, but not shown in the figure for simplicity, and has three layers. The first layer is the input layer consisting of 25 input features. The hidden layer in the middle has 4 nodes. The output layer has one output node.
Figure 8Figure 8 illustrates the transitive homology between sequence a and sequence b through the third sequence c. The homology between sequence a and sequence b can be detected with P(a,b) = 0.72 by the transitive sequence homology.