| Literature DB >> 22790981 |
Rajkumar Sasidharan1, Tamás Nepusz, David Swarbreck, Eva Huala, Alberto Paccanaro.
Abstract
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam's capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22790981 PMCID: PMC3479161 DOI: 10.1093/nar/gks631
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Schematic of the various steps in the GFam pipeline.
Coverage statistics for TAIR9 assignment
| InterPro data source | Total annotated sequences (27 379) | Sequence coverage | Residue coverage |
|---|---|---|---|
| BlastProDom | 425 | 0.0155 | 0.0031 |
| FPrintScan | 3686 | 0.1346 | 0.0285 |
| Gene3D | 12 293 | 0.4490 | 0.2799 |
| HAMAP | 133 | 0.0049 | 0.0035 |
| HMMPIR | 1228 | 0.0449 | 0.0469 |
| HMMPANTHER | 14 973 | 0.5469 | 0.4687 |
| HMMPfam | 20 859 | 0.7619 | 0.4120 |
| HMMSMART | 7809 | 0.2852 | 0.1120 |
| HMMTIGR | 3105 | 0.1134 | 0.0874 |
| PatternScan | 5221 | 0.1907 | 0.0116 |
| ProfileScan | 8798 | 0.3213 | 0.1466 |
| SUPERFAMILY | 15 399 | 0.5624 | 0.4174 |
| All | 22 591 | 0.8251 | 0.6932 |
Sequence coverage from GFam output for TAIR9 proteome was calculated from the number of sequences having at least one domain divided by the total number of sequences (the number in parenthesis in the table header). Residue coverage was calculated from the number of residues covered by at least one domain divided by the total number of residues in all the sequences. GFam_NoFilter describes coverage provided by GFam considering domain annotation provided by member resources as is. In addition, we also included coverage provided by novel domains. GFam_NoFilter_No-novel is similar to GFam_NoFilter after excluding coverage from novel domains. GFam_WithFilter describes coverage calculated after using filters (described in the text). GFam_WithFilter_No-novel is similar to GFam_WithFilter after excluding coverage from novel domains.
Coverage statistics for TAIR10 assignment
| InterPro data source | Total annotated sequences (27 416) | Sequence coverage | Residue coverage |
|---|---|---|---|
| BlastProDom | 425 | 0.0155 | 0.0031 |
| FPrintScan | 3687 | 0.1345 | 0.0283 |
| Gene3D | 12 308 | 0.4489 | 0.2796 |
| HAMAP | 145 | 0.0053 | 0.0039 |
| HMMPIR | 1238 | 0.0452 | 0.0472 |
| HMMPanther | 14 998 | 0.5471 | 0.4684 |
| HMMPfam | 20 889 | 0.7619 | 0.4113 |
| HMMSMART | 7828 | 0.2855 | 0.1123 |
| HMMTIGR | 3102 | 0.1131 | 0.0871 |
| PatternScan | 5216 | 0.1903 | 0.0115 |
| ProfileScan | 8821 | 0.3217 | 0.1464 |
| SUPERFAMILY | 15 420 | 0.5624 | 0.4170 |
| All | 22 622 | 0.8251 | 0.6924 |
Sequence coverage from GFam output for TAIR10 proteome was calculated from the number of sequences having at least one domain divided by the total number of sequences (the number in parenthesis in the table header). Residue coverage was calculated from the number of residues covered by at least one domain divided by the total number of residues in all the sequences. GFam_NoFilter describes coverage provided by GFam considering domain annotation provided by member resources as is. In addition, we also included coverage provided by novel domains. GFam_NoFilter_No-novel is similar to GFam_NoFilter after excluding coverage from novel domains. GFam_WithFilter describes coverage calculated after using filters (described in the text). GFam_WithFilter_No-novel is similar to GFam_WithFilter after excluding coverage from novel domains.
Contribution of individual resources to GFam residue coverage for TAIR9 proteome
| InterPro data source | Total domains from Inter ProScan output | Total domains after GFam | Total residues from domains after GFam | Residue coverage |
|---|---|---|---|---|
| BlastProDom | 434 | 139 | 13 448 | 0.0019 |
| FPrintScan | 19 462 | 483 | 8702 | 0.0013 |
| Gene3D | 17 619 | 26 | 2414 | 0.0003 |
| HAMAP | 133 | 55 | 14 600 | 0.0021 |
| HMMPIR | 1228 | 1163 | 498 641 | 0.0718 |
| HMMPANTHER | 25 216 | 467 | 109 131 | 0.0157 |
| HMMPfam | 36 617 | 8939 | 1 466 666 | 0.2113 |
| HMMSMART | 15 630 | 1965 | 163 298 | 0.0235 |
| HMMTIGR | 7430 | 1625 | 459 902 | 0.0663 |
| PatternScan | 7323 | 56 | 870 | 0.0001 |
| ProfileScan | 19 072 | 4576 | 328 557 | 0.0473 |
| SUPERFAMILY | 22 405 | 16 382 | 3 607 647 | 0.5197 |
| Novel | NA | 1530 | 267 824 | 0.0386 |
| Total | 172 569 | 37 406 | 6 941 700 |
The number of domains from InterProScan output for each of the 12 resources, the number of domains that were incorporated into the final GFam assignment and their residue coverage.
Contribution of individual resources to GFam residue coverage for TAIR10 proteome
| InterPro data source | Total domains from InterPro Scan output | Total domains after GFam | Total residues from domains after GFam | Residue coverage |
|---|---|---|---|---|
| BlastProDom | 519 | 139 | 13 468 | 0.0019 |
| FPrintScan | 24 917 | 475 | 8585 | 0.0012 |
| Gene3D | 23 290 | 26 | 2414 | 0.0003 |
| HAMAP | 191 | 59 | 16 304 | 0.0023 |
| HMMPIR | 1700 | 1172 | 503 950 | 0.0726 |
| HMMPANTHER | 33 878 | 472 | 109 076 | 0.0157 |
| HMMPfam | 46 991 | 8933 | 1 467 716 | 0.2114 |
| HMMSMART | 20 682 | 1962 | 164 732 | 0.0237 |
| HMMTIGR | 8660 | 1610 | 459 488 | 0.0662 |
| PatternScan | 9662 | 56 | 867 | 0.0001 |
| ProfileScan | 24 147 | 4630 | 328 655 | 0.0473 |
| SUPERFAMILY | 29 568 | 16 394 | 3 610 190 | 0.5201 |
| Novel | NA | 1546 | 270 894 | 0.0390 |
| Total | 224 205 | 37 474 | 6 956 339 |
The number of domains from InterProScan output for each of the 12 resources, the number of domains that were incorporated into the final GFam assignment and their residue coverage.
GFam sequence and residue coverage for model genomes
| Species | Sequence coverage | Residue coverage | ||||
|---|---|---|---|---|---|---|
| A | B | C | A | B | C | |
| 0.6790 | 0.6367 | 0.5915 (HMMPanther) | 0.6497 | 0.6233 | 0.5861 (HMMPanther) | |
| 0.8904 | 0.8767 | 0.8440 (HMMPanther) | 0.8229 | 0.8053 | 0.7672 (HMMPanther) | |
| 0.6979 | 0.6975 | 0.6310 (HMMPfam) | 0.5977 | 0.5974 | 0.5023 (HMMPanther) | |
| 0.6831 | 0.6742 | 0.5884 (HMMPfam) | 0.6638 | 0.6574 | 0.6050 (HMMPanther) | |
| 0.7235 | 0.6955 | 0.6437 (HMMPanther) | 0.6739 | 0.6193 | 0.5851 (HMMPanther) | |
| 0.7020 | 0.6698 | 0.5400 (HMMPfam) | 0.6673 | 0.6370 | 0.5674 (HMMPanther) | |
| 0.7737 | 0.7596 | 0.7276 (HMMPanther) | 0.8089 | 0.8012 | 0.7687 (HMMPanther) | |
| 0.5109 | 0.5109 | 0.4230 (HMMPfam) | 0.5041 | 0.4873 | 0.4165 (HMMPanther) | |
Sequence and residue coverage for several model genomes using GFam and the best single-constituent InterPro resource. A, GFam_NoFilter; B, GFam_NoFilter_No-novel and C, best single-constituent resource within InterPro.
Figure 2.Schematic of the work flow adopted to transfer curated labels from TAIR9 GFam families to TAIR10 GFam families.