| Literature DB >> 18402669 |
Shibu Yooseph1, Weizhong Li, Granger Sutton.
Abstract
BACKGROUND: The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18402669 PMCID: PMC2362130 DOI: 10.1186/1471-2105-9-182
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flow chart of the incremental clustering method.
Sensitivity and Specificity of gene identification using the incremental clustering method.
| Acaryochloris marina MBIC11017 | B | 47.2 | 70 | 96.4 |
| Acidobacteria bacterium Ellin345 | B | 58.3 | 68.3 | 95.9 |
| Acidiphilium cryptum JF-5 | B | 67.9 | 80.9 | 84 |
| Acinetobacter baumannii ATCC 17978 | B | 38.9 | 80.6 | 95.5 |
| Alcanivorax borkumensis SK2 | B | 54.7 | 84.7 | 97.7 |
| Bacteroides vulgatus ATCC 8482 | B | 42.2 | 73.3 | 97 |
| Burkholderia thailandensis E264 | B | 67.2 | 81.1 | 87.5 |
| Caldivirga maquilingensis IC-167 | A | 43 | 67.3 | 97.8 |
| Candidatus Methanoregula boonei 6A8 | A | 54.5 | 67.1 | 95.6 |
| Candidatus Pelagibacter ubique HTCC1062 | B | 29.6 | 98.1 | 98 |
| Fervidobacterium nodosum Rt17-B1 | B | 34.9 | 76.3 | 97.1 |
| Francisella tularensis subsp. Holarctica | B | 32.1 | 83.2 | 87.7 |
| Hyperthermus butylicus DSM 5456 | A | 53.7 | 61.3 | 94 |
| Lactobacillus salivarius UCC118 | B | 32.9 | 78.4 | 93 |
| Methanococcus aeolicus Nankai-3 | A | 30 | 73.8 | 97.8 |
| Staphylothermus marinus F1 | A | 35.7 | 63.8 | 96.7 |
| Thermofilum pendens Hrk 5 | A | 57.6 | 63.9 | 97.4 |
| 74.8 | 94.7 |
A-Archaea, B-Bacteria, Sn-Sensitivity, Sp-Specificity.
Figure 2Percentage of Unrelated Pairs in Clusters. For all clusters, only Pfam match-containing sequences were considered. For the top curve (labeled All), all clusters with at least two Pfam match-containing sequences were considered where as for the second curve (labeled At least 5), only those clusters with at least five Pfam match-containing sequences were considered. For the later curve, it is seen that 94% of the reported clusters have no unrelated pairs. The bottom two curves show the trends for the "strict" version of unrelatedness.
Figure 3Number of clusters that domain architectures appear in. For the bottom curve (labeled All), all domain architectures were considered whereas for the top curve a domain architecture is considered as appearing in a cluster only if it has at least five instances in that cluster. In both cases, nearly 61% of domain architectures appear in a single cluster, and over 80% of domain architectures appear in at most 3 clusters.
Cluster size distribution and the distribution of sequences in these clusters
| 2–4 | 208,096 | 794,592 | 521,898 |
| 5–9 | 43,453 | 428,469 | 273,694 |
| 10–19 | 15,584 | 346,415 | 206,188 |
| 20–49 | 4,053 | 234,338 | 143,438 |
| 50–99 | 4,641 | 547,862 | 331,773 |
| 100–199 | 3,546 | 870,406 | 491,229 |
| 200–499 | 2,600 | 1,381,135 | 806,560 |
| 500–999 | 961 | 1,133,749 | 669,420 |
| 1,000–1,999 | 698 | 1,768,532 | 1,002,815 |
| ≥2,000 | 665 | 5,220,484 | 2,909,845 |
| Total | 284,297 | 12,725,982 | 7,356,860 |
The size of a cluster is defined as the number of non-redundant sequences in it.
Figure 4Log-Log Plot of Cluster Size Distribution. The x-axis is the logarithm of the cluster size C and the y-axis is the logarithm of the number of clusters of size ≥C. Logarithms are in base 10. The blue curve is the observed data, which is consistent with a power law. There is an inflection point around C = 2500 (a value of 3.4 on the x-axis). The two red lines are the least square fit to C ≤ 2500 and C > 2500, respectively. The former line is y = -0.733*x + 5.517, with R2 = 0.995, and the later line is y = -1.686*x + 8.813, with R2 = 0.992.
Clusters recruiting largest number of PANDA sequences
| CAM_CL_2057 | 20,508 | 24 | Reverse transcriptase (HIV) |
| CAM_CL_1132 | 18,882 | 1,406 | Cytochrome c oxidase subunit I |
| CAM_CL_2568 | 15,405 | 6,091 | ABC transporter |
| CAM_CL_4367 | 15,228 | 771 | Cytochrome b |
| CAM_CL_49 | 14,751 | 7,389 | Short-chain dehydrogenase |
| CAM_CL_3510 | 13,255 | 5,173 | Immunoglobulin |
| CAM_CL_2630 | 13,140 | 3,297 | Envelope glycoprotein |
| CAM_CL_160 | 13,054 | 3,897 | Kinases |
| CAM_CL_4556 | 12,403 | 6,345 | Response regulator |
| CAM_CL_481 | 12,078 | 5,477 | Transcription regulator |
Column 3 hints at the extent of redundancy in the PANDA set.
Clusters recruiting largest number of HOT/ALOHA sequences
| CAM_CL_49 | 562 | Metabolism, short chain dehydrogenase |
| CAM_CL_399 | 368 | Metabolism, Sulfatase |
| CAM_CL_26 | 338 | electron transport, Acyl-CoA dehydrogenase |
| CAM_CL_1239 | 314 | metabolism, AMP-binding enzyme |
| CAM_CL_2568 | 312 | transport, ABC transporter |
| CAM_CL_1581 | 274 | bioluminescence, methanogenesis, Luciferase-like monooxygenase |
| CAM_CL_4294 | 240 | nucleotide-sugar metabolism, NAD dependent epimerase/dehydratase family |
| CAM_CL_1593 | 235 | metabolism, CoA-transferase family III |
| CAM_CL_357 | 227 | Tetratricopeptide repeat |
| CAM_CL_333 | 225 | lignin biosynthesis, Zinc-binding dehydrogenase |
Recent genome projects with protein predictions that fall in Group II clusters.
| Psychroflexus torquis ATCC 700755a | 110 |
| Cellulophaga sp. MED134a | 38 |
| Flavobacteriales bacterium HTCC2170 a | 36 |
| Robiginitalea biformata HTCC2501a | 32 |
| Croceibacter atlanticus HTCC2559 a | 31 |
| Gramella forsetii KT0803 | 31 |
| Leeuwenhoekiella blandensis MED217 a | 31 |
| Flavobacterium johnsoniae UW101 | 29 |
| Polaribacter irgensii 23-P a | 26 |
| Tenacibaculum sp. MED152 a | 25 |
| Flavobacteria bacterium BBFL7 a | 22 |
| Bacteriophage Syn9 | 18 |
| Microscilla marina ATCC 23134 a | 18 |
| Marine gamma proteobacterium HTCC2080a | 17 |
| Candidatus Pelagibacter ubique HTCC1002a | 12 |
| Magnetospirillum magneticum AMB-1 | 12 |
| Marine gamma proteobacterium HTCC2143 a | 10 |
| Prochlorococcus marinus str. MIT 9312 | 10 |
| Alpha proteobacterium HTCC2255 a | 9 |
| Marine gamma proteobacterium HTCC2207a | 9 |
a Marine Microbial Genome projects funded by the Gordon and Betty Moore Foundation