| Literature DB >> 27586436 |
Leonid Zaslavsky1, Stacy Ciufo2, Boris Fedorov2, Tatiana Tatusova2.
Abstract
BACKGROUND: Microbial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy.Entities:
Keywords: Cluster; Clustering; Core-periphery; Data mining; Knowledge discovery; Microbial; Multiresolution; Multiscale; Parallel computing; Parallel processing; Procaryotic; Protein
Mesh:
Substances:
Year: 2016 PMID: 27586436 PMCID: PMC5009818 DOI: 10.1186/s12859-016-1112-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Local genomic neighborhood of the protein cluster containing the GTP-binding protein LepA (elongation factor) in Salmonella
Fig. 2Parts of the clade tree around Salmonella, Bacillus and Streptococcus
Statistics for the most abundant clades. The information for all 131 abundant clades is provided in Additional file 1: Table S1
| Clade | Taxonomic content | No. annotated | No. nonclonal | No. protein | No. protein | No. conservative |
|---|---|---|---|---|---|---|
| Id | genomes | annotated genomes | coding regions | sequences | inclade clusters | |
| 19668 | Escherichia, Shigella | 2277 | 929 | 3303114 | 310023 | 3894 |
| 19507 | Acinetobacter | 749 | 280 | 774670 | 133653 | 3034 |
| 19252 | Helicobacter pylori | 309 | 216 | 254806 | 191419 | 1244 |
| 20139 | Enterococcus genus | 242 | 155 | 306721 | 33249 | 2106 |
| 20104 | Streptococcus genus | 347 | 139 | 163066 | 61589 | 1394 |
| 20137 | Enterococcus genus | 300 | 139 | 309061 | 45809 | 2314 |
| 19669 | Salmonella, Citrobacter | 638 | 134 | 478093 | 112833 | 3940 |
| 19672 | Enterobacter, Escherichia, Klebsiella | 350 | 132 | 593750 | 84168 | 4726 |
| 19537 | Pseudomonas | 229 | 118 | 622138 | 100992 | 5511 |
| 21194 | Vibrio | 271 | 118 | 433416 | 150390 | 4015 |
| 19400 | Neisseria genus | 204 | 109 | 162808 | 29688 | 1596 |
| 19988 | Staphylococcus aureus | 3827 | 108 | 235562 | 43260 | 2309 |
| 20122 | Streptococcus agalactiae | 285 | 103 | 165898 | 17943 | 1704 |
| 19671 | Enterobacter Lelliottia | 80 | 70 | 229896 | 102783 | 3476 |
| 20021 | Bacillus | 101 | 70 | 250224 | 101171 | 3919 |
| 20103 | Streptococcus suis | 92 | 69 | 97200 | 48055 | 1541 |
| 19543 | Pseudomonas | 108 | 68 | 219354 | 114229 | 3551 |
| 19270 | Campylobacter jejuni | 97 | 63 | 85618 | 29112 | 1444 |
| 20116 | Streptococcus mutans | 165 | 62 | 100740 | 28671 | 1672 |
| 19993 | Staphylococcus genus | 92 | 59 | 114655 | 23197 | 2014 |
Summary of in-clade clustering for abundant clades
| No. abundant clades | 131 |
| No. protein coding regions encoding complete | |
| proteins | 19,740,968 |
| No. non-identical protein sequences | 7,604,425 |
| No. clustroids | 1,566,371 |
| No. clustroids of conservative in-clade clusters | 351,881 |
| No. protein coding regions encoding complete | |
| proteins represented by clustroids | |
| of conservative in-clade clusters | 14,612,418 |
| No. seed global clusters | 144, 415 |