| Literature DB >> 18492663 |
Bas E Dutilh1, Berend Snel, Thijs J G Ettema, Martijn A Huynen.
Abstract
Gene content has been shown to contain a strong phylogenetic signal, yet its usage for phylogenetic questions is hampered by horizontal gene transfer and parallel gene loss and until now required completely sequenced genomes. Here, we introduce an approach that allows the phylogenetic signal in gene content to be applied to any set of sequences, using signature genes for phylogenetic classification. The hundreds of publicly available genomes allow us to identify signature genes at various taxonomic depths, and we show how the presence of signature genes in an unspecified sample can be used to characterize its taxonomic composition. We identify 8,362 signature genes specific for 112 prokaryotic taxa. We show that these signature genes can be used to address phylogenetic questions on the basis of gene content in cases where classic gene content or sequence analyses provide an ambiguous answer, such as for Nanoarchaeum equitans, and even in cases where complete genomes are not available, such as for metagenomics data. Cross-validation experiments leaving out up to 30% of the species show that approximately 92% of the signature genes correctly place the species in a related clade. Analyses of metagenomics data sets with the signature gene approach are in good agreement with the previously reported species distributions based on phylogenetic analysis of marker genes. Summarizing, signature genes can complement traditional sequence-based methods in addressing taxonomic questions.Entities:
Mesh:
Year: 2008 PMID: 18492663 PMCID: PMC2464742 DOI: 10.1093/molbev/msn115
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
FDefinition of signature genes based on a partially unresolved phylogeny. For every species, presence (1) or absence (0) of 3 genes (OGs) is indicated. In this example, only OG1 is a signature for clade A, as it is present in clade A1, clade A2 and clade A3, but not in clade B. Although OG2 and OG3 are present in more species within clade A, they are not a signature for clade A because OG2 is not present in clade A1, and OG3 is present outside of clade A.
FAmounts of signature genes identified in prokaryotic taxa. The unresolved phylogeny is based on a superalignment tree (Ciccarelli et al. 2006) where we collapsed nodes with a bootstrap value lower than 80% and removed the Eukaryota. Several node names used in this paper are indicated with gray boxes. Branch widths and colors indicate the number of signature genes found for each node (see legend).
Statistics of All Signature Genes Identified, the Signature Genes with a Coverage Score Cutoff of 0.75 and Perfect Signature Genes (coverage = 1.00)
| Taxa with Signatures | Number of Signatures | Average Coverage Score | |
| Signatures | 112 | 8,362 | 0.80 |
| Signatures (coverage ≥0.75) | 106 | 6,177 | 0.94 |
| Perfect signatures | 98 | 4,342 | 1.00 |
Sensitivity, Specificity, Precision, and Accuracy of the Signature Gene Method
| Number of Species Left Out | 1 | 16 (10%) | 32 (20%) | 48 (30%) |
| Sensitivity ( | 77.6 | 76.3 | 74.1 | 71.5 |
| Specificity ( | 98.9 | 98.9 | 98.7 | 98.7 |
| Precision ( | 93.1 | 92.9 | 92.0 | 91.7 |
| Accuracy (true/all) | 95.6 | 95.3 | 94.7 | 94.1 |
NOTE.—Results are based on several cross-validation analyses, leaving out 1 or 10%, 20%, or 30% of the species (averages of 100 experiments) from the data set and identifying signature genes in the removed genomes.
FThe number of signature genes, perfect signature genes (coverage score 1), and signature genes with a coverage score cutoff of 0.75 found with increasing numbers of completely sequenced genomes. The genomes are added one by one, in order of appearance (according to www.ncbi.nlm.nih.gov/genomes). Initially, the number of signature genes increases almost linearly with the appearance of more genomes. The 60th genome, that of Streptomyces avermitilis, completes the signature-rich Streptomyces clade (Streptomyces coelicolor was the fourth genome), and causes a great jump in the number of both perfect and normal signature genes.
Signature Genes Shared by Several Species and Potential Sister Clades
| Species | Clade | o/e ratio | Shared Signature Genes |
| Bacteria | 60/0 | 60 COGs | |
| Acidobacteria/Proteobacteria | 1/0 | COG3034 | |
| Alpha-/Beta-/Gamma-/Epsilonproteobacteria | 4.36 | COG3302, NOG13261, NOG09591–NOG17096 | |
| Alpha-/Beta-/Gammaproteobacteria | 0.56 | COG4618, COG5611 | |
| Helicobacteraceae (Epsilonproteobacteria) | 500 | NOG18902 | |
| Rickettsiales (Alphaproteobacteria) | 1/0 | NOG07928 | |
| Beta-/Gammaproteobacteria | 1,000 | COG4969 | |
| Archaea | 368.42 | COG1423, COG1458, COG1503, COG1517, COG1730, COG2112, COG4831 | |
| Crenarchaeota | 1/0 | COG4353 | |
| Sulfolobus (Crenarchaeota) | 1,000 | NOG18904 | |
| Methanosarcina (Euryarchaeota) | 500 | NOG09683 | |
| Bacteria | 33,500 | 67 COGs | |
| Lactobacillales (Firmicutes) | 83.33 | NOG17664 | |
| Mycoplasmataceae ex. | 1/0 | NOG19254–NOG36375 | |
| Treponema (Spirochaetales) | 1/0 | NOG17678 | |
| Alpha-/Beta-/Gamma-/Epsilonproteobacteria | 3.51 | COG2992, COG3713, NOG11181 | |
| Alpha-/Beta-/Gammaproteobacteria | 0.47 | COG4797, NOG18514 | |
| Pasteurellaceae ex. | 1,000 | NOG09881 | |
| Vibrionaceae/Pasteurellaceae/Enterobacteriaceae (Gammaproteobacteria) | 1.20 | COG2926 | |
| Methanosarcina (Euryarchaeota) | 1,000 | NOG22419 | |
| Archaea | 9,000 | 114 COGs, COG1591–NOG14885, COG3353–NOG29648, COG4023–NOG17603, NOG39364–NOG10118 | |
| Euryarchaeota | 5/0 | COG1422, COG1777, COG2150, COG3390, COG1711–NOG33052 | |
| Archaeoglobus/Methanosarcina (Euryarchaeota) | 1,500 | COG4749, COG4885, COG5427 | |
| Methanosarcina (Euryarchaeota) | 1,000 | NOG06067, NOG17658, NOG15033 | |
| Methanococcales/ | 1/0 | COG3363 | |
| Pyrococcus ex. | 1/0 | NOG24228 | |
| Leptospira (Spirochaetaceae) | 1/0 | NOG15034 | |
| Actinobacteridae | 76.92 | COG5282 | |
| Mycobacterium (Actinobacteridae) | 166.67 | NOG20057 | |
| Streptomyces (Actinobacteridae) | 83.33 | NOG36090, NOG15774 | |
| Cyanobacteria | 181.82 | COG4250, COG5524 | |
| Alpha-/Beta-/Gammaproteobacteria | 0.56 | COG3205, COG4538 | |
| 43.48 | COG3743 | ||
| Archaea | 67/0 | 66 COGs, NOG21880 | |
| Euryarchaeota | 2/0 | COG1311, COG1933 | |
| Methanosarcina (Euryarchaeota) | 1/0 | NOG11162 | |
| Pyrococcus (Euryarchaeota) | 1/0 | NOG17563 | |
| Bacteria | 60/0 | 60 COGs | |
| Clostridia (Firmicutes) | 200 | NOG22606 | |
| Archaea | 352.94 | COG1031, COG1184, COG1635, COG1992, COG3374, COG5014 | |
| Pyrococcus (Euryarchaeota) | 1/0 | NOG13536 | |
| Pyrococcus ex. | 1/0 | NOG23777 |
NOTE.—In some cases, no shared signature genes were found in the 1,000 randomized genome sets (e.g., o/e ratio 1/0). OGs that are linked with a hyphen were merged because they are homologous and have a nonoverlapping taxon distribution (see Methods). For the species names and clades see fig. 2.
FPhylogenetic distribution of 3 metagenomics data sets (Venter et al. 2004; Tringe et al. 2005). Pies (a–c) are the total numbers of signature genes found for each clade (including subclades); pies (d–f) are the percentages of the total number signature genes that exist for each clade; pies (g–i) are the percentages of sequences found with several phylogenetic markers in the original publications (averages of all measurements; taxa that were not in the reference tree are not shown). According to the phylogenetic marker-based analyses, all 3 metagenomics data sets were highly dominated by bacterial signature genes (farm soil: 72%; sea: 78%; and whale fall: 70%), archaeal signature genes were present in much lower percentages (farm soil: 0.05%; sea: 0.6%; and whale fall: 0.1%). These phylogenetically less informative clades are not shown in the charts. This analysis is based on STRING 6.3 OGs as the mapping of the metagenomics data sets was only available for that version (kindly provided by C. von Mering).