| Literature DB >> 21437252 |
Dongying Wu1, Martin Wu, Aaron Halpern, Douglas B Rusch, Shibu Yooseph, Marvin Frazier, J Craig Venter, Jonathan A Eisen.
Abstract
BACKGROUND: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species. METHODOLOGY/PRINCIPALEntities:
Mesh:
Substances:
Year: 2011 PMID: 21437252 PMCID: PMC3060911 DOI: 10.1371/journal.pone.0018011
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
RecA superfamily clusters.
| Cluster ID | Corresponding Subfamily (see | Corresponding Group in Lin | Comments | GOS Only | Number of GOS Sequences |
| 1 | RecA | RecA | 2830 | ||
| 11 | RecA-like SAR1 | n/a | Novel | + | 10 |
| 5 | Phage SAR2 | n/a | Novel | + | 68 |
| 4 | Phage UvsX | n/a | 73 | ||
| 2 | Phage SAR1 | n/a | Found in cyanophage by subsequent sequencing | + | 824 |
| 15 | Unknown 1 | Novel | + | 6 | |
| 14 | XRCC3/SpB | Radb-XRCC3 | 0 | ||
| 20 | XRCC3/SpB | Radb-XRCC3 | 0 | ||
| 22 | Rad57 | Radb-XRCC2 | 0 | ||
| 6 | Rad51C | Radb-Rad51C | 1 | ||
| 8 | Rad51B | Radb-Rad51B | 2 | ||
| 10 | Rad51D | Radb-Rad51D | 0 | ||
| 16 | RadB | Radb-RadB | 0 | ||
| 17 | RadB | Radb-RadB | 0 | ||
| 21 | RadB | Radb-RadB | 0 | ||
| 12 | RadB | Radb-RadB | 0 | ||
| 3 | RadA/DMC1/Rad51 | Rada | 101 | ||
| 13 | RadA/DMC1/Rad51 | Rada | 0 | ||
|
| Unknown 2 | n/a | Representatives found in Archaea by subsequent sequencing | + | 19 |
| 18 | XRCC2 | Radb-XRCC2 | 0 | ||
|
| RecA | RecA | RecA fragment | + | 29 |
|
| RecA | RecA | RecA fragment | + | 5 |
|
| RecA | RecA | RecA fragment | + | 3 |
A Lek protein clustering method was applied to all RecA superfamily members retrieved from the NRAA database, microbial genomes, and the GOS data set. The 23 clusters containing more than two sequences are listed. Clusters that contain only sequences from the GOS data set are noted as “GOS only.” When a cluster can be mapped to a RecA subfamily identified by Lin et al. [53], the family designation from that paper is shown in column 3.
*These clusters of RecA fragments from the GOS data set were not included in the phylogenetic tree (Figure 1).
**Although cluster 9 contained only GOS sequences at the time of the initial analysis, it was subsequently found to include marine archaeal homologs from more recent genome sequencing projects.
Figure 1Phylogenetic tree of the RecA superfamily.
All RecA sequences were grouped into clusters using the Lek algorithm. Representatives of each cluster that contained >2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RecA superfamily are shaded and given a name on the right. Five of the proposed subfamilies contained only GOS sequences at the time of our initial analysis (RecA-like SAR, Phage SAR1, Phage SAR2, Unknown 1 and Unknown 2) and are highlighted by colored shading. As noted on the tree and in the text, sequences from two Archaea that were released after our initial analysis group in the Unknown 2 subfamily.
Genes linked to sequences in the novel RecA subfamilies.
| Subfamily | RecA Accession | Accession of Linked Gene | Assembly ID | Neighboring Gene Description | Taxonomy Assignment |
| Phage-SAR1 | 1096700853217 | 1096700853219 | 1096627374158 | gp43 | Viruses/Phages |
| Phage-SAR1 | 1096701673303 | 1096701673301 | 1096627382978 | T4-like DNA polymerase | Viruses/Phages |
| Phage-SAR1 | 1096701673303 | 1096701673305 | 1096627382978 | T4-like DNA primase-helicase | Viruses/Phages |
| Phage-SAR2 | 1096697847133 | 1096697847135 | 1096627014936 | GDP-mannose 4,6-dehydratase | Bacteria |
| Phage-SAR2 | 1096697847133 | 1096697847149 | 1096627014936 | methyltransferase FkbM | Bacteria |
| Unknown2 | 1096695533559 | 1096695533561 | 1096528150039 | ATP-dependent helicase | Archaea |
| Unknown2 | 1096698308433 | 1096698308421 | 1096627021375 | ATP-dependent RNA helicase | Archaea |
| Unknown2 | 1096698308433 | 1096698308423 | 1096627021375 | replication factor A | Archaea |
| Unknown2 | 1096698308433 | 1096698308425 | 1096627021375 | S-adenosylmethionine synthetase | Bacteria |
| Unknown2 | 1096698308433 | 1096698308427 | 1096627021375 | cobalt-precorrin-6A synthase | Archaea |
| Unknown2 | 1096698308433 | 1096698308429 | 1096627021375 | NADH ubiquinone dehydrogenase | Bacteria |
| Unknown2 | 1096698308433 | 1096698308431 | 1096627021375 | CbiG protein | Bacteria |
| Unknown2 | 1096698308433 | 1096698308443 | 1096627021375 | ATP-binding protein of ABC transporter | Bacteria |
| Unknown2 | 1096698308433 | 1096698308435 | 1096627021375 | chaperone protein dnaJ | Eukaryota |
| Unknown2 | 1096698308433 | 1096698308445 | 1096627021375 | small nuclear riboprotein protein snRNP | Archaea |
| Unknown2 | 1096699819041 | 1096699819039 | 1096627295379 | S-adenosylmethionine synthetase | Bacteria |
| Unknown2 | 1096699819041 | 1096699819043 | 1096627295379 | replication factor A | Bacteria |
| Unknown2 | 1096699819041 | 1096699819047 | 1096627295379 | snRNP Sm-like protein | Archaea |
| Unknown2 | 1096686533379 | 1096686533339 | 1096627390330 | ATP-dependent helicase | Archaea |
| Unknown2 | 1096686533379 | 1096686533341 | 1096627390330 | deoxyribodipyrimidine photolyase-related | Bacteria |
| Unknown2 | 1096686533379 | 1096686533343 | 1096627390330 | Glycyl-tRNA synthetase alpha2 dimer | Archaea |
| Unknown2 | 1096686533379 | 1096686533345 | 1096627390330 | RNA-binding protein | Bacteria |
| Unknown2 | 1096686533379 | 1096686533347 | 1096627390330 | cobyrinic acid a,c-diamide synthase | Archaea |
| Unknown2 | 1096686533379 | 1096686533349 | 1096627390330 | sdoxyribodipyrimidine photolyase | Archaea |
| Unknown2 | 1096686533379 | 1096686533351 | 1096627390330 | DNA primase small subunit | Archaea |
| Unknown2 | 1096686533379 | 1096686533353 | 1096627390330 | cobalt-precorrin-6A synthase | Archaea |
| Unknown2 | 1096686533379 | 1096686533355 | 1096627390330 | cobalamin biosynthesis CbiG | Bacteria |
| Unknown2 | 1096686533379 | 1096686533359 | 1096627390330 | DNA primase large subunit | Archaea |
| Unknown2 | 1096686533379 | 1096686533361 | 1096627390330 | aldo/keto reductase | Bacteria |
| Unknown2 | 1096686533379 | 1096686533365 | 1096627390330 | AP endonuclease | Archaea |
| Unknown2 | 1096686533379 | 1096686533369 | 1096627390330 | ATP-dependent helicase | Archaea |
| Unknown2 | 1096686533379 | 1096686533371 | 1096627390330 | translation initiation factor 2 alpha subunit | Archaea |
| Unknown2 | 1096686533379 | 1096686533373 | 1096627390330 | translation initiation factor 2 alpha subunit | Archaea |
| Unknown2 | 1096686533379 | 1096686533375 | 1096627390330 | sirohydrochlorin cobaltochelatase CbiXL | Bacteria |
| Unknown2 | 1096686533379 | 1096686533377 | 1096627390330 | glutamate racemase | Bacteria |
| Unknown2 | 1096686533379 | 1096686533383 | 1096627390330 | glycosyl transferase | Eukaryota |
| Unknown2 | 1096686533379 | 1096686533387 | 1096627390330 | deoxyribodipyrimidine photolyase | Bacteria |
| Unknown2 | 1096686533379 | 1096686533389 | 1096627390330 | AP endonuclease | Archaea |
| Unknown2 | 1096686533379 | 1096686533393 | 1096627390330 | cbiC protein | Archaea |
| Unknown2 | 1096686533379 | 1096686533399 | 1096627390330 | deoxyribodipyrimidine photolyase | Bacteria |
| Unknown2 | 1096686533379 | 1096686533405 | 1096627390330 | cob(I)alamin adenosyltransferase | Bacteria |
| Unknown2 | 1096686533379 | 1096686533407 | 1096627390330 | Phosphohydrolase | Bacteria |
| Unknown2 | 1096686533379 | 1096686533409 | 1096627390330 | glycyl-tRNA synthetase | Archaea |
| Unknown2 | 1096686533379 | 1096686533415 | 1096627390330 | 30S ribosomal protein S6 | Archaea |
| Unknown2 | 1096686533379 | 1096686533421 | 1096627390330 | nuclease | Archaea |
| Unknown2 | 1096686533379 | 1096686533423 | 1096627390330 | phosphohydrolase | Bacteria |
| Unknown2 | 1096686533379 | 1096686533427 | 1096627390330 | cobalt-precorrin-3 methylase | Archaea |
| Unknown2 | 1096686533379 | 1096686533429 | 1096627390330 | universal stress family protein | Bacteria |
| Unknown2 | 1096686533379 | 1096686533473 | 1096627390330 | aryl-alcohol dehydrogenases related oxidoreductases | Eukaryota |
| Unknown2 | 1096686533379 | 1096686533505 | 1096627390330 | snRNP Sm-like protein Chain A | Eukaryota |
| Unknown2 | 1096689280551 | 1096689280549 | 1096627650434 | S-adenosylmethionine synthetase | Bacteria |
| RecA-like SAR1 | 1096683378299 | 1096683378297 | 1096627289467 | DNA polymerase III alpha subunit | Bacteria |
| Unknown1 | 1096694953057 | 1096694953059 | 1096520459783 | FKBP-type peptidyl-prolyl cis-trans isomerase | Archaea |
| Unknown1 | 1096665977449 | 1096665977451 | 1096627520210 | single-stranded DNA binding protein | Viruses/Phages |
| Unknown1 | 1096682182125 | 1096682182127 | 1096628394294 | DNA polymerase I | Bacteria |
Five RecA subfamilies were identified as being novel (i.e., only seen in metagenomic data) in our initial analyses. GOS metagenome assemblies that encode members of these subfamilies were identified and the genes neighboring the novel RecAs were characterized. The neighboring gene descriptions are based on the top BLASTP hits against the NRAA database; taxonomy assignments are based on their closest neighbor in phylogenetic trees built from the top NRAA BLASTP hits.
Figure 2The largest assembly from the GOS data that encodes a novel RecA subfamily member (a representative of subfamily Unknown 2).
This GOS assembly (ID 1096627390330) encodes 33 annotated genes plus 16 hypothetical proteins, including several with similarity to known archaeal genes (e.g., DNA primase, translation initiation factor 2, Table 2). The arrow indicates a novel recA homolog from the Unknown 2 subfamily (cluster ID 9).
Figure 3Phylogenetic tree of the RpoB superfamily.
All RpoB sequences were grouped into clusters using the Lek algorithm. Representatives of each cluster that contained >2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RpoB superfamily are shaded and given a name on the right. The two novel RpoB clades that contain only GOS sequences are highlighted by the colored panels.
RpoB subfamilies.
| Cluster ID | Corresponding Subfamily (see | Comments | GOS Only? | Number of GOS Sequences |
| 7 | Bacteria and Plastids | 1602 | ||
| 4 | Bacteria and Plastids | 0 | ||
| 12 | Bacteria and Plastids | 0 | ||
| 8 | Unknown 1 | + | 4 | |
| 6 | Killer Plasmids” | 0 | ||
| 17 | Rpa2/Rpb2/Rpc2/Archaea | Includes most eukaryotic (nuclear) and archaeal superfamily members | 181 | |
| 2 | Rpa2 | 0 | ||
| 14 | Archaea | 0 | ||
| 3 | Unknown 2 | + | 3 | |
| 13 | Pox Viruses | 0 | ||
|
| n/a | Partial sequences likely from bacteria | + | 6 |
|
| n/a | Partial sequences likely from bacteria | + | 2 |
|
| n/a | Partial sequences likely from eukaryotes. | + | 4 |
|
| n/a | Partial sequences likely from eukaryotes. | + | 4 |
|
| n/a | Partial sequences likely from eukaryotes. | + | 3 |
|
| n/a | Partial sequences likely from eukaryotes. | + | 5 |
|
| n/a | Not analyzed further because only two representatives identified | + | 2 |
A Lek clustering method was applied to all RpoB superfamily members retrieved from the NRAA database, microbial genome projects, and the GOS data set. Clusters that contain only sequences from the GOS data set are noted as “From GOS only.”
*Clusters 1, 9, 10, 11, 15, and 16 contain only sequence fragments from the GOS data set; though possibly novel they were omitted from further analysis.
**Cluster 5 contains only two sequences. Though both are from the GOS (IDs 1096695464231 and 1096681823525) and may represent a novel RpoB subfamily, this group was excluded from further analysis because we restricted analyses to groups with three or more sequences.