Literature DB >> 24030592

Reconstructing an ancestral genotype of two hexachlorocyclohexane-degrading Sphingobium species using metagenomic sequence data.

Naseer Sangwan¹, Helianthous Verma¹, Roshan Kumar¹, Vivek Negi¹, Simon Lax², Paramjit Khurana³, Jitendra P Khurana³, Jack A Gilbert², Rup Lal¹.

Abstract

Over the last 60 years, the use of hexachlorocyclohexane (HCH) as a pesticide has resulted in the production of >4 million tons of HCH waste, which has been dumped in open sinks across the globe. Here, the combination of the genomes of two genetic subspecies (Sphingobium japonicum UT26 and Sphingobium indicum B90A; isolated from two discrete geographical locations, Japan and India, respectively) capable of degrading HCH, with metagenomic data from an HCH dumpsite (∼450 mg HCH per g soil), enabled the reconstruction and validation of the last-common ancestor (LCA) genotype. Mapping the LCA genotype (3128 genes) to the subspecies genomes demonstrated that >20% of the genes in each subspecies were absent in the LCA. This includes two enzymes from the 'upper' HCH degradation pathway, suggesting that the ancestor was unable to degrade HCH isomers, but descendants acquired lin genes by transposon-mediated lateral gene transfer. In addition, anthranilate and homogentisate degradation traits were found to be strain (selectively retained only by UT26) and environment (absent in the LCA and subspecies, but prevalent in the metagenome) specific, respectively. One draft secondary chromosome, two near complete plasmids and eight complete lin transposons were assembled from the metagenomic DNA. Collectively, these results reinforce the elastic nature of the genus Sphingobium, and describe the evolutionary acquisition mechanism of a xenobiotic degradation phenotype in response to environmental pollution. This also demonstrates for the first time the use of metagenomic data in ancestral genotype reconstruction, highlighting its potential to provide significant insight into the development of such phenotypes.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2013 PMID： 24030592 PMCID： PMC3906814 DOI： 10.1038/ismej.2013.153

Source DB: PubMed Journal: ISME J ISSN： 1751-7362 Impact factor: 10.302

Introduction

Hexachlorocyclohexane (HCH) was one of the most popular pesticides for the control of agricultural pests and vector borne diseases during the latter half of the 20th century, and has subsequently become a key global pollutant culminating in the creation of open sinks with extremely high concentrations of HCH in the soil (for example, ∼450 mg g−1) (Jit ). HCH is prepared by the chlorination of benzene in the presence of UV, resulting in the formation of a technical HCH mixture, primarily containing five stable HCH isomers: α (60–70%), β (5–12%), γ (10–12%), δ (6–10%) and ɛ (3–4%) (Lal ). Among these, only γ–HCH, also called lindane, has insecticidal properties and purification of one ton of γ–HCH results in 9–12 tons of HCH waste consisting primarily of α- and β-HCH. During the past six decades, ∼600 000 tons of γ-HCH had been generated; yielding 4–7 million tons of HCH waste (Lal ). Apart from being carcinogenic, endocrine disruptors and neurotoxins (Kalantzi ), these HCH isomers especially β and δ-HCH are highly persistent in the environment (Vega ). The first aerobic bacterium that could degrade lindane (γ-HCH), Sphingobium japonicum UT26, was isolated in 1989 from an HCH contaminated upland experimental field in Japan (Senoo and Wada, 1989). This was followed, in 1990, by the isolation of yet another sphingomonad, S. indicum B90A, from Indian rice rhizosphere soil repeatedly treated with technical HCH; this strain could degrade even the most recalcitrant HCH-isomer, β-HCH (Sahu ). Over the past two decades both UT26 and B90A have served as models to explore the genetics and biochemistry of HCH-isomer degradation (Kumari ; Dogra ; Suar ; Sharma ; Nagata ), and it is now well established that the degradation of HCH isomers in sphingomonads is mediated through enzymes encoded by lin genes (Lal ). Among the lin genes, linA, linB and linC, which encode HCH dehydrochlorinase (LinA), haloalkane dehalogenase (LinB) and dehydrogenase (LinC), are responsible for the dehydrochlorination, hydroxylation and dehydrogenation of HCH isomers or their intermediates, respectively. A recent metagenomic survey of a highly contaminated HCH dumpsite revealed (Supplementary Figures S1 and S2) the expected enrichment of Sphingomonadaceae, as well as lin genes, plasmids, and transposons over a gradient of increasing HCH contamination (Sangwan ). To elucidate the ecology and evolution of the HCH-degrading phenotype in this ecosystem, we have reconstructed an in situ validated ancestral minimal genotype and provide evidence for the lateral gene transfer of xenobiotic degradative genes under the selective pressure of HCH pollution. This demonstrates for the first time the application of metagenomic data to ancestral genotype reconstruction, and highlights the potential of this approach to elucidate ecological genotype development, including demarcation between environmental and strain-specific traits.

Materials and methods

Soil and strain

Soil samples were collected from an open sink of HCH situated at Ummari village, Lucknow, India (27° 00′ 24.7′′ N, 81° 08′ 57.8′′ E) (Jit ). As described earlier (Sangwan ), subsamples (500 g soil per subsample) were collected at a depth range of 10–20 cm on 17 September 2011. Soil composition, sampling coordinates, methods and physicochemical analysis were performed as explained earlier (Sangwan ). Laboratory strain of Sphingobium indicum B90A was used for the experiment.

DNA extraction and sequencing

Total DNA was isolated from each soil subsample (50 subsamples) using the PowerMax Soil DNA Isolation Kit (MO-BIO, Carlsbad, CA, USA). Equal concentrations (5 μg) of the environmental DNA from each subsample was mixed to form a homogenous composite genetic pool. DNA concentrations were quantified using NanoDrop spectrophotometer (NanoDrop Technologies Inc, Wilmington, DE, USA). Cell pellets from 5 ml pure culture of S. indicum B90A were subjected to genomic DNA isolation using QIAamp DNA Mini Kit (Qiagen, Venlo, Netherlands). DNA concentrations were quantified using NanoDrop spectrophotometer (NanoDrop Technologies Inc). Paired-end reads were generated for genomic (3 μg μl−1) and metagenomic DNA (100 ng μl−1) using the 454 GS FLX Titanium (insert length=2000 bp±10, average read length: 320 bp±49) and the Illumina Genome Analyzer (San Diego, CA, USA) (insert length: 140 bp±6 and average read length, 75 bp±5) platforms, respectively. Illumina raw sequence data were subjected to various quality measures using the Seq-trim (Falgueras ) pipeline and pyrosequencing output was analyzed using Roche 454 Analysis software version 2.0 (Branford, CT, USA). Reads below average score Q20, minimum length=50 bp -Illumina and 250 bp -pyrosequencing, and ambiguous bases (including N) were not included in any downstream analysis.

De-novo assembly of genome and metagenome

S. indicum B90A genome sequence reads (paired-end) were assembled into contigs using ABySS (Simpson ) set at k-mer length of 41. Metagenomic reads (Table 1) were assembled by Metavelvet (Namiki ) at k-mer length of 31, insert length=200 bp and expected coverage=auto. BAMBUS-2 (Koren ) was used to generate metagenomic scaffolds. Assemblies (genome and metagenome) were validated and coverage was calculated by aligning the raw data (paired-end) back to the contigs using RTG Investigator (http://www.realtimemetagenomics.com). Any consensus mismatch between contig sequence and overlapping reads was considered as misassembly. Contigs from the previous metagenomic survey (Sangwan ) were also used (minimum identity=95% and minimum overlap=50%) to bridge gaps between paired-end assembled metagenome contigs from this study.

Table 1

Sequencing and assembly statistics

Category	Sphingobium indicum B90A	Metagenome (Illumina data)
Sequence data	450 Mbp (320 bp)	1.6 Gbp (75 bp)
Reads after quality filtration	42 044 290	20 111 630
No. of contigs	149 (>500 bp)	1 216 300 (>200 bp)
Average contig coveragea	80	3
Reads used in assembly	33 635 432 (80%)	1 608 930(8%)
N50b	95 Kbp	253 bp
Max. contig size	∼253 Kbp	∼3 Kbp
GC content	65	61

Abbreviation: N50 is the length of the smallest contig in a bin that contains largest contigs representing at least half of the total assembly length.

Calculated by mapping of reads to contigs with the criteria of 95% identity over 90% of the read length.

Based on minimum contig length criteria, that is, genome ⩾500 bp and metagenome ⩾200 bp.

Gene calling and annotations

Genes were predicted for S. indicum B90A draft genome using Glimmer-3 (Delcher ). Metagenome contigs (minimum length=200 bp) were annotated at various hierarchy levels (individual enzymes, protein families, subsystem and cellular processes) using BLASTX (Altschul ) (E-value=10−5) against COG (Tatusov ), Pfam (Bateman ) and KEGG (Kanehisa ) databases. KEGG annotations were used in comparison against previous metagenomic predictions and hierarchical clustering was performed (n=1000) on the resultant matrix (sample versus category) using the pvclust package in R (R Development Core team, 2009). To taxonomically characterize and quantify the present genotypes in our metagenome data, individual metagenome reads were mapped over RefSeq (Pruitt ) database (release 45, complete microbe) using RTG Investigator (http://www.realtimemetagenomics.com) (sequence identity cutoff=80%). Metagenomic 16S rRNAs were extracted (minimum length ∼150 bp) from contigs using BLASTN (Altschul ) and compared against SILVA SSU rRNA database (Pruesse ) using RTG investigator. SSU-align (Nawrocki and Eddy, 2010) was used to construct 16S rRNA domain (V3, V4 and V6) specific models corresponding to the bacterial 16S rRNA secondary structures and implementation of these domains in soil metagenome analysis (Vasileiadis ). Metagenomic 16S rRNA sequences were aligned against these models and alignments were manually checked for any sequence repetition. Alignment files from RefSeq analysis were processed with SAMtools (Li ) and total number of hits per reference sequence was calculated. Mapping results were further quantified into genera specific relative abundance using taxonomical information from their accession numbers. Pseudogenes were predicted in S. indicum B90A draft genome using Psi-Fi perl script (Guindon and Gascuel, 2003) and by manually checking (minimum identity mismatch=20%) the alignments for frame shift mutations. Pseudogene predictions were validated for sequencing errors by aligning the sequence reads back to the coding sequences (CDSs).

ANI and tetra nucleotide frequency calculations

Average nucleotide identity (ANI) values were calculated as explained in Konstantinidis and Tiedje (2005). All possible pairwise comparisons were performed between the S. indicum B90A draft genome (Anand ) and the available Sphingobium genomes (chromosomes and plasmids). S. japonicum UT26 (Nagata ), S. chlorophenolicum L-1 (Copley ) and Sphingobium sp. SYK-6 (Masai ) are the reference genomes of the organisms included in this study. Whole-genome-based tetranucleotide correlations were calculated using TETRA (Teeling ). Euclidian distance matrix was constructed from ANI and tetranucleotide comparisons and hierarchical clustering was performed on the resultant matrix. These analyses were performed in R (R Development Core team, 2009) using ade (Dray and Dufour, 2007), pvclust and gclus packages.

Metagenomic recruitment of the genus Sphingobium

Metagenomic recruitment plots were generated for all available four genomes (chromosomes and plasmids) of the genus Sphingobium with MUMmer (Kurtz ). Genomic coordinates covered (sequence identity and query coverage cutoff; 25% and 25% of the region, respectively) by metagenome reads tilling were selected as metagenomic islands (here after described as MGIs) (Steffen ). Contigs from this study and previous dumpsite metagenome survey (Sangwan ) were reassembled using Minimus (Sommer, 2007) at default parameter. We binned the metagenome contigs corresponding to reference genotypes (genus Sphingobium) using tetranucleotide frequencies (correlation cutoff (R2)=0.9) and %GC criterion (Tyson ). BLASTN (Altschul ) and MEGAN (Huson ) were used to confirm the genetic identities of the contigs. ANI and tetranucleotide correlation values were also calculated for the ‘meta-Sphingobium' assembly (contigs) against complete Sphingobium reference genomes used in this study.

Identification and analysis of genomic and MGIs

Following the two step process to predict the complete genomic island profile, SIGI-HMM (Waack ) algorithm (sensitivity value=0.7) was used to predict the potential genomic islands in ‘MGI free' (without MGIs) regions of the two genetic subspecies (ANI >98%). Genomic Islands were further annotated (BLASTX; E-value=10−5) against COG (Tatusov ), KEGG (Kanehisa ) and a local database created using protein CDSs from all Sphingobium genomes used in this study.

Reconstruction and comparative genomics of the ancestor genotype

To reconstruct the ancestor genotype of HCH-degrading subspecies, we used the complete genomes (chromosomes and plasmids) of S. indicum B90A and S. japonicum UT26 (ANI value of subspecies level ≥98% and both can degrade HCH isomers). As various Sphingobium spp. are known to carry the catabolic genes (lin genes) involved in HCH degradation, on their plasmids and secondary chromosomes (Dogra ; Nagata ; Tabata ) we included the same of reference organisms for ancestor genotype estimation. Core genomic regions of S. indicum B90A and S. japonicum UT26 were separated from genomic islands using coordinates from metagenomic recruitment plots and SIGI-HMM predictions. Homologous anchors were computed in two subspecies using Murasaki (Popendorf ) with seed weight-30 and seed length-40. Pre-computed homologous anchors were treated as input to predict orthologous segments using OSfinder algorithm (Hachiya ). OSfinder predicts orthologous segments (syntenic) using Markov chain models and machine learning techniques. The advantage of using OSfinder lies in its automatic optimization of parameters to build markov models that increases the accuracy via minimizing the errors usually caused by setting the parameters manually (Pevzner and Tesler, 2003). Orthologous regions (minimum ancestor genotype) were re-annotated with same strategy as explained above. Hypothetical proteins in the ancestor genotype were compared against the protein database of all available Sphingobium genomes. MEGABLAST (Zhang ) was used to check the homologs of predicted foreign gene sequences (horizontally transferred) in ancestor genotype. Ancestral genotype (gene content) was mapped (BLASTN, E-value=10−10) against genome sequences of S. indicum B90A (Anand ), S. japonicum UT26 (Nagata ) and ‘meta-Sphingobium' assembly. Hierarchical clustering (minimum relative abundance=0.8% and s.d. cutoff=0.4%) was also performed on metagenome BLASTX results (Altschul ) (E-value=10−5) analysis against KEGG (Kanehisa ) database, using MeV 4.4 (Saeed ) with Euclidean distance and Kendall's tau matrices and average linkage clustering.

Detection of genes under positive selection

Orthologous proteins from S. indicum B90A and S. japonicum UT26 were subsequently aligned, in a pairwise fashion using the CLUSTAL W algorithm (Thompson ). The corresponding nucleotide sequences of these alignments were later aligned, codon by codon, using the pal2nal script (Suyama ). Yn00 module of the PAML package (Yang, 2007) was used to calculate the dN/dS ratio for each pair of proteins. For proteins potentially involve in the degradation of phenol, toluene, chlorophenol, anthranilate, homogentisate and HCH, quality processed (quality score cutoff: Q15) dumpsite metagenome (pyrosequence data) reads were used to calculate dN/dS ratio using methodology as explained earlier (Tai ).

Graph-based clustering and characterization of repetitive elements in genome and metagenome

Paired-end reads from this study, S. indicum B90A genome (insert size=2000±10 bp) dumpsite metagenome (insert size=140±6 bp) and pyrosequence reads from previous metagenome survey (Sangwan ) were processed separately for de-novo identification and characterization of repetitive elements using RepeatExplorer pipeline (http://galaxy.umbr.cas.cz:8080). Briefly, assembly was performed using CAP3 (Huang and Madan, 1999) program with minimum overlap length for clustering=40% of the length (140 bp for genome and pyrosequence metagenome and 33 bp for Illumina metagenome) and minimum percentage identity criterion was set at 80. Results were analyzed using R program SeqGrapheR (Novak ). Graph-based clustering (Novak ) predicted 2223 (68% reads used) clusters in the whole genome and 817 870 (19% reads used) clusters in the metagenome. Clusters with similar annotations (minimum 40% percent overlap and 80% identity threshold) were merged and re-analyzed. Contigs (contributing in clusters) were annotated (BLASTX; E-value=10−5) against a custom database constructed using nucleotide sequences of insertion sequence (IS)-elements, integrases and putative tranposases present in the available genomes from genus Sphingobium and INTEGRALL database (Moura ). To predict the genetic characteristics of transposon-mediated lateral transfer upon HCH contamination (insert size, expected sequences of insertion sites), we classified the predicted transposons to the ISfinder database (Siguier ) using BLASTN (E-value=10−10, minimum identity 80% over 80% query length). Genetic fingerprints of the de-novo reconstructed transposons were traced back on the draft assembly of S. indicum B90A, the complete genome of S. japonicum UT26 and modern lineages using BLASTN (E-value=10−10).

Sequence availability

Draft assembly of S. indicum B90A is available in Genbank database with accession number AJXQ00000000. Metagenome sequence data have been deposited at DDBJ/EMBL/GenBank under the study accession number of ERP001726 (http://www.ebi.ac.uk/ena/data/view/ERP001726).

Results and discussion

Phylogenomics of the genus Sphingobium: species to ecotypes

A non-redundant database was created (14.98 Mb, 107 735 reads; clustering cutoff at pan-genome level 97% identity) containing HCH dumpsite metagenomic reads that mapped to four Sphingobium genomes. The relative abundance of HCH-degrading genotypes, S. indicum B90A and S. japonicum UT26, was collectively much higher (91 275 reads; 0.8% of the total metagenome data with ANI >98%) than the non-HCH-degrading genotypes of S. chlorophenolicum L-1 and Sphingobium sp. SYK-6 (16 460 reads; 0.3% of the total metagenome data with ANI >98%). The four Sphingobium species, demonstrated close phylogenetic similarity (16S rRNA nucleotide identity of 97–99.6%) (Pal ), but to further establish the evolutionary kinship among these isolates, ANI values were computed with the secondary genetic elements, which often carry the lin genes (Nagata ; Nagata ; Tabata ) (Supplementary Figure S3a). Phylogenomic sequence analysis of S. indicum B90A and S. japonicum UT26 revealed that they are genetic subspecies (ANI=98.04%); whereas Sphingobium sp. SYK-6 and S. chlorophenolicum L-1 are considerably less closely related (ANI=89%) as previously reported using 16S rRNA gene sequence similarity (Copley ). Genetic relatedness was further validated using tetranucleotide profiling, which showed a similar relationship as defined by ANI analysis (Supplementary Figure S3b). As the abundance of populations closely related to S. indicum B90A and S. japonicum UT26 (ANI >98%) was high (>0.8%) at the dumpsite, we hypothesized that their close relatives would have inherited a minimal ancestor gene complement. In order to validate this hypothesis, metagenomic sequence fragments were assembled, and the resulting contigs were characterized using tetranucleotide frequency correlation and %GC into Sphingobium related bins, which were then reassembled. This ‘meta-Sphingobium' assembly was, as expected, genetically (tetranucleotide frequencies and ANI) close to S. indicum B90A and S. japonicum UT26 (Supplementary Figure S3b). However, mapping the ‘meta-Sphingobium' assembly to the non-HCH-degrading Sphingobium strains (Sphingobium chlorophenolicum L1 and Sphingobium sp. SYK-6), clearly highlighted their genetic heterogeneity, corresponding to the higher relative abundance of their secondary genetic elements in comparison with the primary chromosome (Supplementary Figure S3a). The high relative pangenomic abundance (at species level identity cutoff of 97%) of plasmids and secondary chromosome genotypes within the dumpsite metagenome, suggests that both HCH-degrading and non-degrading Sphingobium strains have been selected for these secondary genetic elements in comparison with the gene content present on the primary chromosome, which highlights the selection pressure of the HCH contamination.

Recursive identification of ‘foreign genes'

The individual metagenomic reads (20 111 630 reads) from the dumpsite were mapped on to the draft genome of S. indicum B90A, as well as the two chromosomes (chromosome 1=3.5 Mb and chromosome 2=681.8 Kb) and three plasmids (pCHQ1=190.9 Kb, pUT1=31.7 Kb and pUT2=5.3 Kb) of S. japonicum UT26 (Supplementary Figure S3a). Regions with >90% sequence identity and a mapping coverage of ∼8 × were selected for ancestor genotype predictions (see next section), and those with low sequence identity (less than ∼25%) or low coverage (less than ∼25% of the region) were characterized as MGIs and further annotated to determine their strain or environment specific functional traits. By using this approach, we identified two MGI profiles; 631 genes (∼312 Kb) in the genome of S. indicum B90A and 822 genes (∼559 Kb) in the complete genome sequence of S. japonicum UT26. Although, at a fine-scale (for example, KEGG enzyme) the functional profiles of the two MGIs were not very similar (R2=0.69 and P<0.0001; Fishers two-sided exact test with FDR correction; Supplementary Figure S3c), the majority of the genes still encoded for transposition, recombination and repair (B90A=15.094%±0.13 and UT=11.051%±0.01), as well as inorganic ions and amino-acid transport and metabolism (B90A=10.09%±0.18 and UT26=9.08%±1.4) (Supplementary Figure S3d). As, both the strains have been isolated from different locations (India and Japan), the differences in the functional potential of MGIs profiles could be attributed to the genetic adaptation of these strains in their microhabitats. Besides the MGIs in both strains using the genomes were further analyzed for foreign genes using DNA composition characteristics via SIGI-HMM (Waack ). The existing MGIs for each strain (predicted above) were validated using this algorithm. Although no further foreign genetic material was predicted for S. japonicum UT26 (Supplementary Table S2), the genome (excluding the MGI) of S. indicum B90A was predicted to contain 51 further foreign protein CDSs, totaling 31 Kb. These genes coded mostly for transposase, phage integrase, recombination and repair, and hypothetical proteins (Supplementary Figure S3d), the majority (89%±0.56) of which had greatest phylogenetic similarity to genes belonging to class Alphaproteobacteria. This demonstrates that metagenomic read mapping combined with DNA composition based algorithms (SIGI-HMM), provide a robust methodology for predicting a ‘complete' genomic island profile.

Reconstruction of the ancestral genotype and implications for legacy to recalcitrant compound degradation across modern lineages

The minimal ancestral gene content inherited into each subspecies was reconstructed and its genetic fingerprint traced within the modern lineages. The metagenomic data (Table 1) was mapped (Supplementary Figure S3a) on to the genomes and plasmids of the two abundant subspecies (S. indicum B90A and S. japonicum UT26), and regions with >90% sequence identity and mapping coverage of >8 × were selected. The analysis was based on the assumption that a gene is ancestral if it was present in orthologous segments of the genetic subspecies (genomic regions without MGIs and SIGI-HMM predictions) and has abundant coverage in the metagenome. In all, a total of 3128 protein CDSs (3.04 Mb) were predicted for the minimal ancestral gene content inherited into sub-species from their immediate ancestor (Figures 1a and b). Putative CDSs from the predicted ancestral genotype were compared and mapped (BLASTN, E-value 10−8; Figure 1b) against the genomes of S. japonicum UT26 and S. indicum B90A and the available dumpsite metagenomic sequence data, to determine the relative genetic rearrangements across each discrete subspecies, and to trace the in situ relative abundance of these rearrangements in the metagenome (Figure 1b). Strikingly, although the lower HCH degradation pathway (linD, E, R, F, G, and H) was present in the ancestor, and abundant across all metagenomic data sets, there was a complete absence of two enzymes from the ‘core' upper HCH degradation pathway (linA, C genes) in the predicted ancestral genotype (Figure 1b). This suggests that the linA and linC were laterally acquired by both S. indicum B90A and S. japonicum UT26 (mean pairwise dN/dS >1, Supplementary Figure S4a).

Figure 1

Predicted ancestral genotype, its metagenomic fingerprints and genetic rearrangements across S. indicum B90A and S. japonicum UT26. (a) Metagenomic fingerprints of ancestral gene content; Pyrosequence reads from the previous dumpsite metagenome survey (Sangwan ) and illumina reads from this study were plotted on the predicted ancestral gene content (yellow shade) and genome sequences of the S. indicum B90A (gray shade) and S. japonicum UT26 (pink shade). Each dot on the graph represents an individual sequence read aligned along its reference genotypes. x and y axis represents the genomic coordinate (Mbp) and percentage sequence identity, respectively. (b) Genetic breakpoints between ancestor genotype and HCH-degrading genetic subspecies from outside towards centre; outermost circle: metagenomic contigs, circle 2: draft assembly of S. indicum B90A, circle 3: complete genome of S. japonicum UT26 mapped (BLASTN) over ancestral genotype (black color intensity represent the percentage identity, that is, darker the shade higher is the sequence identity), circle 4: lin genes homology between ancestor and subspecies, circle 5: metagenomic read coverage plot of ancestral gene content (5 × as coverage cutoff), outermost circle: GC plot of the predicted ancestral gene content.

Ecologically homogenous strains showing 1–5% genome-aggregate nucleotide divergence are already known to support the core genome hypothesis (Caro-Quintero and Konstantinidis, 2012) and sequence divergence within their genotypes corresponds to the evolution from the last-common ancestor (LCA). Thus, the fact that two major upper-pathway genes (linA and linC) of aerobic HCH degradation were not detected in the ancestral genotype, and that the HCH-degrading subspecies (Figure 1a and Supplementary Figure S4b) and the predicted ancestor demonstrate sequence-discrete population status (95–100% sequence similarity), suggests that the lower HCH degradation pathway (linD, linE and linR genes) is an evolutionary ‘long-lived' event (Caro-Quintero and Konstantinidis, 2012) in this ecosystem. This ‘long-lived' hypothesis is supported by the fact that the lower pathway genes were found in all metagenomic data sets (dumpsite, as well as 1 and 5 km from the dumpsite (Sangwan )), which is potentially explained by the close homology of linD, linE and linR to other catabolic genes also involved in biodegradation of other recalcitrant chemical compounds and present among several bacterial strains (Copley ). The estimated size of the ancestor genotype is 3 Mb (Figure 1a), which is smaller than that of the two subspecies (S. indicum B90A, 4.08 Mb and S. japonicum UT26, 4.4 Mb) and the estimated in situ bacterial specific average genome size (4.03 Mb±0.17) as predicted from the metagenomic data (Raes ). The smaller size of an ancestral genotype has been reported to be a reliable indicator for large-scale lateral gene transfer in bacterial evolution (Dagan and Martin, 2007). The core-metabolism of the ancestor (Supplementary Figure S4c) was similar to that of the sub-species (complete genomes), but strain-specific functional divergence (Supplementary Figure S4b) was also observed (ancestor versus S. sphingobium B90A, R2=0.99, P>0.0001; ancestor versus S. japonicum UT26, R2=0.78, P>0.0001). Raw reads from the genome of S. indicum B90A, the HCH dumpsite metagenome and the predicted ancestor's CDSs were all mapped on to the recalcitrant compound degradation pathway genes of S. japonicum UT26 (phenol/toluene, chlorophenol, anthranilate, homogentisate and HCH) (Nagata ) (Supplementary Figure S4b). The data suggest that anthranilate degradation is a strain-specific character for S. japonicum UT26, inherited from an immediate ancestor, but completely absent in its genetic sub-species (S. indicum B90A) genome and in the in situ metagenome. On the contrary, homogentisate degradation was found to be an environment specific trait (laterally acquired) as it was completely absent in the ancestor genotype and the S. indicum B90A (subspecies) genome (Supplementary Figure S5), but was present in the dumpsite metagenome. Sequence mapping analysis (identity >97%) (Supplementary Figure S4b), clearly revealed that phenol and chlorophenol degradation pathway genes were present in the ancestral genotype, the reference species, and their as-of-yet-uncultivated relatives present in the HCH dumpsite metagenome. This suggests the selective maintenance of this trait in the evolution of the sphingomonads exposed to HCH isomers. As the ‘geographically distant' isolation sites of S. indicum B90A and S. japonicum UT26 did not share the physicochemical history of the analyzed site except amendment with HCH, these traits appear to be purely ancestral. However, these traits are under strong natural selection pressure (mean pairwise dN/dS >1) and the codon usage patterns are in agreement with the core usage of the reference strains. Following the ‘genome streamlining theory' (Giovannoni ) genetic maintenance of these traits at high evolutionary cost still needs to be explored.

De-novo identification and characterization of mobile genetic elements in situ

The absence of the upper HCH degradation pathway genes linA and linC in the ancestor generates the hypothesis that the sphingomonads were under strong selection pressure to acquire HCH degradation genes in HCH contaminated environments, and thereby increase their genome size. Therefore, both S. indicum B90A, S. japonicum UT26 and the dumpsite metagenome should have the appropriate number of mobile genetic elements, IS and integrases to support these predicted lateral gene transfer events (Aziz ). To determine this, a de-novo approach was used to identify these elements in the genome of S. indicum B90A, and in the HCH dumpsite metagenome. A total of 19 (8.5%) clusters (Supplementary Figure S5a) from the genome and 267 (0.03%) from (Supplementary Figure S5b) the metagenome had significant similarities to the putative transposases and integrases. After merging overlapping clusters (contigs with minimum 40% percent overlap and 80% identity threshold), re-analysis and validation based on the metagenomic coverage cutoff (minimum coverage ∼8 × ), 8 unique clusters (representing complete transposases) were identified (Figure 2). All of the eight putative transposons were found to be from the IS6 family (IS6100).

Figure 2

Metagenomic validation of de-novo assembled strain-specific mobile genetic elements. Total 8 transposons (a–h) were selected after reconstructed from de-novo assembly and graph-based clustering of genome (S. indicum B90A) and HCH dumpsite metagenomes. For each transposon; purple color dots represents the genomic reads of contigs (from graph-based clustering) assigned to a transposon and overlapping yellow reads represents the metagenomic validation with minimum sequence identity of 97%.

These results suggest that the eight transposons are preferentially environment specific, ecologically selected (1% to 5% pangenomic nucleotide divergence) and highly active (based on their abundance in the metagenomic data) mobile genetic elements and have been enriched in the bacterial lineages present at the dumpsite. Interestingly, on tracing these transposases back to the genomes of S. indicum B90A and S. japonicum UT26, four of them (Figures 2a) were found to be associated with linA, linA1, linC and linDER genes, respectively (Dogra ; Nagata ). These findings clearly explain the strong lateral transfer potential (Tn3 family) of lin genes in the reference (genomic) and as-yet-uncultured (metagenomic) HCH-degrading bacterial lineages. It also provides further mechanistic support for the hypothesis that linA and linC genes are environment specific genetic imports that were not present in the LCA of the two HCH-degrading subspecies.

Metagenomic recovery of enriched plasmids and a secondary chromosome

Whole-genome sequences of various Sphingobium species have revealed that most of the catabolic genes responsible for the degradation of xenobiotics are generally present on their plasmids (Nagata ; Tabata ; Copley ). Comparative metagenomic analysis (BLASTN; E-value=10−8) against the NCBI plasmid database (ftp.ncbi.nlm.nih.gov/genomes/Plasmids/) revealed the enrichment of plasmids (P<0.001 Fisher's exact test with FDR correction) at the dumpsite (Supplementary Figure S5c), and increasing relative abundance across the increasing HCH contamination gradient (5 km <1 km <454 and Illumina dumpsite). For 37 enriched plasmid genotypes (Supplementary Figure S5c), a linear increase (P<0.0001) in metagenomic abundance was observed; 19 of which were more abundant in the Illumina data set compared with the 454-pyrosequenced metagenome (Supplementary Figure S5c), potentially as a result of the increased sequencing depth (Supplementary Table S1). Three draft genotypes were assembled from the metagenomic data (Figure 3 and Supplementary Figure S3a) corresponding to plasmids, pSPHCH01 (Copley ), pSLGP (Masai ), and chromosome 2 of S. japonicum UT26 (Lal ; Dogra , Nagata ). The relative enrichment of previously assembled plasmid variants (plasmid pLB1, pISP3 and pISP4; 12) were also validated. Nucleotide composition- (%GC and tetranucleotide frequency) based clustered contigs (species level structural variants) were arranged using paired end information and reference mapping (minimum percentage identity=85). Relatively similar nucleotide composition patterns were observed in pairwise comparisons between corresponding secondary genetic elements, their primary chromosomes and reconstructed genotypes (Copley ).

Figure 3

Genotypic synteny and metagenomic recruitment of reference genotypes. Whole-genome alignments between reference genomes and their reconstructed genotypes: (a) pSLPG plasmid of Sphingobium sp. SYK-6, (b) pSPHCH01 plasmid of Sphingobium chlorophenolicum L1 and (c) chromosome 2 of S. japonicum UT26. Overlaid the synteny plots are the metagenomic read coverage on a numerical scale. Mc, metagenome contigs.

Conclusions

The ‘genome streamlining theory' (Giovannoni ), posits that selection pressure is linearly correlated with the effective genome size. Here, we have invoked this theory to explain the evolutionary increase in genome size as a result of selective pressure from xenobiotic pollution for microbial lineages present at an HCH dumpsite. Using nucleotide composition patterns, we have reanalysed the genetic relatedness among cultured representatives and in situ cohorts of the genus Sphingobium, and have shown that use of metagenomic data and combinatorial bioinformatic algorithms can provide a robust methodology for accurate ‘foreign gene' identification. Coherent genetic relatedness within ecotypes and their meta-pangenomic abundance has enabled us to predict the ancestral gene content inherited into their genotypes (strains). Furthermore, this study also proposes that metagenomics has the potential to clearly demarcate between environment and/or strain-specific functional traits acquired by closely related, sequenced reference organisms. These results suggest that we were successful in reconstructing an LCA genotype (as opposed to a core genome) as we were able to (a) clearly demark gene regions that were not part of the two reference genomes, (b) include secondary genetic elements and (c) identify and remove ‘forigen' genes using an LCA-specific algorithm. Strikingly, we found that despite the absence of various xenobiotic chemicals, for example, phenol, toluene, homogentisate and anthranilate, at HCH dumpsite, there is an evolutionary maintenance (from last ancestor to the cultured and yet-uncultivated cohorts) and positive selection of their biodegradation traits. The reasons for this are unclear, but it represents the powerful potential of metagenome-based ancestral genotype reconstruction for uncovering potentially important cross-linked genotype traits in such processes. Further genome sequencing of relevant isolates, and metagenomic sequencing and genome reconstruction from related sites, combined with knock-out mutants can now be targeted towards these processes and genes to help further elucidate the genetic mechanism.

57 in total

1. Accurate identification of orthologous segments among multiple genomes.

Authors: Tsuyoshi Hachiya; Yasunori Osana; Kris Popendorf; Yasubumi Sakakibara
Journal: Bioinformatics Date: 2009-02-02 Impact factor: 6.937

2. Complete genome sequence of Sphingobium sp. strain SYK-6, a degrader of lignin-derived biaryls and monoaryls.

Authors: Eiji Masai; Naofumi Kamimura; Daisuke Kasai; Akio Oguchi; Akiho Ankai; Shigehiro Fukui; Mikio Takahashi; Isao Yashiro; Hiroki Sasaki; Takeshi Harada; Sanae Nakamura; Yoko Katano; Sachiko Narita-Yamada; Hidekazu Nakazawa; Hirofumi Hara; Yoshihiro Katayama; Masao Fukuda; Shuji Yamazaki; Nobuyuki Fujita
Journal: J Bacteriol Date: 2012-01 Impact factor: 3.490

3. Evaluation of hexachlorocyclohexane contamination from the last lindane production plant operating in India.

Authors: Simran Jit; Mandeep Dadhwal; Hansi Kumari; Swati Jindal; Jasvinder Kaur; Pushp Lata; Neha Niharika; Devi Lal; Nidhi Garg; Sanjay Kumar Gupta; Pooja Sharma; Kiran Bala; Ajaib Singh; John Vijgen; Roland Weber; Rup Lal
Journal: Environ Sci Pollut Res Int Date: 2010-10-22 Impact factor: 4.223

4. Low dose induction of micronuclei by lindane.

Authors: Olga I Kalantzi; Rebecca Hewitt; Kirstie J Ford; Lee Cooper; Ruth E Alcock; Gareth O Thomas; James A Morris; Trevor J McMillan; Kevin C Jones; Francis L Martin
Journal: Carcinogenesis Date: 2003-12-19 Impact factor: 4.944

5. Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes.

Authors: Kris Popendorf; Hachiya Tsuyoshi; Yasunori Osana; Yasubumi Sakakibara
Journal: PLoS One Date: 2010-09-24 Impact factor: 3.240

6. Organization of lin genes and IS6100 among different strains of hexachlorocyclohexane-degrading Sphingomonas paucimobilis: evidence for horizontal gene transfer.

Authors: Charu Dogra; Vishakha Raina; Rinku Pal; Mrutyunjay Suar; Sukanya Lal; Karl-Heinz Gartemann; Christof Holliger; Jan Roelof van der Meer; Rup Lal
Journal: J Bacteriol Date: 2004-04 Impact factor: 3.490

7. Comparative metagenomic analysis of soil microbial communities across three hexachlorocyclohexane contamination levels.

Authors: Naseer Sangwan; Pushp Lata; Vatsala Dwivedi; Amit Singh; Neha Niharika; Jasvinder Kaur; Shailly Anand; Jaya Malhotra; Swati Jindal; Aeshna Nigam; Devi Lal; Ankita Dua; Anjali Saxena; Nidhi Garg; Mansi Verma; Jaspreet Kaur; Udita Mukherjee; Jack A Gilbert; Scot E Dowd; Rajagopal Raman; Paramjit Khurana; Jitendra P Khurana; Rup Lal
Journal: PLoS One Date: 2012-09-28 Impact factor: 3.240

8. ISfinder: the reference centre for bacterial insertion sequences.

Authors: P Siguier; J Perochon; L Lestrade; J Mahillon; M Chandler
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. Minimus: a fast, lightweight genome assembler.

Authors: Daniel D Sommer; Arthur L Delcher; Steven L Salzberg; Mihai Pop
Journal: BMC Bioinformatics Date: 2007-02-26 Impact factor: 3.169

10. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences.

Authors: Hanno Teeling; Jost Waldmann; Thierry Lombardot; Margarete Bauer; Frank Oliver Glöckner
Journal: BMC Bioinformatics Date: 2004-10-26 Impact factor: 3.169

22 in total

Review 1. Survey of (Meta)genomic Approaches for Understanding Microbial Community Dynamics.

Authors: Anukriti Sharma; Rup Lal
Journal: Indian J Microbiol Date: 2016-11-11 Impact factor: 2.461

Review 2. Molecular perspectives and recent advances in microbial remediation of persistent organic pollutants.

Authors: Jaya Chakraborty; Surajit Das
Journal: Environ Sci Pollut Res Int Date: 2016-05-28 Impact factor: 4.223

Review 3. Taxonomically Characterized and Validated Bacterial Species Based on 16S rRNA Gene Sequences from India During the Last Decade.

Authors: Princy Hira; Priya Singh; Anil Kumar Pinnaka; Suresh Korpole; Rup Lal
Journal: Indian J Microbiol Date: 2019-12-03 Impact factor: 2.461

4. Biodegradation of dichlorodiphenyltrichloroethanes (DDTs) and hexachlorocyclohexanes (HCHs) with plant and nutrients and their effects on the microbial ecological kinetics.

Authors: Guangdong Sun; Xu Zhang; Qing Hu; Heqing Zhang; Dayi Zhang; Guanghe Li
Journal: Microb Ecol Date: 2014-09-12 Impact factor: 4.552

5. Genome Sequence of Sphingobium yanoikuyae B1, a Polycyclic Aromatic Hydrocarbon-Degrading Strain.

Authors: Qiang Zhao; Hongbo Hu; Wei Wang; Huasong Peng; Xuehong Zhang
Journal: Genome Announc Date: 2015-02-05

6. Pan-genome dynamics of Pseudomonas gene complements enriched across hexachlorocyclohexane dumpsite.

Authors: Anukriti Sharma; Naseer Sangwan; Vivek Negi; Puneet Kohli; Jitendra Paul Khurana; Desiraju Lakshmi Narsimha Rao; Rup Lal
Journal: BMC Genomics Date: 2015-04-18 Impact factor: 3.969

7. Insights into Ongoing Evolution of the Hexachlorocyclohexane Catabolic Pathway from Comparative Genomics of Ten Sphingomonadaceae Strains.

Authors: Stephen L Pearce; John G Oakeshott; Gunjan Pandey
Journal: G3 (Bethesda) Date: 2015-04-07 Impact factor: 3.154