Literature DB >> 26166067

Comparative genomics reveals conserved positioning of essential genomic clusters in highly rearranged Thermococcales chromosomes.

Matteo Cossu1, Violette Da Cunha1, Claire Toffano-Nioche1, Patrick Forterre1, Jacques Oberto1.   

Abstract

The genomes of the 21 completely sequenced Thermococcales display a characteristic high level of rearrangements. As a result, the prediction of their origin and termination of replication on the sole basis of chromosomal DNA composition or skew is inoperative. Using a different approach based on biologically relevant sequences, we were able to determine oriC position in all 21 genomes. The position of dif, the site where chromosome dimers are resolved before DNA segregation could be predicted in 19 genomes. Computation of the core genome uncovered a number of essential gene clusters with a remarkably stable chromosomal position across species, in sharp contrast with the scrambled nature of their genomes. The active chromosomal reorganization of numerous genes acquired by horizontal transfer, mainly from mobile elements, could explain this phenomenon.
Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.

Entities:  

Keywords:  Archaea; Bioinformatics; Chromosomal landmarks; Genome evolution; Mobile elements; Thermococcales

Mesh:

Year:  2015        PMID: 26166067      PMCID: PMC4640148          DOI: 10.1016/j.biochi.2015.07.008

Source DB:  PubMed          Journal:  Biochimie        ISSN: 0300-9084            Impact factor:   4.079


Introduction

The discovery of anaerobic hyperthermophilic microbes by Karl Stetter and Wolfram Zillig extended the limits of life beyond environmental barriers commonly considered as insuperable. Inhospitable habitats such as saline thermal pools and deep sea hydrothermal vents have been remarkably colonized by these extremophilic life forms. The organisms whose optimal growth temperature approaches or exceeds that of boiling water, belong exclusively to the third domain of life: the Archaea. A significant proportion of microorganisms thriving at the fringe of life in terms of temperature belong to the taxonomic order Thermococcales, ranked in the Euryarchaeaota phylum [1]. Thermococcales are divided into three principal genera: Pyrococcus, Thermococcus and Palaeococcus, and grow chemoorganoheterotrophically at temperatures ranging from 80 °C to 100 °C [2]. They require a source of protein and present variable amino acid requirements; several species such as Pyrococcus furiosus and Thermococcus kodakarensis are able to use chitin as a carbon source [3]. Thermococcales grow easily in the laboratory in complete or synthetic media under strict anoxia. To produce energy, these Archaea prefer anaerobic respiration using S° as terminal electron acceptor to produce hydrogen sulfide. Alternatively, they are able to ferment pyruvate to produce hydrogen [2]. Such unique growth parameters prompted several teams to investigate biosynthetic pathways in Thermococcales. The central metabolism differs quite notably from previously known pathways. The pentose pathway is absent, the TCA cycle is incomplete and glycolysis uses a number of enzymes remarkably different from the canonical view [2]. Even if the net energy balance is still subject to debate, it appears that these Archaea are geared towards an extremely conservative use of energy [2]. Despite their extreme growth conditions, low energetic efficiency and simplified biochemistry, Thermococcales display a very short generation time as low as 23 min [4]. This doubling interval is remarkably similar to that of the fast growing model microbe Escherichia coli, grown under the much more favorable conditions of aerobic respiration [5]. Growth efficiency of Thermococcales is in sharp contrast with an apparent disorganization of their chromosome. Indeed it has been reported that these genomes are subjected to a shuffling-driven evolution [6]. This apparent paradox prompted us to investigate, in this work, the process of fast cell growth and rapid chromosome replication by analyzing genomic organization and replication patterns of the completely sequenced Thermococcales.

Material and methods

Genomic data files retrieval and formatting

GenBank genomic data files corresponding to the 21 Thermococcales species were retrieved locally from the NCBI repository using four sequential commands from NCBI Entrez Programming Utilities (E-Utilities). This redundant procedure was defined in order to guarantee retrieval of the main chromosome of complete genomes exclusively. The first command allows retrieval of the species-specific bioproject: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=[speciesname] The second command permits to examine the 'Sequencing_Status' flag for completeness: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=bioproject&id=[bioproject] The third command retrieves the unique and chromosome-specific GenBank Identification (GI) number: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=[bioproject] The fourth command retrieves locally the organism-specific data file in GenBank format: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=[GI]&rettype=gbwithparts The Thermococcales protein sequences were extracted in Fasta format from these GenBank files using an in-house c# parsing script retaining only the actual amino acid sequence and the unique genomic identification number (GI). All proteins were merged into a single database which was converted to binary format using the NCBI executable 'makeblastdb'. The same script generated a separate indexed file where each individual protein was represented using the following fields: ORF genomic orientation, ORF starting and ending coordinates, gene name, unique protein GI identifier, protein function and source organism name.

Thermococcales phylogenetic tree

DNA sequence corresponding to the 16S ribosomal RNA genes were retrieved using the BAGET web service at http://archaea.u-psud.fr/bin/baget.dll [7]. PhyML phylogeny was computed using web service http://phylogeny.lirmm.fr/ [8].

Thermococcales origin of replication prediction

Replication origin predictions with GC skew or Z-curve methods were performed using software Ori-Finder 2 available at http://tubic.tju.edu.cn/Ori-Finder2/ [9]. In a second predictive method, we used the mini-ORBs sequences identified in Pyrococcus abyssi by Matsunaga et al. [10] as a matrix for oriC prediction using FITBAR available at http://archaea.u-psud.fr/fitbar [11]. In this case, the search algorithm parameters were log-odds PSSM, with a local Markov Model to compute the p-value of the newly predicted ORB site and the investigation was made in intergenic regions only. We have considered as putative replication origin, intergenic regions where more than 4 mini-ORBs can be predicted using FITBAR, with p-values < 0.005. These results were compared to those obtained with Ori-Finder 2 using as ORBs sequences, the three motifs predicted for Thermococcaceae. These three conserved motifs of ORBs sequences were obtained from the comparison of Thermococcales replication origin indicated in the DoriC database [12]. The conserved ORB motifs were calculated from the Thermococcales records in DoriC, with the MEME tool (Multiple EM for Motif Elicitation) used to discover conserved patterns in related DNA sequences [13].

Thermococcales dif site prediction

The identification of dif sites on the 21 sequenced Thermococcales chromosomes was performed using a consensus sequence deduced from the alignment of predicted dif sites in P. abyssi, Pyrococcus horikoshii, P. furiosus and Thermococcus kodakaraensis [14]. This consensus was then used to perform dif site prediction using FITBAR with the same search algorithm parameters as described above for ORBs prediction but on the whole chromosome. Progressively, every newly predicted sequence was added to the consensus to improve detection sensitivity.

Homology searches of XerA recombinase

Thermococcales XerA orthologs were searched by BLASTp analysis using the amino acid sequence of P. abyssi XerA (NP_126073.1). A second predictive method was performed using SYNTTAX web service [15] available at http://archaea.u-psud.fr/synttax.

Core genome procedure

The core genome procedure was conducted as follows. We designed a c# script to construct protein orthologous groups by non-redundant bi-directional BLASTs. Every BLAST score was normalized to the alignment of query and hit proteins to themselves. Proteins showing normalized bi-directional BLASTs > 30% were considered orthologous as recommended by Lerat et al. [16]. A c# script was designed to query the orthologous groups and define the core genome which consists of all protein genes present at least once in the whole dataset. A 'single core' dataset was derived for this core genome by excluding orthologous classes containing more than a single representative per genome.

Core genome chromosomal positioning

For each gene composing the single core, we calculated the mean distance to the predicted origin of replication and its standard deviation (SD) using an in house c# script. The core genes were then successively ranked by mean distance and SD to highlight the presence of clusters.

P. abyssi genome expression

In order to quantify the expression level of every gene in P. abyssi, we used RNA-seq data obtained across several growth phases as described in Ref. [17]. As the sequencing was produced in a directed way, the reads alignment respects the strand of the DNA molecule. The CompareOverlapping tool from the S-mart toolbox [18] was used (with the -c option to respect strand constraint) in order to define the number of overlapping reads for every CDS feature defined into the NC_000868.1 entry from the NCBI repository. For each gene, the RPKM measurement defined by Ref. [19] was computed based on the number of overlapping reads, a read size of 40nt, and a total of 5587560 aligned reads. We have used the RPKM measure for each gene as an estimation of their respective expression level.

Results

Thermococcales genomic dataset

At the time of writing, 21 Thermococcales genomes have been completely sequenced and annotated. They are publicly available at the NCBI repository and consist of 13 Thermococcus, 7 Pyrococcus and 1 Palaeococcus (Table 1). Thermococcales carry a single ∼2 Mb chromosome and encode an average of 2100 proteins. Evolutionary relationships among the various species are illustrated by a phylogenetic tree of their 16S ribosomal RNA genes (Fig. 1). Genomic sequences were retrieved as described in Materials and Methods. The comparative genomic analysis presented here is based on this entire dataset. The first step of this analysis consisted in the identification of chromosomal landmarks such as the origin and terminus of DNA replication followed in a second step by the comparison of the protein content at the genomic level.
Table 1

List of Thermococcales species with a complete genome sequence available.

SpeciesBioprojectGIGenesSize (Mb)GC%Optimum T°CHabitatReference
Palaeococcus pacificus DY20341PRJNA20749566480020420461.8643.080 °CAquatic[57]
Pyrococcus abyssi GE5PRJNA629031451845018751.7744.71103°C/90 °CAquatic[58]
Pyrococcus furiosus DSM 3638PRJNA578731897637222251.9040.77100°C/90 °CAquatic[59]
Pyrococcus furiosus COM1PRJNA16962039765068721131.9140.79100 °CAquatic[60]
Pyrococcus horikoshii OT3PRJNA577531458996320001.7341.8898°C/95 °CAquatic[61]
Pyrococcus sp. NA2PRJNA6655133215764320281.8642.7493 °CAquatic[62]
Pyrococcus sp. ST04PRJNA16726138985144918391.7342.3095 °CAquatic[63]
Pyrococcus yayanosii CH1PRJNA6828133728351119521.7251.6498 °CAquatic[64]
Thermococcus barophilus MPPRJNA5473331522976522572.0141.7685 °CAquatic[65]
Thermococcus eurythermalis strain A501PRJNA25167770030202521832.1253.4785 °CAquatic[66]
Thermococcus gammatolerans EJ3PRJNA5938924010205722102.0553.5688 °CAquatic[67]
Thermococcus guaymasensis DSM11113PRJNA23052974479317221701.9252.8688 °CAquaticZhang,X. et al., 2015
Thermococcus kodakarensis KOD1PRJNA582255763993523582.0952.0085 °CAquatic[68]
Thermococcus litoralis DSM 5473PRJNA8299753054744425752.2243.0983 °CAquatic[69]
Thermococcus nautili strain 30-1PRJNA23773758990859022881.9754.8487.5 °CAquatic[70]
Thermococcus onnurineus NA1PRJNA5904321222314420261.8551.2780 °CTerrestrial[71]
Thermococcus sibiricus MM 739PRJNA5939924239799721071.8540.2078 °COil[72]
Thermococcus sp. 4557PRJNA7084134158108821812.0156.08NDAquatic[73]
Thermococcus sp. AM4PRJNA5473535052568222792.0854.7880 °CAquatic[74]
Thermococcus sp. CL1PRJNA168259/PRJNA16737139096017620901.9555.8285 °CAquatic[75]
Thermococcus sp. ES1PRJNA23023357302386520901.9540.3082 °CAquatic[76]
Fig. 1

Phylogenetic tree of the 21 sequenced Thermococcales. The phylogeny of the Thermococcales dataset was calculated with PhyML using the 16S ribosomal RNA genes as described in Material and Methods.

Prediction of Thermococcales DNA replication origins

The duplication and transmission of genetic information without loss is of fundamental importance for living cells. Cell division must be accompanied by DNA replication executed with appropriate timing and frequency. In all organisms, replication initiates at specific region(s) of the genome known as the origin of replication (oriC) site(s). Eukaryotic DNA replication is initiated at multiple origins at different times across linear chromosomes. In eukaryotes, the origin recognition complex (ORC) contains six separate polypeptides, Orc1-6. Comparative genomic analysis of whole archaeal genome sequences show that the archaeal machinery responsible for DNA replication is largely homologous to that of eukaryotes and is clearly distinct from its bacterial counterpart [20], [21]. It has been shown experimentally that the archaeal origin binding protein is homologous to the related eukaryotic Orc1 and Cdc6 proteins [22]. The fine mapping of the three replication origins in Sulfolobus solfataricus led to the identification of origin recognition boxes (ORBs) and mini-ORBS [23]. ORBs are repeated sequences located on both sides of A/T rich regions and were shown to be the binding site for Cdc6 proteins [23]. ORBs from different species share sequence similarity with a consensus sequence referred to as mini-ORB. It was shown that mini-ORBs are sufficient to bind Cdc6 proteins and that Cdc6 from one organism (Cdc6-1 of S. solfataricus) can bind ORBs from other species in vitro (P. furiosus, Halobacterium NRC1) [23]. ORBs sites are well conserved across many archaeal species and specific binding of ORB sequences by Cdc6 is likely to be a common mechanism for origin recognition in Archaea [22], [24], [25], [26]. Several archaeal species such as S. solfataricus, Sulfolobus acidocaldarius, Haloferax volcanii and Aeropyrum pernix possess multiple oriC per chromosome [23], [27], [28], [29]. Multiple chromosomal replication origins might have arisen by capture of viral or plasmidic replication origins and their respective associated initiator factor [21]. On the other hand, single origins were found in Methanothermobacter thermautotrophicus [24] and mapped precisely in the Thermococcales genus Pyrococcus [22], [30]. In order to compare our genomic dataset, it was fundamental to identify a common and unique genomic feature shared by all 21 Thermococcales genomes under study. Since the origin of replication was shown to be unique in these genomes, we proceeded with a computational prediction of their respective locations. Several bioinformatics techniques have been used to locate origins of replication in prokaryotic genomes: they are based on the measure of asymmetric nucleotide compositions on leading and lagging strands. Cumulative GC-skew plots are commonly used for this purpose [31], [32], [33], [34]. Thermococcales oriC for species P. abyssi, P. horikoshii and P. furiosus have been located using other skewed sequences such as GGTT and GGGT [6], [30]. However, these two particular skews and the remaining 254 tetranucleotide combinations failed to reliably predict Thermococcus origins (data not shown). Alternative scoring methods such as Z-curve calculation have been used successfully for the archaea Methanocaldococcus jannaschii and Methanosarcina mazei, Halobacterium sp. strain NRC-1 and S. solfataricus P2 [9]. Cumulative GC skew and Z-curve methods were tested on Thermococcales genomes using the Ori-Finder 2 web service [9], and the results obtained with four representative genomes are shown in Supplemental Fig. S1. Our results show that the cumulative GC skew method fails to locate replication origins in Thermococcales. The Z-curve approach is positive for few genomes such as P. abyssi and T. kodakarensis but does not provide a prediction for the remaining genomes. Clearly, methods based on Z-curve and DNA composition bias or skew were inoperative for the robust prediction or replication origins in Thermococcales. Therefore, in order to map the position of the replication origins we adopted a different approach based on the systematic detection of biological sequences associated with the initiation of DNA synthesis. As shown above these repeated sequences called ORB are clustered at or near the replication origin and often closely associated with the Cdc6 genes encoding a protein involved in the initiation or replication [10]. All Thermococcales encode a unique Cdc6 gene except Thermococcus sp. CL1 which encodes a second putative Cdc6-related protein encoded by gene CL1_0695. Using the published archaeal mini-ORB sequences [10], the web service FITBAR [11] was used to build consensus sequence and detect its occurrences genome wide, as described in Materials and Methods. A unique oriC could be detected unambiguously in all Thermococcales from the dataset with a p-value < 0.005 (Table 2 and Suppl. Fig. S2). No putative ORB sequence could be found near the second Cdc6-related gene of Thermococcus sp. CL1 and this observation is in agreement with Ori-Finder 2 predictions (data not shown). The association between oriC and Cdc6 was found in all genomes except Thermococcus litoralis and Thermococcus sibiricus where the oriC-Cdc6 distance is respectively 453 kb and 349 kb. Synteny analysis using the SYNTTAX web service [15] indicated that in Thermococcus and Palaeococcus genera, oriC is located between Cdc6 and Rad51-ortholog RadA (Suppl. Fig. S3A). Like its bacterial recA and eukaryal Rad51 orthologs, RadA in involved not only in double strand break repair but also in DNA replication by rescuing collapsed replication forks [35]. In Pyrococcus genus, Cdc6 and oriC are also immediately adjacent whereas RadA is not syntenic (Suppl. Fig. S3B). In all cases, the origin of replication is located in extended non-translated regions or overlaps small computer-predicted orphan genes (Suppl. Fig. S3A&B). A prediction of clustered ORB sequences obtained with the FITBAR web service [11] was used to localize oriCs as shown in Supplemental Table S1. Our analysis indicates that the most robust oriC predictions are those based solely on mini-ORB clusters. The positions of these clusters were therefore considered as bona fide oriC (Table 2, column 2). Replication origin positioning was then used as the first common reference to align and orient all genomes in the dataset (Supp. Fig. S2).
Table 2

Prediction of oriC and dif in Thermococcales.

SpeciesPutative oriC characteristics
Putative dif characteristics
Position on chromosome (Orb cluster coord.)Cdc6 coord.Sequence (28 bp)
Position on chromosomeIntergenic location
Left armSpacerRight arm
Palaeococcus pacificus DY203411858353..0583..1839TTTGGATATAATCAACATTATATCTAAA1158048Yes
Pyrococcus abyssi GE5122701..123499121402..122700ATTGGATATAATCGGCCTTATATCTAAA1220264Yes
Pyrococcus furiosus DSM 363815355..1623516236..17498TTTAGATATAATCAGCCTTATATCTAAA659548Yes
Pyrococcus furiosus COM11479769..14806491478506..1479768TTTAGATATAATCAGCCTTATATCTAAA462638Yes
Pyrococcus horikoshii OT3110790..111561109476..110789TTTAGATATAATCAGCCTTATATCTAAA736581Yes
Pyrococcus sp. NA2579324..580109578064..579323ND
Pyrococcus sp. ST04227904..228761228762..230021ND
Pyrococcus yayanosii CH11426398..14271711427172..1428431TTTAGATATAATGATCCTTATATCTAAA1058381Yes
Thermococcus barophilus MP1672620..16737071670448..1671713TTGTCATATAATATGCCTTATATCTAAA880625Yes
Thermococcus eurythermalis strain A501425720..426421423614..424867TTTAGATATAATGTACCTTATATCTAAA1862025Yes
Thermococcus gammatolerans EJ3126739..127591125431..126738TTTGGATATAATGTACCTTATATCTAAA1457065Yes
Thermococcus guaymasensis DSM11113813701..8143681594403..1595665TTTAGATATAATGTGCCTTATATCTCAA100930Yes
Thermococcus kodakarensis KOD11711251..17121571712158..1713405TTTTGATATAATGTACCTTATATGACAA483614Yes
Thermococcus litoralis DSM 5473974680..9750851594403..1595665TTTGGATATAATGTGCCTTATATGACAA1867166No
Thermococcus nautili strain 30-11603522..16042071605068..1606321TTGAGATATAATGTACCTTATATCTAAA772784Yes
Thermococcus onnurineus NA11510250..15109261508116..1509363TTTAGATATAATGTGTCTTATATCTAAA854799Yes
Thermococcus sibiricus MM 7391783451..17841771434100..1435362TTGTCATATAATAAGCCTTATATCTAAA689121No
Thermococcus sp. 45571373703..13744101376165..1377412TTTTCCTATAATGTGCCTTATATCTAAA97343Yes
Thermococcus sp. AM41530315..15312661529070..1530314TTTGGATATAATGTGCCTTATATCCAAA849102Yes
Thermococcus sp. CL11018000..10183091020367..1021614TTTGGATATAATGTACCTTATATCCAAA1704316Yes
Thermococcus sp. ES11754560..17554811752377..1753639TTTAGATATAATGAATCTTATATGACAA1028150Yes
Thermococcales dif consensusWTKDSMTATAATVDDYMTTATATSHMAA

Prediction of Thermococcales DNA replication termination sites

As shown above, the cumulative GC-skew cannot be used reliably to predict the location of terC where Thermococcales terminate bidirectional DNA replication. So far, terC sites have received much less attention than oriC. To our knowledge, neither biological nor sequence data are available to define where replication forks meet. In accordance with the bacterial paradigm, archaeal DNA replication forks are believed to terminate in the vicinity of dif sites [14], [36]. These dif sites are present in a single copy per genome and are used by a Xer-like recombinase to resolve chromosome dimers, a critical step before their segregation into daughter cells [37]. The 28-nt dif site is composed by two inverted repeats of 11 base pairs (each one specific for one of the two Xer recombinase) separated by a central hexanucleotide; the XerCD/dif recombination system is widespread in the bacterial domain [38]. The efficiency of the archaeal XerA/dif system has been demonstrated in vitro [14]. By sequence homology search, XerA orthologs were found in single copy in all Thermococcales (data not shown). In order to identify dif sites in our dataset, we followed the same methodology used for oriC, as described above. The biological dif sites proposed by Cortes et al. [14] were used to build a consensus for genome wide searching using FITBAR [11]. Bona fide unique dif sites could be identified for 19 genomes out of 21 (Table 2 and Suppl. Fig. S2). The dif site position of Pyrococcus sp. NA2 and Pyrococcus sp. ST04 were estimated to be opposite from their respective predicted oriC.

Core genome

Early chromosomal alignments demonstrated the high level of recombinations and rearrangements in Thermococcales genomes [6]. These observations indicate that these genomes evolve rapidly which might suggest that their genetic content is also highly variable among species. In order to quantify this genomic drift, we submitted our dataset to a recursive systematic comparison of the predicted protein sequences they encode. Each Thermococcales genome encodes an average of 2100 proteins. All the corresponding sequences were compared as described in Material and Methods in order to rank them into orthologous groups. These groups could then be queried to extract common proteins, defined as 'core genome' as well as species-specific or genus-specific proteins and their combinations (Fig. 2). We have used two genetic subsets to define the core: a distinction was made between the 'general core' which contains proteins orthologs and paralogs in every genome and a more restrictive 'single core' which regroups only single copy orthologs shared by all genomes. The general core and single core amount to 790 and 668 proteins respectively (Fig. 2 and Suppl. Table S2A&B). A detailed gene list of the 668 core genome is presented in Supplemental Table S3. The same procedure allowed the identification of genus-specific proteins as well. Pyrococcus and Palaeococcus genera encoded respectively 19 and 116 specific proteins whereas a single Thermococcus-specific protein was found. As shown in Table 3, these proteins could be ranked into functional groups as defined in the archaeal clusters of orthologous genes (ArCOGS) [39]. The core genome comprises proteins of the following classes: information storage and processing (32%), metabolism (30%), poorly characterized (27%) and cellular processes and signaling (11%). This high conservation is in sharp contrast with the very limited chromosomal alignment observed to these organisms [6]. Thus it seemed important to analyze whether this genomic conservation would be clustered to particular chromosomal locations.
Fig. 2

Venn diagram for core and genus-specific proteins counting. Core, genus-specific proteins and their combinations were computed as described in Materials and Methods.

Table 3

ArCOG assignment of the Thermococcales core genes.

ArCOG classFunction790 core668 core
Information storage and processing 32% (34%)Translation, ribosomal structure and biogenesis149140
RNA processing and modification00
Transcription5243
Replication, recombination and repair5145
Chromatin structure and dynamics00
Cellular processes and signaling 11% (10%)Cell cycle control, cell division, chromosome partitioning118
Nuclear structure00
Defense mechanisms118
Signal transduction mechanisms54
Cell wall/membrane/envelope biogenesis1412
Cell motility75
Cytoskeleton00
Extracellular structures00
Intracellular trafficking, secretion, and vesicular transport88
Posttranslational modification, protein turnover, chaperones3122
Mobilome: prophages, transposons00
Metabolism 30% (27%)Energy production and conversion5228
Carbohydrate transport and metabolism3330
Amino acid transport and metabolism4536
Nucleotide transport and metabolism2825
Coenzyme transport and metabolism4136
Lipid transport and metabolism1212
Inorganic ion transport and metabolism2511
Secondary metabolites biosynthesis, transport and catabolism54
Poorly characterized 27% (29%)General function prediction only128115
Function unknown8276

Bold numbers in columns 1 & 3 refer to 790 core genes.

Core genome positioning

In Eukarya, genes involved in related and essential functions often cluster on the chromosome and are co-expressed, which correlates with elevated expression rates [40], [41]. In Archaea and Bacteria, these genes belong to single transcription units or operons, which provide tight co-regulation in addition to expression polarity [42]. Furthermore, bacterial genomes display a non-random gene organization at a higher level such as macrodomains [43] or with multiple scales [44]. Additional chromosomal structuring involves positioning of essential genes preferentially on the leading strand [45] and clustering of transcription and replication genes in the proximity of the bacterial origin of replication [46]. The archaeal chromosome organization has not been investigated in depth with the exception of a few Crenarcheota. It was shown that S. solfataricus and S. acidocaldarius are equipped with three origins or replication surrounded by a higher density of core or essential genes; furthermore, these same regions are more highly expressed [36]. These reports prompted us to investigate the genomic architecture of the Euryarchaeota Thermococcales. For each genome in the dataset, we constructed a detailed physical map indicating the position of each gene. We have used our oriC and dif sites predictions to determine the polarity of each gene respective to the orientation of the replication forks (Fig. 3 and Suppl. Fig. S2). These maps could be used to calculate the proportion of genes whose transcription is collinear with the orientation of DNA replication. Out of the 19 genomes where dif could be predicted, 16 display a higher proportion of genes encoded on the leading strand (Suppl. Table S4). Plotting of 'single core' genes onto the same circular physical maps indicated an even higher proportion of leading strand-encoded genes for 16 genomes (Suppl. Table S4). Since previous studies have shown that essential Sulfolobus genes are clustered near the origin or replication [36], we investigated whether this is the case in Thermococcales as well. We therefore calculated the genomic distance to the respective predicted oriC for each single core ortholog (Suppl. Table S3). Computation of their mean distance and standard deviation allowed the definition of 17 genes clusters whose distance to oriC remains relatively invariable across species (Table 4). The locations of these clusters for each Thermococcales are shown in Supplemental Fig. S2; they often correlate with GC-skew variations.
Fig. 3

Graphical correlation between core-free genomic regions and integration of mobile elements in Thermococcus kodakarensis. The physical map corresponding to Thermococcus kodakarensis was drawn proportionally. The outermost numbered cyan bars indicate the clusters of core genes. Each black bar positions a single gene of the entire genome: the outer bars correspond to genes transcribed in the same polarity as DNA replication; the inner bars refer to the opposite orientation. Similarly, red bars correspond to single 'core genes' with the same orientation convention as above. Bright green bars indicate the location of clusters of species-specific genes (integrated mobile elements). Purple and green bars correspond to GC skew values calculated in windows of 1000bp, shifted 500bp with the purple and green bars indicating values below and above average genomic GC skew, respectively. Predicted origins of replication and dif sites are show as green circles and red squares, respectively. The positions of the four integrated elements (TKV1 to TKV4) as well as the predicted dark matter islands are represented in blue color.

Table 4

Thermococcales conserved clusters characteristics.

ClusteroriC distance
Number of genesMean expression levelpangenomic: 668.5single core: 896.7clusters: 1978.8Relevant encoded protein(s)
Mean (%)Standard deviation (%)
010.330.443478.9Hypothetical
022.691.912221.1Molybdopterin converting factor, subunit 2
035.173.4222551.7Hypothetical
045.393.233557.2KEOPS complex KAE1
057.364.347877.6V-type ATP synthase, 7 subunits
068.253.413268.2Preprotein translocase
079.144.672357.5Oligopeptide transporters
0812.945.1852926.0RNA polymerase
0917.763.90273626.6Ribsosomal proteins
1020.893.63102234.8Ribosomal proteins – RNA polymerase
1122.405.775482.4Thymidylate kinase
1223.464.4731011.2DNA primase
1324.625.453234.9Mevalonate kinase
1426.505.9271535.2Ribosomal proteins - RNA polymerase
1533.346.012486.7Glutamyl-tRNA(Gln) amidotransferase
1634.145.442840.6Translation initiation factor IF-2
1738.585.6321685.0Ribosomal protein

Expression of core genes and conserved gene clusters

Recent experiments have shown that core genes are more strongly expressed in the model organism E. coli [47]. It was therefore important to verify this observation in Thermococcales. The next logical step consisted in the analysis of the correlation between gene position and level of gene expression. We have used the pangenomic gene expression data which was measured recently in P. abyssi using RNA-seq [17]. As shown in Table 4, the mean expression level of the 17 gene clusters described above indicates that they are more transcribed than single core genes which in turn are also more expressed than non-core genes. The largest clusters 8, 9 and 10 were found to be the most highly expressed; they contain genes encoding RNA polymerase subunits and ribosomal proteins. Remarkably, these clusters are positioned at one-quarter of the genome length suggesting that a high selective pressure is acting to constrain them at this particular favorable location.

Localization of organism-specific genes

The positioning of the 'single core' on the chromosomal maps revealed, for all genomes, a number or large area devoid of core genes (Fig. 3 and Suppl. Fig. S2). We observed that clusters containing 3 or more species-specific genes could overlap these blank regions. Since species-specific clusters correspond very likely to the integration of mobile elements such as plasmids or viruses, we can extrapolate the nature of these blank regions as being integrated mobile elements shared by several genomes. Contrarily to what was observed in Sulfolobales [48], the integration of mobile elements in Thermococcales is not confined to a specific location and seems to occur randomly on the chromosome (Suppl. Fig. S2). To confirm this observation, we have mapped on the T. kodakarensis genomic map the four known integrated elements (TKV1 to TKV4) [49] and predicted dark matter islands [50]; all are located in core-free regions (Fig. 3).

Discussion

With the exception of three methanogens, all archaeal genomes sequenced to date encode at least one Cdc6/Orc1 protein which initiates chromosomal DNA replication at one or more oriC origins [51]. In most prokaryotes including several Archaea, chromosomal oriCs can be predicted on the basis of DNA composition using GC-skew [52] or Z-curve algorithms [53]. The comparative genomics analysis presented here confirms the initial observation that Thermococcales chromosomes are highly rearranged. In these genomes, DNA sequence scrambling has reached such a high level that commonly observed prokaryotic chromosomal landmarks such as oriC and terC are no longer readily identifiable by measuring DNA composition biases. It was indeed reported that pure in silico approaches can be unreliable due to frequent genome rearrangements [54]. Nevertheless, the regions corresponding to the origin and termination of replication could be predicted by the means of biological sequence sites determined either biochemically or by analogy to bacterial systems. In most Archaea, replication initiates at ORB sites specifically recognized and bound by Cdc6 [22]. Using the well documented ORB sequences [10], unique origins of replication could be predicted unambiguously for all 21 genomes. They are located in close proximity to RadA which corresponds also to the genomic context of Cdc6 in 19 genomes out of 21. The chromosomal location of terC was identified by the means of the XerC binding site (dif) as defined by Cortez et al. [14]. A unique corresponding site could be identified with high confidence in 19 genomes out of 21. The locations of oriC and dif in each genome define the respective replichores which appear asymmetrical in most Thermococcales and extremely asymmetrical in Pyrococcus yayanosii. This observation raises the question whether terC and dif are co-localized. By analogy to bacterial systems, it is commonly accepted that DNA replication termination and dif sites coincide [14], [36]. On the other hand, an extensive computational analysis based on bacterial genomes has shown a lack of correlation between dif position and the degree of GC skew suggesting that replication termination does not occur strictly at dif sites [55]. However it is quite difficult to extrapolate replication features between Archaea and Bacteria since they use such different replication proteins. Recent evidence has shown that in the Crenarchaeota S. solfataricus, replication termination and dimer resolution are temporally and spatially distinct processes [56]. Since this organism carries three functional oriCs whereas a single one is found in Thermoccocales, it is once again difficult to transpose replication features across archaeal phyla. In the absence of experimental data and of a functional cumulative GC skew in Thermococcales, we cannot prove nor disprove that terC and dif positions are distinct. To assess whether the observed genomic rearrangement could be reflected at the protein level as well, we conducted an extensive ranking of each protein into orthologous groups using a discriminant threshold of 30% similarity. This procedure permitted to characterize the core genome of Thermococcales as well as genus- and species-specific proteins. The 21 genomes considered here share 790 orthologs which corresponds to ∼40% of their total proteins. From the core genome, we isolated the subset of proteins found only once per genome. The genes encoding these 668 'single core' proteins were plotted onto circular chromosome maps which revealed several interesting features. First, the 'single core' genes are not evenly distributed along the chromosome: a number of very extensive areas without core genes are readily observable in all 21 genomes. This phenomenon can be interpreted as the result of recent acquisitions of (non essential) genetic information through horizontal transfer. In a further analysis we were indeed able to show that clusters of strain-specific genes, which correspond presumably to integrated mobile elements, are precisely located within these regions. A second feature consists in the conservation of clusters of core genes in particular location of the chromosome, across Thermococcales. A series of 17 clusters could be identified with a standard deviation of mean distance to origin ≤6%. Despite a high level of genomic rearrangements, the absolute distance between these clusters and the origin of replication remains remarkably constant. These clusters are not confined to oriC-proximal regions but are scattered along the entire chromosome. It is interesting to note that the individual clusters do not belong to the same replichore in every organism; however, their distance to oriC is maintained in a mirrored fashion. The size of each cluster is variable and ranges from 2 to 27 genes often expressed in operons. The largest clusters group essential genes involved in protein translation (cluster 9, 27 genes), gene transcription and protein translation (cluster 10, 10 genes; cluster 14, 7 genes) and energy metabolism (cluster 5, 7 genes). A third feature of the 'single core' consists in its enrichment of genes encoded on the leading strand. This is particularly true with the largest clusters for which a net variation in GC skew is also readily apparent and is very likely to reflect a gene orientation bias of the genes composing the clusters. Indeed, we computed that in 16 organisms out of 19, the core genome is enriched in genes expressed in the same orientation as DNA replication. We were able to show that most of the large clusters display a significantly higher expression rate which further correlates conserved gene position with essential biological functions. The positional conservation of essential genomic subregions is found in the three domains of life [40], [41], [42]. This work has shown that this property is particularly relevant in Archaea Thermococcales due to the highly level of rearrangements of their chromosomes. These small and heavily scrambled genomes were able to maintain highly expressed key genes in the most favorable chromosomal positions and transcribe them in a polarity compatible with DNA replication. We would like to hypothesize that genome shuffling is instrumental to better adapt to challenging extreme environments.

Conclusion

Evolution considerations

All the above observations indicate that a remarkable degree of 'order' has been maintained across Thermococcales even if they display highly scrambled chromosomes. Nevertheless, these organisms display an astonishingly short cell cycle in extreme and resource-deficient environments. This apparent paradox motivated our analysis. The data we presented here led us to propose that Thermococcales chromosome shuffling introduces an increased genome variability which is being actively used by natural selection: (1) to maintain highly expressed key essential genes in favorable and invariant chromosomal positions (2) continuously adapt and optimize the positioning of the constant flow of new genes acquired by horizontal transfer, in order to allow allopatric speciation. The molecular mechanism by which Thermococcales rearrange their chromosomes is presently being investigated.
  75 in total

Review 1.  DNA replication in the archaea.

Authors:  Elizabeth R Barry; Stephen D Bell
Journal:  Microbiol Mol Biol Rev       Date:  2006-12       Impact factor: 11.056

2.  Archaeal proviruses TKV4 and MVV extend the PRD1-adenovirus lineage to the phylum Euryarchaeota.

Authors:  Mart Krupovic; Dennis H Bamford
Journal:  Virology       Date:  2008-03-04       Impact factor: 3.616

Review 3.  The cell cycle of archaea.

Authors:  Ann-Christin Lindås; Rolf Bernander
Journal:  Nat Rev Microbiol       Date:  2013-07-29       Impact factor: 60.633

4.  Complete genome sequence of hyperthermophilic Pyrococcus sp. strain NA2, isolated from a deep-sea hydrothermal vent area.

Authors:  Hyun Sook Lee; Seung Seob Bae; Min-Sik Kim; Kae Kyoung Kwon; Sung Gyun Kang; Jung-Hyun Lee
Journal:  J Bacteriol       Date:  2011-05-20       Impact factor: 3.490

5.  A conserved mechanism for replication origin recognition and binding in archaea.

Authors:  Alan I Majerník; James P J Chong
Journal:  Biochem J       Date:  2008-01-15       Impact factor: 3.857

6.  Replication termination and chromosome dimer resolution in the archaeon Sulfolobus solfataricus.

Authors:  Iain G Duggin; Nelly Dubarry; Stephen D Bell
Journal:  EMBO J       Date:  2010-11-26       Impact factor: 11.598

7.  S-MART, a software toolbox to aid RNA-Seq data analysis.

Authors:  Matthias Zytnicki; Hadi Quesneville
Journal:  PLoS One       Date:  2011-10-06       Impact factor: 3.240

8.  RNA at 92 °C: the non-coding transcriptome of the hyperthermophilic archaeon Pyrococcus abyssi.

Authors:  Claire Toffano-Nioche; Alban Ott; Estelle Crozat; An N Nguyen; Matthias Zytnicki; Fabrice Leclerc; Patrick Forterre; Philippe Bouloc; Daniel Gautheret
Journal:  RNA Biol       Date:  2013-07-02       Impact factor: 4.652

9.  The dif/Xer recombination systems in proteobacteria.

Authors:  Christophe Carnoy; Claude-Alain Roten
Journal:  PLoS One       Date:  2009-09-03       Impact factor: 3.240

10.  SyntTax: a web server linking synteny to prokaryotic taxonomy.

Authors:  Jacques Oberto
Journal:  BMC Bioinformatics       Date:  2013-01-16       Impact factor: 3.169

View more
  7 in total

1.  Metagenomics survey unravels diversity of biogas microbiomes with potential to enhance productivity in Kenya.

Authors:  Samuel Mwangangi Muturi; Lucy Wangui Muthui; Paul Mwangi Njogu; Justus Mong'are Onguso; Francis Nyamu Wachira; Stephen Obol Opiyo; Roger Pelle
Journal:  PLoS One       Date:  2021-01-04       Impact factor: 3.240

2.  Genome Replication in Thermococcus kodakarensis Independent of Cdc6 and an Origin of Replication.

Authors:  Alexandra M Gehring; David P Astling; Rie Matsumi; Brett W Burkhart; Zvi Kelman; John N Reeve; Kenneth L Jones; Thomas J Santangelo
Journal:  Front Microbiol       Date:  2017-10-27       Impact factor: 5.640

3.  Flipping chromosomes in deep-sea archaea.

Authors:  Matteo Cossu; Catherine Badel; Ryan Catchpole; Danièle Gadelle; Evelyne Marguet; Valérie Barbe; Patrick Forterre; Jacques Oberto
Journal:  PLoS Genet       Date:  2017-06-19       Impact factor: 5.917

4.  Elevated Rate of Genome Rearrangements in Radiation-Resistant Bacteria.

Authors:  Jelena Repar; Fran Supek; Tin Klanjscek; Tobias Warnecke; Ksenija Zahradka; Davor Zahradka
Journal:  Genetics       Date:  2017-02-10       Impact factor: 4.562

5.  Extended Archaeal Histone-Based Chromatin Structure Regulates Global Gene Expression in Thermococcus kodakarensis.

Authors:  Travis J Sanders; Fahad Ullah; Alexandra M Gehring; Brett W Burkhart; Robert L Vickerman; Sudili Fernando; Andrew F Gardner; Asa Ben-Hur; Thomas J Santangelo
Journal:  Front Microbiol       Date:  2021-05-13       Impact factor: 5.640

6.  Complete Genome Sequence of Hyperthermophilic Piezophilic Archaeon Palaeococcus pacificus DY20341T, Isolated from Deep-Sea Hydrothermal Sediments.

Authors:  Xiang Zeng; Mohamed Jebbar; Zongze Shao
Journal:  Genome Announc       Date:  2015-09-17

7.  G-Quadruplexes in the Archaea Domain.

Authors:  Václav Brázda; Yu Luo; Martin Bartas; Patrik Kaura; Otilia Porubiaková; Jiří Šťastný; Petr Pečinka; Daniela Verga; Violette Da Cunha; Tomio S Takahashi; Patrick Forterre; Hannu Myllykallio; Miroslav Fojta; Jean-Louis Mergny
Journal:  Biomolecules       Date:  2020-09-21
  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.