| Literature DB >> 25425232 |
Yu Kang1, Chaohao Gu2, Lina Yuan1, Yue Wang3, Yanmin Zhu1, Xinna Li1, Qibin Luo1, Jingfa Xiao1, Daquan Jiang, Minping Qian, Aftab Ahmed Khan1, Fei Chen1, Zhang Zhang4, Jun Yu4.
Abstract
UNLABELLED: The prokaryotic pangenome partitions genes into core and dispensable genes. The order of core genes, albeit assumed to be stable under selection in general, is frequently interrupted by horizontal gene transfer and rearrangement, but how a core-gene-defined genome maintains its stability or flexibility remains to be investigated. Based on data from 30 species, including 425 genomes from six phyla, we grouped core genes into syntenic blocks in the context of a pangenome according to their stability across multiple isolates. A subset of the core genes, often species specific and lineage associated, formed a core-gene-defined genome organizational framework (cGOF). Such cGOFs are either single segmental (one-third of the species analyzed) or multisegmental (the rest). Multisegment cGOFs were further classified into symmetric or asymmetric according to segment orientations toward the origin-terminus axis. The cGOFs in Gram-positive species are exclusively symmetric and often reversible in orientation, as opposed to those of the Gram-negative bacteria, which are all asymmetric and irreversible. Meanwhile, all species showing strong strand-biased gene distribution contain symmetric cGOFs and often specific DnaE (α subunit of DNA polymerase III) isoforms. Furthermore, functional evaluations revealed that cGOF genes are hub associated with regard to cellular activities, and the stability of cGOF provides efficient indexes for scaffold orientation as demonstrated by assembling virtual and empirical genome drafts. cGOFs show species specificity, and the symmetry of multisegmental cGOFs is conserved among taxa and constrained by DNA polymerase-centric strand-biased gene distribution. The definition of species-specific cGOFs provides powerful guidance for genome assembly and other structure-based analysis. IMPORTANCE: Prokaryotic genomes are frequently interrupted by horizontal gene transfer (HGT) and rearrangement. To know whether there is a set of genes not only conserved in position among isolates but also functionally essential for a given species and to further evaluate the stability or flexibility of such genome structures across lineages are of importance. Based on a large number of multi-isolate pangenomic data, our analysis reveals that a subset of core genes is organized into a core-gene-defined genome organizational framework, or cGOF. Furthermore, the lineage-associated cGOFs among Gram-positive and Gram-negative bacteria behave differently: the former, composed of 2 to 4 segments, have their fragments symmetrically rearranged around the origin-terminus axis, whereas the latter show more complex segmentation and are partitioned asymmetrically into chromosomal structures. The definition of cGOFs provides new insights into prokaryotic genome organization and efficient guidance for genome assembly and analysis.Entities:
Mesh:
Year: 2014 PMID: 25425232 PMCID: PMC4251990 DOI: 10.1128/mBio.01867-14
Source DB: PubMed Journal: mBio Impact factor: 7.867
cGOF characteristics of representative species
| cGOF class and species | Gram stain | Phylum | Habitat[ | DnaE group[ | No. of segments | No. of cGOF genes | No. of core genes | % of cGOF/core genes | Genome size (Mb) | No. of coding genes | GC % | LeGP[ | No. of samples |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Single segment | |||||||||||||
| | + | H | 1 | 1 | 1,305 | 1,305 | 100.0 | 1.94 | 1,570 | 60.50 | 0.7 | 11 | |
| | + | H | 2 | 1 | 1,570 | 1,573 | 99.8 | 2.47 | 2,276 | 53.55 | 0.62 | 13 | |
| | + | H | 2[ | 1 | 1,369 | 1,370 | 99.9 | 2.32 | 2,078 | 52.18 | 0.57 | 15 | |
| | + | S | 3 | 1 | 3,102 | 3,109 | 99.8 | 5.52 | 5,608 | 35.35 | 0.75 | 13 | |
| | + | S | 3 | 1 | 2,778 | 2,780 | 99.9 | 4.40 | 4,363 | 43.93 | 0.74 | 11 | |
| | + | H, S, A | 3 | 1 | 2,283 | 2,303 | 99.1 | 3.92 | 3,604 | 28.08 | 0.83 | 10 | |
| | + | H | 3 | 1 | 1,455 | 1,455 | 100.0 | 2.84 | 2,525 | 32.85 | 0.76 | 12 | |
| | − | A | 2 | 1 | 1,385 | 1,550 | 89.4 | 4.59 | 3,939 | 44.77 | 0.54 | 12 | |
| | − | H | 1 | 1 | 1,486 | 2,542 | 58.5 | 5.10 | 4,766 | 50.67 | 0.55 | 19 | |
| | − | H | 1 | 1 | 814 | 814 | 100.0 | 1.14 | 1,026 | 52.80 | 0.66 | 10 | |
| Symmetric | |||||||||||||
| | + | H | 1 | 2 | 865 | 885 | 97.7 | 2.46 | 2,005 | 59.98 | 0.66 | 10 | |
| | + | H | 2 | 2 | 2,666 | 2,666 | 100.0 | 4.40 | 3,941 | 65.57 | 0.59 | 22 | |
| | + | H | 2 | 2 | 1,648 | 1,649 | 99.9 | 2.52 | 2,295 | 60.05 | 0.6 | 10 | |
| | + | S | 3 | 2 | 2,808 | 2,812 | 99.8 | 4.00 | 3,929 | 46.23 | 0.75 | 11 | |
| | + | S | 3 | 2 | 2,264 | 2,982 | 75.9 | 6.03 | 6,055 | 35.15 | 0.75 | 11 | |
| | + | S, A | 3 | 2 | 797 | 798 | 99.9 | 2.94 | 2,906 | 38.04 | 0.79 | 29 | |
| | + | H | 3 | 4[ | 948 | 962 | 98.5 | 2.09 | 1,999 | 41.22 | 0.79 | 16 | |
| | + | H | 3 | 4 | 794 | 1,197 | 66.3 | 2.11 | 2,035 | 39.70 | 0.79 | 26 | |
| | + | H | 3 | 4 | 1,173 | 1,173 | 100.0 | 1.86 | 1,808 | 38.53 | 0.78 | 9 | |
| Asymmetric | |||||||||||||
| | NA[ | A | NA[ | 7 | 1,677 | 1,827 | 91.8 | 2.65 | 2,745 | 35.17 | 0.49 | 10 | |
| | NA | A | 1 | 9 | 472 | 754 | 62.6 | 1.86 | 2,050 | 35.98 | 0.49 | 12 | |
| | − | H | 1 | 7 | 1,046 | 1,204 | 86.9 | 2.22 | 1,953 | 51.58 | 0.53 | 14 | |
| | − | H | 1 | 6 | 916 | 953 | 96.1 | 1.69 | 1,674 | 30.51 | 0.62 | 11 | |
| | − | H | 1 | 6 | 638 | 923 | 69.1 | 1.63 | 1,508 | 38.91 | 0.59 | 17 | |
| | − | A | 1 | 5 | 1,065 | 1,271 | 83.8 | 3.97 | 3,689 | 39.02 | 0.59 | 15 | |
| | − | A | 1 | 13 | 498 | 995 | 49.1 | 1.90 | 1,619 | 32.25 | 0.61 | 12 | |
| | − | A | 1 | 2 | 2,148 | 2,200 | 97.6 | 3.50 | 3,097 | 38.35 | 0.57 | 12 | |
| | − | H, S, A | 2 | 32 | 497 | 1,234 | 40.3 | 6.10 | 5,516 | 61.84 | 0.55 | 11 | |
| | − | Host | 1 | 12 | 2,204 | 2,268 | 97.2 | 4.87 | 4,595 | 52.13 | 0.59 | 29 | |
| | − | Host | 1 | 30 | 1,283 | 2,290 | 56.1 | 4.75 | 4,072 | 47.65 | 0.59 | 12 |
H, host; S, soil; A, aquatic.
The three DnaE groups are classified based on the presence of different DNA polymerase III gene isoforms and other related mutator genes: 1, dnaE1-dnaE1; 2, dnaE1-dnaE1-dnaE2; 3, polC-dnaE3-polV.
This species is proposed to be one of the DnaE2 group members since the dnaE2 gene has been found in almost all species in this genus and sometime is carried in plasmids that are not included in the chromosomal sequences.
This species is proposed to have a four-segment symmetric cGOF, but the arm segment is very short and transfers between the opposite positions along the origin-terminus axis.
NA, not available.
LeGP, leading-strand gene proportion.
FIG 1 The workflow of cGOF definition. cGOF is a subset of the order-stable core genes of a given pangenome and is divided into multiple segments when genome rearrangement occurs. All multiple-segment cGOFs are grouped as symmetric or asymmetric according to their symmetry in segmentation and rearrangement. The downward red arrows indicate the origin of replication, and black dashed lines indicate the origin-terminus axis. Segments are colored according to their movement patterns: black bars indicate those immobile with respect to the origin, dark and light green bars indicate arm segments that exchange their locations and orientations, and red bars indicate segments that are either locally inverted or moved to other locations. Chromosomal regions without cGOF genes are indicated with thin lines.
Additional characteristics of symmetric and asymmetric cGOFs
| Parameter | cGOFs | ||
|---|---|---|---|
| Symmetric | Asymmetric | ||
| Gram staining (no./total) | 0.045 × 10−3 | ||
| Positive | 9/9 | 0/11 | |
| Negative | 0/9 | 9/11 | |
| NA[ | 0/9 | 2/11 | |
| Phylum (no.) | 0.17 × 10−3 | ||
| | 6/9 | 0/11 | |
| | 3/9 | 0/11 | |
| | 0/9 | 9/11 | |
| Others | 0/9 | 2/11 | |
| DnaE (no./total) | 3.4 × 10−3 | ||
| DnaE1 | 1/9 | 9/11 | |
| DnaE2 | 2/9 | 1/11 | |
| DnaE3 | 6/9 | 0/11 | |
| NA | 0/9 | 1/11 | |
| Habitats (no./total) | 70 × 10−3 | ||
| Host | 6/9 | 5/11 | |
| Soil | 2/9 | 0/11 | |
| Aquatic | 0/9 | 5/11 | |
| Positive | 1/9 | 1/11 | |
| Leading-strand gene proportion (%) | 72.2 ± 8.3 | 57.3 ± 3.9 | 0.090 × 10−3 |
| Genome size (Mb) | 3.16 ± 1.39 | 3.19 ± 1.54 | 0.96 |
| GC content (%) | 47.2 ± 11.5 | 42.1 ± 9.8 | 0.31 |
The P value was calculated using the chi-square test for count data and Student’s t test for measurement data.
NA, not available.
FIG 2 Segmentation and rearrangement of symmetric cGOFs. (A) cGOF segmentation and rearrangement in three Streptococcus spp. Segments are colored as follows: black, segment with origin site; red, segment with terminus site; light and dark green, left and right arm segments on both sides; gray, a potential location of the arm segment of Streptococcus suis to distinguish it from the standard four-segment cGOF of the other species. (B) Multilocus sequence typing (MLST) trees of Streptococcus pneumoniae (left) and Streptococcus pyogenes (right). The four types of four-segment symmetric cGOF are indicated by the color of the solid circles at the center of each ring; isolates in the MLST trees are colored accordingly.
FIG 3 Asymmetric cGOF segmentation and permutation. (A) MLST tree for 29 Salmonella enterica isolates. (B) cGOF segment orders in S. enterica. The colors of the solid circles represent the leading isolate names, and solid circles in the center of the rings indicate the segment orders.
FIG 4 cGOF of E. coli. (A) The plot illustrates the average number of E. coli core genes (blue) and cGOF genes (red) for n = 2, 5, 10, 15, 20 … 50 genomes, based on a maximum of 500 random combinations of genomes for each n. (B) cGOF gene distribution in a virtual E. coli genome. cGOF genes are depicted by thin lines. The spaces between neighboring cGOF genes are scaled to the average distance between genes in 19 E. coli genomes. The outer and inner layers represent positive and negative strands, respectively. The downward arrow points to the replication origin.
FIG 5 Function of cGOF genes in E. coli DH10B. (A) Partition of cGOF, non-cGOF, and dispensable genes in DH10B, where cGOF and non-cGOF genes are both core genes. (B) Distribution of cGOF and non-cGOF genes in COG categories. Red asterisks indicate categories where cGOF genes are significantly enriched, and green asterisks indicate those of non-cGOF genes (P < 0.05). (C) Genes and their actions with other genes in the gene-gene interaction network. Red, green, and blue solid circles denote cGOF, non-cGOF core, and dispensable genes, respectively. The radius of each solid circle is scaled to the number of the corresponding gene actions.