Literature DB >> 23395177

Shigella strains are not clones of Escherichia coli but sister species in the genus Escherichia.

Abstract

Shigella species and Escherichia coli are closely related organisms. Early phenotyping experiments and several recent molecular studies put Shigella within the species E. coli. However, the whole-genome-based, alignment-free and parameter-free CVTree approach shows convincingly that four established Shigella species, Shigella boydii, Shigella sonnei, Shigella felxneri and Shigella dysenteriae, are distinct from E. coli strains, and form sister species to E. coli within the genus Escherichia. In view of the overall success and high resolution power of the CVTree approach, this result should be taken seriously. We hope that the present report may promote further in-depth study of the Shigella-E. coli relationship.

Entities: Chemical Disease Species

Mesh：

Year: 2012 PMID： 23395177 PMCID： PMC4357666 DOI： 10.1016/j.gpb.2012.11.002

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Although description of bacillary dysentery can be traced back in ancient records, the aetiologic agent was recognized only in late 19th century. In 1898 Shiga gave a detailed description of what was called Bacillus dysenteriae, which was assigned a new genus Shigella later on. Four Shigella species, Shigella dysenteriae, Shigella boydii, Shigella sonnei and Shigella flexineri, have been identified and listed in several editions of the Bergey’s Manual, including the latest one [1]. However, it has been known since the 1970s that DNA–DNA reassociation studies and a few other phenotyping experiments could not distinguish these species from Escherichia coli strains (see, e.g., [2,3]). Therefore, these Shigella organisms and E. coli were considered “one species genetically” [4]. Recent molecular studies further validated the closeness of the Shigella species and E. coli. Pupo et al. referred to all Shigella strains as “forms of E. coli” by using multilocus enzyme electrophoresis (MLEE) and a housekeeping gene sequence study [5]. Later on these authors simply called the Shigella species “clones of E. coli” [6], suggesting that the Shigella species may have originated from different ancestral strains of E. coli and have undergone convergent evolution to their present status. Ogura et al. [7] further constructed a neighbor-joining tree by using concatenated nucleotide sequences of 345 orthologous CDS groups from 25 sequenced strains (19 E. coli and 6 Shigella). The Shigella strains again were assigned as E. coli strains [7]. As sequences of more and more complete genomes become available, the use of housekeeping genes has been extended to “core genome”. For example, 2034 genes from the “core genome” were selected to construct phylogenetic relationships (22 E. coli and 7 Shigella in [8], see their Figure 3; or 49 E. coli and 7 Shigella in [9], see their Figure 1). In all the aforementioned studies, the Shigella species were mixed up with the E. coli strains. Investigation using 16S rRNA segments and in silico multilocus sequence typing (MLST) based on a small number of housekeeping genes [10] led to more scattered results. Even a recent “alignment-free” study using so-called feature frequency profiles [11] placed the Shigella species into the E. coli strains. There has been a consensus that the Shigella species are indeed E. coli strains and the nomenclature of the genus Shigella and species included within this genus has been kept for historical and medical reasons. No wonder that the Shigella strains were called E. coli “in disguise” [12] or “Machiavellian masqueraders” [13].

Figure 1

The Numerals in parentheses indicate the number of genomes in a branch.

On the other hand, it is curious enough that despite the genetic closeness of the Shigella species and E. coli strains, certain distinctive “morphological” features do show up. Besides the diagnosable clinical difference of the dysentery they cause, there are some other observable dissimilarities. For example, E. coli strains usually have flagella and are motile, but Shigella species do not, though their flagella genes may express under some rare, yet not fully-understood circumstances [14]. As any phylogenetic conclusion drawn from the analysis of a selected set of sequence segments or genes cannot be unambiguously convincing, there is an urgent need for methods that are not based on any special choice of sequences or genes and that do not require any adjustment of parameters. A few years ago we developed such a whole-genome-based, alignment-free, and parameter-free method [15,16], called CVTree in accordance with the name of the public domain web server CVTree [17,18]. The CVTree results clearly show that the four Shigella species as well as all the E. coli strains are well-defined monophyletic clusters of their own; the Shigella species are not clones of E. coli, but members of the genus Escherichia on the same footing as the E. coli species. The only possible change in nomenclature concerns merging the two genera, Shigella and Escherichia, into one genus, but not absorbing the Shigella strains into the E. coli species. Though challenging to the current consensus described above, in view of the overall success of the CVtree approach and its high resolution power (see, e.g., [19,20]), this conclusion cannot be simply ignored or negated.

Results and discussion

We shall not reproduce the 2070-population CVTrees in this report. An interested reader may generate the result by going to the CVTree web server and ticking the appropriate names in the list of built-in genomes. We base our discussion on collapsed subtrees cut from the 2070-genome CVTrees. Figure 1 shows the Escherichia-Shigella branch in CVTrees at different Ks in the “collapsed-tree” notation. At K = 3 (not shown), there was a monophyletic Shigella{9} branch, but one of the E. coli genome (one of the “engineered” Waksman strain KO11LF) escaped from the Escherichia cluster, violating the monophyleticity of the latter. The situation improves for K > 3. Figure 1 and Figure 2 provide examples of convergence of the branching scheme with increasing K. K = 4 is better than K = 3 and K = 5 and 6 are the best, while K = 7 may be slightly worse (see our previous publications [19,20]). An important and consistent fact consists in that all the Shigella species as well as all the E. coli strains form monophyletic clusters of their own. The Shigella species are never included in the E. coli branch. Shigella species are sister species to E. coli but not strains within the E. coli monophyletic branch. We note that the position of the newly-sequenced genome of E. blattae in Figure 1 requires further study, but this does not affect the E. coli-Shigella relationship, which is the main concern of this work.

Figure 2

The monophyletic These six monophyletic clusters agree well with the phylogroups commonly used to characterize the E. coli population and are therefore labeled accordingly as A, B1a, B1b, B2, D, and E. Numerals in parentheses give the number of genomes in each group as indicated in the first column of Table S1.

The results of this whole-genome-based and alignment-free CVTree analysis convincingly reconcile the seeming contradiction between the genetic closeness and the “morphological” differences mentioned in the “Introduction” section. The grouping of the 54 E. coli strains within the monophyletic cluster (Figure 2) reflects the evolution and taxonomy of the strains in much the same way as revealed in many previous studies using different methods (see, e.g., [7-11]). It is remarkable that the six monophyletic clusters within the E. coli{54} branch agree well with the phylogroups commonly used to characterize the E. coli population. This is why we use the phylogroup labels A, B1 (split into B1a and B1b), B2, D, and E to name the six groups in Table S1. Group A contains the commensal strains and their derivatives: the K-12 strains (MG1655, W3110, BW2952, DH1 and DH10B) and the B strains (BL21 and REL606) [21]. The Waksman strains (W [22] and its derivative KO11FL [23]) and the commensal strains IAI1 [24], SE11 [25], enterotoxigenic (ETEC) E24377A and enteroaggregative (EAEC) 55989 form group B1b [26]. The virulent enterohemorrhagic E. coli (EHEC) O157:H7 strains [27-30] and their O55:H7 precursors [8,31] form a monophyletic cluster E. The three non-O157 EHEC phylogroup B1 strains (O26, O103 and O111) [7] join the other phylogroup B1a. The many uropathogenic (UPEC) strains of phylogroup B2 form a large lowermost cluster B2. Note that though the separation of E. coli strains into clusters agrees basically with [7-11] and other studies, the Shigella strains always stay clearly outside the E. coli monophyletic branch. As the main aim of this report is to emphasize the fact that Shigella species are members of the genus Escherichia, not strains of E. coli, we postpone the detailed comparison of the inner structure of the subclusters within the E. coli{54} branch to a later publication. We mention in passing that a similar story is told by the Yersinia pestis and Y. pseudotuberculosis strains in the CVTrees. Strains from these two species could not be distinguished by DNA-DNA hybridization. Therefore, a proposal was made to combine these two species into one. However, “… the change was rejected by the Judicial Commission because of possible danger to public health if there was confusion regarding Y. pestis, the plague bacillus” [32]. In the same 2070-population CVTrees, we see Yersinia{19} K3K4K5K6K7, Y. pestis{12} K3K4K5K6K7, Y. pseudotuberculosis{4} K4K5K6K7 and Y. enterocolitica{3} K3K4K5K6K7. Consequently, the genus Yersinia and the three species therein are all well-defined and there is no worry for the taxonomic Judicial Commission. It should be pointed out that we did not carry out any case study for a group of selected organisms. Instead, we generated CVTrees for all 2062 Archaea and Bacteria genomes, cut and scrutinized the interested branch. The results demonstrated the high resolution power of CVTrees at the subspecies level and below. This resolution is beyond the reach of the 16S rRNA analysis. Concatenation of a large number of nucleotide or protein sequences such as done in [5-9] may lead to seemingly comparable resolution, but the somewhat subjective selection of sequences or genes brings about ambiguity and makes the conclusion less convincing. With the progress of the new generations of sequencing techniques, the cost of sequencing a bacterial genome will soon drop below that of an average phenotyping experiment and the number of sequenced prokaryotic genomes keeps growing rapidly. Among the genomes released at the NCBI FTP site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/), there are more and more strains coming from the same species. For example, for the time being, complete genome sequences of ten or more strains are available for Chlamydia trachomatis, Corynebacterium diphtheriae, Helicobacter pylori, Salmonella enterica, Staphylococcus aureus, Streptococcus pneumoniae, Streptococcus pyogenes, Sulfolobus islandicus, Y. pestis, etc. Once the genomes have been sequenced, there is no additional cost to getting the interrelationship of the strains by simply submitting the genomes to the CVTree Web Server. We encourage researchers to try out this convenient and effective tool.

Materials and methods

Since the CVTree approach has been described repeatedly in previous publications (see [15-20] and references therein), we only give a brief summary in order to introduce notations and concepts needed in what follows. CVTree is a whole-genome-based approach. It makes use of all the protein products encoded in a given genome. In this way it circumvents the problem of lateral gene transfer (LGT) as LGT and lineage-dependent gene loss are merely mechanisms of genome evolution. User avoids the tedious task of finding orthologous proteins as well, since all genomes are orthologous as they are descended from a common ancestor. The methodology of CVTree must be alignment-free due to the extreme diversity of bacterial genomes in their size and gene content. By using a sliding window of width K, a primary protein sequence made of L amino acids is replaced by (L − K + 1) peptides of length K. The number of K-peptides from all the protein products in a genome is counted and these counts are put in lexicographic order of all possible K-peptides over the 20 amino acid letters to form a raw composition vector (CV) of dimension 20K. Then a random background caused by neutral mutations is subtracted from each raw count to highlight the role of natural selection by using a (K − 2)th order Markovian prediction formula. The subtraction procedure is crucial to the success of CVTree. A recalculated CV represents a species and a dissimilarity/distance measure is defined between each pair of CVs. Then a phylogenetic tree is constructed by using the standard neighbor-joining algorithm which has been proved to be a robust quartet-based method [33]. Being alignment-free renders the method parameter-free, as sequence alignment involves many parameters embodied in the elements of scoring matrices and gap penalties. The peptide length K is not a parameter. Longer Ks make emphasis on species-specificity, while shorter Ks reflect common features between different species. We never adjust K value. Five trees are calculated for K = 3–7 (there is no need to go beyond K = 7) and the improved agreement of the tree topology with taxonomy when K increases provides an additional angle to evaluate the quality of the resulted trees. In order to facilitate the use of the new method by biologists practitioners, a web server entitled CVTree was published in 2004 [17]. A significantly-improved version was released in 2009 [18]. Just by entering the URL (http://tlife.fudan.edu.cn/cvtree/) into a browser, user can enjoy playing with CVTree. The built-in dataset is updated automatically in the beginning of each month from the NCBI FTP site. Users may also upload their own data to CVTree, these data will be kept only for 48 h after the last run of the job. The results may be displayed online or sent back to users by email. In the latter case, there is a directory named Collapsed-trees with many files in Newick (.nwk) or plain text format. The notion of collapsed trees requires special explanation. Although statistical re-sampling methods such as bootstrap or jackknife have been designed to check the stability and self-consistency of the CVTree results [34], the CVTrees are verified by direct comparison with prokaryote systematics at all taxonomic ranks from domain down to genera and species. In doing so, the monophyleticity of a branch is taken as a guideline. When all genomes from one and the same taxon in the input dataset appear in the same branch and no other genomes fall in, one may collapse the branch to a single leave named after the taxon. For example, Escherichia_coli{54} means that all 54 E. coli genomes appear in a monophyletic branch at a given K. In fact, we have the E. coli strains making a monophyletic branch at all K-values from 4 to 7, which is denoted as “Escherichia_coli{54} K4K5K6K7”. For the time being, this kind of “convergence lists” has to be obtained by manual inspection of the corresponding files returned via email by the CVTree web server. Automatic generation of such lists at all taxonomic ranks will be implemented in the next release of the CVTree web server. Throughout this paper we use the abbreviation CVTree to denote the method, the CVTree web server, and the phylogenetic tree obtained by using the CVTree web server. In the present study, we have used all the prokaryote genomes released at the NCBI FTP site as of 30 September 2012, excluding 14 tiny highly-degenerated genomes of bacterial endosymbiont bacteria. The 54 E. coli genomes, listed in Table S1 in the Supplementary material, are divided into six groups, corresponding to the six monophyletic clusters within the monophyletic E. coli{54} branch in CVTrees for K = 4–7 (see Figure 2). We note that 49 [9] and 53 [10] E. coli genomes from GenBank were used, respectively. There are minor differences in the lists as we used all the genomes released by NCBI with accession number starting with NC_ in order to have better comparability. The 9 Shigella genomes used in the present study are listed in Table S2. When constructing the phylogenetic trees, we used all 133 Archaea genomes and 1929 Bacteria genomes, including the 54 E. coli and 9 Shigella genomes. We excluded 14 tiny highly-degenerated genomes of endosymbiont bacteria (Candidatus Carsonella, C. Hodgkinia, C. Sulcia, C. Tremblaya, and C. Zinderia), as they would violate the trifurcation of the three main domains of life. Eight Eukarya genomes were included as outgroups. Altogether it led to a treeing job with 133 + 1929 + 8 = 2070 population.

Authors’ contributions

GZ and BH posed the problem. ZX maintained the CVTree web server. GZ, ZX and BH collected data and performed the calculation. GZ and BH analyzed the results and wrote the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors have declared that no competing interests exist.

28 in total

1. Shigella and Escherichia coli at the crossroads: machiavellian masqueraders or taxonomic treachery?

Authors:
Journal: J Med Microbiol Date: 2000-07 Impact factor: 2.472

2. Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach.

Authors: Ji Qi; Bin Wang; Bai-Iin Hao
Journal: J Mol Evol Date: 2004-01 Impact factor: 2.395

Review 3. Escherichia coli in disguise: molecular origins of Shigella.

Authors: Ruiting Lan; Peter R Reeves
Journal: Microbes Infect Date: 2002-09 Impact factor: 2.700

4. Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance.

Authors: Bailin Hao; Ji Qi
Journal: J Bioinform Comput Biol Date: 2004-03 Impact factor: 1.122

5. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7.

Authors: N T Perna; G Plunkett; V Burland; B Mau; J D Glasner; D J Rose; G F Mayhew; P S Evans; J Gregor; H A Kirkpatrick; G Pósfai; J Hackett; S Klink; A Boutin; Y Shao; L Miller; E J Grotbeck; N W Davis; A Lim; E T Dimalanta; K D Potamousis; J Apodaca; T S Anantharaman; J Lin; G Yen; D C Schwartz; R A Welch; F R Blattner
Journal: Nature Date: 2001-01-25 Impact factor: 49.962

6. Expression of flagella and motility by Shigella.

Authors: J A Girón
Journal: Mol Microbiol Date: 1995-10 Impact factor: 3.501

7. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12.

Authors: T Hayashi; K Makino; M Ohnishi; K Kurokawa; K Ishii; K Yokoyama; C G Han; E Ohtsubo; K Nakayama; T Murata; M Tanaka; T Tobe; T Iida; H Takami; T Honda; C Sasakawa; N Ogasawara; T Yasunaga; S Kuhara; T Shiba; M Hattori; H Shinagawa
Journal: DNA Res Date: 2001-02-28 Impact factor: 4.458

8. Multiple independent origins of Shigella clones of Escherichia coli and convergent evolution of many of their characteristics.

Authors: G M Pupo; R Lan; P R Reeves
Journal: Proc Natl Acad Sci U S A Date: 2000-09-12 Impact factor: 11.205

9. Polynucleotide sequence divergence among strains of Escherichia coli and closely related organisms.

Authors: D J Brenner; G R Fanning; F J Skerman; S Falkow
Journal: J Bacteriol Date: 1972-03 Impact factor: 3.490

10. CVTree: a phylogenetic tree reconstruction tool based on whole genomes.

Authors: Ji Qi; Hong Luo; Bailin Hao
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

15 in total

Review 1. Biogeography of the Oral Microbiome: The Site-Specialist Hypothesis.

Authors: Jessica L Mark Welch; Floyd E Dewhirst; Gary G Borisy
Journal: Annu Rev Microbiol Date: 2019-06-10 Impact factor: 15.500

Review 2. Escherichia coli Pathobionts Associated with Inflammatory Bowel Disease.

Authors: Hengameh Chloé Mirsepasi-Lauridsen; Bruce Andrew Vallance; Karen Angeliki Krogfelt; Andreas Munk Petersen
Journal: Clin Microbiol Rev Date: 2019-01-30 Impact factor: 26.132

3. Pseudoalignment for metagenomic read assignment.

Authors: L Schaeffer; H Pimentel; N Bray; P Melsted; L Pachter
Journal: Bioinformatics Date: 2017-07-15 Impact factor: 6.937

4. Complete genome sequence of DSM 30083(T), the type strain (U5/41(T)) of Escherichia coli, and a proposal for delineating subspecies in microbial taxonomy.

Authors: Jan P Meier-Kolthoff; Richard L Hahnke; Jörn Petersen; Carmen Scheuner; Victoria Michael; Anne Fiebig; Christine Rohde; Manfred Rohde; Berthold Fartmann; Lynne A Goodwin; Olga Chertkov; Tbk Reddy; Amrita Pati; Natalia N Ivanova; Victor Markowitz; Nikos C Kyrpides; Tanja Woyke; Markus Göker; Hans-Peter Klenk
Journal: Stand Genomic Sci Date: 2014-12-08

5. Gene content dissimilarity for subclassification of highly similar microbial strains.

Authors: Qichao Tu; Lu Lin
Journal: BMC Genomics Date: 2016-08-17 Impact factor: 3.969

Review 6. Analysing Microbial Community Composition through Amplicon Sequencing: From Sampling to Hypothesis Testing.

Authors: Luisa W Hugerth; Anders F Andersson
Journal: Front Microbiol Date: 2017-09-04 Impact factor: 5.640

Review 7. The Intriguing Evolutionary Journey of Enteroinvasive E. coli (EIEC) toward Pathogenicity.

Authors: Martina Pasqua; Valeria Michelacci; Maria Letizia Di Martino; Rosangela Tozzoli; Milena Grossi; Bianca Colonna; Stefano Morabito; Gianni Prosseda
Journal: Front Microbiol Date: 2017-12-05 Impact factor: 5.640

8. CVTree3 Web Server for Whole-genome-based and Alignment-free Prokaryotic Phylogeny and Taxonomy.

Authors: Guanghong Zuo; Bailin Hao
Journal: Genomics Proteomics Bioinformatics Date: 2015-11-10 Impact factor: 7.691

9. Phylogenetic Analyses of Shigella and Enteroinvasive Escherichia coli for the Identification of Molecular Epidemiological Markers: Whole-Genome Comparative Analysis Does Not Support Distinct Genera Designation.

Authors: Emily A Pettengill; James B Pettengill; Rachel Binet
Journal: Front Microbiol Date: 2016-01-19 Impact factor: 5.640

10. What Can We Learn from a Metagenomic Analysis of a Georgian Bacteriophage Cocktail?

Authors: Henrike Zschach; Katrine G Joensen; Barbara Lindhard; Ole Lund; Marina Goderdzishvili; Irina Chkonia; Guliko Jgenti; Nino Kvatadze; Zemphira Alavidze; Elizabeth M Kutter; Henrik Hasman; Mette V Larsen
Journal: Viruses Date: 2015-12-12 Impact factor: 5.048