Literature DB >> 24572375

Molecular phylogenetics before sequences: oligonucleotide catalogs as k-mer spectra.

Mark A Ragan¹, Guillaume Bernard¹, Cheong Xin Chan¹.

Abstract

From 1971 to 1985, Carl Woese and colleagues generated oligonucleotide catalogs of 16S/18S rRNAs from more than 400 organisms. Using these incomplete and imperfect data, Carl and his colleagues developed unprecedented insights into the structure, function, and evolution of the large RNA components of the translational apparatus. They recognized a third domain of life, revealed the phylogenetic backbone of bacteria (and its limitations), delineated taxa, and explored the tempo and mode of microbial evolution. For these discoveries to have stood the test of time, oligonucleotide catalogs must carry significant phylogenetic signal; they thus bear re-examination in view of the current interest in alignment-free phylogenetics based on k-mers. Here we consider the aims, successes, and limitations of this early phase of molecular phylogenetics. We computationally generate oligonucleotide sets (e-catalogs) from 16S/18S rRNA sequences, calculate pairwise distances between them based on D 2 statistics, compute distance trees, and compare their performance against alignment-based and k-mer trees. Although the catalogs themselves were superseded by full-length sequences, this stage in the development of computational molecular biology remains instructive for us today.

Entities: Chemical Disease Gene Species

Keywords: 16S ribosomal RNAs; k-mers; molecular phylogenetics; oligomers

Mesh：

Substances：

Year: 2014 PMID： 24572375 PMCID： PMC4008546 DOI： 10.4161/rna.27505

Source DB: PubMed Journal: RNA Biol ISSN： 1547-6286 Impact factor: 4.652

Introduction

"A basic goal of biology is to account for the evolution of the cell. Emergence of the translation apparatus is the single most important event in this evolution, for capacity to translate is what defines genotype and phenotype." From our vantage point, informed as we are by petabytes of sequence and structural data from all manner of organisms, it is easy to forget how little was known of the molecular basis of life when Carl Woese began his career in research. Carl was awarded his PhD in 1953, the year Watson and Crick published the double-helical structure of DNA. At about the same time Keller et al. localized protein synthesis to a “microsomal” fraction of the cell, within which RNA-rich particles, later termed ribosomes, were soon discovered. In 1959–1961 two large RNA molecules were revealed as components of the ribosome—one sedimenting at 16S, the other at 23S. In early publications, Carl described how DNA, RNA, and microsomal fractions behaved during the germination of bacterial spores., In late 1960 he initiated research on the genetic code, and over the next few years made fundamental contributions to our understanding of its origin, universality, and specificity. He was among the first to consider translation in an explicitly evolutionary perspective- and emphasized the role of RNA, for example in refocusing the basis of genetic code specificity away from steric interactions among amino acids: “in an important sense, the codon ‘chooses’ its amino acid, not the reverse.” Through these early years, the structure of RNAs remained unclear; indeed, not until the early 1960s was it established that RNAs were linear polymers, i.e., can be referred to as having a sequence. In the early 1950s, Fred Sanger and collaborators had developed a stepwise experimental strategy to reveal the structure of insulin as a sequence of amino acids. Each chain was enzymatically cleaved into oligopeptides; these were separated and laboriously characterized, and from the fragmentary sequences large portions of the original protein sequence were reassembled. By about 1960 it was becoming clear that protein sequences were non-random and contained regions with different degrees of conservation. A similar strategy was soon applied to elucidate the structure of some viral RNAs. RNAs were digested with pancreatic ribonuclease and the products separated using chromatography and electrophoresis, yielding mono-, di-, and tri-nucleotides consistent with an unbranched linear sequence of ribonucleotides linked by phosphodiester bonds. Ten of these products, with lengths from one to four nucleotides, could be readily identified based on their electrophoretic mobility alone, while others were identified via a combination of strategies. Differences in the relative abundance of dinucleotides were interpreted as demonstrating differences in “the sequential arrangement of nucleotides” that were imagined to underlie biological differences among viruses. The Sanger protocol was later modified to introduce an initial digestion with ribonuclease T1, thereby generating only a single product from G. With the separation technology then available, about 40 oligomers in the length range from one to five nucleotides could be resolved; this was not sufficient to distinguish Escherichia coli 16S from 23S rRNA, although in due course methodological improvements provided access to higher oligomers., Sequences of tRNA and 5S rRNA yielded to other protocols that generated overlapping fragments; but in 1964, when Carl took up an appointment to the faculty of the University of Illinois, no RNA molecule had been fully sequenced.

The evolutionary approach to conserved structure and function

Ideas of interrelationships among structure, function, and evolution run deep in the history of biology. Karl von Baer, the French transcendental morphologists and others variously glimpsed parts of this nexus, albeit from pre-Darwinian perspectives and with a mechanical interpretation of function. The appearance of protein sequences in the early 1960s brought renewed interest in relationships among ancestry, conserved and variable regions of sequence and structure, and molecular function.- In a 1969 letter to Francis Crick, Carl referred to this history embedded in molecules as the cell’s “internal fossil record.” Carl understood that a comparative approach would likewise reveal which regions of RNAs were conserved, hence, functionally important. Already in 1961 he had compared nucleotide compositions of 16S and 23S rRNA fractions in different bacteria and in the 1969 letter to Crick he wrote of his “important and nearly irreversible decision” to “determine primary structures for a number of genes in a very diverse group of organisms, on the hope that by deducing rather ancient ancestor sequences for these genes, one will eventually be in the position of being able to see features of the cell’s evolution. The obvious choice of molecules here lies in the components of the translation apparatus. What more ancient lineages are there?” Carl directed some effort to 5S rRNA and 23S rRNA but his main focus was on 16S rRNA. Beginning in 1971, Carl and his coworkers at Illinois, and in due course collaborators in Halifax and Munich, generated oligonucleotide catalogs for 16S rRNAs from about 400 organisms., Their comparative approach quickly bore fruit, with the observation that the sets of oligonucleotides from Escherichia coli and Bacillus megaterium 16S rRNAs were much more similar than expected by chance: "It is important to explain the existence of sequence homology between these two 16S rRNA species. If it reflects the fact that certain portions of their common ancestral primary structure are locked into the present sequences due to stringent constraints imposed by structural and/or functional considerations, then the conservation becomes highly significant. However, were the frequency of occurrence of mutations in rRNA cistrons to be sufficiently low for some reason, then the bulk of the observed conservation could merely reflect the fact that mutations had not occurred in those regions in either organism, and conservation would be of trivial significance." The second, alternative hypothesis is amenable to experiment, and their comparison with a third 16S rRNA, represented by a partial catalog from Alcaligenes faecalis, and with the “unrelated” 14S rRNA of Rhodopseudomonas spheroides and the 18S rRNA of yeast, may have constituted the first validation in computational molecular biology. Although not a proof, additional sequences could be brought into the comparison until the argument for homology becomes undeniable. Pechman and Woese also concluded that “(i)n a molecule as large as the 16S rRNA, all residues are clearly not equivalent in their importance to molecular function.” Some residues are neutral and would be replaced quickly on an evolutionary timescale, whereas others are functionally constrained such their replacement would have to be compensated by a “more or less simultaneous” change of other residues. The more deeply such a “replacement unit” is entangled into molecular function, the longer its mutational “half-life,” and the more informative it might be on basal features in the tree. 16S/18S rRNA was a “compound, non-linear chronometer” whose broad-range applicability arises not from its size per se, but rather because each of its more-or-less independent structural domains embeds covariance sets that inform on different scales of evolutionary time, much as the hands of a clock separately indicate hours, minutes, and seconds. For Carl, the “ultimate goal in comparative studies of rRNA sequence is to construct a chronometric model of the molecule that permits its potential as an evolutionary measuring device to be fully exploited.” He formalized this deeply structural (i.e., not purely statistical or cladistic) concept as covariance sets of nucleotides. In due course, sets of co-varying positions would be mapped onto folded structure; but in the meantime, the path to covariance sets lay through oligonucleotide catalogs and signature analysis.

Oligonucleotide catalogs

In this context, a catalog is the list of oligomers identified following enzymatic digestion of an RNA or protein. Complete digestion of an RNA with T1 ribonuclease yields non-overlapping oligonucleotides that end in G. Although at first only short oligonucleotides could be resolved and identified, by the mid-1970s the upper limit on length had been pushed well into the teens, and in one case to 24. Incompletely characterized oligomers, those with modified bases, and termini, were often included in these catalogs; short oligomers (for 16S rRNAs, 5-mers and below) contributed no additional resolving power, and were often not reported. An RNA dinucleotide catalog was presented by Reddi and catalogs with larger oligonucleotides were published by Rushizky and Knight, Sanger, and others. More than 30 16S/18S rRNAs had been oligo-cataloged by 1975, more than 170 by 1980, and more than 400 by 1985. Most of these data were transferred to punch cards and organized as a database with search, comparison, and tree-inference tools.

Comparing catalogs and computing trees

Sydney Fox and Paul Homeyer compared partial amino acid compositions in seed globulins of six plants, and in 24 protein types mostly from animals (Table I of ref. 44). The idea of combinatorially based diversity can be discerned in their publication, but Fox and Homeyer did not discuss sequences per se. Importantly, however, they interpreted these composition data as showing that “protein synthesis has not, in the main, yet become sufficiently diverse through molecular evolution to yield substantially unrelated proteins.” In modern terminology, protein structure as reflected in 1-mers did not seem to be evolving so fast that historical signal would be lost. This had not been shown before, and set the scene for the subsequent development of molecular phylogenetics. As we mention above in the context of primary-structural determination, peptides from protease digests could be separated in two dimensions by paper chromatography and electrophoresis, and compared by eye for similarities and differences.,, František Šorm and colleagues, were arguably the first to use patterns and frequencies of di-, tri-, and tetra-peptides not only to explore regularities in protein structure, but also to compare “proteins which have the same function but differ in their origin (different animal species), and proteins of similar function and a common origin.” As summarized by Williams et al., Šorm thought that his work demonstrated that “even where complete sequences are not known, the number of peptides common to two proteins can be used to show similarity of their primary structures.” New techniques were needed to compare sets of oligomers. Two sequences might be compared by eye (e.g., ref. 50), but this is neither scalable nor statistically rigorous. Citing a standard statistical text Carl selected for this purpose the binary association coefficient (S): twice the sum of nucleotides in oligonucleotides common to a pair of catalogs, divided by the total number of nucleotides in the two catalogs. Short oligomers were omitted, and no background correction was made (see ref. 53). Carl was nonetheless distrustful about comparing catalogs in this (or any other automated) way: the oligonucleotide data were biased (ribonuclease T1 does not cleave randomly, and electrophoresis at low pH separates some oligonucleotides more cleanly than it does others), and families of similar, probably homologous, oligonucleotides were mostly ignored. But more fundamentally for Carl, S values could not capture molecular structure. Later, when full sequences had become available, Carl plotted pairwise S values between catalogs against percent similarity of aligned 16S rRNA sequences, revealing an imprecise relationship for S less than about 0.40, i.e., most of them. Carl criticized his earlier catalog approach as (1) not having resolved branching orders among major bacterial divisions and subdivisions, and (2) failing to resolve branching order of rapidly evolving lineages such as the planctomycetes. Catalogs and pairwise S values could not offer the resolving power that was available from the rRNA chronometer as read via sequences; nor should we “consider the second hand when timing the seasons.” Given a matrix of pairwise S values, a dendrogram could be computed by average linkage clustering. The first rRNA oligonucleotide trees appeared in 1976 and 1977., Fox et al. asserted that although this approach is phyletic, “it is clear from the molecular nature of the data” that the topology “would closely resemble, if not be identical to, that of a phylogenetic tree based upon such ancestral catalogs.” These trees might be a guide to relatedness and relative antiquity (e.g., ref. 40), but Carl did not delineate taxa solely on the basis of trees.

Signatures

More important than trees was the “internal fossil record” revealed through signatures. Carl defined a signature as a “set of oligonucleotides that is characteristic of (unique to) a group of organisms,” but immediately relaxed this to allow oligonucleotides to “occur in half or more of the members of the group, but are either not found in other organisms or occur only sporadically therein.” Slightly different formulations were offered later., Modulo this relaxation, signatures were synapomorphies (Carl Woese, personal communication to MAR, 30 August 1988). Carl immersed himself in the details. As related by George Fox, during the heyday of the oligonucleotide work “Carl had established routines that allowed him to be with the fingerprints 8 hours a day, 5 days a week. He went to great lengths to avoid interruptions and non-research related activities.” Carl’s knowledge of patterns of oligonucleotide occurrence and co-variation, and his ability to map details immediately onto folded structure, convinced one of us (MAR) that he had an exquisitely detailed mental map of 16S rRNA structure and evolution, as Emanuel Margoliash surely had for cytochrome c. In any case, to Carl a signature was a deeply structural and chronometric construct, not to be entrusted to generic (or even purpose-built) software. Carl’s group managed and compared signatures, and computed trees, with the aid of mainframe computing. Tom Macke wrote a program “sig” that could map the distribution of oligonucleotides, including degenerate ones, across a set of catalogs.,, Similar programs are mentioned by Sobieski et al. In those years, hardware and operating systems were far less standardized than today, and it was not straightforward to exchange programs, much less to offer remote access. All these factors conspired to make signature analysis à la Woese somewhat opaque to outsiders, including the numerical taxonomy and cladistics communities. Zablen et al. clearly articulate the value of shared derived characters; Fox et al. describe an approach seemingly inspired by parsimony; and Carl mentions parsimony analyses elsewhere in passing, e.g., reference 35. Once 16S rRNA structures became available,, Carl mapped these signatures onto folded structure. Taxa could at last be recognized by three criteria (page 236 of ref. 35): coherence by S, shared sequence signature, and higher-order molecular structure.

Oligonucleotides and k-mers

Sequences or regions thereof can be arranged relative to each other to reveal similarities and differences; the term “alignment” was introduced for such operations in 1960, although the concept has deeper roots in genetics, computer science, and other fields. Peptides and proteins were aligned first, then tRNAs in 1966 and 5S rRNA in 1971. These early alignments were based on visual inspection, but as the comparison problem began to be described more precisely for analysis using electronic computers, three not unrelated classes of approaches emerged. In today’s terminology these are the sliding-window, dot-matrix, and k-mer spectrum approaches. Dot-matrix methods were prefigured by Walter Fitch and others prior to their formal description by Gibbs and MacIntyre. Adrian Gibbs and colleagues considered the dot-matrix to subsume the sliding-window approach, and to be “similar in principle” to a method explained by Saul Needleman to Fitch in 1965 and later introduced as the first algorithm for full-length sequence alignment. Applied to molecular sequences, all these approaches find regions of local identity (or similarity). Like oligonucleotides matched between two catalogs, these local regions are not of predefined length; rather, their frequency spectrum (number at each increment of length) is determined by the degree and pattern of pairwise sequence similarity, and by data quality. Alternatively, sequence analysis can be approached using a fixed word length. In the BLAST algorithm for example, the query sequence is hashed into regions of predetermined length. Similar operations are encountered in diverse areas of mathematics, computer science, and information theory e.g., for sequence compression, indexing, or retrieval. Reflecting these diverse origins and applications, short perfectly matched strings of predetermined length are variously termed k-mers, words, or n-grams. A common thread is that these strings provide a fast approach to detecting a signal of similarity. K-mers find utility in many areas of genomics including genome size estimation, assembly, clustering, and studies on sequence periodicity and lateral genetic transfer., In molecular phylogenetics, k-mers have long been used to capture phylogenetic signal. Gibbs et al. used dipeptide frequencies (k = 2) to compute phylogenetic trees based on sequences of cytochromes c, hemoglobins, and other proteins. Blaisdell did likewise for a broader set of proteins, with k = 2 and k = 3. More recently, tree inference has used values of k in the range 3–5 for proteins,, and longer k has been proposed for nucleotides. Below we look back on Carl’s oligonucleotide catalogs as a source of data for phylogenetic inference. With the benefit of complete 16S/18S rRNA sequences, we ask about the accuracy and coverage of T1 oligonucleotide catalogs, and compare Carl’s clustering diagrams with trees based on multiple alignment of complete sequences and inference methods. Because most original T1 catalogs are no longer accessible in an electronic format, we computationally reconstruct e-catalogs from full-length rRNA sequences of the 13 organisms examined by Woese and Fox, compare them with selected empirical catalogs, calculate D statistics,- and compute a neighbor-joining (NJ) tree. We then do the same for a more complete set of bacteria. Thereafter, we extract k-mers (at different values of k) from the full-length sequences, and again calculate D statistics, and compute NJ trees. This allows us to explore similarities and differences between oligonucleotide catalogs and modern k-mer spectra in phylogenetics.

Results

Trees from aligned full-length sequences

As a reference topology, we inferred a tree based on full-length 16S/18S rRNA sequences of the 13 organisms in Woese and Fox or very close relatives. Multiple sequence alignment (i.e., not leveraging the folded structure of rRNA) followed by fast maximum-likelihood (Fig. 1A) or Bayesian inference (Fig. 1B) yielded trees differing from each other in two respects: the position of the cyanobacterial/chloroplast subtree within the bacteria, and branching order within eukaryotes. These disagreements correspond to very short internal edges and poor bootstrap support in the likelihood tree (Fig. 1A). We followed the same approach to infer trees from aligned full-length 16S rRNA sequences from eight proteobacteria (Fig. 4 of ref. 40), with a Synechocystis rRNA as outgroup (Fig. 2A and B).

Figure 2. Trees for 16S rRNA in the proteobacterial data set inferred via multiple sequence alignment of full-length rRNAs using MUSCLE and (A) RAxML or (B) MRBAYES; (C) calculated via D S/2 and neighbor-joining from our e-catalogs; and calculated via D S/2 and neighbor-joining from k-mer spectra at (D) k = 6, (E) k = 8, (F) k = 12, or (G) k = 16. To facilitate comparison, all trees were rooted similarly on the 16S rRNA of the cyanobacterium Synechocystis.

Figure 1. Trees for 16S/18S rRNAs in the three-kingdom data set inferred via multiple sequence alignment of full-length rRNAs using MUSCLE and (A) RAxML or (B) MRBAYES; (C) computed via neighbor-joining from the similarity matrix in reference 74; (D) calculated via and neighbor-joining from our e-catalogs; and calculated via and neighbor-joining from k-mer spectra at (E) k = 6, (F) k = 8, (G) k = 12, or (H) k = 16. To facilitate comparison, all trees were rooted similarly (arbitrarily on archaea), except for (C) in which trees were rooted independently on archaea (left), bacteria (middle), and eukaryotes (right). Figure 2. Trees for 16S rRNA in the proteobacterial data set inferred via multiple sequence alignment of full-length rRNAs using MUSCLE and (A) RAxML or (B) MRBAYES; (C) calculated via D S/2 and neighbor-joining from our e-catalogs; and calculated via D S/2 and neighbor-joining from k-mer spectra at (D) k = 6, (E) k = 8, (F) k = 12, or (G) k = 16. To facilitate comparison, all trees were rooted similarly on the 16S rRNA of the cyanobacterium Synechocystis.

Trees from published S matrices

Woese and Fox present a matrix of pairwise association coefficients (S) between oligonucleotide catalogs (length ≥ 6), but do not depict the tree these data imply. We converted these S values to distances, and computed the NJ tree (Fig. 1C). Rooted at any point outside the three clusters of sequences, this tree clearly reveals three main lines of descent. Woese and Fox do not treat branching structure within each kingdom, but the topology we reconstruct within the bacterial lineage is congruent with the cluster diagram published at about the same time by Balch et al. Later, with data from additional bacteria, Chlorobium assumed a more-basal position. However, Synechocystis and the Lemna chloroplast appear paraphyletic, as do Methanobrevibacter and Methanothermobacter among Archaea.

Computational generation of e-catalogs

We had hoped to generate trees from the original oligonucleotide catalog data underlying Woese and Fox but were able to access only six of the 13 catalogs, and part of a seventh (George Fox has more recently recovered others for us). So instead, starting with full-length 16/18S rRNA sequences from the same or very closely related organisms (Table 1), we computationally generated sets of oligonucleotides, mimicking digestion with ribonuclease T1. Fragments at the 5′ and 3′ termini were included, and oligomers of length < 6 were removed. We refer to these sets as e-catalogs.

Table 1. All 16S ribosomal rRNA sequences used in this study, their GenBank accession numbers, and their inclusion in our re-analysis of rRNAs from (A) three kingdoms74 and (B) proteobacteria (Fig. 4 of ref. 40). For proteobacteria in our analysis B, we identify class (α, β, γ, or Δ-proteobacteria).

Source organism	GenBank accession	Analysis
Mus musculus	X00686.1	A
Saccharomyces cerevisiae	V01335.1	A
Spathiphyllum wallisii	AF207023.1	A
Methanobacterium ruminantium	NR_074117.1	A
Methanoculleus marisnigri	NR_074174.1	A
Methanosarcina barkeri	NR_074253.1	A
Methanothermobacter thermautotrophicus	NR_074260.1	A
Bacillus firmus	JQ282815	A
Chlorobium vibrioforme	M62791	A
Corynebacterium diphtheriae	NR_103937.1	A
Lemna minor chloroplast	NC_010109.1*	A
Synechocystis sp.	NR_074311.1	A, B
Rhodobacter sphaeroides (α)	NR_029215.1	B
Rhodospirillum rubrum (α)	NR_074249.1	B
Rhizobium leguminosarum (α)	D14513.1	B
Alcaligenes faecalis (β)	AF155147.1	B
Desulfovibrio desulfuricans (Δ)	NR_036778.1	B
Escherichia coli (γ)	NR_102804.1	A, B
Yersinia pestis (γ)	NR_074199.1	B
Pseudomonas aeruginosa (γ)	NR_074828.1	B

*, positions 106162–107648.

Comparison of empirical and e-catalogs

To determine the extent to which our e-catalogs recapitulate Carl’s empirical T1 catalogs (and can thus stand in for the latter in tree inference), we compared e-catalogs and original T1 catalogs for Escherichia coli and Methanobacterium ruminantium M-1 (later renamed Methanobrevibacter ruminantium M1). For the purpose of this comparison, we ignored base modifications (e.g., treated A* as identical to A) and copy number, and resolved ambiguities in the empirical data in favor of a match. Table 2 demonstrates that our e-catalogs recapitulate the empirical oligonucleotides very well, although not perfectly. Mismatches likely arise due to weak, diffuse (e.g., Figure 1 of ref. 52), or incompletely resolved spots on paper electrophoresis (e.g., Figure 1 of ref. 16), although sequencing errors, covalent modifications, and/or strain differences cannot be ruled out. It is clear from Table 3 that the landmark recognition of three kingdoms, and molecular-systematic studies on numerous groups of bacteria and archaea, were based on data representing fewer than 40% of the positions in the 16S rRNA. This is less worrisome than might be thought; many of these oligonucleotides map to one side of a helical region, such that much of the “missing” information is in fact represented as the reverse complement (see Figure 2 of ref. 17).

Table 2. Numbers of unique oligonucleotides in empirical 16S rRNA catalogs, and of k-mers in e-catalogs. Escherichia coli empirical catalog from Uchida et al. as corrected by Magrum et al., and Methanobacterium ruminantium M-1 (later renamed Methanobrevibacter ruminantium M1) empirical catalog from Fox, et al. For the calculation of matching, modifications of bases are ignored and ambiguities are resolved favorably.

Oligomer length or k	Escherichia coli			Methanobacterium ruminantium
Oligomer length or k	empirical	e-catalog	match	empirical	e-catalog	match
6^a	21^b	21	21	22	22	20
7	17	16	16	15	16	13
8	10	11	10	14	15	13
9	13	12	12	10	9	8
≥10	11	13	10	11	12	10
Total	72	73	69	72	74	64

a Includes the 5′ termimus. bUchida et al. report one 6-mer sequence twice, once as unmodified and once as modified; for the purposes of this table we count them once.

Table 3. Nucleotide coverage of full-length 16S rRNA sequence by oligonucleotides in empirical catalogs, and k-mers in e-catalogs, of Escherichia coli and Methanobacterium ruminantium M-1 (Methanobrevibacter ruminantium M1). For catalogs, see Table S1. Multiple (non-unique) instances are counted (note that Fox et al. do not report multiple occurrences, which in any case were rare for oligonucleotides ≥ 6). Full-length sequences are NR_102904.1 and NR_074117.1, respectively.

16S rRNA source	Number (empirical)	Coverage (%)	Number (e-catalog)	Coverage (%)
E. coli	584/1542	37.9	590/1542	38.3
M. ruminantium	572/1436	39.8	601/1436	41.9

a Includes the 5′ termimus. bUchida et al. report one 6-mer sequence twice, once as unmodified and once as modified; for the purposes of this table we count them once.

Trees from e-catalogs

From the e-catalogs we calculated pairwise distances via the statistic (Materials and Methods), and computed an NJ tree (Fig. 1D). This tree shows the three-kingdom structure. Topology within the archaeal (methanogen) subtree agrees with that in Fox et al., and with our k-mer trees (Fig. 1E–H, for which see below); for simplicity we call this the 2M+2M topology within Archaea. Among bacteria, the Synechocystis-chloroplast and Bacillus-Corynebacterium pairs seen in the alignment-based trees are apparent here too, but Escherichia and Chlorobium rRNAs no longer form a monophyletic group, instead appearing as adjacent branches. Pairwise S values for these bacterial catalogs are in the range 0.19–0.34. Recalling that Carl called attention to the imprecise relationship between S and full-length sequence similarity especially at S < 0.40, we selected a different bacterial data set (from Fig. 4 of ref. 40) with pairwise S values in the range 0.31–0.78, and again calculated values and distances. The topology of this tree (Fig. 2C) agrees with the alignment-based references (Fig. 2A and B) and differs from that implied by Fox et al. only in the relative branching positions of the most-basal branches; that is, again the differences correspond with the smallest S values, and short internal edges.

K-mer trees from full-length sequences

We extracted k-mers from full-length sequences at selected values of k between 6 and 16, calculated pairwise values and distances, and used these to compute NJ trees for the three-kingdom and bacterial data sets. The three-kingdom structure, and branching order within Archaea, do not depend on choice of k within this range; branching order within bacteria, and within eukaryotes, does (Fig. 1E–H). The expected cyanobacterium–chloroplast and Bacillus–Corynebacterium pairs are apparent across all k = 6, 8, 12, or 16, while the other two bacterial sequences, Escherichia coli and Chlorobium vibrioforme, show no consistent position. This is perhaps unsurprising, as even today basal branching in the bacterial tree can scarcely be resolved. As above, we therefore examined a less-divergent bacterial data set (from Fig. 4 of ref. 40). At k = 6, 8, or 12 (Fig. 2D–F) our -based NJ trees agree with the alignment-based reference (Fig. 2A and B). Even at k = 16 (Fig. 2G), much of the expected internal structure is preserved.

Discussion

From about 1971 through the mid-1980s, Carl Woese and colleagues generated T1 oligonucleotide catalogs for more than 400 organisms, mostly bacteria and archaea, with the aim of understanding the nexus among structure, function, and evolution for the RNA components of the translational apparatus. Using tools that in retrospect seem basic—nuclease digestion, radiolabelling, paper electrophoresis, binary association coefficients, clustering algorithms, and simple statistical models of expected similarity—Carl and his colleagues revolutionized the way we view the living world. Recognition of the three kingdoms of life, a phylogenetic backbone of the microbial world, and natural groupings of various size, taxonomic depth, and biological specialization all arose from Carl’s interpretation of the molecular fossil record internal to 16S/18S rRNA, via the deeply structural idea of the molecular chronometer that intertwines structure, sequence, and evolution for sufficiently large rRNA molecules., For this fundamental biology to have emerged and withstand the test of time, T1 oligonucleotide catalogs—incomplete sets of unordered, short, and somewhat noisy sequences—must carry phylogenetic signal. To be sure, their power of resolution wears thin at greater depths (low S values), but this is true as well for complete sequences using modern methods. Empirical oligonucleotide catalogs sample surprisingly little of the full-length sequence (Table 3), although rather more of its information content (see above). Carl, who was using these catalogs (along with other approaches) to reconstruct full-length sequences, was well aware of this, but argued that oligonucleotides of length ≤ 4, which accounted for much of the sequence not represented in the catalogs, were in any case uninformative about homology; length 5 was “marginal.” The same argument had earlier been made for short oligopeptides in tryptic digests (e.g., ref. 22). By contrast, k-mers represent the entire sequence, base-paired, and uninformative regions along with informative ones. Three kingdoms are apparent in all the trees we compute from e-catalogs or k-mers, as is the 2M+2M arrangement within archaea (methanogens). By contrast, within bacteria the branching order is somewhat unstable, particularly for the more-basal branches. Interestingly, the same features are poorly resolved in a modern curated resource, with structure-guided multiple alignment of full-length sequences, and RAxML inference of trees. As for the eukaryotic subtree, the inability of 18S rRNA sequence analysis to resolve the branching order of the green plant, fungal, and animal lineages is well known. It has not been our aim here to illustrate the full spectrum of so-called alignment-free approaches and methods, nor to compute k-mer trees for other genes, proteins, concatenated gene sets, or full genomes. We hope that these analyses will stimulate reflection and deeper analysis where warranted, on how and why catalog-based methods could underpin the revolutionary era in microbiology associated with Carl Woese. Thanks to next-generation and community sequencing technologies, microbiology again faces large, imperfect, and not entirely familiar data; new analytical, comparative, and computational approaches are in play, while non-evolutionary directions beckon. Carl understood that only an evolutionary framework could link genotype with phenotype, and molecular structure with function. With the support of colleagues and great personal determination, Carl built that framework. His life and achievements are, and will long remain, an inspiration.

Materials and Methods

Data

All 16S/18S rRNA sequences used in this study are listed in Table 1. We obtained full-length rRNA sequences from organisms and strains that are identical, or as closely related as possible, to those examined by Woese and Fox (Table S1). For closer examination of the bacterial lineage, we selected from the organisms in Figure 4 of reference 40 in a way that gives representation of the four major proteobacterial groupings: α (3), β (1), Δ (1), and γ (3), with the 16S rRNA of the cyanobacterium Synechocystis sp. as outgroup (Table S2).

Generation of e-catalogs

To mimic T1 RNase digestion, we computationally cleaved each full-length sequence (Table 1) immediately 3′ of each guanine (G) residue, yielding a set of strings that end in G. Terminal fragments were included in each set, while strings of length < 6 were removed. For ease of handling, we ordered each list first by increasing size, then alphabetically. Our e-catalogs of the Woese and Fox three-kingdom organism set are presented in Table S1, and those of the bacterial set in Table S2.

Phylogenetic analysis

For each of the two sets of full-length rRNA sequences, we performed multiple sequence alignment using MUSCLE at default settings, then inferred trees using MRBAYES and RAxML. MRBAYES parameter settings were: MCMC ngen = 5 000 000 generations, nchain = 4, burnin = 2 500 000 generations. RAxML (-m GTRGAMMA) was run with 100 bootstraps. For the k-mer-based approach, for each sequence set we applied statistics independently at k = 6, 8, 12, and 16, yielding a score for each possible pair of sequences within each set. These scores were transformed via logarithmic representation of the geometric mean, to generate a distance. The pairwise distance d between sequences a and b is defined as where D is the pairwise score, and D and D are the respective self-matching scores. The resulting distance matrix generated for each k was used to reconstruct a phylogenetic tree using neighbor in PHYLIP v3.69 (evolution.genetics.washington.edu/phylip). Similarly, using the in silico oligomer catalog for each full-length sequence as input for we calculated pairwise scores and distances for all pairs within a sequence set. The resulting distance matrix was used to compute a tree using neighbor in PHYLIP v3.69.

73 in total

Review 1. Alignment-free sequence comparison-a review.

Authors: Susana Vinga; Jonas Almeida
Journal: Bioinformatics Date: 2003-03-01 Impact factor: 6.937

2. EVOLUTION OF HEMOGLOBIN IN PRIMATES.

Authors: R L HILL; J BUETTNER-JANUSCH; V BUETTNER-JANUSCH
Journal: Proc Natl Acad Sci U S A Date: 1963-11 Impact factor: 11.205

3. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models.

Authors: Alexandros Stamatakis
Journal: Bioinformatics Date: 2006-08-23 Impact factor: 6.937

4. The role of microsomes in the incorporation of amino acids into proteins.

Authors: E B KELLER; P C ZAMECNIK; R B LOFTFIELD
Journal: J Histochem Cytochem Date: 1954-09 Impact factor: 2.479

5. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors: N Saitou; M Nei
Journal: Mol Biol Evol Date: 1987-07 Impact factor: 16.240

6. Complete nucleotide sequence of a 16S ribosomal RNA gene from Escherichia coli.

Authors: J Brosius; M L Palmer; P J Kennedy; H F Noller
Journal: Proc Natl Acad Sci U S A Date: 1978-10 Impact factor: 11.205

Review 7. Construction of phylogenetic trees.

Authors: W M Fitch; E Margoliash
Journal: Science Date: 1967-01-20 Impact factor: 47.728

8. Partial sequences of 16S rRNA and the phylogeny of blue-green algae and chloroplasts.

Authors: L Bonen; W F Doolittle
Journal: Nature Date: 1976-06-24 Impact factor: 49.962

9. The phylogeny of prokaryotes.

Authors: G E Fox; E Stackebrandt; R B Hespell; J Gibson; J Maniloff; T A Dyer; R S Wolfe; W E Balch; R S Tanner; L J Magrum; L B Zablen; R Blakemore; R Gupta; L Bonen; B J Lewis; D A Stahl; K R Luehrsen; K N Chen; C R Woese
Journal: Science Date: 1980-07-25 Impact factor: 47.728

10. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

Authors: Sylvain Forêt; Miriam R Kantorovitz; Conrad J Burden
Journal: BMC Bioinformatics Date: 2006-12-18 Impact factor: 3.169

6 in total

1. Inferring phylogenies of evolving sequences without multiple sequence alignment.

Authors: Cheong Xin Chan; Guillaume Bernard; Olivier Poirion; James M Hogan; Mark A Ragan
Journal: Sci Rep Date: 2014-09-30 Impact factor: 4.379

2. Recapitulating phylogenies using k-mers: from trees to networks.

Authors: Guillaume Bernard; Mark A Ragan; Cheong Xin Chan
Journal: F1000Res Date: 2016-11-29

3. Phylogenic inference using alignment-free methods for applications in microbial community surveys using 16s rRNA gene.

Authors: Yifei Zhang; Alexander V Alekseyenko
Journal: PLoS One Date: 2017-11-14 Impact factor: 3.240

4. Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer.

Authors: Raquel Bromberg; Nick V Grishin; Zbyszek Otwinowski
Journal: PLoS Comput Biol Date: 2016-06-23 Impact factor: 4.475

5. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer.

Authors: Guillaume Bernard; Cheong Xin Chan; Mark A Ragan
Journal: Sci Rep Date: 2016-07-01 Impact factor: 4.379

Review 6. Alignment-free inference of hierarchical and reticulate phylogenomic relationships.

Authors: Guillaume Bernard; Cheong Xin Chan; Yao-Ban Chan; Xin-Yi Chua; Yingnan Cong; James M Hogan; Stefan R Maetschke; Mark A Ragan
Journal: Brief Bioinform Date: 2019-03-22 Impact factor: 11.622

6 in total