Literature DB >> 27346943

Partitional Classification: A Complement to Phylogeny.

Abstract

The tree of life is currently an active object of research, though next to vertical gene transmission non vertical gene transfers proved to play a significant role in the evolutionary process. To overcome this difficulty, trees of life are now constructed from genes hypothesized vital, on the assumption that these are all transmitted vertically. This view has been challenged. As a frame for this discussion, we developed a partitional taxonomical system clustering taxa at a high taxonomical rank. Our analysis (1) selects RNase P RNA sequences of bacterial, archaeal, and eucaryal genera from genetic databases, (2) submits the sequences, aligned, to k-medoid analysis to obtain clusters, (3) establishes the correspondence between clusters and taxa, (4) constructs from the taxa a new type of taxon, the genetic community (GC), and (5) classifies the GCs: Archaea-Eukaryotes contrastingly different from the six others, all bacterial. The GCs would be the broadest frame to carry out the phylogenies.

Entities: Chemical Disease Gene Species

Keywords: RNase P RNA; bioinformatics; classification; cluster analysis; evolution; k-medoid analysis

Year: 2016 PMID： 27346943 PMCID： PMC4912232 DOI： 10.4137/EBO.S38288

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

Partitional cluster analyses (PCAs) constitute a diverse body of methods.1,2 To our knowledge, very few taxonomic studies used PCAs, though these methods were recommended for the classification of organisms by a number of their founders.3,4 The reason for this lies in one of the ideals of evolutionary biology, ie, to unravel the history of living beings in the form of a single phylogenetical tree, the tree of life (TOL), and simultaneously to classify them, the two activities hypothesized inseparable. In fact, each proposed TOL is a tree organizing certain taxa of the three domains of life,5 based not necessarily on the same molecules or other characters, thus not congruent from one to another author.5–7 Another case against TOL is nonvertical gene transfer, namely lateral gene transfer (LGT), endosymbiosis, and chimerism. LGTs have been known since the end of the 1970s but were considered significant in the evolutionary process much later.8 Endosymbiosis and chimerism are also invoked to explain the occurrence of main evolutionary events (eg, the emergence of Eucarya). Mitochondriae were shown to evolve from Alphaproteobacteria, chloroplasts from Cyanobacteria, and nuclei at least partially from Archaea.9,10 In a eukaryotic cell, exchanges of material between organellar and nucleic DNA occur, a phenomenon called chimerism. The different origins of gene acquisition launched a debate about the method to classify the Living World. Most authors have persisted to construct TOLs from genes hypothesized unaffected by LGT – the core genes –11,12 mainly involved in transcriptional and/or translational mechanisms. Currently, strenuous efforts are made to combine the different published phylogenetic trees, taxonomical tools, and open bioinformatic systems to approach a comprehensive TOL.13 But others criticized the phylogenetic method more deeply, arguing that LGT is still involved in a number of informational genes, and called for other representations.14,15 Without interfering in this discussion, we propose to construct a taxonomy based on degree of identity (DI) rather than degree of relationship. We defined the DI between two taxa as the overall distance calculated on evolutionary traits stemming both from gene vertical transmission and nonvertical transfers. The DIs were computed on the aligned DNA sequences coding for the RNA of RNase P – a universal ribozyme involved in the maturation of the tRNAs by cleaving its 5′ extremity. RNase P is an endonuclease generally comprising one RNA and a variable number of protein subunits – 1 in bacteria, 4–5 in archaea, and 8–10 in eukaryotes.16,17 Except in the plants studied18 and the mitochondrion of man19 where the RNA is absent, the latter is generally the catalytic part and is widespread in a large number of taxa across the three domains. RNase P RNA contains highly conserved regions, ie, the catalytic domain forming loops or hairpins, and highly variable regions linking them, hence the relevance of the choice of this molecule for classification. Compared with 16S–18S rRNA, RNase P RNAs are smaller sequences leading to comparable results with far less machine time. A higher rate of nucleotide variation explains some discrepancies between the phylogenies performed with one or the other molecule.7,20

Methods

The material

Our initial material consisted of 564 DNA sequences coding for complete RNase P RNAs, carried by 564 different taxa (genera) and pooled together from three genetic databases, ie, Rfam, Noncode and GeneBank. The sequences obtained from Rfam originated from several built-in files where they were already displayed aligned, but this alignment was useless to us since it was performed within each file; the lengths of the sequences were different from one to another file. This length difference was even increased with the addition of the unaligned sequences coming from Noncode and GenBank. Besides, this raw material was heterogeneous concerning the presence and absence of gaps, since they were an admixture of aligned and unaligned sequences. The 564 sequences were then sorted in such a way that the nth sequence corresponded to the nth item of Dataset3.text – the file of the carriers of the sequences (cf. below). The sequences were gathered into file Dataset0.txt, whose sequences were thereafter ridden of their contingent gaps and multiply aligned (with MUSCLE21 and Algorithms S1–S3, Figs. S1–S3 – algorithms, pieces of text, tables and figures referred to with ‘S+a number’ in supplementary file SupplementaryMaterial.pdf) this file and datasets 0–9 are referred to in the SupplementaryMaterial section. The sequences resulting from these modifications were of equal length (2059 characters) and constituted file Dataset1.text. They were then numerized by an appropriate codification (Algorithm S4) and changed into numeric vectors of equal length composed of 8236 numerals, either 0 or 9 (Figs. S4). These vectors composed file Dataset2.text and were the objects on which our PCA was applied. We will now proceed to the analyses (see below).

The analyses

Our analyses developed into the following three steps: (1) a k-medoid analysis revealing a number of clusters among which the sampled sequences were distributed, (2) the study of the overlap between the clusters and operational taxonomic units (OTUs), and (3) a hierarchical clustering of the clusters assimilated to the OTUs, from which we derived a typology of cluster families, very strongly overlapping reunions of OTUs, ie, the genetic communities (GCs).

The genera and their taxonomic position

The taxonomic position (TP) of a given genus was defined as a sequence of nesting taxa in decreasing ranks, ie, domain, kingdom (for eukaryotes only), phylum, class, and order, each containing the genus. This information is easily available in taxonomic databases and in the literature. File Dataset3.txt contains 564 genera and their TP (the rows). Each genus corresponding to the nth row of Dataset3.txt is the carrier of the sequence corresponding to the nth row of Dataset2.txt (for more detail, see SupplementaryMaterial.pdf, section S1).

k-medoid analysis on our data

We carried out a k-medoid analysis on file Dataset2.txt with the following parameters: (1) d = Manhattan distance, (2) n = number of sequences, (3) k0 = number of clusters optimizing the partition, (4) M = method = either clustering large applications (CLARA) or partitioning around medoids (PAM) (we performed both analyses), and (5) in case, we applied CLARA, N = number of samples to be drawn for CLARA = 100 (subsection S3.3). Number k0 was obtained with Mardia’s cluster variation method, ie, .22 The analyses (Algorithms S5 launching CLARA and S7 launching PAM) resulted in (1) the construction of the clusters around their medoids, (2) the assignment of the genera to each of the clusters, and (3) the computation of the cluster means. The k0 clusters formed our cluster partition . This analysis constructed an optimal partition of clusters gathering the most similar genera.

Contingency table crossing clusters and taxa

The genera were distributed among the k taxa T and k0 clusters C, crossed to form a contingency table (CT) – with n representing the number of genera within T and C. Per taxon T, Cmax is the cluster containing the largest number of genera; n. the number of genera; and δ, the degree of membership to a cluster (DMTC) defined as the percentage of the taxon within . Per cluster C, Tmax is the taxon containing the largest number of genera; n the number of genera; τ, the taxonomic specificity (TS) defined as the percentage of the genera of the cluster within Tmax within . . Such a CT is illustrated in Table 1.

Table 1

The OTUs crossed with the 17 clusters.

CLUSTERS
OTUs	C₁	C₂	C₃	C₄	C₅	C₆	C₇	C₈	C₉	C₁₀	C₁₁	C₁₂	C₁₃	C₁₄	C₁₅	C₁₆	C₁₇	n_i.	d_i
A	38	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	40	95
At	0	1	0	0	1	33	5	2	1	0	0	0	0	0	0	0	0	43	77
Ba1	0	11	1	0	2	0	0	0	1	0	0	0	0	0	0	0	0	15	73
FL	0	0	1	0	0	0	0	0	0	0	10	0	0	0	0	0	0	11	91
Cy	0	0	0	0	0	0	0	0	0	1	0	18	0	0	0	0	0	19	95
Co1	5	3	6	0	3	0	0	2	0	2	0	0	2	0	0	0	0	23	26
Ng	0	0	0	0	0	0	0	0	0	7	0	0	0	0	0	0	0	7	100
Al1	0	4	1	3	1	0	0	1	0	0	0	0	0	1	0	0	0	11	36
Al2	0	5	0	26	1	0	0	0	0	0	0	0	0	1	0	0	0	33	79
Al3	0	0	0	5	1	0	0	0	0	0	0	0	0	18	0	0	0	24	75
Bu	0	0	7	0	1	0	0	0	0	0	0	0	0	0	1	9	0	18	50
Ga1	0	4	0	1	19	0	0	0	5	0	0	0	0	0	14	0	0	43	44
Ga2	0	17	0	0	7	0	0	0	6	0	0	0	0	0	0	0	0	30	57
E1	9	0	6	0	1	0	0	0	0	0	0	0	37	0	0	0	4	57	65
E2	3	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	22	27	82
E3	0	2	0	0	0	0	0	0	0	0	0	1	0	0	1	1	0	5	40
n._j	55	48	23	35	37	33	5	5	13	10	10	19	41	20	16	10	26
τ_j	69	35	30	74	51	100	100	40	46	70	100	95	90	90	88	90	85

Note: The boldfaced numbers correspond to the intersection of each OTU with its Cmax and represent the number of genera within the OTU and Cmax in question.

Definite and indefinite taxa

Each taxon of a TP is a definite taxon, ie, corresponding to an acknowledged taxonomical category. We considered these taxa as mathematical sets of genera; the reunions of the most similar taxa of a TP not corresponding to an officially defined taxonomical category were the indefinite taxa.

Functional biological units and OTUs

A functional biological unit (FBU) is a definite or indefinite taxon with a given set of known evolutionary characters and useful for the construction of the OTUs.23–26 An OTU is an FBU that has the requirement appropriate to a given study – in our case a strong overlap with the clusters. The idea is to verify whether the clusters strongly match the OTUs, so that a typology of the clusters can be assimilated to a partitional taxonomy of the OTUs. We started our partitional classification analysis with κ initial FBUs (IFBUs) and constructed the OTUs in two step: (1) from the IFBUs to κ′(<κ) larger FBUs (LFBUs) and (2) from the LFBUs to κ″(<κ′) OTUs. The IFBUs, LFBUs, and OTUs were crossed with the k0 clusters to build up CTs.

Taxonomic interpretation of the clusters

Two independent analyses applied on the LFBUs were carried out to interpret the clusters taxonomically: (1) a statistical one based on an overlap index (OI) and (2) a correspondence analysis (CA).22,27

Statistical method based on overlap index

Three overlap indices between any taxon T and cluster C were proposed, and the best one among them selected: (1) (Dice index), (2) (Jaccard index), and (3) (cosine index).28 Dice, Jaccard, and cosine OIs were calculated between κ′ LFBUs and k0 clusters. Of each LFBU T, the maximal OI (MOI) defined as was computed. This number describes the overlap between LFBU T and its Cmax and reflects, if above a threshold ωinf determined statistically, a specific association between Cmax – necessarily one of the Cs – and LFBU T, the last being a revealed OTU. We selected the best OI (with the strongest MOI) for partitional classification.

Correspondence analysis

CA was carried out with Algorithm S8 from a CT crossing κ′ LFBUs (Dataset7.txt) with the k0 clusters.

A hierarchical cluster analysis to infer the partitional classification

Algorithm S9 performed a hierarchical cluster analysis (HCA)1 on the means of the taxon specific clusters and the mean of cluster C2 obtained with Algorithm S6 with (1) Manhattan distance as the dissimilarity index and (2) Ward as the agglomerative method. These means were identified to the OTUs. We considered as taxon-specific, clusters having 7+ members and a TS ≥ 50. The numerous cluster C2 was also processed despite its low TS since it showed an interesting bimodal distribution. If the analysis showed that these clusters could be assimilated to OTUs, the inferred cluster typology would be equivalent to a taxonomic system of the OTUs based on the DIs. In this system, we gather the most similar clusters into cluster families (CFs) assimilated to the reunions of the OTUs showing the highest DIs. Such reunions of taxa were called GCs.

Abbreviated taxon names

A = Archaea, Ab = Acidithiobacillales, Ac = Actinopterygii, Ae = Aves = Ae1 ∪ Ae2, Ae1 = Taenopygia, Ae2 = Gallus, AE = Archaea or Eucarya = A ∪ E, Af = Afrosoricida, Ai = Ascidiaceae, Al2 = α-Proteobacteria 2 = Rz ∪ Ro ∪ Sh, Al3 = α-Proteobacteria 3 = Rh ∪ Mg, Am = Aeromonadales, An = Alteromonadales, Ar = Arthropoda, AT = Actinobacteria, Av = Alveolata, Ay = Artiodactyla, Ba1 = Bacteroidetes 1 = BT ∪ CT ∪ SB, BT = Bacteroidia, Bu = Burkholderiales, Ch = Chiroptera, Ci = Cnidaria, Cm = Chromatiales, Cn = Carnivora, Co1 = Clostridia 1 = Cs ∪ Se, Cp = Cephalochordata, Cs = Clostridiales, CT = Cytophagia, Cy = Cyanobacteria, Dd = Didelphimorpha, Dp = Diprodontia, E1 = Eucarya 1 = Ac ∪ Av ∪ Ex ∪ Ho ∪ Ae1 ∪ Ai ∪ Ar ∪ Cp ∪ Fn ∪ Ma1 ∪ Ml ∪ Ne ∪ Pl ∪ Pt, E2 = Eucarya 2 = Ma2 ∪ Ae2, E3 = GL ∪ Hr ∪ Ec ∪ Ci, Ec = Echinodermata, En = Enterobacteriales, Ex = Excavata, FL = Flavobacteria, Fn = Fungi, Ga1 = γ-Proteobacteria 1 = Ab ∪ Am ∪ An ∪ En ∪ Ps ∪Vi ∪ Xa, Ga2 = γ-Proteobacteria 2 = Cd ∪ Cm ∪ Gais ∪ Lg ∪ Mc ∪ Oc ∪ Pd ∪ Tt, Gais = γ-Proteobacteria incertia sedis, GL = Glaucophyta, Ho = Choanomonada, Hr = Chromalveolata, Hy = Hyracoidia, La = Lagomorpha, Lg = Legionellales, Ma1 = Mammalia 1 = Af ∪ Ay ∪ Dp ∪ La ∪ Rd ∪ Sc ∪ Ty, Ma2 = Mammalia 2 = Cn ∪ Ch ∪ Dd ∪ Hy ∪ Pe ∪ Mo, Mc = Methylococcales, Mg =Magnetococcales, Ml = Mollusca, Mo = Monotremata, Ne = Nematoda, Oc = Oceanospirillales, Pd = Pseudomonadales, Pe = Perissodactyla, Pl = Placozoa, Ps = Pasteurellales, Pt = ‘Platyhel-mynthes, Rd = Rodentia, Rh = Rhodobacterales, Ri = Rickettsiales, Ro = Rhodospirillales, Rz = Rhizobiales, SB = Sphingobacteria, Sc = Scandentia, Se = Selenomonadales, Sh = Sphingomonadales, Tt = Thiotrichales, Ty = Tylopodes, Vi = Vibrionales, Xa = Xanthomonadales.

Results

Relevant taxa

Results from the k-medoid analysis

Our data showed that the optimal number of clusters, obtained with Mardia’s cluster variation method, was k0 = 17. Our k-medoid analysis, carried out with method CLARA, resulted in (1) the assignment of each of the 564 genera to one of the 17 clusters (Dataset4.txt) and (2) the computation of the mean vectors of the 17 clusters (Dataset5.txt). We performed the same analysis with method PAM and obtained an almost identical assignment of the genera to the 17 clusters (Dataset6.txt) except for four among the 564 genera (Sebaldella, Liberibacter, Novosphingobium, and Nautilia). We decided to proceed to the analyses with CLARA (see Discussion section).

The three successive CTs

The 564 genera were distributed into three successive CTs – taxa crossed with the same k0 = 17 clusters: A CT involving k = 100 IFBUs (Table S1). A CT on k′ = 33 LFBUs (Table S2). Each of these taxa is the reunion of IFBUs included in the same taxon of the immediate higher rank (TIHR) as displayed in Dataset3.txt and shares the same Cmax. For example, LFBU Archaea (A) is the reunion of IFBUs Crenarchaeota (Cr), Euryarchaeota (Er), Korarchaeota (Kr), and Thaumarchaeota (TH); the genera of the member IFBUs of A overwhelmingly belong to cluster Cmax = C1. A CT involving k″ = 16 OTUs (Table 1). The OTUs are heuristically defined as (1) LFBUs having a DMTC ≥ 50 and represented by seven or more genera and (2) LFBUs belonging to the same TIHR as other member OTUs, ie, Al1, Ga1, and E3 (69.3% of the sampled genera).

Correspondence between cluster groups and taxa

With the statistical method based on the OIs

Tables S3–S5 display the overlap between the LFBUs and the clusters – assessed respectively with Dice, Jaccard, and cosine indices (MOIs ω in right margin of the tables). From these tables, we calculated (1) ō and , respectively, mean and standard deviation of random variable O taking on values ω. and (2) threshold (kappa being the number of the taxa involved) after normality of the ωs was verified (Table 2). Each LFBU with a ω. ≥ ωinf was considered as significantly superimposed to its Cmax cluster, which we called its corresponding cluster. (We called these LFBUs candidate OTUs.)

Table 2

Comparison of statistic descriptors of variable O for the three OIs (analysis on the LFBUs).

OI TYPE	TABLE	ō (MOI)	σ^O	JB	ω_inf	NUMBER OF TAXA WITH ω_i. > ω_inf
Dice index	S3	0.54	0.28	ns	0.46	11
Jaccard index	S4	0.42	0.26	ns	0.34	10
Cosine index	S5	0.56	0.26	ns	0.49	11

Abbreviations: JB, Jarque–Bera normality test statistic49; ō, MOIs; ns, nonsignificant.

Tables S6–S8 present the three OIs between the candidate OTUs and the clusters. The cosine index was the best OI, with the highest mean MOI and largest number of candidate OTUs above ωinf (cf. Table 2) and, thus, chosen as our OI for the rest of the study.

From the CA

We applied the CA to Dataset7.txt and obtained file Dataset8.txt, the listing of the analysis, from which we plotted CA diagrams (Fig. 1). The results of the analysis, ie, the relationships between cluster and OTU as revealed by the CA, are reported in Table 3.

Figure 1

Plot diagrams inferred by CA. Inertia rates in brackets next to the factorial axes (FAs). Squares with C in gray are clusters. Dots with abbreviated names in black are taxa. Factorial planes generated by two factorial axes: (A) F1 and F2; (B) F1 and F3; (C) F3 and F4; (D) F4 and F5; (E) F5 and F6; (F) F5 and F7.

Table 3

Comparative results between the OIA and CA.

OIA			CA
OTUs	C_max	MOIs	ASSOCIATED CLUSTERS	FACTORIAL PLANES
FL	C₁₁	0.95	C₁₁	F1 × F2
Cy	C₁₂	0.95	C₁₂	F1 × F2
E2	C₁₇	0.83	C₁₇	F1 × F3
Al3	C₁₄	0.82	C₁₄	F3 × F4
At	C₆ (C₇)	0.78	C₆, C₇	F3 × F4
E1	C₁₃	0.78	C₁₃	F1 × F3
A	C₁	0.76	C₁	F1 × F3
Co1	C₈	0.73	C₈	F5 × F7
Al2	C₄	0.72	C₄	F3 × F4
Bu	C₁₆ (C₃)	0.67	C₁₆	F4 × F5
Ga2	C₁₅ (C₉)	0.53	C₁₅	F4 × F5
Ng	C₁₀	0.45	C₁₀	F5 × F7
Ga1	C₂ (C₉)	0.34	C₂	F4 × F5
Ba1	C₂	0.32	C₂	F1 × F12

Notes: OTUs sorted in decreasing order of MOI. Clusters in parentheses, in OIA, cluster with the second largest number of genera for a given taxon.

Abbreviations: OIA, overlap index analysis; CA, correspondence analysis.

Both methods give the same association between cluster and taxon (Table 3). Remarkably, (1) the associated clusters unveiled by CA are the Cmaxs of the descriptive method and (2) the taxa revealed as overlapping the clusters were all OTUs as determined in the previous subsection. A solid underpinning between clusters and OTU is thus highlighted. The clusters are identified to OTUs.

The DIs revealed by HCAs

The HCA on the cluster means (Dataset5.txt), restricted to the taxon-specific clusters, calculated the distances between them, organized these distances in a distance matrix (Dataset9.txt), and drew from the latter our dendrogram (Fig. 2). We associated the CFs with their corresponding GCs (cf. Table 4).

Figure 2

HCA dendrogram. Distance = DI = Manhattan; aggregation method = Ward. Cut at distance ca. 3200. = cluster families: = {C1, C13, C17}; = {C2, C5, C10}; = {C4, C6, C14}; = {C11}; = {C12}; = {C15}; and = {C16}.

Table 4

Cluster families inferred by the HCA of the TSCs.

CLUSTER FAMILIES Aj
T_i	A1	A2	A3	A4	A5	A6	A7	n_i.	ω_i.
AE	115 (0.99)	2 (0.02)	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	117	0.99
Ga	0 (0)	47 (0.68)	1 (0.01)	0 (0)	0 (0)	14 (0.46)	0 (0)	62	0.68
Ba1	0 (0)	13 (0.41)	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	13	0.41
Al	0 (0)	12 (0.17)	54 (0.71)	0 (0)	0 (0)	0 (0)	0 (0)	66	0.71
AT	0 (0)	2 (0.04)	33 (0.59)	0 (0)	0 (0)	0 (0)	0 (0)	35	0.59
FL	0 (0)	0 (0)	0 (0)	10 (1)	0 (0)	0 (0)	0 (0)	10	1
Cy	0 (0)	1 (0.03)	0 (0)	0 (0)	18 (0.97)	0 (0)	0 (0)	19	0.97
Bu	0 (0)	1 (0.03)	0 (0)	0 (0)	0 (0)	1 (0.08)	9 (0.90)	11	0.90
n_.j	116	12	22	19	68	57	61	355

Notes: T, PGCs. At the intersection of T and A: n = number of genera in taxon T and cluster family A; n. number of genera in taxon in T, and n number of genera in taxon in A. In brackets, OI between PGCs and CFs. ω. = MOI of T. Mean of . From calculation, ωinf = 0.69.

The dendrogram plot highlighted a typology with seven cluster families (CFs): = {C1, C13, C17}; = {C2, C5, C10}; = {C4, C6, C14}; = {C11}; = {C12}; = {C15}; and = {C16}. We identified each CF with a potential GC (PGC) defined as the reunion of the OTUs corresponding to the clusters composing the CF. For example, to C1 corresponds Archaea, to C4 Eucarya 1, and to C17 Eucarya 2. Hence, to , we could identify the PGC obtained by reuniting these three taxa. We considered the typology displayed in Table 4 to be good because (calculated from the data of Table 4). The PGCs with MOI ≥ ωinf, boldfaced, were defined as GCs. Hence, the GCs are (1) the Archaea and Eukaryotes altogether (AE), (2) the Burkholderiales (Bu), (3) Bacteroidetes 1 (Ba1), (4) the Cyanobacteria (Cy), (5) the γ-Proteobacteria (Ga), (6) the α-Proteobacteria (Al), and (7) the Actinobacteria (AT). The genera processed numbered 333, accounting thus for 59% of sample S.

Discussion

Justification of our methological principle

LGT and endosymbiosis may have played a key role in the emergence of new groups in certain circumstances (such as, after massive extinctions or radical changes in their environment). These events could have introduced novelties in organisms, shared thereafter by their descendants via classical vertical gene transmission if these gene acquisitions conferred to the bearers increased selective advantages.10,29 Hence, entire historical communities could have emerged this way, introducing evolutionary discontinuities, possibly the GCs. We propose that phylogenies could be unraveled within the GCs. The construction of the TOL implicitly accepts the hypothesis of the constancy of the molecular clock – at least stochastically – throughout the geological eras, within the organisms classified. However, it has been shown that in the remote past, radiation rates coupled with atmosphere composition varied, entailing a variation of the rate of molecular evolution between the taxa.30–32 TOLs based on core genes might trace back the phylogenies of only parts of organisms, if these are phylogenetically too distant. The aim of a sound taxonomical system being the objective comparison of whole organisms, we suggest to carry out phylogenetical taxonomy only on restricted groups where one can take nearly for granted that the overwhelming part of the genetical material has been acquired by vertical transmission, like for instance in the Metazoa or γ-Proteobacteria. Thus, we propose to apply partitional clustering mainly to higher ranked taxa and phylogenetical analyses principally on lower ranked taxa, when the molecular clock can be reasonably calibrated and the genes shown to be transmitted vertically. One might object against partitional clustering that the latter is equivalent to rootless tree analyses, as used in previous studies.33–35 In our opinion, the two approaches are distinct, and the main differences between them are as follows: (1) In a rootless phylogeny, one poses a hypothesis on the relationships between taxa of a given group, which would constitute a community of related taxa exclusively sharing a set of characters between themselves, hypothesized to be relevant for the group and supposed to be possessed by a common (unknown) ancestor. These characters are termed polarized. A rootless tree, like any tree, is a hypothetico-deductive construction. (2) On the contrary, partitional clustering is not based upon an a priori hypothesis. The global DI between the taxa is revealed by structures underpinning the data. This approach is inductive. Our analysis revives the old-standing debate between the tenants of the deductive methods and those of the inductive methods in systematics and evolutionary biology.36,37 Deductive methods have been favored for the last three decades, and inductive methods on the contrary hardly evoked. However, though the deductive methods have been extremely useful and fruitful in the explanation of many evolutionary phenomena, inductive methods can also deliver very interesting information.38,39

The choice of the k-medoid analysis

We chose to apply k-medoid analysis because contrary to k-mean and k-median analyses, it does not rely on means or medians, not appropriate to our data (binary numerals). In addition, k-medoid analyses are less influenced by outliers, and they are more robust than k-mean or k-median analyses, ie, their results depending less on the initial conditions (the choice of the first centroids).40 There are two methods for k-medoid analysis on a given sample, ie, PAM and CLARA.41 PAM handles all the objects and is appropriate for relatively small samples. CLARA on the other hand selects, from a large sample, a series of randomly drawn subsamples. The seeds are selected in each subsample by means of a program similar to PAM; thereafter, the objects of the entire sample are assigned to each of these seeds by means of a chosen (either Euclidian or Mahattan) index distance. CLARA is best suited for large files since the complexity of this algorithm rises arithmetically and not exponentially like in k-mean and k-median analyses. This property makes it possible to process large samples of long sequences in a reasonable time period and in portable computers. We compared the two methods and found that among the 564 genera analyzed, only 4 genera were not assigned to the same cluster. Hence, for us, the methods are comparable, and we can use either method, perhaps with a preference for CLARA to minimize the complexity of the algorithm.

The GCs

Our analysis revealed taxa, ie, the GCs, overlapping the cluster families very significantly and gathering the most similar organisms, ie, the genera whose DI between themselves are smallest. This may be explained by the fact that mathematical clustering does not assemble the genera randomly. Organisms are hypercomplex systems highly constrained phenotypically, hence also genetically. This mere fact probably imposed on them a relatively small number of solutions for their structuration, reflected by the strong genetic resemblance of the organisms within a small number of sets. Figure 2 shows that the nonbacterial organisms are genetically less differentiated than the bacterial ones, the largest between-cluster distance (LBCD) of being about 2775 and of the reunion of the other cluster families ca 5080. Cluster family is remarkable in the sense that Eucarya 2 (one of the avian and about half of the mammalian orders) is contained in cluster C17, which is more distant from cluster C3 (comprising almost the rest of the Eukaryotes), than C1 (the cluster gathering almost all the Archaea). One of the possible explanations lies in the acquisition of extra protein subunits partially involved in catalytic activity in the eucaryal genera, a situation that would have correlatively entailed a weaker involvement of the RNA subunit in that activity, and consequently a structural simplification of the latter. This could explain some structural convergence between very distinct groups of nonbacterial genera in CF .16 A huge gap exists between the Eukaryotes and Archaea on one hand and the Bacteria on the other hand. The LBCD between these two groups is 6780. Thus, GC Archaea-Eucarya forms a consistent group, in opposition with the remaining GCs forming another and as consistent group of GCs, all bacterial. Remarkably, within this group, the GCs Burkholderiales ( ) and Cyanobacteria ( ) are more distant from their neighboring bacterial CFs than the nonbacterial clusters between themselves. Reversely, two composite GCs, the one containing most of the γ-Proteobacteria and Bacteroidetes 1 ( ) on one hand, and the other composed of the α-proteobacteria and the Actinobacteria ( ) on the other hand, are less diversified than the nonbacterial GC (respective LBCDs ≃ 1590 and 2320). A number of taxa of sample S are not members of any of the GCs, namely those which are scattered among the clusters with no preferential connections (and thus showing a weak DMTC), or those connected to clusters C1, C2, or C7, which do not enter in the composition of the cluster families. Of the first category, one can mention Bacilli; and of the second, one can mention Clostridia 23 and Negativicutes, and δ- and ε-Proteobacteria. Such a result is not compatible with the systematics inferred from the phylogeny based on 16S/18S rRNA. However, the heterogeneity of the Firmicutes and the Proteobacteria highlighted by our analysis was also revealed in a number of phylogenetic studies on universal molecules other than 16S/18S rRNA,20,42,43 inciting the authors to question the monophyly of these taxa. Two biases can be encountered in classification based upon aligned sequences, namely the convergence of homologous blocks resulting from plesiomorphic sequence position,42 and the compensatory base changes not necessarily leading to a phenotypic differentiation (in the case of noncoding RNA, no change in secondary structures).44,45 But this remark is not only valid about our study but also to the vast majority of the current phylogenetic studies exclusively involving the primary structures. The method is tributary to the sequences existing in the genetic databases. The material obtained have a strong influence on the optimization of k-medoid analysis, hence on k0 – the optimal number of clusters – and consequently all the genera will not necessarily be processed. But this problem also exists in phylogenetic analysis, where a decision is always made concerning a hypothesis, necessarily concealing – in parts of a tree in construction – uncertainties or lack of knowledge.

Conclusion

The seven GCs would be the result of the plurality of the sources of genetic heritage that would render the history linking them blurred and tremendously difficult to unravel. The nonbacterial GC is distinct from all the other, bacterial, GCs taken altogether. And within the bacterial GCs, surprisingly, Actinobacteria have a relatively strong DI with α-proteobacteria, which again does not mean that α-Proteobacteria are more related to Actinobacteria, than they are to γ-Proteobacteria. The same holds for Burkholderiales – an order of β-Proteobacteria – which show a smaller DI with the Bacteroidetes (Flovobacteria) than with the other Proteobacteria. This shows that the dendrogram interpreting the DIs is not a phylogeny but add information to it, contributing hopefully to the construction of a taxonomy at the highest ranks, when all cellular organisms are compared, perhaps more based on partitional than purely phylogenetical reasoning. Interestingly, each GC is genetically so consistent that this does not seem fortuitous. It appeared to us very likely that vertical gene transmission did play a great role in this internal coherence. Therefore, we propose that the seven GCs be the broadest frames for phylogenetic reconstructions. At the highest rank, ie, that of the domain, our results are strikingly compatible with the three-partite division of the Living World present in the TOL of Woese et al5; Archaea-Eucarya is, also with our method, the sister group of all the remaining known cellular organisms, ie, bacteria, but at the same time, our proposition introduces an uncertainty principle in the search of the phylogenetic relationships between all of the cellular organisms. We based our analysis on a universal albeit single molecule, and further studies on other molecules or parts of the genome are needed to check consistency and thus validate the method. Some of the validating approaches, with appropriate modifications, could be applied to our method, eg, benchmarking.46,47 We could compare our method with other classificatory systems, eg, the Cluster of Orthologous Groups of proteins for prokayotic or eukaryotics organisms (COG/KOG).48

32 in total