Literature DB >> 23300345

Methods of combinatorial optimization to reveal factors affecting gene length.

Abstract

In this paper we present a novel method for genome ranking according to gene lengths. The main outcomes described in this paper are the following: the formulation of the genome ranking problem, presentation of relevant approaches to solve it, and the demonstration of preliminary results from prokaryotic genomes ordering. Using a subset of prokaryotic genomes, we attempted to uncover factors affecting gene length. We have demonstrated that hyperthermophilic species have shorter genes as compared with mesophilic organisms, which probably means that environmental factors affect gene length. Moreover, these preliminary results show that environmental factors group together in ranking evolutionary distant species.

Entities: Chemical Disease Species

Keywords: adaptation; clustering; dimension-reduction techniques; evolution of prokaryotes; factor analysis; machine learning; orthologs; ranking; rating

Year: 2012 PMID： 23300345 PMCID： PMC3528112 DOI： 10.4137/BBI.S10525

Source DB: PubMed Journal: Bioinform Biol Insights ISSN： 1177-9322

Introduction

Evolution of protein coding sequences remains an obscure issue in the molecular evolution of both prokaryotes and eukaryotes. In this paper we focus exclusively on prokaryotes. In prokaryotes, the main driving force in shaping gene length is a point mutation, while in addition to point mutations, eukaryotic variations of gene length may be often caused by other factors,1 such as insertion of mobile elements (transposons and retrotransposons), change to an alternative translation start,2 and so on. We will refer to the protein coding sequences as simply “genes.” The majority of prokaryotic genes have homologs in a variety of genomes. Different homologs of a gene may significantly vary in length. Point mutations can change gene length in a few ways. One of the possible paths for lengthening (or shortening) a gene is stop codon shift.3 Single nucleotide substitutions can destroy the existing stop codon, leading to uninterrupted translation up to the next stop codon in the gene’s reading frame, or create a premature stop codon via a nonsense mutation. Furthermore, short indels-caused frameshifts near the 3′-end of a gene may lead to premature stop codons (shortening) or to translation past the existing stop codon (lengthening). A start codon drift can also occur. Reduction of gene length may happen due to the mutation of a start codon, and a combination of mutations in the upstream region may lead to lengthening of a gene. Whether variations of gene lengths are neutral (some genes become longer than their predecessors, while other genes become shorter, and the sizes of these factions are randomly different from organism to organism) or depend on organismal evolution, and adaptation is still an open question. This paper is an attempt to review several relevant methods. We hypothesize that the ranking of genomes according to lengths of their genes is the most appropriate approach in revealing evolutionary driving factors. For example, we expect that hyperthermophiles should have the shortest genes. The genome ranking problem may be presented in the following way: given a set of annotated prokaryotic genomes, the task is to rank each prokaryotic genome according to its gene lengths. Until now, this problem has been addressed by a (naive) dimension reduction technique (DRT) of the averaging method: denote length of gene k in genome i as; the ranking is then obtained by the averaging method as, where R is the set of genes in the ith genome.4,5 This method has many drawbacks, especially pronounced for annotated genomes. For example, in Skovgaard et al,6 the following weakness of the method is mentioned: in microbial genomes, some annotated genes are actually not protein-coding genes, but rather open reading frames that occur purely by chance; as a result, too many short genes are annotated across genomes. Even taking only those genes that have orthologs in other genomes does not remove this main weakness. Below we present an example of averaging to a sparse data set. Suppose that we want to rank genomes A, B, and C according to gene families a, b, and c. From the first row of the Table 1, we see that the genome A contains genes a and c with lengths 800 and 100, correspondingly; the genome A does not have the gene b. An intuitive order of genomes implied by the gene lengths is A, C, B. On the other hand, taking the genome average gene length does not agree with the intuitive order; it contradicts with the ordering of genomes (B, C) according to the gene b, and the ordering of the genomes (A, B) and (B, C) according to the gene c.

Table 1

Following an example from (Hochbaum et al).20

	Gene families			Average	A-ran king	Intuitive ranking

	a	b	c
A	800		100	450	2	1
B		200	600	400	1	3
C	900	100	500	500	3	2

Hence, we present the corrected formulation of the genome ranking problem: given a set of prokaryotic genomes, a set of genes from a given gene family (GF), and the gene lengths of each genome-GF pair, the genome ranking problem is to rate each prokaryotic genome according to its gene lengths. The available data is sparse: prokaryotic genomes do not contain every GF. Figure 1 shows that less than 2.5% of all genomes contain all possible COGs.

Figure 1

Histogram of number of genomes contained in each COG.

The main outcomes described in this paper are the following: the formulation of the genome ranking problem, presentation of relevant methodologies to solve it, and demonstration of preliminary results from prokaryotic genomes ordering. This paper is organized as follows. Section 2 reviews ordering methods: finding a ranking vector by optimization of a sum of Kendall’s Tau rank-correlation coefficients and obtaining a ranking vector by dimension-reduction methods. Section 3 compares the prokaryotic genome ranking problem with the sports rating problem. Section 4 presents an example of genome ranking using a small dataset. Section 5 is the main contribution of this paper, that is, applicability of the genome ranking to uncover factors affecting gene length.

Review of Dimension-Reduction Techniques

In this section we focus on the unsupervised learning problem to reveal hidden factors. Unsupervised learning techniques can be classified as cluster- analysis or dimension-reduction techniques. Different applications of clustering techniques with regards to a problem of genome classification are reviewed in Bolshoy and Vokovich.7 In particular, the problem of genome classification based on gene lengths is also studied in Bolshoy and Volkovich8 and Korenblat et al.9 Cluster analysis approach addresses the following problem: given information about n objects, cluster these objects into groups so that objects belonging to the same cluster are similar in some sense. Cluster analysis methods such as k-means, hierarchical clustering, and Gaussian mixture models aim to find a partition of objects so that the objects on each subset (cluster) share some common traits. Genome clustering methods define a partition of genomes into clusters, with genomes belonging to the same cluster sharing common properties, such as a phylogenetic signal, such as shown in Korenblat et al,9 or common regulatory signals, as suggested in Kozobay-Avraham et al.10,11 Cluster analysis can substantially assist in revealing factors affecting gene length. Clustering is not a straightforward natural way to reveal these factors; however, it may be a useful step in heuristically constructed procedures of rating. Here we would like to present alternative, more natural approaches, that is, dimension-reduction techniques (DRTs). DRTs solve the following problem: given an n × k matrix, R, find the n × k′ matrix with k′ < k that best captures the content in the original matrix according to certain criteria. For the genome segmentation problem, R is the matrix containing gene lengths’ distribution of k GF by n prokaryotic genomes, and the output is an n × 1 vector that captures relative “gene-length’s tendency” of the prokaryotic genomes. Some of the widely used DRTs are principal component analysis (PCA),12–14 factor analysis (FA),14 multidimensional scaling (MDS),15,16 and averaging. In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations (the principal components) of the original variables with the largest variance. The first principal component is the linear combination with the largest variance; in this sense, it is the one-dimensional vector that best captures the information contained in the original data. Factor analysis assumes that the measured variables depend on some unknown common factors. The goal of FA is to uncover them. Typical examples include variables defined as various test scores of individuals, as such scores are thought to be related to a common “intelligence” factor. Here the measured variables are the gene lengths in a prokaryotic genome, and the indeterminate factors of interest are taxonomic, environmental, or genomic properties affecting the prokaryotic genome’s proclivity for shorter gene lengths. Given n items in a k-dimensional space and an n × n matrix of distances among the items, MDS produces a k′-dimensional, k′ < k, representation of the items such that the pairwise distances among the n points in the new space are similar to the distances in the original data. For PCA and FA, missing data pose serious problems.17–19 In the genome ranking problem, assuming full data is equivalent to assuming that every prokaryotic genome has representatives of all gene families. This assumption does not hold for a heterogeneous set of prokaryotic genomes. The PCA and FA methods require that the missing values are estimated and artificially imputed. While modern implementations of PCA and FA can handle imputed values, they require the imputed values to be consistent with an underlying stochastic model for the data. There is not enough data to fit an underlying stochastic model for the genome-ranking task. Thus, PCA and FA in spite of being so popular in other fields are not appropriate to solve our problem. We are grateful to Hochbaum for bringing our attention to the conceptually similar problem of rating customers by their adoption promptness.20 It seems that rating customers as well as genome ranking may be undertaken by separation-deviation approach21 that was used for group decision making22 and country-credit risk rating.23 Another approach to the ranking problem is to solve an optimization problem using Kendall tau rank correlation coefficient.24 This coefficient provides a measure of the degree of correspondence between two vectors. In particular, it assesses how well the order of the elements of the vectors is preserved. In Hochbaum et al,20 it was noted that finding the customer rating vector that maximizes Kendall tau rank correlation coefficient is a good alternative to other optimization methods (reviewed below in 2.1–2.3), and the authors believe that it is appropriate to use Kendall tau rank correlation coefficient for measuring how well the customer ratings recovers “true ranking.” However, in Hochbaum et al,20 it is also mentioned that finding solution that maximizes Kendall tau rank correlation coefficient is a difficult task because the problem is (non-deterministic polynomial-time) (NP)-hard.25 Another drawback of this approach is ignoring the absolute values of dissimilarity between elements. In spite of these problems, we believe that the approach to using Kendall tau rank-correlation coefficient is a superior way to reveal the true ordering of the genomes, at least in the case of a relatively small amount of genomes.

Review of multidimensional scaling for genome ranking

Multidimensional scaling (MDS) aims to approximate given nonnegative dissimilarities, δij, among pairs of objects, i and j, by distances between points in an m-dimensional MDS configuration X. Here X, the configuration, is an n × m matrix with the coordinates of the n objects in ℛm. The most common function to measure the fit between the given dissimilarities, δij, and distances, dij(X), is STRESS, defined as where wij is a given nonnegative weight reflecting the importance or precision of the dissimilarity δij. Note that wij can be set to 0 if δij is unknown. dij(X) is a vector norm, defined as with given parameter q ≥ 1. Usually, dij (X) is the L2 norm (q = 2) or the L1 norm (q=1). In a useful MDS technique, the three-way MDS, for each pair of objects we are given K dissimilarity measures from different “replications” (different paralogs in our case). This technique is referred to as three-way MDS because the input is a three-dimensional matrix, [δijk], as opposed to the two-dimensional matrix in “classic” MDS. The objective function of three-way MDS is defined as,26 Unidimensional scaling (UDS) is the important one-dimensional case of MDS where the configuration X is an n × 1 matrix. Therefore UDS seeks to approximate the given dissimilarities by distances between points in a one-dimensional space. In our particular application, ranking prokaryotic genomes according to their gene length proclivity, the input data is a matrix R with rik giving the gene length (relative to GF) of genome i for GF k. This matrix is, in general, incomplete and has many missing elements. The objective is to assign each genome i to a scale x such that xi most accurately recovers the across-genome gene lengths. In order to solve our problem, we can setup the following three-way UDS problem: Here the objective is that genomes with high dissimilarities have dissimilar gene lengths and should be placed “far from each other” in the desired scale x. We have to note the most serious drawback of formulating our genome ranking problem as the three-way UDS problem (3): by calculating the dissimilarities as |rik − rjk |, - three-way UDS problem (3) ignores the so-called directionality of dominance that is the sign of (rik rjk −). There are ways to overcome this difficulty: there are papers,27,28 that consider the case where the dissimilarities are given in a complete skew-symmetric matrix (ie, δij = −δij).

Genome ranking via the separationdeviation model

Consider a set of genomes, identified by the index i. Let rik be the median length of the GF k in a genome i. This means that if there is more than one representative of a GF in a given genome; then in this case the median length is chosen to represent the entire set of paralogs. Each of the n genomes is associated with a K-dimensional vector ri = (ri1,…,rik), recording the gene lengths related to the indexed GF. In the event that the genome does not have a gene, the corresponding entry in the vector is set to zero. Actually, it may be regarded as “missing” or “absent”. Statistically saying, the second option is much more frequent. This means that the dissimilarity matrix [δij] is skew-symmetric and assumed to have no missing entries. (Nevertheless, zero elements rik will be treated differently from non-zeroes through ωijk values). One of the important features of the separation-deviation model is that the model takes a collection of pairwise comparisons rik − rjk between the objects (genomes) to be classified as an input. In this particular application, the SD-model uses the gene lengths to create pairwise comparisons among the different genomes. (In this sense, formulas (4) and (5) are similar to (3) and dissimilar to (6). There are certain advantages and disadvantages of using δijk≡rik − rjk. For example, a single genome-GF pair can have several possibly conflicting pairwise comparisons. We are interested in differentiating between genomes that have shorter genes and genomes that have longer genes. In this respect, it is important to emphasize that we are not concerned with the problem of predicting the gene length when a certain genome will acquire a given gene, which is the absolute gene length of each genome. The main motivation for considering pairwise comparisons is given below. While the specific lengths for different genes might have a high variation, the relative difference in the lengths might have less variation. So, for example, the genome for Helicobacter pylori has genes a and c with lengths 800 and 100, respectively, and genome Bacillus subtilis has the same genes with lengths 900 and 156, respectively. Just considering Helicobacter gene lengths is impossible to determine if this genome adopts short or long genes; however, comparing the genes of H. pylori with B. subtilis we can determine that H. pylori has “shorter genes” than B. subtilis. Here are the formulas for SD: Let us set ωijk equal to 1 if both genomes i and j have a gene k and set ωijk equal to 0 otherwise. Following Hochbaum et al20 we set wijk equal to ωijk. Similarly, we set vik equal to 1 if genome i have a gene k, and set vik equal to 0 otherwise. In problems (4) and (5) the parameter M should be chosen in a way such that the separation penalty is the dominant term in the optimization problem; the deviation penalty is only used to choose among the feasible solutions with minimum separation penalty. The SD optimization model is efficiently solvable, resulting in a scalar value for each prokaryotic genome representing their overall score based on gene lengths. There are numerous advantages but also some concerns regarding usage of the SD-model for the genome ranking problem. First we mention the advantages of the SD-model: The SD-model is an approach for unsupervised learning and hence it does not require a training set. The SD-model works well (without any need for data preprocessing) in situations where the information matrix is sparse. The SD algorithm has a polynomial time complexity. The SD-model does not rely on specified distributions for different classes, and there is no requirement of any specific sample size. Here are the concerns whether the following properties of the SD-model are good or bad for the purpose of genomes ranking: The SD-model can use subjective and unreliable judgments as an input and produce a misleading output with realistically-looking confidence levels. The SD-model uses a collection of pairwise comparisons rik − rjk between the objects (genomes) as an input. In this manuscript we scrutinize another approach presented in the following subsection, which does not take into account the value of difference rik-rkj.

Maximization of Kendall tau rank correlation coefficient

Kendall’s rank correlation coefficient τ provides a distribution free test of independence and a measure of the strength of dependence between two variables a and b. If the a is just [1, 2, …, n] then τ measures how well b is sorted. Let us note υ = [1, 2, …, n]. A scale x is a permutation of υ. In our application (to rank prokaryotic genomes according to their gene length) the input is a matrix R with rik containing gene lengths (relative to GF) of genome i for GF k. The goal is to assign each genome i to a scale x such that xi most accurately recovers the across-genome gene lengths. “Most accurately” here means achieving the maximum of (6): mk—is a number of non-zero rik elements. Because solving of this optimization problem is NP-hard,25 heuristic methods, such as the simulated annealing procedure (SAP) or another meta-heuristics, should be used. In the next section we present our implementation of SAP and results of its application.

Comparison of the Prokaryotic Genome Ranking Problem with the Problem of Sports Rating

In many sports there is an officially accepted world ranking. Tennis has its (Association for Tennis Professionals) ATP world ranking, golf has the Sony world ranking, and football (soccer, in the United States and Canada), its Fédération Internationale de Football Association (FIFA) ranking. These rankings create some measures of a player’s or a team’s success and are sometimes used for seeding or prize money. The approaches for ranking may differ; however, they may be useful for solving our problem. Genome is analogous to a player (or a team), a gene family (GF) is analogous to a tournament, and gene lengths correspond to tournament points. The ATP tennis ranking is based on tournament performance and bonus points. A player gained tournament points depending on how far in a tournament he/she progressed, and the quality of a tournament. For our problem, formula (6) would be changed to where ω reflects the “quality” of a GF k (developing this approach is in progress). An exception to the ad hoc rating systems used in most sports is the Elo rating system utilized in chess. This purely quantitative rating system is based on exponential smoothing of a player’s rating depending on the actual proportion of victory compared with that expected given the ratings of the opponents. Various authors have suggested using exponential smoothing methods for rating in other sports, such as those described by Strauss and Arnold29 for racquetball and Clarke30 for squash. However, there are difficulties in ranking tennis players: the tournaments are played on different surfaces (grass, clay, synthetic, etc.). Most players have a favorite surface, and their performance level changes with different surfaces. We may see a direct analogy to protein families here, so the Elo rating may be less appropriate to genome ranking problem.

Ranking Genomes From a Small Dataset

Description of the database of clusters of orthologous groups of proteins

There are several ways to define a gene (protein) family.31–33 The presented algorithm is evaluated on the subset of the database of clusters of orthologous groups of proteins (COGs). The principles of the database construction are described in.34–37 The COGs were constructed by applying the criterion of consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The data in COGs are updated continuously following sequencing new prokaryotic genomes. For example, at some point in the year 2012, proteins from a total of about 1500 complete genomes were assigned to more than 4500 COGs. COGs are widely used in comparative genomics by a number of research groups.38–42

100-genomes’ dataset

The results presented below are obtained using a subset containing randomly selected genomes from the National Center for Biotechnology Information (NCBI) genomic database and COGs present in these genomes. This small dataset R1 consists of 100 genomes out of 1496. It contains 9 archaeal and 91 eubacterial genomes. Table 2 contains a list of these genomes.

Table 2

List of genomes in a ranking order.

Rank	Domain	Name
1	Archaea	Archaeoglobus fulgidus dsm 4304
2	Bacteria	Thermotoga sp. rq2
3	Archaea	Thermococcus onnurineus na1
4	Archaea	Thermoplasma volcanium gss1
5	Bacteria	Thermotoga neapolitana dsm 4359
6	Archaea	Thermoplasma acidophilum dsm 1728
7	Archaea	Pyrococcus abyssi ge5
8	Bacteria	Aquifex aeolicus vf5
9	Bacteria	Campylobacter concisus 13826
11	Archaea	Thermococcus sibiricus mm 739
12	Archaea	Pyrococcus horikoshii ot3
12	Bacteria	Campylobacter curvus 525.92
13	Bacteria	Helicobacter felis atcc 49179
13	Bacteria	Dictyoglomus thermophilum h-6–12
15	Bacteria	Streptococcus pneumoniae p1031
16	Bacteria	Streptococcus agalactiae a909
18	Bacteria	Bacillus cereus atcc 14579
19	Bacteria	Mycoplasma pulmonis uab ctip
20	Bacteria	Bacillus cytotoxicus nvh 391–98
20	Bacteria	Listeria monocytogenes serotype 4b
20	Archaea	Methanosalsum zhilinae dsm 4017
21	Bacteria	Streptococcus agalactiae 2603v/r
22	Bacteria	Caldicellulosiruptor bescii dsm 6725
25	Bacteria	Bacillus amyloliquefaciens dsm 7
26	Bacteria	Mycoplasma fermentans m64
27	Bacteria	Rickettsia canadensis str. mckiel
28	Bacteria	Ureaplasma parvum serovar 3
29	Bacteria	Francisella sp. tx077308
29	Bacteria	Streptococcus zooepidemicus
30	Bacteria	Melissococcus plutonius atcc 35311
30	Bacteria	Mycoplasma leachii pg50
31	Bacteria	Bacillus pumilus safr-032
32	Bacteria	Pediococcus pentosaceus atcc 25745
34	Bacteria	Mycoplasma genitalium g37
35	Bacteria	Enterococcus faecalis v583
39	Bacteria	Legionella pneumophila str. paris
40	Bacteria	Natranaerobius thermophilus
40	Bacteria	Brachyspira pilosicoli 95/1000
40	Bacteria	Ruminococcus albus 7
41	Bacteria	Bacillus thuringiensis str. al hakam
41	Bacteria	Brevibacillus brevis nbrc 100599
41	Bacteria	Geobacter uraniireducens rf4
42	Bacteria	Geobacter lovleyi sz
44	Bacteria	Neisseria meningitidis 053442
44	Bacteria	Coxiella burnetii rsa 331
45	Bacteria	Mycoplasma pneumoniae m129
46	Bacteria	Maribacter sp. htcc2170
47	Bacteria	Laribacter hongkongensis hlhk9
48	Bacteria	Pseudogulbenkiania sp. nh8b
51	Bacteria	Zobellia galactanivorans
52	Bacteria	Dechloromonas aromatica rcb
53	Bacteria	Sodalis glossinidius str. ‘morsitans’
54	Bacteria	Erwinia amylovora atcc 49946
55	Archaea	Halalkalicoccus jeotgali b3
55	Bacteria	Escherichia coli bw2952
55	Bacteria	Gramella forsetii kt0803
55	Bacteria	Lactobacillus gasseri atcc 33323
55	Bacteria	Borrelia turicatae 91e135
60	Bacteria	Klebsiella variicola at-22
60	Bacteria	Candidatus riesia pediculicola usda
61	Bacteria	Salmonella enterica subsp. arizonae
61	Bacteria	Eubacterium eligens atcc 27750
62	Bacteria	Sphingobacterium sp. 21
63	Bacteria	Methylomonas methanica mc09
63	Bacteria	Dyadobacter fermentans dsm 18053
64	Bacteria	Yersinia enterocolitica subsp
67	Bacteria	Chlamydophila pneumoniae ar39
68	Bacteria	Cronobacter turicensis z3032
69	Bacteria	Spirochaeta smaragdinae dsm 11293
71	Bacteria	Yersinia pseudotuberculosis pb1/+
72	Bacteria	Pelodictyon phaeoclathratiforme
72	Bacteria	Tropheryma whipplei tw08/27
73	Bacteria	Xanthomonas oryzae
73	Bacteria	Desulfovibrio vulgaris
74	Bacteria	Dinoroseobacter shibae dfl 12
75	Bacteria	Acidiphilium cryptum jf-5
77	Bacteria	Thauera sp. mz1t
77	Bacteria	Magnetococcus marinus mc-1
78	Bacteria	Prosthecochloris aestuarii dsm 271
79	Bacteria	Anaerolinea thermophila uni-1
81	Bacteria	Sinorhizobium meliloti 1021
82	Bacteria	Bordetella petrii dsm 12804
84	Bacteria	Chloroflexus aggregans dsm 9485
84	Bacteria	Arcanobacterium haemolyticum
85	Bacteria	Corynebacterium glutamicum r
87	Bacteria	Cyanothece sp. pcc 7822
87	Bacteria	Starkeya novella dsm 506
87	Bacteria	Gluconacetobacter diazotrophicus
88	Bacteria	Rhodopseudomonas palustris dx-1
90	Bacteria	Rhodospirillum centenum sw
91	Bacteria	Xanthobacter autotrophicus py2
92	Bacteria	Mycobacterium leprae br4923
93	Bacteria	Intrasporangium calvum dsm 43043
94	Bacteria	Streptomyces scabiei 87.22
94	Bacteria	Streptomyces griseus subsp.
96	Bacteria	Burkholderia rhizoxinica hki 454
97	Bacteria	Haliangium ochraceum dsm 14365
98	Bacteria	Salinibacter ruber m8
99	Bacteria	Rothia dentocariosa atcc 17931
100	Bacteria	Bifidobacterium animalis

Figure 1 (blue bars) illustrates the input data. The complete set related to 1496 genomes consists of 4692 COGs, and, naturally, there are COGs that no genome of R1 has members of these GFs. Therefore, we removed those COGs that are present in less than 35% of genes from R1. After this filtering, our dataset contained 1409 COGs (Fig. 1, red bars).

Gene-length matrix preparation

The original data is transformed to the following format: to each (genome, COG) pair we assigned one standardized protein length. For a given COG, each organism is represented by a calculated length—a median length of all paralogous proteins. For example, there are 8 paralogs of a gene tryptophan transporter of high affinity (mtr, sdaC, tdcC, tnaB, tyrP, yhaO, yhjV, and yqeG) in a genome Escherichia coli K12, taxonomy id 83333. These paralogs with lengths 403, 409, 414, 415, 423, 425, 429, and 443 appear as members of COG0814 (amino acid permeases). Only one triplet (83333, COG0814, 419) would be included in the input data. A number of genes vary from genome to genome. Consequently, genomes of the dataset R1 are presented by different number of COGs, that is, from small mycoplasmas and ureaplasma, the smallest and simplest self-replicating organisms with genome sizes from about 540 kb and less than 300 COGs inside to long genomes with more than 900 COGs (see Fig. 2).

Figure 2

Number of COGs that each genome contains. Genomes are ordered as in Table 2.

Simulated annealing

A variety of combinatorial optimization strategies are available for optimization (3), starting with a bruteforce approach and continuing with deterministic and stochastic approaches.43 The strategy of local pairwise interchange (LOPI) does not guarantee global optimality, but it is very efficient,44 and being enhanced by stochastic techniques, brings good results (manuscript in preparation). Simulated Annealing45,46 is a generic probabilistic metaheuristics for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. We used acceptance probability function in the form α(τ, τ,new, t,) = min(1,e). The algorithm was implemented in R using mpiR package to enable parallel processing using HPC Wales computer cluster.

Results and their interpretation

We performed a random selection of 1050 COGs twice (overlap was 777 COGs). For the two subsets of COGs, the resulting rankings are significantly correlated (Fig. 3); the Kendall tau correlation coefficient is 0.908 (2-sided P value < 0.001). Lowest and highest ranks agree the most, while genomes from the middle portion of the ordering show the most deviation.

Figure 3

Comparison of rankings produced by two different incomplete subsets of COGs.

Figure 2 shows a modest positive tendency to rank genomes that have smaller number of COGs below those with a larger number. However, the number of COGs in a genome is not the major force that affects a rank of a genome. Table 2 aids understanding the driving forces behind the ranking. Table 2 presents the resulting ranking of the genomes. Different taxonomic groups appear to be tightly clustered within the ordering. For example, the majority of archaeal genomes are placed on the top of the ranking table. Figure 3 and our calculations performed on other genome subsets (unpublished data) has led us to discuss at this stage only the most stable groups: the top (ranked 1–16 in Table 2) and the bottom (ranked 85–100). Among the top 16, 13 hyperthermophiles are the clear majority. There are both archaea and eubacteria in this group. Among sequenced archaea, there are many organisms living in extreme environments, such as volcanic hot springs, and they are all in the top group. Next to Achaea are Thermotogae bacteria. The name of this phylum is derived from the existence of many of these organisms at high temperatures. Bacteria of Aquificae phylum live in harsh environmental settings. Representatives of Dicyolglomi phylum are also extremely thermophilic. In addition to hyperthermophiles, two campylobacters and one helicobacter accomplish this group. There are no other campylobacters or helicobacters in R1. The two species of archaea that are not hyperthermophiles are placed at ranks 20 and 50. The opposite end of the spectrum is occupied by Actinobacteria.47 Many species of Actinobacteria are found in soil, including some of the most common soil life, playing important roles in decomposition and humus formation.

Conclusions

We have demonstrated the efficacy of the Kendall tau rank-correlation coefficient for solving the problem of genome ranking. The proposed method is stable, yielding meaningful results on a small test set of prokaryotic genomes. The presented results agree with our prior intuitive ordering, placing thermophilic species on top of the ranking table. Simulated Annealing approach, in combination with parallel implementation of the developed algorithm, allowed developing an efficient method that can be scaled up to include all prokaryotic genomes. Maximization of an average Kendall tau rank correlation coefficient is suitable to assess the ordering of genomes since it has a simple interpretation. For one column, maximization of tau means sorting. For the full input matrix (no missing values), such maximization is equivalent to sorting the table (the input matrix) to get the best average sorting. An extension of this interpretation for our sparse input data is not straightforward but seems very reasonable. Our approach can be used for various ranking problems, where each subject has multiple categories that needed to be taken into account. For example, ranking of athletes, consumer products, and so on. The method can be further improved by introducing different weights for categories based on their relative importance.

22 in total

1. Evolution of prokaryotic genes by shift of stop codons.

Authors: Anna A Vakhrusheva; Marat D Kazanov; Andrey A Mironov; Georgii A Bazykin
Journal: J Mol Evol Date: 2010-11-17 Impact factor: 2.395

2. Bicriterion seriation methods for skew-symmetric matrices.

Authors: Michael J Brusco; Stephanie Stahl
Journal: Br J Math Stat Psychol Date: 2005-11 Impact factor: 3.380

Review 3. Hierarchical clustering based upon contextual alignment of proteins: a different way to approach phylogeny.

Authors: Anna Gambin; Piotr P Slonimski
Journal: C R Biol Date: 2005-01 Impact factor: 1.583

4. Protein sequence comparison at genome scale.

Authors: E V Koonin; R L Tatusov; K E Rudd
Journal: Methods Enzymol Date: 1996 Impact factor: 1.600

5. Protein Sequence Database of the Protein Identification Resource (PIR).

Authors: W C Barker; L T Hunt; D G George; L S Yeh; H R Chen; M C Blomquist; E I Seibel-Ross; A Elzanowski; J K Bair; D A Ferrick
Journal: Protein Seq Data Anal Date: 1988-02

Review 6. Comparative genomics of Bifidobacterium, Lactobacillus and related probiotic genera.

Authors: Oksana Lukjancenko; David W Ussery; Trudy M Wassenaar
Journal: Microb Ecol Date: 2011-10-27 Impact factor: 4.552

7. Alternative translation start sites are conserved in eukaryotic genomes.

Authors: Georgii A Bazykin; Alex V Kochetov
Journal: Nucleic Acids Res Date: 2010-09-22 Impact factor: 16.971

8. Transcriptome analysis of the oriental fruit fly (Bactrocera dorsalis).

Authors: Guang-Mao Shen; Wei Dou; Jin-Zhi Niu; Hong-Bo Jiang; Wen-Jia Yang; Fu-Xian Jia; Fei Hu; Lin Cong; Jin-Jun Wang
Journal: PLoS One Date: 2011-12-15 Impact factor: 3.240

9. The core and unique proteins of haloarchaea.

Authors: Melinda D Capes; Priya DasSarma; Shiladitya DasSarma
Journal: BMC Genomics Date: 2012-01-24 Impact factor: 3.969

10. Protein length in eukaryotic and prokaryotic proteomes.

Authors: Luciano Brocchieri; Samuel Karlin
Journal: Nucleic Acids Res Date: 2005-06-10 Impact factor: 16.971

5 in total

1. Ranking of Prokaryotic Genomes Based on Maximization of Sortedness of Gene Lengths.

Authors: A Bolshoy; B Salih; I Cohen; T Tatarinova
Journal: J Data Mining Genomics Proteomics Date: 2014

2. Lengths of Orthologous Prokaryotic Proteins Are Affected by Evolutionary Factors.

Authors: Tatiana Tatarinova; Bilal Salih; Jennifer Dien Bard; Irit Cohen; Alexander Bolshoy
Journal: Biomed Res Int Date: 2015-05-31 Impact factor: 3.411

3. The mysterious orphans of Mycoplasmataceae.

Authors: Tatiana V Tatarinova; Inna Lysnyansky; Yuri V Nikolsky; Alexander Bolshoy
Journal: Biol Direct Date: 2016-01-08 Impact factor: 4.540

4. Evidence-based gene models for structural and functional annotations of the oil palm genome.

Authors: Kuang-Lim Chan; Tatiana V Tatarinova; Rozana Rosli; Nadzirah Amiruddin; Norazah Azizi; Mohd Amin Ab Halim; Nik Shazana Nik Mohd Sanusi; Nagappan Jayanthi; Petr Ponomarenko; Martin Triska; Victor Solovyev; Mohd Firdaus-Raih; Ravigadevi Sambanthamurthi; Denis Murphy; Eng-Ti Leslie Low
Journal: Biol Direct Date: 2017-09-08 Impact factor: 4.540

5. Gene-Family Extension Measures and Correlations.

Authors: Gon Carmi; Alexander Bolshoy
Journal: Life (Basel) Date: 2016-08-03

5 in total