Literature DB >> 26085808

Normalization of Complete Genome Characteristics: Application to Evolution from Primitive Organisms to Homo sapiens.

Kenji Sorimachi¹, Teiji Okayasu², Shuji Ohhira³.

Abstract

Normalized nucleotide and amino acid contents of complete genome sequences can be visualized as radar charts. The shapes of these charts depict the characteristics of an organism's genome. The normalized values calculated from the genome sequence theoretically exclude experimental errors. Further, because normalization is independent of both target size and kind, this procedure is applicable not only to single genes but also to whole genomes, which consist of a huge number of different genes. In this review, we discuss the applications of the normalization of the nucleotide and predicted amino acid contents of complete genomes to the investigation of genome structure and to evolutionary research from primitive organisms to Homo sapiens. Some of the results could never have been obtained from the analysis of individual nucleotide or amino acid sequences but were revealed only after the normalization of nucleotide and amino acid contents was applied to genome research. The discovery that genome structure was homogeneous was obtained only after normalization methods were applied to the nucleotide or predicted amino acid contents of genome sequences. Normalization procedures are also applicable to evolutionary research. Thus, normalization of the contents of whole genomes is a useful procedure that can help to characterize organisms.

Entities: Chemical Disease Gene Mutation Species

Keywords: Amino acid composition; Chargaff’s parity rules; Cluster analysis; Evolution; Genome; Normalization; Nucleotide content; Phylogenetic trees.

Year: 2015 PMID： 26085808 PMCID： PMC4467310 DOI： 10.2174/1389202916666150119215716

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

Molecular biology and parentology have contributed to our understanding of evolution using phylogenetic trees based on nucleotide or amino acid sequence changes and on morphological changes detected in fossils, respectively. In these studies, evolutionary divergences have been evaluated as the degree of similarity or difference in gene structures based on nucleotide sequences or on fossil shapes and sizes. Numerous phylogenetic trees have been drawn with single genes such as cytochrome c [1], tRNA [2-4], and rRNA [5], and fossils that indicate new species have been found in many different places. It has been almost two decades since the complete genome of Haemophilus influenzae was first analyzed in 1995 [6], and the first human genome draft was reported in 2001 [7, 8]. New data are continuously increasing and, to date, the complete genomes of 285 eukaryotes, 2898 bacteria, and 173 archaea have been sequenced. However, common data analysis methods that focus on the nucleotide or amino acid sequences of single gene(s) have been inadequate in evaluating whole-genome characteristics of many species, although, recently, whole genomes of Homo sapiens, H. neanderthalensis, the Denisova specimen, and chimpanzees (Pan troglodytes) were compared [9]. Regarding the normalization of complete genomes consisting of a large number of different genes, the choice of target is particularly important: four nucleotide contents are simple, whereas 20 amino acid compositions are more complex because of many influencing factors. For example, the amino acid composition may differ according to transcriptional levels and analytical methods. To evaluate normalization methods for high-throughput RNA sequencing data analysis, Dillies et al. [10] compared the data obtained from real and simulated datasets of four species, H. sapiens, Entamoeba histolytica, Aspergillus fumigatus, and Mus musculus, and from seven different analyses. However, the ideal normalization method has not yet been obtained. Indeed, we encountered difficulties in our early studies regarding the gene expressions of whole genomes consisting of many different genes [11]. To our knowledge, that article is the first report that showed the amino acid composition predicted from the complete genome. Finally, we decided to assume that all genes are equally expressed in the entire genome, and that the data are coincidentally consistent between genomic and experimental analyses based on cell hydrolyzates [11], as shown in the later section of this review article. This assumption has been applied to our subsequent studies, to date. However, normalization, which is independent of target size and type, is a useful approach to characterize whole genomes. Particularly, in normalization based on nucleotide and amino acid sequences, experimental errors are theoretically excluded and the normalized values, which are accurate, can represent completely the characteristics of the target organisms. In the normalization of small nucleotides or peptides, the normalized values are equal when the ratios of each of the four nucleotides to the total nucleotides are equal among samples, even though their sequences differ. However, for polynucleotide, polypeptide, or genome sequences that consist of large numbers of nucleotides, the normalized values are never equal among samples. In addition, by visualizing the values, complicated phenomena can be easily understood [12]. For example, investigations have been carried out based on molecular structure changes, DNA structure [13], evolution-driven protein structural changes [14], drug resistance mechanisms [15], cancer-associated single nucleotide polymorphisms [16] and protein–protein interactions involving point mutations [17]. In this review, we introduce the methods that have been used for the normalization of complete genome characteristics such as nucleotide and amino acid contents, and show some of the results related to genome structure and evolution.

Homogeneous Genome Structure

We first investigated whether it was reasonable to assume that the characteristics of a whole genome, which consists of a huge number of different genes with different nucleotide sequences, can be expressed simply by normalization of the nucleotide and amino acid contents. We found that when a mouse complementary DNA (cDNA) dataset [18] consisting of 10,465 genes was divided into two equal halves, the amino acid composition predicted from the first 5, 10, 50, 100, 500, 1,000, 5,232 and 5,233 genes, according to the order listed in the dataset, was similar in the two halves and within the same part (Fig. ). Indeed, the amino acid compositions based on more than 10 genes resembled each other and that of the whole dataset (Fig. ). This finding implies that genome structure is made up of putative small units that encode sequences that have similar amino acid compositions, even though each gene in these small units has a different nucleotide sequence, as shown in (Fig. ). To investigate this result further, the complete genome of Treponema pallidum, which consists of 1,031 genes [19], was divided into two equal halves (Fig. ), and the amino acid composition of each half and of the complete genome was calculated. The amino acid compositions of each half and of the whole genome were very similar (Fig. ). Next, we divided all the genes into nine groups of 103 genes each and one group of 104 genes, and then divided the 10 groups into 20 half-size groups consisting of 52 genes each (Fig. ). The amino acid compositions for all the genes in each group were all similar to each other (Fig. and ). Thus, we have shown that the genomes were homogeneously constructed with putative units with similar amino acid compositions encoded in the open reading frames, even though each gene encodes a clearly different amino acid sequence. We found that the amino acid composition of the 3,236 amino acid residues encoded by the first 10 genes in one of the units resembled the amino acid composition encoded by all the genes in the genome. The amino acid compositions calculated from the first gene encoding 151–677 amino acid residues in each of the 10 units differed from each other and from that of the complete genome (Fig. ). However, the largest gene, which encoded 1,517 amino acid residues, also had an amino acid composition that was similar to that of the complete genome. In general, the size of the coding unit that showed similar amino acid composition was reported to be between 3,000 and 7,000 amino acid residues [20, 21]. This coding unit size is the same for bacteria, archaea, and eukaryotes. The fact that genome structure is homogeneous suggests that mutations occur synchronously over the whole genome. This idea led to the construction of phylogenetic trees based on different genes in the same organism. However, phylogenetic trees are not absolute because their form depends on the analytical methods and traits that are used to construct them.

Normalization of Nucleotide and Amino Acid Contents

Normalized nucleotide and predicted amino acid contents of complete genomes can be visualized on radar charts where the proportion of each of the four nucleotides or 20 amino acids is plotted on one of the radii or on bar graphs. The normalized values for complete genome, polynucleotide, and polypeptide sequences are always different for different species. Thus, organisms can be compared based on the shapes of the resultant polygons in the radar charts, the values of the 4 and 20 angle points, and their combinations, which can be used as traits in Ward’s clustering analyses [22]. A neighbor-joining method [23] yielded consistent results using amino acid and nucleotide contents as traits [24, 25]. The normalized value is indicated by the length of the bar in the graphs. It is also possible to compare organisms with only one single point value, where each single value represents the whole organism, using normalized nucleotide or amino acid contents calculated from the complete genome. Indeed, we found that the purine content of complete mitochondria sequences was enough to classify vertebrates into two groups, aquatic and terrestrial, although the pyrimidine content did not produce the same result [unpublished data]. Consistent classifications have been obtained from Ward’s clustering analysis using normalized amino acid or nucleotide contents as well as from a neighbor-joining method using 16S rRNA gene sequences [24, 25].

Primitive Life Forms

It is impossible to evaluate the characteristics of primitive organisms that are now extinct. However, because natural rules are universal, some of their characteristics can be predicted based on rules obtained from extant organisms. Bacteria that are found as fossils seem to be most closely related to primitive life forms [11, 26]. The normalized amino acid composition of Escherichia coli shows a characteristic pattern which resembles a “starfish shape” on the radar chart (Fig. ). The protist, Monosiga brevicollis, which is thought to be close to the origin of multi-cellular organisms shows a similar amino acid composition pattern as E. coli on the radar chart (Fig. ). The normalized amino acid composition pattern of hagfish (Eptatretus burgeri), which may be close to the origin of vertebrates [27], also resembles that of E. coli. The amino acid composition pattern of human (Homo sapiens) also resembles that of E. coli, as shown in (Fig. ). These patterns from the complete genomes imply that the encoded amino acid composition of genomes is similar from bacteria to eukaryotes. The amino acid compositions of 11 Gram-positive and 12 Gram-negative bacteria were classified into two groups, “S-type” represented by Staphylococcus aureus and “E-type” represented by Escherichia coli, based on their patterns of amino acid compositions predicted from the complete genome [28]. The amino acid composition based on the plasmid resembled that based on the parent complete genome [28]. This two group classification was independent of Gram staining [28]. The cellular amino acid composition of bacteria was first analyzed using cell hydrolysates by Sueoka [29]. We also analyzed independently the cellular amino acid compositions of cell hydrolysates obtained from bacteria, archaea, and eukaryotes [26]. The basic “starfish shape” pattern was obtained from all these cell hydrolysates [26]. Differences in the “starfish shape” reflect the evolutionary changes that have occurred in genomes. It is curious that the amino acid composition of complete genomes resembles the amino acid composition obtained from cell hydrolysates, because both values are based on different methods and different origins. However, we subsequently found that this coincidence is a result of the homogeneous genome structure. In cell hydrolysates, because differences in gene expression levels are channeled within putative small units that show similar amino acid compositions after normalization, the total cellular amino acid composition is independent of the expression and transcription levels of each gene. Thus, both cellular and genome-normalized amino acid compositions were almost equal. It can be speculated that primitive organisms may have similar amino acid compositions as extant organisms, because all extant organisms, including bacteria, archaea, and eukaryotes, have similar normalized amino acid compositions [11, 26]. This conclusion was obtained only after the nucleotide or amino acid contents of sequences were normalized across whole genomes. This conclusion is independent of whether cell hydrolysates or complete genomes are used.

Rules Governing Genome Evolution

We can speculate about past and future evolutionary phenomena based on the rules that govern the evolution of present-day organisms. By plotting the normalized nucleotide contents separately for the four nucleotides, it was found that the nucleotide contents for three nucleotides could be expressed by the nucleotide content of the fourth using three linear regression equations [30]. This rule was applicable to complete genome sequences and coding and non-coding regions, using chromosomal [30], chloroplast [31], or plant mitochondrial genomes [31]. The nucleotide relationships were heteroskedastic in animal mitochondria [31], while the relationships between homonucleotides and their analogues, and between heteronucleotides and their analogues were linear and heteroskedastic, respectively, in animal mitochondria. These were divided into two groups, high G/C and low G/Ccontent [32]. When the nucleotide contents of coding or non-coding regions were plotted against the nucleotide contents of complete genomes, two regression lines based on chloroplast and plant mitochondria sequences were found to be closed at the edge point, because the maximum nucleotide contents in either the coding or non-coding regions against the complete genome nucleotide contents were equal in the two organelles [31]. Normalization of nucleotides was first carried out by Chargaff [33] to characterize cellular DNA consisting of double strands, as Chargaff’s first parity rule [G = C, T = A, and (G + A) = (C + T)][33]. Subsequently, the rule was expanded to single-stranded DNA forming double-stranded DNA, as Chargaff’s second parity rule [34]. The first rule was discovered before the Watson and Crick DNA model was proposed [35]. Based on the double-stranded DNA structure, Chargaff’s first parity rule is easily understood. However, the second parity rule, based on similar nucleotide relationships in single strand DNA, has been puzzle in molecular biology, because it is impossible to imagine how pairs of G and C, and A and T formed in the single DNA strand. Using normalization of nucleotide contents, this historic puzzle was solved mathematically based on homogeneous genome structure [36]. Now, the results obtained from a huge genome dataset based on interspecies characteristics have been found to be consistent with Chargaff’s rules [30, 37]. When the cellular nucleotide content of one nucleotide was fixed, the nucleotide contents of the other three nucleotides were naturally determined in chromosomes, chloroplasts, and plant mitochondria. Furthermore, not only the codon distributions but also the amino acid compositions were subsequently determined [30, 31]. In all organisms and cellular organelles except animal mitochondria, nucleotide contents can be expressed using dot plots with two duplicate points on the diagonal lines of a 0.5 square, which has been referred to as the diagonal genome universe [38].

Origin of Life

All phylogenetic trees start from a single origin. However, this feature is derived from the algorithms that are used to calculate similarities in the nucleotide or amino acid sequences of organisms. Charles Darwin described evolution and natural selection in “On the Origin of Species by Means of Natural Selection, or, the Preservation of Favoured Races in the Struggle for Life”, published in 1859. Since then, the concepts of “a single origin” and “a common ancestor” of species has been generally accepted. Phylogenetic trees tend to represent the single origin of species. We have shown that all organisms started from a single origin based on the result that nucleotide relationship lines, which represent not only the divergence of organisms but also cell organelles, closed at a single point [32]. In addition, it has been shown that vertebrates diverged from the low G/C invertebrate group [39]. These findings based on nucleotide content relationship lines have been confirmed by phylogenetic trees based on Ward’s clustering analysis using 20 amino acid contents as traits [25, 40].

Phylogenetic Trees

Nucleotide and amino acid sequences reflect evolutionary divergence of organisms [1-5], and the changes in these sequences have been used to construct phylogenetic trees. Morphological changes in fossils have also been used to construct phylogenetic trees. Similarly, because normalized nucleotide and amino acid contents can be visualized in the form of radar charts, the resultant patterns can be compared computationally by multivariate analyses, using the four nucleotide contents or predicted 20 amino acid contents of chromosomal DNA as traits. The genome sequences of the organisms examined were classified into two groups, GC-rich and AT-rich [41]. It is possible to change the number of traits used; however, it should be noted that while increasing the number of traits yields better results, increasing the number of samples yields worse results because the probability of coincidence increases [25]. Using amino acid compositions predicted from complete mitochondrial genomes to draw the trees separated vertebrates into aquatic and terrestrial groups, whereas some exceptions were observed when nucleotide contents were used to draw the trees [24]. When 16S rRNA nucleotide sequences were used, the results were consistent with those obtained from mitochondrial amino acid compositions, with some minor differences [24, 25]. When the normalized amino acid compositions of vertebrate and invertebrate complete mitochondrial genomes were used, the groups were separated cleanly into two large clusters, vertebrates and invertebrates (Fig. ). In invertebrates, starfish (Echinodermata) formed a small cluster, and squids and octopus (Mollusca) were grouped into the same cluster. Vertebrates were further classified into three major clusters, mammals, fish, and a mixture of reptiles and amphibians. For example, primates (human, chimpanzee, and gorilla) formed a small cluster. Thus, close species fell into the same cluster, and did not split into different clusters. These results indicate that the normalized values of amino acid and nucleotide contents calculated from complete genomes could be used to characterize organisms and to construct phylogenetic trees. Our results based on complete mitochondrial genomes revealed that hemichordates (Balanoglossus carnosus and Saccoglossus kowalevskii) and Xenoturbella bocki, which were classified into the low G/C content invertebrates group, were closer to vertebrates than to invertebrates [40]. Protists(Monosiga brevicollis) and cephalochordate (Branchiostoma belcheri) were classified into the low G/C and high G/C content invertebrate groups, respectively [32].

CONCLUSION

The normalized nucleotide and amino acid contents calculated from complete genomes can be accurately presented not only as numbers but also as shapes in radar charts. Normalization avoids the introduction of experimental errors and normalized nucleotide and amino acid contents have been used successfully to characterize genomes. For example, the concept of genome homogeneity based on putative small units was recognized based on the normalization of genome characteristics. In addition, phylogenetic trees that were constructed with the normalized nucleotide or amino acid contents predicted from complete genomes were found to give completely reasonable results. Thus, normalization of nucleotide or amino acid contents is a useful way of characterizing whole genomes consisting of numerous nucleotides and different genes.

CONFLICT OF INTEREST

The author(s) confirm that this article content has no conflict of interest.

30 in total

1. Codon evolution is governed by linear formulas.

Authors: K Sorimachi; T Okayasu
Journal: Amino Acids Date: 2008-01-08 Impact factor: 3.520

2. Organisms can essentially be classified according to two codon patterns.

Authors: T Okayasu; K Sorimachi
Journal: Amino Acids Date: 2008-04-01 Impact factor: 3.520

3. 16S ribosomal DNA amplification for phylogenetic study.

Authors: W G Weisburg; S M Barns; D A Pelletier; D J Lane
Journal: J Bacteriol Date: 1991-01 Impact factor: 3.490

4. Genetic code origins: tRNAs older than their synthetases?

Authors: L Ribas de Pouplana; R J Turner; B A Steer; P Schimmel
Journal: Proc Natl Acad Sci U S A Date: 1998-09-15 Impact factor: 11.205

Review 5. Applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady-state systems.

Authors: K C Chou
Journal: Biophys Chem Date: 1990-01 Impact factor: 2.352

6. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors: N Saitou; M Nei
Journal: Mol Biol Evol Date: 1987-07 Impact factor: 16.240

7. Functional annotation of a full-length mouse cDNA collection.

Authors: J Kawai; A Shinagawa; K Shibata; M Yoshino; M Itoh; Y Ishii; T Arakawa; A Hara; Y Fukunishi; H Konno; J Adachi; S Fukuda; K Aizawa; M Izawa; K Nishi; H Kiyosawa; S Kondo; I Yamanaka; T Saito; Y Okazaki; T Gojobori; H Bono; T Kasukawa; R Saito; K Kadota; H Matsuda; M Ashburner; S Batalov; T Casavant; W Fleischmann; T Gaasterland; C Gissi; B King; H Kochiwa; P Kuehl; S Lewis; Y Matsuo; I Nikaido; G Pesole; J Quackenbush; L M Schriml; F Staubli; R Suzuki; M Tomita; L Wagner; T Washio; K Sakai; T Okido; M Furuno; H Aono; R Baldarelli; G Barsh; J Blake; D Boffelli; N Bojunga; P Carninci; M F de Bonaldo; M J Brownstein; C Bult; C Fletcher; M Fujita; M Gariboldi; S Gustincich; D Hill; M Hofmann; D A Hume; M Kamiya; N H Lee; P Lyons; L Marchionni; J Mashima; J Mazzarelli; P Mombaerts; P Nordone; B Ring; M Ringwald; I Rodriguez; N Sakamoto; H Sasaki; K Sato; C Schönbach; T Seya; Y Shibata; K F Storch; H Suzuki; K Toyo-oka; K H Wang; C Weitz; C Whittaker; L Wilming; A Wynshaw-Boris; K Yoshida; Y Hasegawa; H Kawaji; S Kohtsuki; Y Hayashizaki
Journal: Nature Date: 2001-02-08 Impact factor: 49.962

8. Studies on adaptability of binding residues and flap region of TMC-114 resistance HIV-1 protease mutants.

Authors: Rituraj Purohit; Vidya Rajendran; Rao Sethumadhavan
Journal: J Biomol Struct Dyn Date: 2011-08

9. Codon usage in Homo sapiens: evidence for a coding pattern on the non-coding strand and evolutionary implications of dinucleotide discrimination.

Authors: C Alff-Steinberger
Journal: J Theor Biol Date: 1987-01-07 Impact factor: 2.691

10. New 3D graphical representation of DNA sequence based on dual nucleotides.

Authors: Xiao-Qin Qi; Jie Wen; Zhao-Hui Qi
Journal: J Theor Biol Date: 2007-09-01 Impact factor: 2.691