Literature DB >> 19481099

Classification analysis of dual nucleotides using dimension reduction.

Zhao-Hui Qi¹, Jian-Min Wang, Xiao-Qin Qi.

Abstract

We introduce a new approach to investigate the dual nucleotides compositions of 11 Gram-positive and 12 Gram-negative eubacteria recently studied by Sorimachi and Okayasu. The approach firstly obtains a 16-dimension vector set of dual nucleotides by PN-curve from the complete genome of organism. Each vector of the set corresponds to a single gene of genome. Then we reduce the 16-dimension vector set to 2-dimension by principal components analysis (PCA). The reduction avoids possible loss of information averaging all 16-dimension vectors. Then we suggest a 2D graphical representation based on the 2-dimension vector to investigate the classification patters among different organisms.

Entities: Chemical Disease Species

Mesh：

Substances：
DNA, Bacterial

Year: 2009 PMID： 19481099 PMCID： PMC7126582 DOI： 10.1016/j.jtbi.2009.05.011

Source DB: PubMed Journal: J Theor Biol ISSN： 0022-5193 Impact factor: 2.691

Introduction

Recently, graphical techniques have emerged as a powerful tool for the visualization and analysis of complicated biological systems. These methods can provide an intuitive picture and help people gain useful insights. Many graphical approaches have also been used to deal with a wide variety of biological problems. For instance, various graphic schemes have been successfully used to study enzyme-catalyzed system (King and Altman, 1956; Chou et al., 1979; Chou, 1980; Chou and Forsen, 1980, Chou and Forsen, 1981; Chou and Liu, 1981; Cornish-Bowden, 1979; Myers and Palmer, 1985; Zhou and Deng, 1984; Chou, 1989, Chou, 1990; Lin and Neet, 1990; Kuzmic et al., 1992; Andraos, 2008), protein folding kinetics (Chou, 1990, Chou, 1993), condon usage (Chou and Zhang, 1992; Zhang and Chou, 1994), HIV reverse transcriptase inhibition mechanisms (see Althaus et al., 1993a, Althaus et al., 1993b, Althaus et al., 1993c, as well as a review article, Chou et al., 1994), and base frequency distribution in the anti-sense strands (Chou et al., 1996). Recently, the images of cellular automata were also used to represent biological sequences (Xiao et al., 2005a, Xiao et al., 2005b), predict protein subcellular location (Xiao et al., 2006a, Xiao et al., 2006b), investigate HBV virus gene missense mutation (Xiao et al., 2005a, Xiao et al., 2005b) and HBV viral infections (Xiao et al., 2006a, Xiao et al., 2006b), predicting protein structural classes (Xiao et al., 2008) and G-protein-coupled receptor functional classes (Xiao et al., 2009), as well as analyze the fingerprint of SARS coronavirus (Wang et al., 2005; Gao et al., 2006). Graphic approaches have been also used recently to examine the similarities/dissimilarities among the coding sequences of different species (Qi et al., 2007; Qi and Qi, 2007, Qi and Qi, 2009; Qi and Fan, 2007; Yao et al., 2006), analyze the network structure of the amino acid metabolism (Shikata et al., 2007), and study cellular signaling networks (Diao et al., 2007). Another useful graphic method, radar chart, has been used to illustrate differences in amino acid compositions to predict protein subcellular localization (Chou and Elrod, 1999). Also, radar charts have been applied in a similar manner to classifying organisms (Sorimachi and Okayasu, 2004, Sorimachi and Okayasu, 2008a, Sorimachi and Okayasu, 2008b; Okayasu and Sorimachi, 2009). Quite recently, Sorimachi reported some interesting results based on graphical analyses in Sorimachi and Okayasu, 2008a, Sorimachi and Okayasu, 2008b and Sorimachi (2009). In Sorimachi and Okayasu (2004), 23 eubacteria was classified (11 Gram-positive and 12 Gram-negative eubacteria) into two groups, “S-Type” represented by Staphylococcus aureus and “E-Type” represented by Escherichia coli, based on their patterns of amino acid compositions by radar charts determined from the complete genome. The study shows that amino acid compositions are useful values to investigate genomic structures and biological evolution. In this paper, we introduce a new approach to investigate the 23 eubacteria studied by Sorimachi and Okayasu (2004). The method consists of two parts: (i) PN-curve, a 3D graphical representation of DNA sequences presented in our earlier study (Qi and Fan, 2007) and (ii) principal components analysis (PCA), a projection method to analyze data set and reduce it from high dimensional space. Here, we firstly obtain a 16-dimension vector set of dual nucleotides by PN-curve from the complete genome of organism. Each vector of the set corresponds to a single gene of genome. Then the 16-dimension vector set is reduced to 2-dimension by PCA. The 2D graphical representation based on the 2-dimension vector set is proposed to investigate the classification patters among different organisms.

Methods

PN-curve and its applications

PN-curve is a 3D graphical representation of DNA sequences presented in our earlier study (Qi and Fan, 2007). It consider a consider a 4×4 matrix in which the rows and columns are assigned to pairs of nucleotides (PNs)Given an arbitrary DNA primary sequence, the PN-curve can be generated by the following map φ:where (AA), (AT), (AG), (AC),…, (CG) and (CC) are the cumulative occurrence numbers of AA, AT, AG, AC,…, CG and CC, respectively, in the subsequence from the first base to the nth base in the sequence. Unlike important geometry curve Z (Zhang, 1997) and Z′ (Zhang and Zhang, 2004), PN-curve is not unique because of the 16! different combinations about 16 kinds of PNs. However, we are only interested in PN-curve as numerical parameters that may extract characteristics of DNA sequences. Here, the different combinations attach no impact on the extracted numerical parameters. For a given gene or genome, there is a cumulative PN-profile or PN-curve corresponding to it. The PN-curve or the cumulative PN-profile is used interchangeably in this paper. Note that the essence of cumulative PN-profile is to display the variations of the PN content along a gene or genome. The derivative of PN-curve with respect to the PN content is used to construct 16-component vectors related with the cumulative PN content. The 16-component vector consists of the percentage of 16 kinds of PNs: AA, AT, AG, AC,…, CG and CC. For a given gene, there is a 16-component vector to reveal the patterns of PN compositions. As for genome of an organism, there are thousands of genes. It is not practical that the cluster tendency of patterns is only based on a single pattern derived from a single gene. There are two normal ways to know the tendency of genome: (i) all genes are linked each other into a very long sequence, and a 16-component vector by cumulative PN-profile is used to represent the cluster pattern of PN compositions; (ii) each gene is used to generate a corresponding 16-component vector by PN-profile, and the average of all vectors illustrates the cluster pattern. However, the two ways may hide some detail. For example, given an average 5, there are many possible chooses: 1 and 9, 2 and 8, or 5 and 5, 6 and 4. It was obvious that the average maybe hide the distinction among different chooses. So we use dimension reduction method to uncover the more detail hid in all vectors.

Dimension reduction method based on principal components analysis

Principal components analysis (Jackson and Wiley, 1991) is a projection method to analyze data set and reduce it from high dimensional space to few hidden variables while keeping information on its variability. It and its many expanded methods have been successfully applied to the resolution of some problems (Costa et al., 2009; Du et al., 2006). Since the patterns in 16-component vectors of genome of an organism can be hard to find in high dimension space, where graphical representation is not available, the possibility of grouping the variability in few variables is an important step to visualize and consequently uncover the information. In Wang et al. (2008), an effective dimension-reducing approach was introduced for predicting membrane protein types. Here, we reduce 16-dimension vector to 2-dimension by using PCA. Then we give the 2D graphical representation of the patterns of PN compositions and utilize the representation to intuitively observe the evolution patters among different organisms. We now give the simple description of PCA. Assume that the mean of sample of space R is . Write the singular value decomposition of covariance matrix as , where . Matrix U is orthogonal matrix. Diagonal matrix is made up of the eigenvalues of , where and . The principle component transformation is . Then a new data set is obtained by . The mean and covariance matrix of are 0 and diagonal matrix , respectively. Now we ignore the components of lesser significance and leave out some important components. Then the final data set will have fewer dimensions than the original. In this paper, the original data set is 16 dimensions. The final data set has only two dimensions by choosing only the first two eigenvectors.

Genomic data used for this study

Complete genome sequences were downloaded from NCBI GenBank (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). Here, the analysis of genomes was performed by some bacteria consisting of 11 Gram-positive and 12 Gram-negative bacteria (Sorimachi and Okayasu, 2004). The genome sequences used for this study are summarized in Table 1 .

Table 1

Genomes used for this study.

Strain	Accession (GenBank)	RefSeq identifier	Total length (bp)	Genes
Staphylococcus aureus Mu50	BA000017.4	NC_002758	2,878,529	2775
Streptococcus pyogenes M1	AE004092.1	NC_002737	1,852,441	1811
Bacillus subtilis	AL009126.2	NC_000964	4,214,630	4225
Clostridium perfringens 13	BA000016.3	NC_003366	3,031,430	2786
Listeria monocytogenes	AL591824.1	NC_003210	2,944,528	2940
Mycoplasma pulmonis	AL445566.1	NC_002771	963,879	815
Mycoplasma genitalium	L43967.2	NC_000908	580,076	525
Mycoplasma pneumoniae	U00089.2	NC_000912	816,394	733
Ureaplasma urealyticum	CP001184.1	NC_011374	874,478	692
Mycobacterium tuberculosis	AE000516.2	NC_002755	4,403,837	4293
Mycobacterium leprae	AL450380.1	NC_002677	3,268,203	2770
Rickettsia prowazekii	AJ235269.1	NC_000963	1,111,523	886
Borrelia burgdorferi	AE000783.1	NC_001318	910,724	875
Campylobacter jejuni	CP000538.1	NC_008787	1,616,554	1707
Helicobacter pylori 26695	AE000511.1	NC_000915	1,667,867	1630
Helicobacter pylori J99	AE001439.1	NC_000921	1,643,831	1535
Escherichia coli	U00096.2	NC_000913	4,639,675	4467
Salmonella typhi	AL513382.1	NC_003198	4,809,037	4711
Vibrio cholerae	AE003852.1	NC_002505	2,961,149	2889
	AE003853.1	NC_002506	1,072,315	1119
Yersinia pestis	AL590842.1	NC_003143	4,653,728	4103
Neisseria meningitidis	AL157959.1	NC_003116	2,184,406	2065
Haemophilus influenzae	L42023.1	NC_000907	1,830,138	1789
Treponema pallidum	AE000520.1	NC_000919	1,138,011	1095

Genomes used for this study.

Applications

Calculations

Dual nucleotide contents at various base positions were computationally calculated by PN-curve. As for a sequence, a 16-dimension vector related to dual nucleotide contents is generated by PN-curve. The complete genome of species consists of thousands of genes. Each gene corresponds to a vector. Then we can obtain a vector set corresponding to the complete genome of the specie. In order to visualize and uncover the information hidden in the vector set, we reduce 16-dimension vector to 2-dimension by using PCA. Then we give the 2D graphical representation of the patterns of PN compositions and utilize the representation to intuitively observe the evolution patters among different organisms. We develop two programs. A program named as “GenomePNs.pl” is designed to generate 16-dimension vector set of complete genome. The input of the perl program is file “*.ffn” from NCBI GenBank. Its output is a file called as “percentage_PNs.txt”. The other program, “DimReductionAnaly.m”, is a matlab program used to reduce 16-dimension vector set to 2-dimension by using PCA algorithm and visualize the 2-dimension vector set. Its input is “percentage_PNs.txt” and the output is a 2-dimension graphic representation.

Results

The patterns of dual nucleotide compositions based on the complete genomes of various eubacteria in Table 1 are 2-dimension dot-cluster graphs, as shown in Fig. 1 .

Fig. 1

2-dimension dot-matrix graphs of dual nucleotides determined from the complete genomes of various eubacteria of Table 1. As shown in Fig. 1 of Sorimachi and Okayasu (2004), blue represents Gram-positive bacteria; red represents Gram-negative bacteria; green represents mycoplasmas, which lack a cell wall. (For interpretation of the references to the color in this figure legend, the reader is referred to the web version of this article.) To characterize the pattern of dual nucleotide compositions and to classify eubacteria, we focused on particular dot-cluster. A close look to Fig. 1 shows that dot-cluster in some graphs is mainly grouped into two clusters while those dots in other graphs is mainly grouped into one cluster. Now, we divide the grid coordinate system into two regions: I and II, as shown in Fig. 2 . Concentrations of dot-cluster changed markedly two main groups: “S-Type” represented by S. aureus and “E-Type” represented by E. coli. The conception about “S-Type” and “E-Type” is presented by Sorimachi and Okayasu (2004). Here, concentrations of dot clusters are mainly inside Region I in “S-Type” whereas they are mainly inside Region II in “E-Type”. The two groups are separated from each other by these dot-clusters.

Fig. 2

Grid coordinate system is divided into two region: Region Ι (x∈[−0.05, 0.05], y∈[−0.05, 0.05]) and Region ΙI (outside Region I).

Grid coordinate system is divided into two region: Region Ι (x∈[−0.05, 0.05], y∈[−0.05, 0.05]) and Region ΙI (outside Region I). By using PCA algorithm, eubacteria is classified into two groups, “S-Type” and “E-Type”, based on the dual nucleotides compositions calculated from the complete genome. In “S-Type”, the patterns of the dot-clusters also show much difference each other. According to the relative location between main clusters, “S-Type” can be classified into two subgroups: (i) one subgroup includes the bacteria, S. aureus Mu50, Str. pyogenes M1, B. subtilis, R. prowazekii, C. perfringens 13 and B. burgdorferi. Concentration of dots of them is mainly in the left dot-clusters and (ii) the other consists of the following, L. monocytogenes, C. jejuni, M. pulmonis, H. pylori J99, H. influenzae, M. genitalium and H. pylori 26695. Concentration of dots of these bacteria is mainly in the right dot-clusters. Similarly, “E-Type” can be also classified into two subgroups. The first subgroup includes those bacteria whose dots are classified into two clusters. They are E. coli, S. typhi, V. cholerae and Y. pestis. Concentration of the dots of the second is mainly one dot-cluster. They are M. tuberculosis, M. leprae, N. meningitides, T. pallidum, M. pneumoniae and U. urealyticum, respectively. The above results show that bacteria in Table 1 are grouped into two classes: S. aureus “S-Type” and E. coli “E-Type”, based on their genomic structures. As Okayasu and Sorimachi (2009) reported both types “S-Type” and “E-Type”, the above species were classified further into their subgroups based on amino acid compositions or codon usages. Similar results have also been obtained by Sorimachi and Okayasu (2004).

Discussion

By using data derived from dual nucleotides based on complete genomes, our studies are applicable to analyze genomic structures and provide their 2-dimension graphic representation by PCA algorithm. Then the method is used to investigate the dual nucleotide compositions of 11 Gram-positive and 12 Gram-negative eubacteria in Table 1. The amino acid compositions of the eubacteria of Table 1 have been studied by Sorimachi and Okayasu (2004). Their research results show that these eubacteria were classified into two groups, “S-Type” represented by S. aureus and “E-Type” represented by E. coli. Similarly, we also classified these eubacteria into two groups, “S-Type” and “E-Type”, according to particular dot-cluster derived from complete genomes data by PCA. Compared with the research results of Sorimachi and Okayasu, our results show some diversity in three eubacteria: H. influenzae, M. pneumoniae and U. urealyticum. Here, H. influenzae belong to “S-Type” while M. pneumoniae and U. urealyticum are “E-Type”. We do not think that the diversity imply some failure in our approach or Sorimachi and Okayasu's scheme. The proportion of diversity is low, only 13%. Looked by statistics viewpoint, the diversity is acceptable. The present study demonstrates that dual nucleotide compositions are useful values to investigate genomic structures and biological evolution. The more similar the structures of dot-cluster are the more similar the organisms are. That is to say, the structures between evolutionary closely related species are more similar, while those between evolutionary disparate species are larger. Closely observing Fig. 1, we find that in “S-Type” the more similar species groups are the following: S. aureus Mu50, S. pyogenes M1 and B. subtilis; L. monocytogenes, H. pylori J99, H. influenzae and H. pylori 26695. In “E-Type” the more similar species groups are the following: E. coli, S. typhi, Y. pestis; M. tuberculosis, M. leprae, T. pallidum and U. urealyticum. Similar results can be found out in Fig. 1 of Sorimachi and Okayasu (2004).

37 in total

1. Mixtures of tight-binding enzyme inhibitors. Kinetic analysis by a recursive rate equation.

Authors: P Kuzmic; K Y Ng; T D Heath
Journal: Anal Biochem Date: 1992-01 Impact factor: 3.365

2. Numerical characterization of DNA sequences based on digital signal method.

Authors: Zhao-Hui Qi; Xiao-Qin Qi
Journal: Comput Biol Med Date: 2009-03-03 Impact factor: 4.589

3. A new schematic method in enzyme kinetics.

Authors: K C Chou
Journal: Eur J Biochem Date: 1980-12

4. An extension of Chou's graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways.

Authors: G P Zhou; M H Deng
Journal: Biochem J Date: 1984-08-15 Impact factor: 3.857

Classification analysis of dual nucleotides using dimension reduction.

Introduction

Methods

PN-curve and its applications

Dimension reduction method based on principal components analysis

Genomic data used for this study

Applications

Calculations

Results

Discussion

1. Mixtures of tight-binding enzyme inhibitors. Kinetic analysis by a recursive rate equation.

2. Numerical characterization of DNA sequences based on digital signal method.

3. A new schematic method in enzyme kinetics.

4. An extension of Chou's graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways.

5. Graphical rules for non-steady state enzyme kinetics.

6. Graphical rules for enzyme-catalysed rate laws.

7. GPCR-CA: A cellular automaton image approach for predicting G-protein-coupled receptor functional classes.

8. A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences.

9. Using cellular automata images and pseudo amino acid composition to predict protein subcellular location.

10. A novel fingerprint map for detecting SARS-CoV.

1. Self-similarity analysis of eubacteria genome based on weighted graph.

2. Phylogenetic and biological significance of evolutionary elements from metazoan mitochondrial genomes.