Literature DB >> 25747773

A new method to cluster DNA sequences using Fourier power spectrum.

Tung Hoang¹, Changchuan Yin¹, Hui Zheng¹, Chenglong Yu², Rong Lucy He³, Stephen S-T Yau⁴.

Abstract

A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods.

Entities: Chemical Disease Gene Species

Keywords: Genes; Moments; Phylogenetic trees

Mesh：

Substances：
DNA, Mitochondrial
DNA

Year: 2015 PMID： 25747773 PMCID： PMC7094126 DOI： 10.1016/j.jtbi.2015.02.026

Source DB: PubMed Journal: J Theor Biol ISSN： 0022-5193 Impact factor: 2.691

Introduction

In the last few decades, several methods to classify genes and proteins have been proposed. Most of these methods are alignment-based in which optimal alignments are obtained by using selected scoring systems. These methods provide accurate classification of biological sequences, and several algorithms have been developed and successfully applied (Katoh et al., 2002, Edgar, 2004, Larkin et al., 2007). Nevertheless, their major drawback is due to significantly high time and memory consumption which is not suitable when a quick clustering needs to be made, for example on a new deadly virus (Marra et al., 2003). Henceforth, an alignment-free technique is a trending method that often gives much faster classification on the same dataset (Vinga and Almeida, 2003, Yau et al., 2008, Yu et al., 2011, Yu et al., 2013). For example, the k-mer method is among the most popular alignment-free methods. In order to measure how different the two sequences are, the set of k-mers, or subsequences of length k, in the two biological sequences are collected and then the evolutionary distance between them is computed (Vinga and Almeida, 2003, Pandit and Sinha, 2010). The k-mer method gives comparable results to alignment-based methods while being computationally faster (Blaisdell, 1989). Discrete Fourier Transform (DFT) is a powerful tool in signal and image processing. During recent years, DFT has been increasingly used in DNA research, such as gene prediction, protein coding region, genomic signature, hierarchical clustering, periodicity analysis (Tiwari et al., 1997, Anastassiou, 2000, Kotlar and Lavner, 2003, Vaidyanathan and Yoon, 2004, Afreixo et al., 2004, Afreixo et al., 2009, Tenreiro Machado et al., 2011). A DFT power spectrum of a DNA sequence reflects the nucleotide distribution and periodic patterns of that sequence, and it has been applied to identify protein coding regions in genomic sequences (Fukushima et al., 2002, Yin and Yau, 2005, Yin and Yau, 2007). In this paper we provide a new alignment-free method to classify DNA sequences based on the DFT power spectrum. The method is tested and compared to other state-of-the-art methods on various datasets for speed and accuracy.

Materials and method

Mathematical background

In signal processing, sequences in time domain are commonly transformed into frequency domain to make some important features visible. Via that transformation, no information is lost but some hidden properties could be revealed (Oppenheim et al., 1989). One of the most common transformations is Discrete Fourier Transform (Oppenheim et al., 1989). For a signal of length , the DFT of the signal at frequency k is for . The DFT power spectrum of a signal at frequency k is defined as Notice that by definition, . The DFT is often used to find the frequency components of a signal buried in a noisy time domain. For example, let y be a signal containing a 60 Hz sinusoid of amplitude 0.8 and a 140 Hz sinusoid of amplitude 1. This signal can be corrupted by a zero-mean random noise: The frequencies can hardly be identified by looking at the original signal as in Fig. 1 (a), but can be seen quite clearly when the signal is transformed to frequency domain by taking the DFT (Fig. 1(b)).

Fig. 1

Signal in time domain and frequency domain. (a) Signal Corrupted with Zero−Mean Random Noise. and (b) Single−Sided Power Spectrum

Moment vectors

For a DNA sequence composed of nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T), one typical way to get numerical representation is to use binary indicator sequences. The values of these sequences are either 0 or 1 indicating the absence or presence of a specific nucleotide. Specifically, for a given DNA sequence of length N, we define u of the same length as follows: are defined similarly. For example, for the sequence AGTCTTACGA, the corresponding indicator sequence of nucleotide A is u =1000001001. The DFT of u is U where for . The DFT power spectrum of u is PS where . The corresponding power spectrum for nucleotides is defined similarly. In general, for a gene sequence of length N, let be the number of nucleotide in that sequence, respectively. It is difficult to compare numerical sequences with different lengths, so we cannot cluster genes and genomes based on their power spectra sequences. One common approach to get over this problem is to use mathematical moments, e.g. for nucleotide A defines jth moment , where α be scaling factors. We want higher moments to converge to zero, i.e. essential information is kept in the first few moments. Thus, the chosen normalization factors α must reflect the nature of the sequences. Let us examine the binary indicator sequence of one nucleotide, A, in more detail. By Parseval׳s theorem (Oppenheim et al., 1989), The left side is actually N , i.e. the number of 1 in the A binary sequence. Hence, . So it is reasonable for α to be a power of N . As stated above, we want moments converge to zero gradually so that information loss is minimal, thus is the best choice (which will be verified later). Therefore With this normalization, . Our experimental results on various datasets proved that this is a good normalization. However, by re-examining the formula, we find that a slight modification can be made to get better outcomes. From Section 2.1, we know . Thus might hold large weight compared to PS(k) for other index k, which in turn leads to unnecessary memory consumption and computations for higher moments. Therefore, the terms are removed from the moments, and M 1 becomes . From the first modified moment, we know how to adjust the scaling factor for the -th moment in general, i.e. α must be a power of . Thus the new normalization is The fact that higher moments tend to zero is verified as follows: where . Notice that , thus it is obvious that . This fact also shows that is the best scaling factor, as for , so moments tend to zero very fast, thus much information can be lost, and for , so moments tend to zero much slower, thus more computational time and memory storage are needed. Additionally, due to symmetric property of DFT coefficients (Oppenheim et al., 1989), we only have to consider the first half of power spectrum. Therefore, the moments are improved as follows: The moments for other nucleotides are given similarly. Then the first few moments are used to construct vectors in Euclidean space. Our experimental results show that three moments are sufficient for an accurate clustering. Thus, each gene or genome sequence can be realized as a geometric point in the 12-dimensional Euclidean space, i.e. . Pairwise Euclidean distances between these points are calculated to cluster the gene or genome sequences. We call this Power Spectrum Moments method, or PS-M method. Power spectrum has been rarely used to study the phylogenetic analysis of DNA sequences because it is difficult to do comparison on sequences with different lengths. Zhao et al. (2011) came up with the idea of using normalized and centralized moments to compare sequences of different lengths. Motivated by that idea, we discovered a way to scale moments naturally, and only normalized moments are used to construct the Euclidean vectors. Discarding the first coefficient is another novelty of our PS-M method. The first coefficient holds significant weight and is highly dependent on the number of the respective nucleotide, which is redundant information. We also exclude the second half of the power spectrum due to its symmetry, and we give proof as to why the moment vectors converge to zero. Experimental results in the following section show that 12-dimensional moment vectors are enough to cluster genomes correctly.

Results

The PS-M method is tested on different datasets that range from small to medium size, as well as short to long genomes. In order to compare and analyze various genomic data, moment vectors are calculated and a matrix of Euclidean pairwise distances between those vectors is constructed. To cluster data into biological groups, a phylogenetic tree is built based on the distance matrix using the UPGMA method (Sokal, 1958). The running time of our PS-M method is compared to three state-of-the-art methods. The first is the alignment-based method ClustalW (Larkin et al., 2007) implemented on MEGA 6 (Tamura et al., 2013). The second is MAFFT (Katoh et al., 2002), another alignment-based method using fast Fourier transform. The last method is the alignment-free k-mer method (Vinga and Almeida, 2003) which is implemented on Matlab by Vinga and Almeida (2003). The running time is recorded in Table 1 .

Table 1

Running time comparison.

Datasets	Our method	MAFFT	k-mer	ClustalW
Mammals	4 s	NA	18 min 15 s	3 h 15 min
Influenza A	0.6 s	22 s	12 s	1 min 55 s
HRV	5 s	17 min 40 s	47 min 28 s	8 h 10 min
Coronavirus	6 s	NA	69 min 12 s	11 h 40 min
Bacteria	9 min 41 s	NA	NA	NA

Running time comparison. Even though reasonably accurate, ClustalW is significantly time consuming, as it aims to achieve the best possible results neglecting speed and efficiency. MAFFT runs much faster than ClustalW, but it sacrifices some accuracy in exchange. Moreover, errors occurred when we tested the datasets showing limitations of the MAFFT method. The errors are discussed in detail in the next section. Meanwhile, both k-mer and PS-M methods are alignment-free, and both attempt to improve speed and efficiency with little sacrifice of accuracy. Thus, our phylogeny results are directly compared to the k-mer method in this study. Phylogenies of the ClustalW method are included in the supplementary materials for reference (Figure S1–S4). Phylogenies of our method and the k-mer method are drawn using Matlab and MEGA 6 respectively (Tamura et al., 2013). Computations in this research are implemented on a PC with configuration of Intel Core i7 CPU 2.40 GHz and 8 GB RAM.

Mammals

The mitochondrial genome is not highly conserved and has a rapid mutation rate, thus it is suitable for examining the mode and tempo of molecular evolution (Brown et al., 1982). The PS-M method was tested on a mitochondrial DNA dataset of 31 mammalian genome sequences from GenBank, each sequence has a length range from 16,300 to 17,500 nucleotides. This dataset was analyzed by Deng et al. (2011) using the natural vector method. In our method, these 31 genomes are clustered correctly into 7 groups: Primates, Cetacea and Artiodactyla, Perissodactyla, Rodentia, Lagomorpha, Carnivore, and Erinaceomorpha (Fig. 2 ), which is consistent with the work of Deng et al. (2011). Meanwhile, as Fig. 3 illustrates, a majority part of Carnivore is misplaced by the k-mer method.

Fig. 2

Phylogenetic tree of 31 mammalian mitochondrial genomes by our method, drawn by Mega 6.

Fig. 3

Phylogenetic tree of 31 mammalian mitochondrial genomes by the k-mer method, k=5.

Phylogenetic tree of 31 mammalian mitochondrial genomes by our method, drawn by Mega 6. Phylogenetic tree of 31 mammalian mitochondrial genomes by the k-mer method, k=5.

Influenza A viruses

We test the efficiency of the PS-M method at a gene level. Influenza A viruses have been a major health threat to both human society and animals (Alexander, 2000). Influenza A viruses are single-stranded, segmented RNA viruses, which are classified based on the viral surface proteins hemagglutinin (HA or H) and neuraminidase (NA or N) (Webster et al., 1992). Eighteen H serotypes (H1 to H18) and eleven N (N1 to N11) serotypes of Influenza A viruses have been identified. Influenza A viruses are the most dangerous due to their wide natural host range, including birds, horses, swines, and humans; and they are known to have high degree of genetic and antigenic variability (Palese and Young, 1982, Garten et al., 2009). Influenza A viruses have caused many pandemic flues, some of the most lethal subtypes are H1N1, H2N2, H5N1, H7N3, and H7N9. These subtypes will be chosen to test the efficiency of our method. Specifically, we will examine segment 6 of Influenza A virus genome, which encodes for neuraminidase (NA). As illustrated by the phylogenetic trees, the virus A/turkey/VA/505477-18/2007(H5N1) is not correctly clustered in the k-mer method (Fig. 5). On the other hand, the dataset is classified correctly into biological groups by our proposed PS-M method (Fig. 4 ).

Fig. 5

Phylogenetic tree of 38 Influenza A viruses based on segment 6 by the k-mer method, k=4.

Fig. 4

Phylogenetic tree of 38 Influenza A viruses based on segment 6 by our method, drawn by Mega 6.

Phylogenetic tree of 38 Influenza A viruses based on segment 6 by our method, drawn by Mega 6. Phylogenetic tree of 38 Influenza A viruses based on segment 6 by the k-mer method, k=4.

Human rhinovirus

The efficiency of PS-M method was examined on a large size dataset, Human Rhinoviruses (HRV). HRV are associated with upper and lower respiratory diseases, particularly in patients with asthma. They are the predominant cause of the common cold and are responsible for more than one-half of cold-like illnesses. Past works have classified HRV into three genetically distinct groups within the genus Enterovirus and the family Picornaviridae (Palmenberg et al., 2009, Deng et al., 2011). In their paper, Palmenberg et al. (2009) clustered the complete HRV genomes, consisting of three groups HRV-A, HRV-B, HRV-C of 113 genomes and three outgroup sequences HEV-C. While the genomes were well classified, the running time was quite high due to the use of multiple sequence alignment to construct the evolutionary tree. By our method, the phylogenetic tree is reconstructed and the three groups of HRV are clearly separated and are distinguished from the outgroup HEV-C (Fig. 6 ). On the other hand, a small part of HRV-A is grouped on the same clade with HRV-B in the k-mer method (Fig. 7 ).

Fig. 6

Phylogenetic tree of HRV by our method, drawn by Mega 6.

Fig. 7

Phylogenetic tree of HRV by the k-mer method, k=4.

Phylogenetic tree of HRV by our method, drawn by Mega 6. Phylogenetic tree of HRV by the k-mer method, k=4.

Coronavirus

Coronaviruses, a genus of the Coronaviridae family, can cause a variety of severe diseases in respiratory and gastrointestinal tract (HCoV-229E and HCoV-OC43), or even life-threatening pneumonia (severe acute respiratory syndrome, or SARS). Thus, identification and classification of coronaviruses, especially human coronaviruses, are important and have been extensively studied so far. We cluster the set of 30 coronaviruses and 4 outgroup non-coronavirus sequences using the PS-M method. This coronaviruses dataset has been used by various authors before, e.g. Woo et al. (2005) and Yu et al. (2010). In van der Hoek et al. (2004) the authors put the newly discovered human coronavirus NL63 into the same group with human coronavirus 229E (group 1 in our work), which is separated from other two human coronaviruses groups, HCoV-OC43 (group 2 in our work) and SARS (group 4 in our work). In Woo et al. (2005) the authors agreed to include the newfound HCoV-HKU1 in group 2 but also claimed that it is a distinct member within the group and it is not very closely related to the rest of the group. Yu et al. (2010) noticed that HCoV-HKU1 is an individual coronavirus between SARS group 4 and the traditional group 2, thus they proposed HCoV-HKU1 belongs to a new group 5. In our phylogenetic tree below (Fig. 8 ), HCoV-NL63 and HCoV-229E are clustered into group 1, which is similar to the work of van der Hoek et al. (2004). Moreover, group 5 is close to both group 4 and group 2 but separated from them, thus the result is consistent with the work of Woo et al. (2005) and Yu et al. (2010). Based on Fig. 9 , the HCoV-NL63 is misplaced from group 1 by the k-mer method.

Fig. 8

Phylogenetic tree of coronavirus using our method, drawn by Mega 6.

Fig. 9

Phylogenetic tree of coronavirus using the k-mer method, k=6.

Phylogenetic tree of coronavirus using our method, drawn by Mega 6. Phylogenetic tree of coronavirus using the k-mer method, k=6.

Bacteria

Methods like multiple sequence alignment cannot handle large data. The bacteria genome, consisting of millions of base pairs, is a useful dataset for checking a method׳s efficiency. The PS-M method was tested on the set of 30 bacterial genomes, consisting of 8 families: Enterobacteriaceae, Bacilleceae, Rhodobacteriaceae, Spirochaetaceae, Desulfovibrionaceae, Clostridiaceae, Burkholderiaceae, and Staphylococcaceae. Most of the data has a length between 3 and 5 million base pairs. As illustrated by the phylogenetic tree in Fig. 10 , the dataset is well clustered into 8 groups by the PS-M method in about 10 min. None of the three state-of-the-art methods (ClustalW, MAFFT, k-mer) are able to cluster these genomes.

Fig. 10

Phylogenetic tree of 30 bacteria species by our method, drawn by Mega 6.

Discussion

EMBL-EBI (http://www.ebi.ac.uk/Tools/msa/mafft/) is among the most common online tool for MAFFT. We tested our datasets on this site but there were errors for coronavirus and mammals genomes, namely “Raw Tool Output” as it appeared on the EBI website. Since the input DNA sequences worked for all other methods, the error seems to be the tool׳s problem. Therefore, despite MAFFT׳s reasonable biological classification and fast processing time, it has some drawbacks that is likely to require some restriction on the underlying dataset. Another obvious disadvantage of this online tool is that it does not allow us to save the phylogenetic tree directly to the computer, and it does not have circular tree option for dataset with a large number of elements. Thus, we do not include the phylogeny of HRV and Influenza A generated by MAFFT in this paper due to these reasons, even though the method can reasonably classify these datasets. However, the running time of MAFFT for HRV and Influenza A is still recorded in Table 1. The k-mer method is alignment-free and it gives fairly good classification with an acceptable running time. But its major disadvantage is that we do not know which value of k will produce the best classification. In our work, we had to try various values of k and then choose the value with the best outcome. ClustalW is by far among the best alignment-based method. It often gives the most accurate biological classification, but its running time is significantly high in exchange. As a result, it is not able to perform well on large datasets. From the performance comparison (Table 1), we can see that our method is much faster than other methods while still providing comparable biological classification. Most notably, none of the above three methods were able to cluster large genome like bacteria, while our method could do well in about 10 min. The MATLAB code for our method can be accessed from: http://www.mathworks.com/matlabcentral/fileexchange/49026.

29 in total

10. Large-Scale Genome Comparison Based on Cumulative Fourier Power and Phase Spectra: Central Moment and Covariance Vector.

Authors: Shaojun Pei; Rui Dong; Rong Lucy He; Stephen S-T Yau
Journal: Comput Struct Biotechnol J Date: 2019-07-11 Impact factor: 7.271

A new method to cluster DNA sequences using Fourier power spectrum.

Introduction

Materials and method

Mathematical background

Moment vectors

Results

Mammals

Influenza A viruses

Human rhinovirus

Coronavirus

Bacteria

Discussion

1. Frequency-domain analysis of biomolecular sequences.

Review 2. Alignment-free sequence comparison-a review.

3. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

4. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions.

5. Spectrum and symbol distribution of nucleotide sequences.

6. A Fourier characteristic of coding sequences: origins and a non-Fourier approximation.

7. Clustal W and Clustal X version 2.0.

8. Prediction of probable genes by Fourier analysis of genomic sequences.

9. A novel clustering method via nucleotide-based Fourier power spectrum analysis.

10. Real time classification of viruses in 12 dimensions.

1. Periodic power spectrum with applications in detection of latent periodicities in DNA sequences.

2. Whole-genome single nucleotide variant distribution on genomic regions and its relationship to major depression.

3. A latent genetic subtype of major depression identified by whole-exome genotyping data in a Mexican-American cohort.

4. Using a Classifier Fusion Strategy to Identify Anti-angiogenic Peptides.

5. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity.

6. Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data.

7. Genomic signal processing for DNA sequence clustering.

8. A coevolution analysis for identifying protein-protein interactions by Fourier transform.

9. A novel strategy for clustering major depression individuals using whole-genome sequencing variant data.

10. Large-Scale Genome Comparison Based on Cumulative Fourier Power and Phase Spectra: Central Moment and Covariance Vector.