| Literature DB >> 29379686 |
Gerardo Mendizabal-Ruiz1, Israel Román-Godínez1, Sulema Torres-Ramos1, Ricardo A Salido-Ruiz1, Hugo Vélez-Pérez1, J Alejandro Morales1.
Abstract
Genomic signal processing (GSP) methods which convert DNA data to numerical values have recently been proposed, which would offer the opportunity of employing existing digital signal processing methods for genomic data. One of the most used methods for exploring data is cluster analysis which refers to the unsupervised classification of patterns in data. In this paper, we propose a novel approach for performing cluster analysis of DNA sequences that is based on the use of GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data.Entities:
Keywords: COX1; DNA; Genomic signal processing; K-means; Sequence clustering
Year: 2018 PMID: 29379686 PMCID: PMC5786891 DOI: 10.7717/peerj.4264
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Depiction of the DNA cluster visualization results structure proposed plot for a value of k = 8.
Figure 2Depiction of the selected organisms and their correspondence in the Tree of Life.
The respective hierarchic markings for each class is shown next to them. A detailed list of names and their KEGG entries is in Table S1. *These two organisms Galdieria sulphuraria (gsl:JL72_p19) and Chondrus crispus (ccp:ChcroMp03), a Cyanoalgae thermoacidophilic and Irish moss, respectively, do not have a reported Kingdom in the Tree of Life and were reported with the same Kingdom label ‘Unknown’.
Figure 3DNA clustering for marker COXI with k = 6.
Figure 4DNA clustering for marker COXI with k = 17.
Figure 5DNA clustering for marker COXI with k = 35.
Figure 6K-means decomposition analysis.
Figure 7Mean square distances between each cluster centroid and the sequences assigned to that cluster by the proposed method using k = 6, and the mean square distances between a cluster centroid generated with the sequences corresponding to each of the six kingdoms.
Performance comparison of STARS with respect to ClustalW and UCLUST on sets of different sizes of COXI sequences.
| Number of sequences | STARS | STARS | STARS | ClustalW (s) | UCLUST (s) / No. of resulting clusters |
|---|---|---|---|---|---|
| 8 | 0.034 | – | – | 2.8 | 1 / 8 |
| 17 | 0.038 | 0.071 | – | 9.85 | 1 / 17 |
| 35 | 0.071 | 0.098 | 0.167 | 32.68 | 1 / 35 |
| 70 | 0.160 | 0.188 | 0.279 | 130.91 | 1 / 70 |
| 141 | 0.383 | 0.655 | 0.770 | 485.02 | 1 / 138 |
Performance comparison of STARS with respect to UCLUST on sets of different sequences.
| Dataset | Number of sequences | Average sequence length | Number of clusters | Sequence to PSD transform (s) | STARS (s) | UCLUST (s) |
|---|---|---|---|---|---|---|
| A | 31 | 16,695 | 6 | 1.92 | 0.70 | 1 |
| B | 38 | 1,407 | 4 | 0.27 | 0.03 | 1 |
| C | 116 | 7,154 | 17 | 3.05 | 1.69 | 1 |
| D | 34 | 27,567 | 12 | 3.38 | 1.23 | 12 |
| E | 30 | 3,361,393 | 8 | 392.98 | 281.70 | – |