| Literature DB >> 22369633 |
Anderson R Santos1, Marcos A Santos, Jan Baumbach, John A McCulloch, Guilherme C Oliveira, Artur Silva, Anderson Miyoshi, Vasco Azevedo.
Abstract
BACKGROUND: Singular value decomposition (SVD) is a powerful technique for information retrieval; it helps uncover relationships between elements that are not prima facie related. SVD was initially developed to reduce the time needed for information retrieval and analysis of very large data sets in the complex internet environment. Since information retrieval from large-scale genome and proteome data sets has a similar level of complexity, SVD-based methods could also facilitate data analysis in this research area.Entities:
Mesh:
Year: 2011 PMID: 22369633 PMCID: PMC3287580 DOI: 10.1186/1471-2164-12-S4-S11
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Dataset2 schema. Construction scheme for a set of species that were used as a negative control for the partitioning techniques.
Using the distance matrix that corrected separated Aves cluster: K-Means compared to ASAP
| Number of species joined by clusters | Linnaean Taxonomy levels in common by clusters | common Linnaean taxonomy levels frequency (cLtlf) by cluster | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 | |
| 14 | 27 | 14 | 25 | 10 | 9 | 10 | 9 | 140 | 243 | 140 | 225 | |
| 4 | 1 | 4 | 7 | 12 | 13 | 12 | 8 | 48 | 13 | 48 | 56 | |
| 7 | 17 | 4 | 7 | 8 | 8 | 10 | 8 | 56 | 136 | 40 | 56 | |
| 2 | 2 | 9 | 2 | 13 | 11 | 9 | 12 | 26 | 22 | 81 | 24 | |
| 6 | 1 | 4 | 4 | 10 | 13 | 10 | 10 | 60 | 13 | 40 | 40 | |
| 5 | 1 | 6 | 4 | 9 | 13 | 10 | 12 | 45 | 13 | 60 | 48 | |
| 11 | 1 | 8 | 1 | 9 | 13 | 8 | 13 | 99 | 13 | 64 | 13 | |
This table displays the results of K-Means and ASAP on a cluster of 60 species obtained in the first ASAP clustering round, when 76 species were separated into clusters.
Inferring quality from clustering methods
| Algorithm/ software | Rank | N | Min cLtlf | Max cLtlf | Mean cLtlf | cLtlf clusters sum (∑cLtlf) | cLtlf standard deviation (σ) | Linnaean clusters quality (∑cLtlf/σ) | Linnaean clusters quality gain (K09/K60)% | cLtlf median | Median clusters quality gain (K09/K60)% |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AQBC-javaml | K09 | 8 | 32 | 180 | 71.25 | 570 | 52.27 | 10.90 | 49.58% | 42.50 | 26.87% |
| K60 | 8 | 0 | 220 | 64.38 | 515 | 70.64 | 7.29 | 33.50 | |||
| EM-weka | K09 | 8 | 40 | 120 | 70.12 | 561 | 31.53 | 17.79 | 48.99% | 57.00 | 1.79% |
| K60 | 8 | 16 | 160 | 70.25 | 562 | 47.06 | 11.94 | 56.00 | |||
| Kmeans-weka | K09 | 8 | 30 | 180 | 69.38 | 555 | 46.70 | 11.88 | 9.26% | 61.50 | -2.38% |
| K60 | 8 | 16 | 180 | 69.88 | 559 | 51.39 | 10.88 | 63.00 | |||
| Kmeans-R | K09 | 8 | 40 | 140 | 71.62 | 573 | 34.48 | 16.62 | 9.21% | 62.00 | 6.90% |
| K60 | 8 | 26 | 140 | 71.75 | 574 | 37.72 | 15.22 | 58.00 | |||
| K-Medoids-R | K09 | 8 | 24 | 160 | 70.12 | 561 | 44.37 | 12.64 | 15.92% | 60.00 | 13.21% |
| K60 | 8 | 26 | 180 | 68.50 | 548 | 50.24 | 10.91 | 53.00 | |||
| MDBC-weka | K09 | 8 | 30 | 180 | 69.38 | 555 | 46.70 | 11.88 | 9.26% | 61.50 | -2.38% |
| K60 | 8 | 16 | 180 | 69.88 | 559 | 51.39 | 10.88 | 63.00 | |||
| ASAP-in house | K09 | 8 | 13 | 225 | 70.25 | 562 | 67.68 | 8.30 | 27.51% | 52.00 | 197.14% |
| K60 | 8 | 13 | 243 | 69.12 | 553 | 84.92 | 6.51 | 17.50 | |||
All evaluated partitioning's algorithms showed improved performance considering the Linnaean clusters quality when used the optimized distance matrix created by the better kdc parameters tested.
Figure 2kdcSearch algorithm schema. Main procedures, datasets and products. Multiple rectangles mean recurring calls.
Linnaean taxonomy levels
| Linnaean Taxonomy levels | ||
|---|---|---|
| 14 | ||
| 13 | ||
| 12 | ||
| 11 | ||
| 10 | ||
| 9 | ||
| 8 | ||
| 7 | ||
| 6 | ||
| 5 | ||
| 4 | ||
| 3 | ||
| 2 | ||
| 1 | ||
Linnaean taxonomy levels used to classify the species in this paper. The numbers denote an increasing degree of nomenclature specialization.
Function Finalize: sample data
| 06clusters k03 | 06clusters k06 | 08clusters k06 | 08clusters k09 | 08clusters k45 | 12clusters k12 | 14clusters k18 | 14clusters k21 | 14clusters k60 | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 100 | 100 | 100 | 100 | 100 | ||||||
| 200 | 243 | 216 | 240 | 250 | 220 | ||||||
| 56 | 13 | 13 | 13 | 13 | 13 | ||||||
| 13 | 136 | 64 | 90 | 88 | 112 | ||||||
| 100 | 22 | 22 | 22 | 22 | 22 | ||||||
| 45 | 13 | 13 | 13 | 13 | 20 | ||||||
| 24 | 13 | 16 | 16 | 13 | 13 | ||||||
| 40 | 13 | 24 | 24 | 13 | 24 | ||||||
| 40 | 40 | 30 | 13 | ||||||||
| 48 | 13 | 13 | 13 | ||||||||
| 13 | 13 | 13 | 13 | ||||||||
| 13 | 13 | 13 | 13 | ||||||||
| 13 | 13 | 13 | |||||||||
| 13 | 13 | 13 | |||||||||
The statistic cLtlf for all of the partitionings of species obtained with nine kdc values that separate the positive control group in the function Finalize of the algorithm kdsSearch, along with three kdc values as a negative control (-).
Function Finalize: sample statistics
| ASAP/ Clusters | Rank | N | Min cLtlf | Max cLtlf | Mean cLtlf | cLtlf clusters sum (ΣcLtlf) | cLtlf standard deviation (σ) | Linnaean clusters quality (ΣcLtlf/σ) | cLtlf median |
|---|---|---|---|---|---|---|---|---|---|
| 08clusters | K06 | 8 | 13 | 200 | 72.25 | 578 | 60.65 | 9.53 | 50.50 |
| 08clusters | K45 | 8 | 13 | 243 | 69.12 | 553 | 84.92 | 6.51 | 17.50 |
| 12clusters | K12 | 12 | 13 | 216 | 48.50 | 582 | 59.10 | 9.85 | 23.00 |
| 14clusters | K18 | 14 | 13 | 240 | 44.50 | 623 | 63.29 | 9.84 | 14.50 |
| 14clusters | K21 | 14 | 13 | 250 | 43.36 | 607 | 66.12 | 9.18 | 13.00 |
| 14clusters | K60 | 14 | 13 | 220 | 43.00 | 602 | 60.68 | 9.92 | 13.00 |
Comparison of the Lcq values and the ΣcLtlf medians for partitionings of the species obtained in Table 4.
Figure 3Exploring the number of species in the The number of species grouped into the Aves cluster as a function of rank value and number of clusters. Ordinates are multiplied by the respective maximum Linnaean taxonomy levels shared by species in Figure 5.
Figure 4Exploring The number of Linnaean levels shared by all species is plotted against rank value and number of imposed clusters. Ordinates are multiplied by the respective number of species that produced Figure 5.
Figure 5Determining the best algorithm parameters. Aves cluster quality as a function of rank value and different numbers of clusters. The number of clustered elements multiplied by maximum common Linnaean taxonomy levels shared between species gives the quality measure.
Figure 6Determining the best algorithm parameters at the first algorithm recurrence step. Aves cluster quality measured with a reduced numbered of species than in dataset2. Now is possible to cluster the Aves species separately and the best algorithm adjustment to this cluster is preferred. Higher curves do not represent better quality.
Figure 760 species from the Stuart data set. A 60 species data set unrooted tree generated from a distance matrix created with the ASAP algorithm. The original algorithm from this paper provided the distance matrix. Blue labels denote clusters.
Eight clusters from 60 data set
| Cluster | Number of species joined | Linnaean taxonomy levels in common | Deepest Linnaean taxonomy level |
|---|---|---|---|
| 1 | 10 | 10 | |
| 2 | 25 | 9 | |
| 3 | 7 | 8 | |
| 4 | 7 | 8 | |
| 5 | 2 | 12 | |
| 6 | 4 | 10 | |
| 7 | 4 | 12 | |
| 8 | 1 | 13 |
Eight clusters created from the first recurrence algorithm execution calibrated with a rank value of nine. Species were grouped according to their deepest evolutionary relatedness based on Linnaean taxonomy levels. Clusters 2, 5 and 8 belong to the mammalian class.