| Literature DB >> 32082410 |
Loai Abdallah1, Malik Yousef2.
Abstract
BACKGROUND: Advances in molecular biology have resulted in big and complicated data sets, therefore a clustering approach that able to capture the actual structure and the hidden patterns of the data is required. Moreover, the geometric space may not reflects the actual similarity between the different objects. As a result, in this research we use clustering-based space that convert the geometric space of the molecular to a categorical space based on clustering results. Then we use this space for developing a new classification algorithm.Entities:
Keywords: Classification; Ensemble clustering; k-means
Year: 2020 PMID: 32082410 PMCID: PMC7017541 DOI: 10.1186/s13015-020-0162-7
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 2The workflow for creating the EC categorical space based on the k-means clustering algorithm. The original data is the input to the workflow. The outcome is a new dataset named EC data in a categorical space with dimension k. the sign ≪ indicates that k is dramatically smaller than the original data dimension N
Fig. 1Example of clustering data
EC space for 20 points and number of cluster (nmc) of 11
| Point/k | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|---|---|---|---|---|---|---|---|---|---|---|
| Point 1 | c0 | c2 | c3 | c2 | c2 | c4 | c5 | c4 | c4 | c5 |
| Point 2 | c0 | c0 | c3 | c3 | c2 | c4 | c4 | c4 | c4 | c2 |
| Point 3 | c0 | c2 | c2 | c4 | c5 | c5 | c6 | c6 | c6 | c6 |
| Point 4 | c1 | c0 | c0 | c3 | c3 | c2 | c2 | c3 | c3 | c3 |
| Point 5 | c0 | c0 | c3 | c3 | c2 | c2 | c4 | c2 | c2 | c2 |
| Point 6 | c0 | c2 | c3 | c2 | c4 | c4 | c5 | c4 | c4 | c5 |
| Point 7 | c0 | c2 | c3 | c2 | c4 | c4 | c5 | c5 | c5 | c4 |
| Point 8 | c0 | c2 | c2 | c4 | c4 | c5 | c6 | c6 | c6 | c6 |
| Point 9 | c1 | c0 | c0 | c3 | c3 | c2 | c2 | c3 | c3 | c3 |
| Point 10 | c0 | c2 | c3 | c2 | c4 | c4 | c5 | c5 | c4 | c5 |
| Point 11 | c0 | c2 | c2 | c2 | c4 | c5 | c6 | c5 | c5 | c4 |
| Point 12 | c0 | c2 | c2 | c2 | c4 | c5 | c6 | c5 | c5 | c4 |
| Point 13 | c0 | c2 | c2 | c2 | c4 | c5 | c6 | c5 | c5 | c4 |
| Point 14 | c0 | c2 | c3 | c2 | c2 | c4 | c5 | c4 | c4 | c5 |
| Point 15 | c0 | c2 | c2 | c2 | c4 | c5 | c6 | c5 | c5 | c4 |
| Point 16 | c0 | c2 | c3 | c2 | c4 | c4 | c5 | c5 | c4 | c5 |
| Point 17 | c0 | c2 | c3 | c2 | c4 | c5 | c5 | c5 | c5 | c4 |
| Point 18 | c0 | c2 | c3 | c2 | c2 | c4 | c5 | c4 | c4 | c5 |
| Point 19 | c0 | c0 | c3 | c3 | c2 | c2 | c4 | c2 | c2 | c2 |
| Point 20 | c0 | c2 | c2 | c2 | c4 | c5 | c6 | c5 | c5 | c4 |
First column is the point name, second column is the results of assigning k-means of each point into two clusters (c0 and c1), the third column is the result of assigning k-means for each point into 3 clusters etc.
The data Cercopithecidae vs Malvacea with k = 30
| Size | Unique points | #Points | Ratio unique points | Ratio all |
|---|---|---|---|---|
| 1 | 305 | 305 | 67.929% | 34.116% |
| 2 | 68 | 136 | 30.290% | 15.213% |
| 3 | 22 | 66 | 14.699% | 7.383% |
| 4 | 18 | 72 | 16.036% | 8.054% |
| 5 | 11 | 55 | 12.249% | 6.152% |
| 6 | 5 | 30 | 6.682% | 3.356% |
| 7 | 5 | 35 | 7.795% | 3.915% |
| 10 | 4 | 40 | 8.909% | 4.474% |
| 13 | 3 | 39 | 8.686% | 4.362% |
| 8 | 3 | 24 | 5.345% | 2.685% |
| 9 | 2 | 18 | 4.009% | 2.013% |
| 29 | 1 | 29 | 6.459% | 3.244% |
| 14 | 1 | 14 | 3.118% | 1.566% |
| 31 | 1 | 31 | 6.904% | 3.468% |
| Total | 449 | 894 |
The total number of points (points) is 894 which is the sum of column #Points. The size of the unique points is the sum of columns “Unique Points” which is 449. #Points is multiplication of Size and Unique Points. Ratio Unique Points is the #Unique Points/Total #Points while Ratio All is #Points/Total #Points
GrpClassifierEC: -EC classifier results with a k value of 49 compared to Random forest applied on the EC samples and results for regular classifiers applied on the original data (K is number of clusters)
| Data/performance | Data info | EC classifier | Accuracy difference | EC-RF | Regular classifiers | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #Point | #EC_Samples | Ratio | Sensitivity | Specificity | F-measure | Accuracy | EC random forest | Random forest | DTT | KNN | Sensitivity | Specificity | Accuracy | AccDT | AccKNN | AccRF | |
| Aves vs embryophyta | 1068 | 726 | 68% | 0.97 | 0.92 | 0.97 | 0.96 | 0.02 | 0.01 | 0.05 | 0.02 | 0.84 | 0.97 | 0.93 | 0.91 | 0.93 | 0.95 |
| Cercopithecidae vs Malvaceae | 894 | 593 | 66% | 0.98 | 0.97 | 0.98 | 0.98 | 0.08 | 0.05 | 0.10 | 0.07 | 0.84 | 0.94 | 0.90 | 0.88 | 0.91 | 0.93 |
| Embryophyta vs Laurasiatheria | 953 | 652 | 68% | 0.96 | 0.92 | 0.96 | 0.95 | 0.08 | 0.04 | 0.10 | 0.07 | 0.94 | 0.72 | 0.87 | 0.85 | 0.88 | 0.91 |
| Fabaceae vs Nematoda | 2642 | 1004 | 38% | 0.85 | 0.89 | 0.84 | 0.87 | 0.02 | -0.01 | 0.04 | 0.00 | 0.92 | 0.76 | 0.85 | 0.83 | 0.88 | 0.89 |
| Hexapoda vs Aves | 2840 | 2087 | 73% | 0.85 | 0.95 | 0.86 | 0.92 | 0.10 | 0.03 | 0.11 | 0.10 | 0.61 | 0.91 | 0.83 | 0.81 | 0.82 | 0.89 |
| Laurasiatheria vs Brassicaceae | 1209 | 570 | 47% | 0.93 | 0.93 | 0.94 | 0.93 | 0.05 | 0.01 | 0.05 | 0.02 | 0.86 | 0.90 | 0.88 | 0.89 | 0.91 | 0.92 |
| Malvaceae vs Fabaceae | 1401 | 749 | 53% | 0.69 | 0.87 | 0.68 | 0.82 | 0.16 | 0.05 | 0.15 | 0.12 | 0.84 | 0.22 | 0.67 | 0.67 | 0.70 | 0.77 |
| brassicaceae vs Hexapoda | 2584 | 870 | 34% | 0.84 | 0.96 | 0.84 | 0.93 | 0.02 | 0.00 | 0.03 | 0.01 | 0.97 | 0.74 | 0.92 | 0.90 | 0.93 | 0.94 |
| Hominidae vs Cercopithecidae | 1829 | 1059 | 58% | 0.72 | 0.91 | 0.73 | 0.86 | 0.15 | 0.09 | 0.20 | 0.14 | 0.25 | 0.87 | 0.70 | 0.66 | 0.71 | 0.76 |
| Monocotyledons vs HomoSapiens | 2625 | 1460 | 56% | 0.92 | 0.93 | 0.92 | 0.92 | 0.10 | 0.03 | 0.09 | 0.04 | 0.84 | 0.82 | 0.83 | 0.83 | 0.88 | 0.89 |
| Average | 56% | 87% | 92% | 87% | 91% | 8% | 3% | 9% | 6% | 79% | 78% | 84% | 82% | 85% | 89% | ||
Fig. 3Distribution of the groups points (points) size comparing nmc = 30 and nmc = 50
The table shows a list of clades used in the study
| Data set | Number of precursors | Number of unique precursors |
|---|---|---|
| Hominidae | 3629 | 1326 |
| Brassicaceae | 726 | 535 |
| Hexapoda | 3119 | 2050 |
| Monocotyledons (Liliopsida) | 1598 | 1402 |
| Nematoda | 1789 | 1632 |
| Fabaceae | 1313 | 1011 |
| Pisces (Chondricthyes) | 1530 | 682 |
| Virus | 306 | 295 |
| Aves | 948 | 790 |
| Laurasiatheria | 1205 | 675 |
| Rodentia | 1778 | 993 |
| 1828 | 1223 | |
| Cercopithecidae | 631 | 503 |
| Embryophyta | 287 | 278 |
| Malvaceae | 458 | 419 |
| Platyhelminthes | 424 | 381 |
The first column represents the name of the clade, the second column the number of pre-cursors available on miRBase, and the third column the number of precursors after preprocessing the data
Ten datasets
| Positive data | Negative data |
|---|---|
| Aves | Embryophyta |
| Cercopithecidae | Malvaceae |
| Embryophyta | Laurasiatheria |
| Fabaceae | Nematoda |
| Hexapoda | Aves |
| Laurasiatheria | Brassicaceae |
| Malvaceae | Fabaceae |
| Brassicaceae | Hexapoda |
| Hominidae | Cercopithecidae |
| Monocotyledons | homoSapiens |
The first column shows the name of the first clade positive data, and the second column the second clade negative data
Fig. 4The accuracy of the classifiers over different level of sample training size
GrpClassifierEC: EC classifier results with a k value of 30 compared to Random forest applied on the EC samples and results for regular classifiers applied on the original data
| Data/performance | Data info | EC classifier | Accuracy difference | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Aves vs Embryophyta | 1068 | 513 | 48% | 0.86 | 0.94 | 0.85 | 0.92 | -0.01 | -0.03 | 0.02 | -0.01 |
| Cercopithecidae vs Malvaceae | 894 | 449 | 50% | 0.94 | 0.92 | 0.94 | 0.94 | 0.04 | 0.01 | 0.06 | 0.03 |
| Embryophyta vs Laurasiatheria | 953 | 493 | 52% | 0.94 | 0.83 | 0.94 | 0.91 | 0.04 | 0.00 | 0.06 | 0.03 |
| Fabaceae vs Nematoda | 2642 | 536 | 20% | 0.78 | 0.88 | 0.79 | 0.84 | -0.01 | -0.05 | 0.01 | -0.04 |
| Hexapoda vs Aves | 2840 | 1647 | 58% | 0.76 | 0.92 | 0.78 | 0.88 | 0.05 | -0.01 | 0.07 | 0.06 |
| Laurasiatheria vs Brassicaceae | 1209 | 406 | 34% | 0.89 | 0.88 | 0.89 | 0.88 | 0.00 | -0.04 | 0.00 | -0.03 |
| Malvaceae vs Fabaceae | 1401 | 451 | 32% | 0.55 | 0.80 | 0.53 | 0.73 | 0.07 | -0.04 | 0.06 | 0.03 |
| brassicaceae vs Hexapoda | 2584 | 542 | 21% | 0.77 | 0.95 | 0.78 | 0.91 | -0.01 | -0.03 | 0.01 | -0.02 |
| Hominidae vs Cercopithecidae | 1829 | 786 | 43% | 0.61 | 0.87 | 0.63 | 0.80 | 0.10 | 0.04 | 0.14 | 0.09 |
| Monocotyledons vs HomoSapiens | 2625 | 855 | 33% | 0.86 | 0.87 | 0.86 | 0.87 | 0.04 | -0.03 | 0.03 | -0.01 |
| Average | 39% | 80% | 89% | 80% | 87% | 3% | -2% | 5% | 1% | ||
K is number of clusters. The section “Accuracy Difference” is EC Classifier-ACC of the other classifier. A positive value indicates that the EC classifier is better than the other corresponding classifiers. EC-RF is a random forest applied on the EC data, RF is a random forest applied on the original data. DTT is a decisionTrees while KNN is K- Nearest Neighbors applied on the original data