| Literature DB >> 34732744 |
Abstract
The concept of depth induces an ordering from centre outwards in multivariate data. Most depth definitions are unfeasible for dimensions larger than three or four, but the Modified Band Depth (MBD) is a notable exception that has proven to be a valuable tool in the analysis of high-dimensional gene expression data. This depth definition relates the centrality of each individual to its (partial) inclusion in all possible bands formed by elements of the data set. We assess (dis)similarity between pairs of observations by accounting for such bands and constructing binary matrices associated to each pair. From these, contingency tables are calculated and used to derive standard similarity indices. Our approach is computationally efficient and can be applied to bands formed by any number of observations from the data set. We have evaluated the performance of several band-based similarity indices with respect to that of other classical distances in standard classification and clustering tasks in a variety of simulated and real data sets. However, the use of the method is not restricted to these, the extension to other similarity coefficients being straightforward. Our experiments show the benefits of our technique, with some of the selected indices outperforming, among others, the Euclidean distance.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34732744 PMCID: PMC8566472 DOI: 10.1038/s41598-021-00678-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Computation of binary matrices. (a) Contingency table for binary vectors and , associated to objects and respectively. (b) Representation of four quantitative observations with five variables that define six possible 2-bands, namely, , , , , , . The gray region between the black curves is the band . The first coordinate of curve (green dot) is outside the band; this corresponds to a 0 in the entry (5,1) of its associated Boolean matrix . The first coordinate of curve , in red, is inside the band; the entry (5,1) in matrix is 1, accordingly. The corresponding Boolean product is 0, meaning that the band does not include both coordinates. (c) Binary contingency table for coordinate k and the number j of curves forming a band, constructed from the matrices sketched in (b).
Figure 2Method workflow. Calculations relative to red blocks can be skipped to alleviate the computational cost.
Description of the model parameters. The matrix is the identity of dimension n. The matrix is an matrix with all entries equal to zero, except the i-th position in the main diagonal, which equals 1. Models 1 and 2 consist of two Gaussian clusters; models 3–5 are Gaussian mixture models with fixed levels of overlap.
| Model | Dimension | Means | Covariance Matrices |
|---|---|---|---|
Model 1 | 100 | ||
Model 2 | 10 | ||
Figure 3Classification error rates in the simulated models. Distribution of the classification error rates over 200 test sets from Models 1–5 after running kNN, using the classical distances and the dissimilarity indices based on bands, for and 3. Best and second best means are indicated with dark and light gray boxes, respectively. The number of observations in each class of the train and test sets is 100 and 25, respectively.
Average clustering error rate over 200 simulations for all the simulated models; standard errors are written in brackets. Best and second best performance are highlighted in bold and italics, respectively.
| Dissim. | M1 | M2 ( | M2 ( | M3 | M4 | M5 |
|---|---|---|---|---|---|---|
| A | 0.3877 (0.0685) | 0.2461 (0.0701) | 0.2186 (0.0800) | 0.4020 (0.0817) | 0.5440 (0.0817) | 0.6095 (0.0484) |
| A | 0.3875 (0.0679) | 0.2502 (0.0724) | 0.2159 (0.0813) | 0.4024 (0.0845) | 0.5474 (0.0845) | 0.6108 (0.0495) |
| D | 0.3326 (0.0839) | 0.1747 (0.0659) | 0.1524 (0.0668) | 0.2975 (0.0833) | 0.4476 (0.0833) | 0.5171 (0.0675) |
| D | 0.3310 (0.0842) | 0.1782 (0.0762) | 0.1550 (0.0681) | 0.2950 (0.0833) | 0.4467 (0.0833) | 0.5195 (0.0684) |
| Eucl | 0.2072 (0.0821) | 0.3945 (0.0844) | 0.1486 (0.0559) | 0.2476 (0.0559) | 0.3004 (0.0813) | |
| F | 0.3611 (0.0769) | 0.2080 (0.0710) | 0.1782 (0.0695) | 0.3482 (0.0800) | 0.4924 (0.0800) | 0.5612 (0.0576) |
| F | 0.3594 (0.0791) | 0.2019 (0.0743) | 0.1742 (0.0687) | 0.3397 (0.0845) | 0.4889 (0.0845) | 0.5530 (0.0605) |
| J | 0.3618 (0.0769) | 0.2102 (0.0729) | 0.1794 (0.0714) | 0.3495 (0.0819) | 0.4979 (0.0819) | 0.5637 (0.0582) |
| J | 0.3609 (0.0763) | 0.2110 (0.0761) | 0.1813 (0.0735) | 0.3472 (0.0816) | 0.4975 (0.0816) | 0.5653 (0.0596) |
| Manh | 0.2300 (0.0875) | 0.0372 (0.0252) | 0.1492 (0.1184) | 0.1721 (0.0610) | 0.2828 (0.0610) | 0.3397 (0.0874) |
| Mk | 0.3064 (0.0923) | 0.1022 (0.0441) | 0.1097 (0.0533) | 0.2699 (0.0809) | 0.4060 (0.0809) | 0.4742 (0.0745) |
| Mk | 0.2774 (0.0950) | 0.0694 (0.0337) | 0.0868 (0.0548) | 0.2207 (0.0743) | 0.3472 (0.0743) | 0.4152 (0.0873) |
| Mk | 0.2536 (0.0948) | 0.0484 (0.0280) | 0.0954 (0.0748) | 0.1900 (0.0662) | 0.3064 (0.0662) | 0.3727 (0.0898) |
| Mk | 0.2272 (0.0932) | 0.4356 (0.0479) | 0.1556 (0.059)0 | 0.2650 (0.0590) | 0.3231 (0.0802) | |
| Mk | 0.2566 (0.0918) | 0.0324 (0.0239) | 0.4436 (0.0439) | 0.1659 (0.0579) | 0.2948 (0.0579) | 0.3508 (0.0821) |
| Mk | 0.2816 (0.0854) | 0.0363 (0.0250) | 0.4492 (0.0375) | 0.1795 (0.0619) | 0.3171 (0.0619) | 0.3846 (0.0824) |
| O | 0.3132 (0.0866) | 0.1592 (0.0650) | 0.1404 (0.0625) | 0.2753 (0.0821) | 0.4153 (0.0821) | 0.4828 (0.0733) |
| O | 0.3124 (0.0888) | 0.1578 (0.0702) | 0.1433 (0.0677) | 0.2752 (0.0819) | 0.4159 (0.0819) | 0.4862 (0.0754) |
| Pears | 0.4630 (0.0282) | 0.2374 (0.0329) | 0.4340 (0.0464) | 0.1311 (0.0489) | 0.2323 (0.0489) | 0. |
| RR | 0.3998 (0.0620) | 0.2640 (0.0889) | 0.2456 (0.1030) | 0.4491 (0.0877) | 0.5738 (0.0877) | 0.6358 (0.0459) |
| RR | 0.4064 (0.0600) | 0.2680 (0.0902) | 0.2592 (0.1048) | 0.4541 (0.0869) | 0.5785 (0.0869) | 0.6408 (0.0447) |
| S | 0.0645 (0.0440) | 0.2736 (0.0649) | ||||
| S | 0.0598 (0.0461) | |||||
| SM | 0.3187 (0.0717) | 0.1989 (0.0585) | 0.1376 (0.0453) | 0.2968 (0.0718) | 0.4286 (0.0718) | 0.5021 (0.0613) |
| SM | 0.3498 (0.0706) | 0.1988 (0.0543) | 0.1342 (0.0470) | 0.2975 (0.0715) | 0.4273 (0.0715) | 0.5128 (0.0625) |
Average ARI over 200 simulations for all the simulated models; standard errors are written in brackets. Best and second best performance are highlighted in bold and italics, respectively.
| Dissim. | M1 | M2 ( | M2 ( | MS1 | MS2 | MS3 |
|---|---|---|---|---|---|---|
| A | 0.0598 (0.0702) | 0.2702 (0.1356) | 0.3356 (0.1619) | 0.1945 (0.0917) | 0.1064 (0.0504) | 0.0825 (0.0337) |
| A | 0.0597 (0.0697) | 0.2632 (0.1361) | 0.3427 (0.1635) | 0.1953 (0.0960) | 0.1050 (0.0491) | 0.0833 (0.0343) |
| D | 0.1317 (0.1126) | 0.4351 (0.1660) | 0.4960 (0.1687) | 0.3418 (0.1223) | 0.2074 (0.0754) | 0.1672 (0.0572) |
| D | 0.1341 (0.1145) | 0.4318 (0.1782) | 0.4896 (0.1730) | 0.3458 (0.1222) | 0.2076 (0.0769) | 0.1652 (0.0571) |
| Eucl | 0.3636 (0.1722) | 0.0641 (0.1215) | 0.6188 (0.1171) | 0.4818 (0.1087) | 0.4293 (0.1006) | |
| F | 0.0917 (0.0962) | 0.3549 (0.1574) | 0.4276 (0.1613) | 0.2637 (0.1018) | 0.1553 (0.0622) | 0.1235 (0.0436) |
| F | 0.0950 (0.1000) | 0.3712 (0.1703) | 0.4380 (0.1608) | 0.2769 (0.1117) | 0.1633 (0.0659) | 0.1312 (0.0475) |
| J | 0.0910 (0.0946) | 0.3508 (0.1620) | 0.4259 (0.1625) | 0.2627 (0.1029) | 0.1518 (0.0640) | 0.1216 (0.0435) |
| J | 0.0917 (0.0945) | 0.3507 (0.1692) | 0.4221 (0.1659) | 0.2656 (0.1044) | 0.1543 (0.0635) | 0.1204 (0.0453) |
| Manh | 0.3155 (0.1788) | 0.8576 (0.0909) | 0.5436 (0.2817) | 0.5698 (0.1185) | 0.4289 (0.1050) | 0.3742 (0.1013) |
| Mk | 0.1757 (0.1376) | 0.6371 (0.1370) | 0.6168 (0.1529) | 0.3887 (0.1242) | 0.2543 (0.0820) | 0.2095 (0.0668) |
| Mk | 0.2266 (0.1597) | 0.7438 (0.1147) | 0.6918 (0.1581) | 0.4738 (0.1293) | 0.3318 (0.0922) | 0.2795 (0.0886) |
| Mk | 0.2716 (0.1724) | 0.8172 (0.1000) | 0.6739 (0.2020) | 0.5333 (0.1236) | 0.3904 (0.0989) | 0.3328 (0.0978) |
| Mk | 0.3257 (0.1760) | 0.0165 (0.0353) | 0.6043 (0.1192) | 0.4539 (0.1066) | 0.3963 (0.0956) | |
| Mk | 0.2632 (0.1582) | 0.8754 (0.0878) | 0.0110 (0.0281) | 0.5804 (0.1153) | 0.4070 (0.1000) | 0.3563 (0.0946) |
| Mk | 0.2121 (0.1341) | 0.8612 (0.0913) | 0.0064 (0.0214) | 0.5526 (0.1203) | 0.3714 (0.0949) | 0.3135 (0.0877) |
| O | 0.1613 (0.1283) | 0.4761 (0.1700) | 0.5283 (0.1649) | 0.3773 (0.1244) | 0.2437 (0.0793) | 0.2031 (0.0675) |
| O | 0.1641 (0.1346) | 0.4828 (0.1757) | 0.5225 (0.1771) | 0.3782 (0.1239) | 0.2444 (0.0834) | 0.2014 (0.0693) |
| Pears | 0.0014 (0.0143) | 0.2745 (0.0701) | 0.0163 (0.0336) | 0.6568 (0.1090) | ||
| RR | 0.0464 (0.0610) | 0.2469 (0.1619) | 0.2945 (0.1921) | 0.1467 (0.0913) | 0.0818 (0.0466) | 0.0659 (0.0302) |
| RR | 0.0404 (0.0567) | 0.2406 (0.1620) | 0.2690 (0.1902) | 0.1419 (0.0900) | 0.0794 (0.0477) | 0.0630 (0.0304) |
| S | 0.7952 (0.1371) | 0.5841 (0.0982) | 0.5604 (0.0864) | |||
| S | 0.8117 (0.1456) | |||||
| SM | 0.1433 (0.1030) | 0.3699 (0.1326) | 0.5286 (0.1272) | 0.3370 (0.1040) | 0.2206 (0.0671) | 0.1778 (0.0507) |
| SM | 0.1011 (0.0912) | 0.3684 (0.1238) | 0.5392 (0.1329) | 0.3342 (0.1027) | 0.2215 (0.0709) | 0.1678 (0.0514) |
Figure 4Classification error rates in real data sets. Distribution of kNN classification error rates for the lymphoma (top left), colon (top middle), leukemia with two (top right) and three classes (bottom left), RNAseq storage conditions (bottom middle) and pan-cancer (bottom right) data sets, using 10-folds cross-validation, with selection-bias correction.
Figure 5Clustering the lymphoma data set. Dendrograms and heatmaps for the lymphoma data set, using the Euclidean distance (left), the Simpson index (middle) and the Ochiai index (right), for . The Euclidean distance merges 2 DLBCL and 2 FL samples with the CLL group. S identifies the structure better, misplacing 1 DLBCL in the FL group and 1 DLBCL in the CLL group. O completely restores the three classes.
Figure 6Clustering the colon data set. Dendrograms and heatmaps for the colon data set, using the Euclidean distance (left), the Simpson index (middle) and the Ochiai index (right), for . The types of samples are difficult to separate and one branch with mixed leaves emerges in the three dendrograms.
Figure 7Clustering the leukemia data set. Dendrograms and heatmaps for the leukemia data set, using the Euclidean distance (left), the Simpson index (middle) and the Ochiai index (right), for . The colour labels in the columns and the rows correspond to the 2-class and 3-class scenarios, respectively.
Figure 8Clustering the Pan-cancer data set. Dendrograms for the Pan-cancer data set, using the Euclidean distance (top), the Simpson (middle) and the Ochiai (bottom) indices, with . Wrongly assigned samples are coloured in the dendrogram.
FFvsFFPE data. Experimental design according to the tissue of origin and the storage condition.
| Tissue type | Storage condition | Number of samples |
|---|---|---|
| Bladder carcinoma | FF | 18 |
| FFPE | 12 | |
| Colon carcinoma | FF | 12 |
| FFPE | 16 | |
| Normal colon | FF | 6 |
| Normal liver | FFPE | 4 |
| Prostate carcinoma | FF | 7 |
| FFPE | 7 | |
| Normal tonsil | FFPE | 4 |
Figure 9Clustering the FFvsFFPE data set. Dendrograms and heatmaps for the FFvsFFPE data set, using the Euclidean distance (left), the Simpson (middle) and the Ochiai (right) indices, with . Main branches are highlighted with black dots.