Literature DB >> 32282889

Pooled variable scaling for cluster analysis.

Jakob Raymaekers1, Ruben H Zamar2.   

Abstract

MOTIVATION: Many popular clustering methods are not scale-invariant because they are based on Euclidean distances. Even methods using scale-invariant distances, such as the Mahalanobis distance, lose their scale invariance when combined with regularization and/or variable selection. Therefore, the results from these methods are very sensitive to the measurement units of the clustering variables. A simple way to achieve scale invariance is to scale the variables before clustering. However, scaling variables is a very delicate issue in cluster analysis: A bad choice of scaling can adversely affect the clustering results. On the other hand, reporting clustering results that depend on measurement units is not satisfactory. Hence, a safe and efficient scaling procedure is needed for applications in bioinformatics and medical sciences research.
RESULTS: We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures, such as the SD and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well-known real-data examples that the proposed scaling method is safe and generally useful. Finally, we use our approach to cluster a high-dimensional genomic dataset consisting of gene expression data for several specimens of breast cancer cells tissue obtained from human patients.
AVAILABILITY AND IMPLEMENTATION: An R-implementation of the algorithms presented is available at https://wis.kuleuven.be/statdatascience/robust/software. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Year:  2020        PMID: 32282889     DOI: 10.1093/bioinformatics/btaa243

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  2 in total

1.  Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans).

Authors:  Alfred Ultsch; Jörn Lötsch
Journal:  BMC Bioinformatics       Date:  2022-06-16       Impact factor: 3.307

2.  Comprehensive Analyses of Ferroptosis-Related Alterations and Their Prognostic Significance in Glioblastoma.

Authors:  Yuan Tian; Hongtao Liu; Caiqing Zhang; Wei Liu; Tong Wu; Xiaowei Yang; Junyan Zhao; Yuping Sun
Journal:  Front Mol Biosci       Date:  2022-06-03
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.