Literature DB >> 28520900

Accelerating high-dimensional clustering with lossless data reduction.

Bahjat F Qaqish1, Jonathon J O'Brien2, Jonathan C Hibbard1, Katie J Clowers2.   

Abstract

MOTIVATION: For cluster analysis, high-dimensional data are associated with instability, decreased classification accuracy and high-computational burden. The latter challenge can be eliminated as a serious concern. For applications where dimension reduction techniques are not implemented, we propose a temporary transformation which accelerates computations with no loss of information. The algorithm can be applied for any statistical procedure depending only on Euclidean distances and can be implemented sequentially to enable analyses of data that would otherwise exceed memory limitations.
RESULTS: The method is easily implemented in common statistical software as a standard pre-processing step. The benefit of our algorithm grows with the dimensionality of the problem and the complexity of the analysis. Consequently, our simple algorithm not only decreases the computation time for routine analyses, it opens the door to performing calculations that may have otherwise been too burdensome to attempt.
AVAILABILITY AND IMPLEMENTATION: R, Matlab and SAS/IML code for implementing lossless data reduction is freely available in the Appendix. CONTACT: obrienj@hms.harvard.edu.
© The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Entities:  

Mesh:

Substances:

Year:  2017        PMID: 28520900      PMCID: PMC5870568          DOI: 10.1093/bioinformatics/btx328

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  10 in total

1.  Principal component analysis for clustering gene expression data.

Authors:  K Y Yeung; W L Ruzzo
Journal:  Bioinformatics       Date:  2001-09       Impact factor: 6.937

2.  Quantitative mass spectrometry-based multiplexing compares the abundance of 5000 S. cerevisiae proteins across 10 carbon sources.

Authors:  Joao A Paulo; Jeremy D O'Connell; Robert A Everley; Jonathon O'Brien; Micah A Gygi; Steven P Gygi
Journal:  J Proteomics       Date:  2016-07-16       Impact factor: 4.044

3.  Tight clustering: a resampling-based approach for identifying stable and tight patterns in data.

Authors:  George C Tseng; Wing H Wong
Journal:  Biometrics       Date:  2005-03       Impact factor: 2.571

4.  Evaluation and comparison of gene clustering methods in microarray analysis.

Authors:  Anbupalam Thalamuthu; Indranil Mukhopadhyay; Xiaojing Zheng; George C Tseng
Journal:  Bioinformatics       Date:  2006-07-31       Impact factor: 6.937

5.  What is principal component analysis?

Authors:  Markus Ringnér
Journal:  Nat Biotechnol       Date:  2008-03       Impact factor: 54.908

6.  ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking.

Authors:  Matthew D Wilkerson; D Neil Hayes
Journal:  Bioinformatics       Date:  2010-04-28       Impact factor: 6.937

7.  Quantitative temporal viromics: an approach to investigate host-pathogen interaction.

Authors:  Michael P Weekes; Peter Tomasec; Edward L Huttlin; Ceri A Fielding; David Nusinow; Richard J Stanton; Eddie C Y Wang; Rebecca Aicheler; Isa Murrell; Gavin W G Wilkinson; Paul J Lehner; Steven P Gygi
Journal:  Cell       Date:  2014-06-05       Impact factor: 41.582

8.  Comprehensive molecular portraits of human breast tumours.

Authors: 
Journal:  Nature       Date:  2012-09-23       Impact factor: 49.962

9.  A prediction-based resampling method for estimating the number of clusters in a dataset.

Authors:  Sandrine Dudoit; Jane Fridlyand
Journal:  Genome Biol       Date:  2002-06-25       Impact factor: 13.583

10.  Multidimensional scaling for large genomic data sets.

Authors:  Jengnan Tzeng; Henry Horng-Shing Lu; Wen-Hsiung Li
Journal:  BMC Bioinformatics       Date:  2008-04-04       Impact factor: 3.169

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.