Literature DB >> 23875057

Penalized unsupervised learning with outliers.

Daniela M Witten1.   

Abstract

We consider the problem of performing unsupervised learning in the presence of outliers - that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, K-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outliers. In this paper, we take a new approach to extending existing unsupervised learning techniques to accommodate outliers. Our approach is an extension of a recent proposal for outlier detection in the regression setting. We allow each observation to take on an "error" term, and we penalize the errors using a group lasso penalty in order to encourage most of the observations' errors to exactly equal zero. We show that this approach can be used in order to develop extensions of K-means clustering and principal components analysis that result in accurate outlier detection, as well as improved performance in the presence of outliers. These methods are illustrated in a simulation study and on two gene expression data sets, and connections with M-estimation are explored.

Entities:  

Keywords:  M-estimation; group lasso; k-means clustering; outliers; principal components analysis; robust; unsupervised learning

Year:  2013        PMID: 23875057      PMCID: PMC3716393          DOI: 10.4310/sii.2013.v6.n2.a5

Source DB:  PubMed          Journal:  Stat Interface        ISSN: 1938-7989            Impact factor:   0.582


  12 in total

1.  Tight clustering: a resampling-based approach for identifying stable and tight patterns in data.

Authors:  George C Tseng; Wing H Wong
Journal:  Biometrics       Date:  2005-03       Impact factor: 2.571

2.  Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data.

Authors:  George C Tseng
Journal:  Bioinformatics       Date:  2007-06-27       Impact factor: 6.937

3.  Sparse inverse covariance estimation with the graphical lasso.

Authors:  Jerome Friedman; Trevor Hastie; Robert Tibshirani
Journal:  Biostatistics       Date:  2007-12-12       Impact factor: 5.899

4.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.

Authors:  Daniela M Witten; Robert Tibshirani; Trevor Hastie
Journal:  Biostatistics       Date:  2009-04-17       Impact factor: 5.899

5.  Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables.

Authors:  Benhuai Xie; Wei Pan; Xiaotong Shen
Journal:  Electron J Stat       Date:  2008       Impact factor: 1.125

6.  Joint estimation of multiple graphical models.

Authors:  Jian Guo; Elizaveta Levina; George Michailidis; Ji Zhu
Journal:  Biometrika       Date:  2011-02-09       Impact factor: 2.445

7.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

Authors:  U Alon; N Barkai; D A Notterman; K Gish; S Ybarra; D Mack; A J Levine
Journal:  Proc Natl Acad Sci U S A       Date:  1999-06-08       Impact factor: 11.205

8.  Penalized classification using Fisher's linear discriminant.

Authors:  Daniela M Witten; Robert Tibshirani
Journal:  J R Stat Soc Series B Stat Methodol       Date:  2011-11       Impact factor: 4.488

9.  Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain.

Authors:  Lixin Sun; Ai-Min Hui; Qin Su; Alexander Vortmeyer; Yuri Kotliarov; Sandra Pastorino; Antonino Passaniti; Jayant Menon; Jennifer Walling; Rolando Bailey; Marc Rosenblum; Tom Mikkelsen; Howard A Fine
Journal:  Cancer Cell       Date:  2006-04       Impact factor: 31.743

10.  NCBI GEO: mining millions of expression profiles--database and tools.

Authors:  Tanya Barrett; Tugba O Suzek; Dennis B Troup; Stephen E Wilhite; Wing-Chi Ngau; Pierre Ledoux; Dmitry Rudnev; Alex E Lash; Wataru Fujibuchi; Ron Edgar
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

View more
  1 in total

1.  A penalized likelihood approach for robust estimation of isoform expression.

Authors:  Hui Jiang; Julia Salzman
Journal:  Stat Interface       Date:  2015       Impact factor: 0.582

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.