Literature DB >> 19139392

CUR matrix decompositions for improved data analysis.

Michael W Mahoney1, Petros Drineas.   

Abstract

Principal components analysis and, more generally, the Singular Value Decomposition are fundamental data analysis tools that express a data matrix in terms of a sequence of orthogonal or uncorrelated vectors of decreasing importance. Unfortunately, being linear combinations of up to all the data points, these vectors are notoriously difficult to interpret in terms of the data and processes generating the data. In this article, we develop CUR matrix decompositions for improved data analysis. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix. Because they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). We present an algorithm that preferentially chooses columns and rows that exhibit high "statistical leverage" and, thus, in a very precise statistical sense, exert a disproportionately large "influence" on the best low-rank fit of the data matrix. By selecting columns and rows in this manner, we obtain improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work. In addition, since the construction involves computing quantities with a natural and widely studied statistical interpretation, we can leverage ideas from diagnostic regression analysis to employ these matrix decompositions for exploratory data analysis.

Year:  2009        PMID: 19139392      PMCID: PMC2630100          DOI: 10.1073/pnas.0803205106

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


  8 in total

1.  Fundamental patterns underlying gene expression profiles: simplicity from complexity.

Authors:  N S Holter; M Mitra; A Maritan; M Cieplak; J R Banavar; N V Fedoroff
Journal:  Proc Natl Acad Sci U S A       Date:  2000-07-18       Impact factor: 11.205

2.  Singular value decomposition for genome-wide expression data processing and modeling.

Authors:  O Alter; P O Brown; D Botstein
Journal:  Proc Natl Acad Sci U S A       Date:  2000-08-29       Impact factor: 11.205

3.  From paragraph to graph: latent semantic analysis for information visualization.

Authors:  Thomas K Landauer; Darrell Laham; Marcia Derr
Journal:  Proc Natl Acad Sci U S A       Date:  2004-03-22       Impact factor: 11.205

4.  A network analysis of committees in the U.S. House of Representatives.

Authors:  Mason A Porter; Peter J Mucha; M E J Newman; Casey M Warmbrand
Journal:  Proc Natl Acad Sci U S A       Date:  2005-05-16       Impact factor: 11.205

5.  Intra- and interpopulation genotype reconstruction from tagging SNPs.

Authors:  Peristera Paschou; Michael W Mahoney; Asif Javed; Judith R Kidd; Andrew J Pakstis; Sheng Gu; Kenneth K Kidd; Petros Drineas
Journal:  Genome Res       Date:  2006-12-06       Impact factor: 9.043

6.  Molecular characterisation of soft tissue tumours: a gene expression study.

Authors:  Torsten O Nielsen; Rob B West; Sabine C Linn; Orly Alter; Margaret A Knowling; John X O'Connell; Shirley Zhu; Mike Fero; Gavin Sherlock; Jonathan R Pollack; Patrick O Brown; David Botstein; Matt van de Rijn
Journal:  Lancet       Date:  2002-04-13       Impact factor: 79.321

7.  A genome-wide transcriptional analysis of the mitotic cell cycle.

Authors:  R J Cho; M J Campbell; E A Winzeler; L Steinmetz; A Conway; L Wodicka; T G Wolfsberg; A E Gabrielian; D Landsman; D J Lockhart; R W Davis
Journal:  Mol Cell       Date:  1998-07       Impact factor: 17.970

8.  Vector algebra in the analysis of genome-wide expression data.

Authors:  Finny G Kuruvilla; Peter J Park; Stuart L Schreiber
Journal:  Genome Biol       Date:  2002-02-13       Impact factor: 13.583

  8 in total
  22 in total

Review 1.  Big-Data Science in Porous Materials: Materials Genomics and Machine Learning.

Authors:  Kevin Maik Jablonka; Daniele Ongari; Seyed Mohamad Moosavi; Berend Smit
Journal:  Chem Rev       Date:  2020-06-10       Impact factor: 60.622

2.  Online Decentralized Leverage Score Sampling for Streaming Multidimensional Time Series.

Authors:  Rui Xie; Zengyan Wang; Shuyang Bai; Ping Ma; Wenxuan Zhong
Journal:  Proc Mach Learn Res       Date:  2019-04

Review 3.  Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry.

Authors:  Nico Verbeeck; Richard M Caprioli; Raf Van de Plas
Journal:  Mass Spectrom Rev       Date:  2019-10-11       Impact factor: 10.946

4.  Enhancement of plant metabolite fingerprinting by machine learning.

Authors:  Ian M Scott; Cornelia P Vermeer; Maria Liakata; Delia I Corol; Jane L Ward; Wanchang Lin; Helen E Johnson; Lynne Whitehead; Baldeep Kular; John M Baker; Sean Walsh; Anuja Dave; Tony R Larson; Ian A Graham; Trevor L Wang; Ross D King; John Draper; Michael H Beale
Journal:  Plant Physiol       Date:  2010-06-21       Impact factor: 8.340

5.  Challenges of Big Data Analysis.

Authors:  Jianqing Fan; Fang Han; Han Liu
Journal:  Natl Sci Rev       Date:  2014-06       Impact factor: 17.275

6.  Gaussian Process Regression for Materials and Molecules.

Authors:  Volker L Deringer; Albert P Bartók; Noam Bernstein; David M Wilkins; Michele Ceriotti; Gábor Csányi
Journal:  Chem Rev       Date:  2021-08-16       Impact factor: 60.622

7.  Optimal Subsampling for Large Sample Logistic Regression.

Authors:  HaiYing Wang; Rong Zhu; Ping Ma
Journal:  J Am Stat Assoc       Date:  2018-06-06       Impact factor: 5.033

Review 8.  Non-negative matrix factorization of multimodal MRI, fMRI and phenotypic data reveals differential changes in default mode subnetworks in ADHD.

Authors:  Ariana Anderson; Pamela K Douglas; Wesley T Kerr; Virginia S Haynes; Alan L Yuille; Jianwen Xie; Ying Nian Wu; Jesse A Brown; Mark S Cohen
Journal:  Neuroimage       Date:  2013-12-19       Impact factor: 6.556

9.  An Extended DEIM Algorithm for Subset Selection and Class Identification.

Authors:  Emily P Hendryx; Béatrice M Rivière; Craig G Rusin
Journal:  Mach Learn       Date:  2021-03-21       Impact factor: 2.940

10.  Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems.

Authors:  John A Keith; Valentin Vassilev-Galindo; Bingqing Cheng; Stefan Chmiela; Michael Gastegger; Klaus-Robert Müller; Alexandre Tkatchenko
Journal:  Chem Rev       Date:  2021-07-07       Impact factor: 60.622

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.