Literature DB >> 34457104

MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY.

David K Lim1, Naim U Rashid1, Joseph G Ibrahim1.   

Abstract

Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.

Entities:  

Keywords:  RNA-seq; batch; clustering; confounders

Year:  2021        PMID: 34457104      PMCID: PMC8386505          DOI: 10.1214/20-aoas1407

Source DB:  PubMed          Journal:  Ann Appl Stat        ISSN: 1932-6157            Impact factor:   2.083


  51 in total

1.  Independent filtering increases detection power for high-throughput experiments.

Authors:  Richard Bourgon; Robert Gentleman; Wolfgang Huber
Journal:  Proc Natl Acad Sci U S A       Date:  2010-05-11       Impact factor: 11.205

Review 2.  Tackling the widespread and critical impact of batch effects in high-throughput data.

Authors:  Jeffrey T Leek; Robert B Scharpf; Héctor Corrada Bravo; David Simcha; Benjamin Langmead; W Evan Johnson; Donald Geman; Keith Baggerly; Rafael A Irizarry
Journal:  Nat Rev Genet       Date:  2010-09-14       Impact factor: 53.242

3.  svaseq: removing batch effects and other unwanted noise from sequencing data.

Authors:  Jeffrey T Leek
Journal:  Nucleic Acids Res       Date:  2014-10-07       Impact factor: 16.971

4.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Authors:  T R Golub; D K Slonim; P Tamayo; C Huard; M Gaasenbeek; J P Mesirov; H Coller; M L Loh; J R Downing; M A Caligiuri; C D Bloomfield; E S Lander
Journal:  Science       Date:  1999-10-15       Impact factor: 47.728

5.  Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors:  Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal:  J Stat Softw       Date:  2010       Impact factor: 6.440

6.  Toward a Shared Vision for Cancer Genomic Data.

Authors:  Robert L Grossman; Allison P Heath; Vincent Ferretti; Harold E Varmus; Douglas R Lowy; Warren A Kibbe; Louis M Staudt
Journal:  N Engl J Med       Date:  2016-09-22       Impact factor: 91.245

7.  Variable selection in the cox regression model with covariates missing at random.

Authors:  Ramon I Garcia; Joseph G Ibrahim; Hongtu Zhu
Journal:  Biometrics       Date:  2009-05-18       Impact factor: 2.571

8.  Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.

Authors:  Linda Vidman; David Källberg; Patrik Rydén
Journal:  PLoS One       Date:  2019-12-05       Impact factor: 3.240

9.  Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model.

Authors:  F William Townes; Stephanie C Hicks; Martin J Aryee; Rafael A Irizarry
Journal:  Genome Biol       Date:  2019-12-23       Impact factor: 13.583

10.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.

Authors:  Antonio Colaprico; Tiago C Silva; Catharina Olsen; Luciano Garofano; Claudia Cava; Davide Garolini; Thais S Sabedot; Tathiane M Malta; Stefano M Pagnotta; Isabella Castiglioni; Michele Ceccarelli; Gianluca Bontempi; Houtan Noushmehr
Journal:  Nucleic Acids Res       Date:  2015-12-23       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.