BACKGROUND: There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. RESULTS: We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. CONCLUSION: We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets.
BACKGROUND: There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. RESULTS: We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. CONCLUSION: We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets.
Authors: Olga G Troyanskaya; Mitchell E Garber; Patrick O Brown; David Botstein; Russ B Altman Journal: Bioinformatics Date: 2002-11 Impact factor: 6.937
Authors: X Wen; S Fuhrman; G S Michaels; D B Carr; S Smith; J L Barker; R Somogyi Journal: Proc Natl Acad Sci U S A Date: 1998-01-06 Impact factor: 11.205
Authors: Andrew I Su; Tim Wiltshire; Serge Batalov; Hilmar Lapp; Keith A Ching; David Block; Jie Zhang; Richard Soden; Mimi Hayakawa; Gabriel Kreiman; Michael P Cooke; John R Walker; John B Hogenesch Journal: Proc Natl Acad Sci U S A Date: 2004-04-09 Impact factor: 11.205
Authors: Florin M Selaru; Jing Yin; Andreea Olaru; Yuriko Mori; Yan Xu; Steven H Epstein; Fumiaki Sato; Elena Deacu; Suna Wang; Anca Sterian; Amy Fulton; John M Abraham; David Shibata; Claudia Baquet; Sanford A Stass; Stephen J Meltzer Journal: Cancer Res Date: 2004-03-01 Impact factor: 12.701
Authors: Vamsi K Mootha; Cecilia M Lindgren; Karl-Fredrik Eriksson; Aravind Subramanian; Smita Sihag; Joseph Lehar; Pere Puigserver; Emma Carlsson; Martin Ridderstråle; Esa Laurila; Nicholas Houstis; Mark J Daly; Nick Patterson; Jill P Mesirov; Todd R Golub; Pablo Tamayo; Bruce Spiegelman; Eric S Lander; Joel N Hirschhorn; David Altshuler; Leif C Groop Journal: Nat Genet Date: 2003-07 Impact factor: 38.330
Authors: Jie Tan; Georgia Doing; Kimberley A Lewis; Courtney E Price; Kathleen M Chen; Kyle C Cady; Barret Perchuk; Michael T Laub; Deborah A Hogan; Casey S Greene Journal: Cell Syst Date: 2017-07-12 Impact factor: 10.304
Authors: Vincent P Diego; Joanne E Curran; Jac Charlesworth; Juan M Peralta; V Saroja Voruganti; Shelley A Cole; Thomas D Dyer; Matthew P Johnson; Eric K Moses; Harald H H Göring; Jeff T Williams; Anthony G Comuzzie; Laura Almasy; John Blangero; Sarah Williams-Blangero Journal: Mech Ageing Dev Date: 2011-12-01 Impact factor: 5.432
Authors: Christopher E Bradburne; Anne B Verhoeven; Ganiraju C Manyam; Saira A Chaudhry; Eddie L Chang; Dzung C Thach; Charles L Bailey; Monique L van Hoek Journal: J Biol Chem Date: 2013-01-15 Impact factor: 5.157
Authors: Martha Campbell-Thompson; Elizabeth A Butterworth; J Lucas Boatwright; Malavika A Nair; Lith H Nasif; Kamal Nasif; Andy Y Revell; Alberto Riva; Clayton E Mathews; Ivan C Gerling; Desmond A Schatz; Mark A Atkinson Journal: Sci Rep Date: 2021-03-22 Impact factor: 4.379