MOTIVATION: Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS. RESULTS: We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches. AVAILABILITY: Software implementing of the method described in this article is available as free and open source code in the crlmm R/BioConductor package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS. RESULTS: We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches. AVAILABILITY: Software implementing of the method described in this article is available as free and open source code in the crlmm R/BioConductor package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Matthew E Ritchie; Benilton S Carvalho; Kurt N Hetrick; Simon Tavaré; Rafael A Irizarry Journal: Bioinformatics Date: 2009-08-06 Impact factor: 6.937
Authors: Lori D Bash; Thomas P Erlinger; Josef Coresh; Jane Marsh-Manzi; Aaron R Folsom; Brad C Astor Journal: Am J Kidney Dis Date: 2008-12-24 Impact factor: 8.860
Authors: Shin Lin; Benilton Carvalho; David J Cutler; Dan E Arking; Aravinda Chakravarti; Rafael A Irizarry Journal: Genome Biol Date: 2008-04-03 Impact factor: 13.583
Authors: Sarah E Reese; Kellie J Archer; Terry M Therneau; Elizabeth J Atkinson; Celine M Vachon; Mariza de Andrade; Jean-Pierre A Kocher; Jeanette E Eckel-Passow Journal: Bioinformatics Date: 2013-08-19 Impact factor: 6.937
Authors: Santiago Herrera; Wilfred C de Vega; David Ashbrook; Suzanne D Vernon; Patrick O McGowan Journal: Epigenetics Date: 2018-12-05 Impact factor: 4.528
Authors: Daniela O Procopio; Laura M Saba; Henriette Walter; Otto Lesch; Katrin Skala; Golda Schlaff; Lauren Vanderlinden; Peter Clapp; Paula L Hoffman; Boris Tabakoff Journal: Alcohol Clin Exp Res Date: 2012-12-27 Impact factor: 3.455
Authors: David E Larson; Christopher C Harris; Ken Chen; Daniel C Koboldt; Travis E Abbott; David J Dooling; Timothy J Ley; Elaine R Mardis; Richard K Wilson; Li Ding Journal: Bioinformatics Date: 2011-12-06 Impact factor: 6.937