Literature DB >> 19661241

R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips.

Matthew E Ritchie1, Benilton S Carvalho, Kurt N Hetrick, Simon Tavaré, Rafael A Irizarry.   

Abstract

UNLABELLED: Illumina produces a number of microarray-based technologies for human genotyping. An Infinium BeadChip is a two-color platform that types between 10(5) and 10(6) single nucleotide polymorphisms (SNPs) per sample. Despite being widely used, there is a shortage of open source software to process the raw intensities from this platform into genotype calls. To this end, we have developed the R/Bioconductor package crlmm for analyzing BeadChip data. After careful preprocessing, our software applies the CRLMM algorithm to produce genotype calls, confidence scores and other quality metrics at both the SNP and sample levels. We provide access to the raw summary-level intensity data, allowing users to develop their own methods for genotype calling or copy number analysis if they wish.
AVAILABILITY AND IMPLEMENTATION: The crlmm Bioconductor package is available from http://www.bioconductor.org. Data packages and documentation are available from http://rafalab.jhsph.edu/software.html.

Entities:  

Mesh:

Year:  2009        PMID: 19661241      PMCID: PMC2752620          DOI: 10.1093/bioinformatics/btp470

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

In recent years, large-scale genome-wide association studies have provided significant insight into the genetics underpinning many complex diseases (Grant and Hakonarson, 2008). High-density microarrays, which allow many single nucleotide polymorphisms (SNPs) to be genotyped simultaneously in a sample at low cost, have been the technology driving this research. Illumina Inc. (San Diego, CA, USA) is a major provider of such arrays. Illumina BeadChips are composed of a number of rectangular strips, each containing many randomly arranged, replicated beads. For Infinium genotyping, beads are coupled with specific 50mer probes designed to be complementary to the sequence adjacent to the SNP site, and the two alleles (A, B) are discriminated using either a red or green dye (Steemers et al., 2006). Data are acquired by scanning each strip at different wave lengths using Illumina's scanning device followed by automatic image analysis (Galinsky, 2003). A robust summary of the intensity in each channel for each SNP assayed is reported in proprietary idat files. BeadChips of varying SNP density and sample format (single, duo, quad) are available for human genotyping. Some contain non-polymorphic probes for assessing copy number variation. Many algorithms that take summarized alleles A and B signals as inputs to produce genotypes (AA, AB, BB) have been developed for Affymetrix SNP arrays (Carvalho et al., 2007; Hua et al., 2007; Rabbee and Speed, 2006; Xiao et al., 2007). A smaller number of Illumina-specific methods (Giannoulatou et al., 2008; Teo et al., 2007) including Illumina's GenCall algorithm in BeadStudio/GenomeStudio are also available. Software for the analysis of Illumina data such as beadarray (Dunning et al., 2007), beadarraySNP and lumi (Du et al., 2008) is available in R/Bioconductor (Gentleman et al., 2004); however, current packages do not deal specifically with Infinium BeadChip data. In this article, we present the crlmm package for Illumina genotyping. Our software extracts summarized intensities, performs normalization and applies the CRLMM algorithm (Carvalho et al., 2007) to remove chip- and SNP-specific biases and call genotypes.

2 METHODS

To begin, summarized data are read from idat files (two per array, one for each channel) using the function readIdatFiles. Binary idat files are a convenient starting point, as they are routinely output by the scanning software, provide a compact representation of the data and have a consistent format (unlike output from Illumina's BeadStudio/GenomeStudio software, which is exported at the user's discretion, meaning the raw signals needed for the analysis are not always available). Access to the raw data allows for low-level plotting to help visualize trends and biases that may be present (Fig. 1A and B). It also allows alternative genotyping algorithms, which require data on the raw scale, to be applied.
Fig. 1.

(A) Plot of the log2allele B (green) intensity by strip (labelled by Row.Column position), which steadily increases from rows 1 to 10 down the BeadChip. This effect is less prominent in the allele A (red) channel (data not shown). The source of this trend is related to the way post-hybridization reagents are applied to the BeadChip and scan order, and its presence motivates strip-level normalization. (B) A smoothed scatter plot of M versus S for a typical array, where darker regions indicate a higher density of points. This plot shows intensity-dependent effects in M which vary for the AA and BB genotypes, and motivate the three-component mixture model in CRLMM. The curves represent the smoothing splines that model this effect. (C) SNR for 60 arrays, with the median (solid line) and median–median absolute deviation (dashed line) SNR values plotted. Lower scores correspond to poorer separation between the genotype clouds depicted in (B). This metric can be used to flag low-quality arrays to exclude from further analysis.

(A) Plot of the log2allele B (green) intensity by strip (labelled by Row.Column position), which steadily increases from rows 1 to 10 down the BeadChip. This effect is less prominent in the allele A (red) channel (data not shown). The source of this trend is related to the way post-hybridization reagents are applied to the BeadChip and scan order, and its presence motivates strip-level normalization. (B) A smoothed scatter plot of M versus S for a typical array, where darker regions indicate a higher density of points. This plot shows intensity-dependent effects in M which vary for the AA and BB genotypes, and motivate the three-component mixture model in CRLMM. The curves represent the smoothing splines that model this effect. (C) SNR for 60 arrays, with the median (solid line) and median–median absolute deviation (dashed line) SNR values plotted. Lower scores correspond to poorer separation between the genotype clouds depicted in (B). This metric can be used to flag low-quality arrays to exclude from further analysis. Next, the allele A (X Raw) and allele B (Y Raw) signals are normalized between channels and samples simultaneously using strip-level quantile normalization. The between-channel aspect of the normalization [also recommended in Oosting et al., (2007) and Staaf et al., (2008)] aims to remove any dye-bias effects, while the strip-level component corrects for intensity gradients which can occur within BeadChips (Fig. 1A). Normalization at the strip-level has also proven useful for data from Illumina's gene expression BeadChips (Wei Shi, personal communication). By default, the strip-level quantiles are standardized against a reference distribution obtained from HapMap samples (International HapMap Consortium, 2007) run on the same platform to correct for lab and batch effects. After normalization, the CRLMM genotyping algorithm (Carvalho et al., 2007; Lin et al., 2008) is applied. For each array, SNP-specific log-ratios (M = log2alleleA − log2alleleB) and average intensities [S = (log2alleleA + log2alleleB)/2] are calculated. As noted for Affymetrix data (Carvalho et al., 2007), S appears to have an effect on M. The effect appears to be a smooth function of S, but only applies to the AA and BB intensities (Fig. 1B). To remove this effect, we fit a three-component mixture model with a spline used to model the smooth function. This model is fitted per array via the expectation-maximization (EM) algorithm using a random sample of data-points. Due to the different chemistry used for Illumina genotyping, the fragment length covariate described in Carvalho et al. (2007) can be ignored. Next, a two-level hierarchical model is applied. SNP-specific means and standard deviations (SDs) are obtained for each genotype via supervised learning using HapMap data. Independent genotype calls (available from http://www.hapmap.org/) provide the true states for samples that have been genotyped using the respective BeadChip platform. Normalized signals from these arrays are then used to estimate robustly the genotype means and SDs. The intensity-dependent splines from the EM (which explain the between-SNP variation) and the SNP-specific genotype means and SDs (obtained from training data) are combined in the model. New genotype calls are assigned by choosing the class that minimizes the negative log likelihood. CRLMM produces a number of metrics for quality assessment (Lin et al., 2008). Confidence scores for each call are provided using the log-likelihood ratio test from the hierarchical model. The sample-specific SNR (signal-to-noise ratio) assesses the separation of the three genotypes within an array. Lower SNR values indicate poorer quality, and this metric can be used to exclude samples from further analysis (Fig. 1C). SNP-specific quality is measured as the minimum distance between the heterozygote centre and either of the two homozygous centres. The preprocessing and genotyping steps above are performed by the crlmmIllumina function. All code is written in R (R Development Core Team, 2009) and existing Biobase classes are used to store the data. The software requires chip-specific data packages (available at http://rafalab.jhsph.edu/software.html) that store basic SNP annotation information and various parameters used by CRLMM. We also provide the hapmap370k data package, which contains idats from 40 HapMap samples hybridized to HumanHap 370 K Duo BeadChips, and a user guide that provides example R code to analyse these samples (see Supplementary Material). A 64-bit Linux system takes ∼90 s and uses up to 1.2 GB of RAM to read these data, while normalization and genotyping takes a further 470 s and uses up to 3.3 GB of RAM. This equates to processing around 600 SNPs per second.

3 DISCUSSION

The crlmm package provides bioinformaticians with an additional tool outside of Illumina's proprietary software for analysing Infinium BeadChip data. Our software also facilitates the analysis of Affymetrix SNP chips and the use of a consistent algorithm and framework to process both the platforms allows data from different studies to be combined more easily. Implementation in R/Bioconductor gives users the opportunity to exploit other tools that have been adapted for Illumina data. For example, if raw bead-level data were available, the BASH spatial artefact detection method (Cairns et al., 2008) in the beadarray package could be applied. Once summarized, the data could be further processed using crlmm. The CRLMM algorithm can be applied to new versions of Illumina BeadChips for humans and other species, provided that the necessary training data and prior information on genotype calls are available. Future work will benchmark the performance of our method with other genotyping algorithms tailored to suit Illumina data, such as Illuminus (Teo et al., 2007) and Illumina's own algorithms in BeadStudio/GenomeStudio. Furthermore, tools for copy number analysis are being developed in the crlmm package.
  17 in total

1.  Automatic registration of microarray images. II. Hexagonal grid.

Authors:  Vitaly L Galinsky
Journal:  Bioinformatics       Date:  2003-09-22       Impact factor: 6.937

2.  A genotype calling algorithm for affymetrix SNP arrays.

Authors:  Nusrat Rabbee; Terence P Speed
Journal:  Bioinformatics       Date:  2005-11-02       Impact factor: 6.937

3.  Whole-genome genotyping with the single-base extension assay.

Authors:  Frank J Steemers; Weihua Chang; Grace Lee; David L Barker; Richard Shen; Kevin L Gunderson
Journal:  Nat Methods       Date:  2006-01       Impact factor: 28.547

4.  Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data.

Authors:  Benilton Carvalho; Henrik Bengtsson; Terence P Speed; Rafael A Irizarry
Journal:  Biostatistics       Date:  2006-12-22       Impact factor: 5.899

5.  SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays.

Authors:  Jianping Hua; David W Craig; Marcel Brun; Jennifer Webster; Victoria Zismann; Waibhav Tembe; Keta Joshipura; Matthew J Huentelman; Edward R Dougherty; Dietrich A Stephan
Journal:  Bioinformatics       Date:  2006-10-24       Impact factor: 6.937

6.  A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays.

Authors:  Yuanyuan Xiao; Mark R Segal; Y H Yang; Ru-Fang Yeh
Journal:  Bioinformatics       Date:  2007-04-25       Impact factor: 6.937

7.  beadarray: R classes and methods for Illumina bead-based data.

Authors:  Mark J Dunning; Mike L Smith; Matthew E Ritchie; Simon Tavaré
Journal:  Bioinformatics       Date:  2007-06-22       Impact factor: 6.937

8.  High-resolution copy number analysis of paraffin-embedded archival tissue using SNP BeadArrays.

Authors:  Jan Oosting; Esther H Lips; Ronald van Eijk; Paul H C Eilers; Károly Szuhai; Cisca Wijmenga; Hans Morreau; Tom van Wezel
Journal:  Genome Res       Date:  2007-01-31       Impact factor: 9.043

9.  A second generation human haplotype map of over 3.1 million SNPs.

Authors:  Kelly A Frazer; Dennis G Ballinger; David R Cox; David A Hinds; Laura L Stuve; Richard A Gibbs; John W Belmont; Andrew Boudreau; Paul Hardenbol; Suzanne M Leal; Shiran Pasternak; David A Wheeler; Thomas D Willis; Fuli Yu; Huanming Yang; Changqing Zeng; Yang Gao; Haoran Hu; Weitao Hu; Chaohua Li; Wei Lin; Siqi Liu; Hao Pan; Xiaoli Tang; Jian Wang; Wei Wang; Jun Yu; Bo Zhang; Qingrun Zhang; Hongbin Zhao; Hui Zhao; Jun Zhou; Stacey B Gabriel; Rachel Barry; Brendan Blumenstiel; Amy Camargo; Matthew Defelice; Maura Faggart; Mary Goyette; Supriya Gupta; Jamie Moore; Huy Nguyen; Robert C Onofrio; Melissa Parkin; Jessica Roy; Erich Stahl; Ellen Winchester; Liuda Ziaugra; David Altshuler; Yan Shen; Zhijian Yao; Wei Huang; Xun Chu; Yungang He; Li Jin; Yangfan Liu; Yayun Shen; Weiwei Sun; Haifeng Wang; Yi Wang; Ying Wang; Xiaoyan Xiong; Liang Xu; Mary M Y Waye; Stephen K W Tsui; Hong Xue; J Tze-Fei Wong; Luana M Galver; Jian-Bing Fan; Kevin Gunderson; Sarah S Murray; Arnold R Oliphant; Mark S Chee; Alexandre Montpetit; Fanny Chagnon; Vincent Ferretti; Martin Leboeuf; Jean-François Olivier; Michael S Phillips; Stéphanie Roumy; Clémentine Sallée; Andrei Verner; Thomas J Hudson; Pui-Yan Kwok; Dongmei Cai; Daniel C Koboldt; Raymond D Miller; Ludmila Pawlikowska; Patricia Taillon-Miller; Ming Xiao; Lap-Chee Tsui; William Mak; You Qiang Song; Paul K H Tam; Yusuke Nakamura; Takahisa Kawaguchi; Takuya Kitamoto; Takashi Morizono; Atsushi Nagashima; Yozo Ohnishi; Akihiro Sekine; Toshihiro Tanaka; Tatsuhiko Tsunoda; Panos Deloukas; Christine P Bird; Marcos Delgado; Emmanouil T Dermitzakis; Rhian Gwilliam; Sarah Hunt; Jonathan Morrison; Don Powell; Barbara E Stranger; Pamela Whittaker; David R Bentley; Mark J Daly; Paul I W de Bakker; Jeff Barrett; Yves R Chretien; Julian Maller; Steve McCarroll; Nick Patterson; Itsik Pe'er; Alkes Price; Shaun Purcell; Daniel J Richter; Pardis Sabeti; Richa Saxena; Stephen F Schaffner; Pak C Sham; Patrick Varilly; David Altshuler; Lincoln D Stein; Lalitha Krishnan; Albert Vernon Smith; Marcela K Tello-Ruiz; Gudmundur A Thorisson; Aravinda Chakravarti; Peter E Chen; David J Cutler; Carl S Kashuk; Shin Lin; Gonçalo R Abecasis; Weihua Guan; Yun Li; Heather M Munro; Zhaohui Steve Qin; Daryl J Thomas; Gilean McVean; Adam Auton; Leonardo Bottolo; Niall Cardin; Susana Eyheramendy; Colin Freeman; Jonathan Marchini; Simon Myers; Chris Spencer; Matthew Stephens; Peter Donnelly; Lon R Cardon; Geraldine Clarke; David M Evans; Andrew P Morris; Bruce S Weir; Tatsuhiko Tsunoda; James C Mullikin; Stephen T Sherry; Michael Feolo; Andrew Skol; Houcan Zhang; Changqing Zeng; Hui Zhao; Ichiro Matsuda; Yoshimitsu Fukushima; Darryl R Macer; Eiko Suda; Charles N Rotimi; Clement A Adebamowo; Ike Ajayi; Toyin Aniagwu; Patricia A Marshall; Chibuzor Nkwodimmah; Charmaine D M Royal; Mark F Leppert; Missy Dixon; Andy Peiffer; Renzong Qiu; Alastair Kent; Kazuto Kato; Norio Niikawa; Isaac F Adewole; Bartha M Knoppers; Morris W Foster; Ellen Wright Clayton; Jessica Watkin; Richard A Gibbs; John W Belmont; Donna Muzny; Lynne Nazareth; Erica Sodergren; George M Weinstock; David A Wheeler; Imtaz Yakub; Stacey B Gabriel; Robert C Onofrio; Daniel J Richter; Liuda Ziaugra; Bruce W Birren; Mark J Daly; David Altshuler; Richard K Wilson; Lucinda L Fulton; Jane Rogers; John Burton; Nigel P Carter; Christopher M Clee; Mark Griffiths; Matthew C Jones; Kirsten McLay; Robert W Plumb; Mark T Ross; Sarah K Sims; David L Willey; Zhu Chen; Hua Han; Le Kang; Martin Godbout; John C Wallenburg; Paul L'Archevêque; Guy Bellemare; Koji Saeki; Hongguang Wang; Daochang An; Hongbo Fu; Qing Li; Zhen Wang; Renwu Wang; Arthur L Holden; Lisa D Brooks; Jean E McEwen; Mark S Guyer; Vivian Ota Wang; Jane L Peterson; Michael Shi; Jack Spiegel; Lawrence M Sung; Lynn F Zacharia; Francis S Collins; Karen Kennedy; Ruth Jamieson; John Stewart
Journal:  Nature       Date:  2007-10-18       Impact factor: 49.962

10.  Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays.

Authors:  Shin Lin; Benilton Carvalho; David J Cutler; Dan E Arking; Aravinda Chakravarti; Rafael A Irizarry
Journal:  Genome Biol       Date:  2008-04-03       Impact factor: 13.583

View more
  28 in total

1.  M(3): an improved SNP calling algorithm for Illumina BeadArray data.

Authors:  Gengxin Li; Joel Gelernter; Henry R Kranzler; Hongyu Zhao
Journal:  Bioinformatics       Date:  2011-12-08       Impact factor: 6.937

2.  A framework for oligonucleotide microarray preprocessing.

Authors:  Benilton S Carvalho; Rafael A Irizarry
Journal:  Bioinformatics       Date:  2010-08-05       Impact factor: 6.937

3.  Quantifying uncertainty in genotype calls.

Authors:  Benilton S Carvalho; Thomas A Louis; Rafael A Irizarry
Journal:  Bioinformatics       Date:  2009-11-11       Impact factor: 6.937

4.  Performance assessment of copy number microarray platforms using a spike-in experiment.

Authors:  Eitan Halper-Stromberg; Laurence Frelin; Ingo Ruczinski; Robert Scharpf; Chunfa Jie; Benilton Carvalho; Haiping Hao; Kurt Hetrick; Anne Jedlicka; Amanda Dziedzic; Kim Doheny; Alan F Scott; Steve Baylin; Jonathan Pevsner; Forrest Spencer; Rafael A Irizarry
Journal:  Bioinformatics       Date:  2011-04-15       Impact factor: 6.937

5.  Pathway analysis of transcriptomic data shows immunometabolic effects of vitamin D.

Authors:  Amadeo Muñoz Garcia; Martina Kutmon; Lars Eijssen; Martin Hewison; Chris T Evelo; Susan L Coort
Journal:  J Mol Endocrinol       Date:  2017-12-12       Impact factor: 5.098

6.  The 5p12 breast cancer susceptibility locus affects MRPS30 expression in estrogen-receptor positive tumors.

Authors:  David A Quigley; Elisa Fiorito; Silje Nord; Peter Van Loo; Grethe Grenaker Alnæs; Thomas Fleischer; Jorg Tost; Hans Kristian Moen Vollan; Trine Tramm; Jens Overgaard; Ida R Bukholm; Antoni Hurtado; Allan Balmain; Anne-Lise Børresen-Dale; Vessela Kristensen
Journal:  Mol Oncol       Date:  2013-12-03       Impact factor: 6.603

Review 7.  A user guide to the Brassica 60K Illumina Infinium™ SNP genotyping array.

Authors:  Annaliese S Mason; Erin E Higgins; Rod J Snowdon; Jacqueline Batley; Anna Stein; Christian Werner; Isobel A P Parkin
Journal:  Theor Appl Genet       Date:  2017-02-20       Impact factor: 5.699

8.  TumorBoost: normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays.

Authors:  Henrik Bengtsson; Pierre Neuvial; Terence P Speed
Journal:  BMC Bioinformatics       Date:  2010-05-12       Impact factor: 3.169

9.  Genome Sequencing of Autism-Affected Families Reveals Disruption of Putative Noncoding Regulatory DNA.

Authors:  Tychele N Turner; Fereydoun Hormozdiari; Michael H Duyzend; Sarah A McClymont; Paul W Hook; Ivan Iossifov; Archana Raja; Carl Baker; Kendra Hoekzema; Holly A Stessman; Michael C Zody; Bradley J Nelson; John Huddleston; Richard Sandstrom; Joshua D Smith; David Hanna; James M Swanson; Elaine M Faustman; Michael J Bamshad; John Stamatoyannopoulos; Deborah A Nickerson; Andrew S McCallion; Robert Darnell; Evan E Eichler
Journal:  Am J Hum Genet       Date:  2015-12-31       Impact factor: 11.025

10.  Comparative genomic and genetic analysis of glioblastoma-derived brain tumor-initiating cells and their parent tumors.

Authors:  Brad Davis; Yaoqing Shen; Candice C Poon; H Artee Luchman; Owen D Stechishin; Carly S Pontifex; Wei Wu; John J Kelly; Michael D Blough
Journal:  Neuro Oncol       Date:  2015-08-05       Impact factor: 12.300

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.