Florian Privé1,2, Keurcien Luu2, Michael G B Blum2,3, John J McGrath1,4,5, Bjarni J Vilhjálmsson1. 1. National Centre for Register-Based Research, Aarhus University, Aarhus 8210, Denmark. 2. Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes, La Tronche 38700, France. 3. OWKIN France, Paris 75010, France. 4. Queensland Brain Institute, University of Queensland, St. Lucia, 4072 Queensland, Australia. 5. Queensland Centre for Mental Health Research, The Park Centre for Mental Health, Wacol, 4076 Queensland, Australia.
Abstract
MOTIVATION: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. RESULTS: For example, we find that PC19-PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. AVAILABILITY AND IMPLEMENTATION: R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. RESULTS: For example, we find that PC19-PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. AVAILABILITY AND IMPLEMENTATION: R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Abdel Abdellaoui; Jouke-Jan Hottenga; Peter de Knijff; Michel G Nivard; Xiangjun Xiao; Paul Scheet; Andrew Brooks; Erik A Ehli; Yueshan Hu; Gareth E Davies; James J Hudziak; Patrick F Sullivan; Toos van Beijsterveldt; Gonneke Willemsen; Eco J de Geus; Brenda W J H Penninx; Dorret I Boomsma Journal: Eur J Hum Genet Date: 2013-03-27 Impact factor: 4.246
Authors: David M Altshuler; Richard A Gibbs; Leena Peltonen; David M Altshuler; Richard A Gibbs; Leena Peltonen; Emmanouil Dermitzakis; Stephen F Schaffner; Fuli Yu; Leena Peltonen; Emmanouil Dermitzakis; Penelope E Bonnen; David M Altshuler; Richard A Gibbs; Paul I W de Bakker; Panos Deloukas; Stacey B Gabriel; Rhian Gwilliam; Sarah Hunt; Michael Inouye; Xiaoming Jia; Aarno Palotie; Melissa Parkin; Pamela Whittaker; Fuli Yu; Kyle Chang; Alicia Hawes; Lora R Lewis; Yanru Ren; David Wheeler; Richard A Gibbs; Donna Marie Muzny; Chris Barnes; Katayoon Darvishi; Matthew Hurles; Joshua M Korn; Kati Kristiansson; Charles Lee; Steven A McCarrol; James Nemesh; Emmanouil Dermitzakis; Alon Keinan; Stephen B Montgomery; Samuela Pollack; Alkes L Price; Nicole Soranzo; Penelope E Bonnen; Richard A Gibbs; Claudia Gonzaga-Jauregui; Alon Keinan; Alkes L Price; Fuli Yu; Verneri Anttila; Wendy Brodeur; Mark J Daly; Stephen Leslie; Gil McVean; Loukas Moutsianas; Huy Nguyen; Stephen F Schaffner; Qingrun Zhang; Mohammed J R Ghori; Ralph McGinnis; William McLaren; Samuela Pollack; Alkes L Price; Stephen F Schaffner; Fumihiko Takeuchi; Sharon R Grossman; Ilya Shlyakhter; Elizabeth B Hostetter; Pardis C Sabeti; Clement A Adebamowo; Morris W Foster; Deborah R Gordon; Julio Licinio; Maria Cristina Manca; Patricia A Marshall; Ichiro Matsuda; Duncan Ngare; Vivian Ota Wang; Deepa Reddy; Charles N Rotimi; Charmaine D Royal; Richard R Sharp; Changqing Zeng; Lisa D Brooks; Jean E McEwen Journal: Nature Date: 2010-09-02 Impact factor: 49.962
Authors: Joseph K Pickrell; John C Marioni; Athma A Pai; Jacob F Degner; Barbara E Engelhardt; Everlyne Nkadori; Jean-Baptiste Veyrieras; Matthew Stephens; Yoav Gilad; Jonathan K Pritchard Journal: Nature Date: 2010-03-10 Impact factor: 49.962
Authors: Christopher C Chang; Carson C Chow; Laurent Cam Tellier; Shashaank Vattikuti; Shaun M Purcell; James J Lee Journal: Gigascience Date: 2015-02-25 Impact factor: 6.524
Authors: Po-Ru Loh; George Tucker; Brendan K Bulik-Sullivan; Bjarni J Vilhjálmsson; Hilary K Finucane; Rany M Salem; Daniel I Chasman; Paul M Ridker; Benjamin M Neale; Bonnie Berger; Nick Patterson; Alkes L Price Journal: Nat Genet Date: 2015-02-02 Impact factor: 38.330
Authors: Chenjie Zeng; Lisa A Bastarache; Ran Tao; Eric Venner; Scott Hebbring; Justin D Andujar; Sarah T Bland; David R Crosslin; Siddharth Pratap; Ayorinde Cooley; Jennifer A Pacheco; Kurt D Christensen; Emma Perez; Carrie L Blout Zawatsky; Leora Witkowski; Hana Zouk; Chunhua Weng; Kathleen A Leppig; Patrick M A Sleiman; Hakon Hakonarson; Marc S Williams; Yuan Luo; Gail P Jarvik; Robert C Green; Wendy K Chung; Ali G Gharavi; Niall J Lennon; Heidi L Rehm; Richard A Gibbs; Josh F Peterson; Dan M Roden; Georgia L Wiesner; Joshua C Denny Journal: JAMA Oncol Date: 2022-06-01 Impact factor: 33.006
Authors: Clara Albiñana; Jakob Grove; John J McGrath; Esben Agerbo; Naomi R Wray; Cynthia M Bulik; Merete Nordentoft; David M Hougaard; Thomas Werge; Anders D Børglum; Preben Bo Mortensen; Florian Privé; Bjarni J Vilhjálmsson Journal: Am J Hum Genet Date: 2021-05-07 Impact factor: 11.043
Authors: Z Tamimy; S T Kevenaar; J J Hottenga; M D Hunter; E L de Zeeuw; M C Neale; C E M van Beijsterveldt; C V Dolan; Elsje van Bergen; D I Boomsma Journal: Behav Genet Date: 2021-02-27 Impact factor: 2.805
Authors: Xiaoqin Liu; Trine Munk-Olsen; Clara Albiñana; Bjarni J Vilhjálmsson; Emil M Pedersen; Vivi Schlünssen; Marie Bækvad-Hansen; Jonas Bybjerg-Grauholm; Merete Nordentoft; Anders D Børglum; Thomas Werge; David M Hougaard; Preben B Mortensen; Esben Agerbo Journal: Brain Behav Immun Date: 2020-07-28 Impact factor: 7.217