Literature DB >> 32415959

Efficient toolkit implementing best practices for principal component analysis of population genetic data.

Florian Privé1,2, Keurcien Luu2, Michael G B Blum2,3, John J McGrath1,4,5, Bjarni J Vilhjálmsson1.   

Abstract

MOTIVATION: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.
RESULTS: For example, we find that PC19-PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.
AVAILABILITY AND IMPLEMENTATION: R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2020. Published by Oxford University Press.

Entities:  

Year:  2020        PMID: 32415959      PMCID: PMC7750941          DOI: 10.1093/bioinformatics/btaa520

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  31 in total

1.  Interpreting principal component analyses of spatial population genetic variation.

Authors:  John Novembre; Matthew Stephens
Journal:  Nat Genet       Date:  2008-04-20       Impact factor: 38.330

2.  Population structure, migration, and diversifying selection in the Netherlands.

Authors:  Abdel Abdellaoui; Jouke-Jan Hottenga; Peter de Knijff; Michel G Nivard; Xiangjun Xiao; Paul Scheet; Andrew Brooks; Erik A Ehli; Yueshan Hu; Gareth E Davies; James J Hudziak; Patrick F Sullivan; Toos van Beijsterveldt; Gonneke Willemsen; Eco J de Geus; Brenda W J H Penninx; Dorret I Boomsma
Journal:  Eur J Hum Genet       Date:  2013-03-27       Impact factor: 4.246

3.  Integrating common and rare genetic variation in diverse human populations.

Authors:  David M Altshuler; Richard A Gibbs; Leena Peltonen; David M Altshuler; Richard A Gibbs; Leena Peltonen; Emmanouil Dermitzakis; Stephen F Schaffner; Fuli Yu; Leena Peltonen; Emmanouil Dermitzakis; Penelope E Bonnen; David M Altshuler; Richard A Gibbs; Paul I W de Bakker; Panos Deloukas; Stacey B Gabriel; Rhian Gwilliam; Sarah Hunt; Michael Inouye; Xiaoming Jia; Aarno Palotie; Melissa Parkin; Pamela Whittaker; Fuli Yu; Kyle Chang; Alicia Hawes; Lora R Lewis; Yanru Ren; David Wheeler; Richard A Gibbs; Donna Marie Muzny; Chris Barnes; Katayoon Darvishi; Matthew Hurles; Joshua M Korn; Kati Kristiansson; Charles Lee; Steven A McCarrol; James Nemesh; Emmanouil Dermitzakis; Alon Keinan; Stephen B Montgomery; Samuela Pollack; Alkes L Price; Nicole Soranzo; Penelope E Bonnen; Richard A Gibbs; Claudia Gonzaga-Jauregui; Alon Keinan; Alkes L Price; Fuli Yu; Verneri Anttila; Wendy Brodeur; Mark J Daly; Stephen Leslie; Gil McVean; Loukas Moutsianas; Huy Nguyen; Stephen F Schaffner; Qingrun Zhang; Mohammed J R Ghori; Ralph McGinnis; William McLaren; Samuela Pollack; Alkes L Price; Stephen F Schaffner; Fumihiko Takeuchi; Sharon R Grossman; Ilya Shlyakhter; Elizabeth B Hostetter; Pardis C Sabeti; Clement A Adebamowo; Morris W Foster; Deborah R Gordon; Julio Licinio; Maria Cristina Manca; Patricia A Marshall; Ichiro Matsuda; Duncan Ngare; Vivian Ota Wang; Deepa Reddy; Charles N Rotimi; Charmaine D Royal; Richard R Sharp; Changqing Zeng; Lisa D Brooks; Jean E McEwen
Journal:  Nature       Date:  2010-09-02       Impact factor: 49.962

Review 4.  New approaches to population stratification in genome-wide association studies.

Authors:  Alkes L Price; Noah A Zaitlen; David Reich; Nick Patterson
Journal:  Nat Rev Genet       Date:  2010-07       Impact factor: 53.242

5.  pcadapt: an R package to perform genome scans for selection based on principal component analysis.

Authors:  Keurcien Luu; Eric Bazin; Michael G B Blum
Journal:  Mol Ecol Resour       Date:  2016-09-07       Impact factor: 7.090

6.  Fast and robust ancestry prediction using principal component analysis.

Authors:  Daiwei Zhang; Rounak Dey; Seunggeun Lee
Journal:  Bioinformatics       Date:  2020-06-01       Impact factor: 6.937

7.  Understanding mechanisms underlying human gene expression variation with RNA sequencing.

Authors:  Joseph K Pickrell; John C Marioni; Athma A Pai; Jacob F Degner; Barbara E Engelhardt; Everlyne Nkadori; Jean-Baptiste Veyrieras; Matthew Stephens; Yoav Gilad; Jonathan K Pritchard
Journal:  Nature       Date:  2010-03-10       Impact factor: 49.962

8.  Second-generation PLINK: rising to the challenge of larger and richer datasets.

Authors:  Christopher C Chang; Carson C Chow; Laurent Cam Tellier; Shashaank Vattikuti; Shaun M Purcell; James J Lee
Journal:  Gigascience       Date:  2015-02-25       Impact factor: 6.524

9.  Efficient Bayesian mixed-model analysis increases association power in large cohorts.

Authors:  Po-Ru Loh; George Tucker; Brendan K Bulik-Sullivan; Bjarni J Vilhjálmsson; Hilary K Finucane; Rany M Salem; Daniel I Chasman; Paul M Ridker; Benjamin M Neale; Bonnie Berger; Nick Patterson; Alkes L Price
Journal:  Nat Genet       Date:  2015-02-02       Impact factor: 38.330

10.  Scalable probabilistic PCA for large-scale genetic variation data.

Authors:  Aman Agrawal; Alec M Chiu; Minh Le; Eran Halperin; Sriram Sankararaman
Journal:  PLoS Genet       Date:  2020-05-29       Impact factor: 6.020

View more
  17 in total

1.  Haplotype and population structure inference using neural networks in whole-genome sequencing data.

Authors:  Jonas Meisner; Anders Albrechtsen
Journal:  Genome Res       Date:  2022-07-06       Impact factor: 9.438

Review 2.  Open problems in human trait genetics.

Authors:  Nadav Brandes; Omer Weissbrod; Michal Linial
Journal:  Genome Biol       Date:  2022-06-20       Impact factor: 17.906

3.  Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics.

Authors:  Florian Privé
Journal:  Bioinformatics       Date:  2022-05-23       Impact factor: 6.931

4.  Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort.

Authors:  Florian Privé; Hugues Aschard; Shai Carmi; Lasse Folkersen; Clive Hoggart; Paul F O'Reilly; Bjarni J Vilhjálmsson
Journal:  Am J Hum Genet       Date:  2022-01-06       Impact factor: 11.043

5.  Association of Pathogenic Variants in Hereditary Cancer Genes With Multiple Diseases.

Authors:  Chenjie Zeng; Lisa A Bastarache; Ran Tao; Eric Venner; Scott Hebbring; Justin D Andujar; Sarah T Bland; David R Crosslin; Siddharth Pratap; Ayorinde Cooley; Jennifer A Pacheco; Kurt D Christensen; Emma Perez; Carrie L Blout Zawatsky; Leora Witkowski; Hana Zouk; Chunhua Weng; Kathleen A Leppig; Patrick M A Sleiman; Hakon Hakonarson; Marc S Williams; Yuan Luo; Gail P Jarvik; Robert C Green; Wendy K Chung; Ali G Gharavi; Niall J Lennon; Heidi L Rehm; Richard A Gibbs; Josh F Peterson; Dan M Roden; Georgia L Wiesner; Joshua C Denny
Journal:  JAMA Oncol       Date:  2022-06-01       Impact factor: 33.006

6.  Early-Life Injuries and the Development of Attention-Deficit/Hyperactivity Disorder.

Authors:  Theresa Wimberley; Isabell Brikell; Emil M Pedersen; Esben Agerbo; Bjarni J Vilhjálmsson; Clara Albiñana; Florian Privé; Anita Thapar; Kate Langley; Lucy Riglin; Marianne Simonsen; Helena S Nielsen; Anders D Børglum; Merete Nordentoft; Preben B Mortensen; Søren Dalsgaard
Journal:  J Clin Psychiatry       Date:  2022-01-04       Impact factor: 4.384

7.  Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction.

Authors:  Clara Albiñana; Jakob Grove; John J McGrath; Esben Agerbo; Naomi R Wray; Cynthia M Bulik; Merete Nordentoft; David M Hougaard; Thomas Werge; Anders D Børglum; Preben Bo Mortensen; Florian Privé; Bjarni J Vilhjálmsson
Journal:  Am J Hum Genet       Date:  2021-05-07       Impact factor: 11.043

8.  LDpred2: better, faster, stronger.

Authors:  Florian Privé; Julyan Arbel; Bjarni J Vilhjálmsson
Journal:  Bioinformatics       Date:  2020-12-16       Impact factor: 6.937

9.  Multilevel Twin Models: Geographical Region as a Third Level Variable.

Authors:  Z Tamimy; S T Kevenaar; J J Hottenga; M D Hunter; E L de Zeeuw; M C Neale; C E M van Beijsterveldt; C V Dolan; Elsje van Bergen; D I Boomsma
Journal:  Behav Genet       Date:  2021-02-27       Impact factor: 2.805

10.  Genetic liability to major depression and risk of childhood asthma.

Authors:  Xiaoqin Liu; Trine Munk-Olsen; Clara Albiñana; Bjarni J Vilhjálmsson; Emil M Pedersen; Vivi Schlünssen; Marie Bækvad-Hansen; Jonas Bybjerg-Grauholm; Merete Nordentoft; Anders D Børglum; Thomas Werge; David M Hougaard; Preben B Mortensen; Esben Agerbo
Journal:  Brain Behav Immun       Date:  2020-07-28       Impact factor: 7.217

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.