BACKGROUND: Non-random patterns of genetic variation exist among individuals in a population owing to a variety of evolutionary factors. Therefore, populations are structured into genetically distinct subpopulations. As genotypic datasets become ever larger, it is increasingly difficult to correctly estimate the number of subpopulations and assign individuals to them. The computationally efficient non-parametric, chiefly Principal Components Analysis (PCA)-based methods are thus becoming increasingly relied upon for population structure analysis. Current PCA-based methods can accurately detect structure; however, the accuracy in resolving subpopulations and assigning individuals to them is wanting. When subpopulations are closely related to one another, they overlap in PCA space and appear as a conglomerate. This problem is exacerbated when some subpopulations in the dataset are genetically far removed from others. We propose a novel PCA-based framework which addresses this shortcoming. RESULTS: A novel population structure analysis algorithm called iterative pruning PCA (ipPCA) was developed which assigns individuals to subpopulations and infers the total number of subpopulations present. Genotypic data from simulated and real population datasets with different degrees of structure were analyzed. For datasets with simple structures, the subpopulation assignments of individuals made by ipPCA were largely consistent with the STRUCTURE, BAPS and AWclust algorithms. On the other hand, highly structured populations containing many closely related subpopulations could be accurately resolved only by ipPCA, and not by other methods. CONCLUSION: The algorithm is computationally efficient and not constrained by the dataset complexity. This systematic subpopulation assignment approach removes the need for prior population labels, which could be advantageous when cryptic stratification is encountered in datasets containing individuals otherwise assumed to belong to a homogenous population.
BACKGROUND: Non-random patterns of genetic variation exist among individuals in a population owing to a variety of evolutionary factors. Therefore, populations are structured into genetically distinct subpopulations. As genotypic datasets become ever larger, it is increasingly difficult to correctly estimate the number of subpopulations and assign individuals to them. The computationally efficient non-parametric, chiefly Principal Components Analysis (PCA)-based methods are thus becoming increasingly relied upon for population structure analysis. Current PCA-based methods can accurately detect structure; however, the accuracy in resolving subpopulations and assigning individuals to them is wanting. When subpopulations are closely related to one another, they overlap in PCA space and appear as a conglomerate. This problem is exacerbated when some subpopulations in the dataset are genetically far removed from others. We propose a novel PCA-based framework which addresses this shortcoming. RESULTS: A novel population structure analysis algorithm called iterative pruning PCA (ipPCA) was developed which assigns individuals to subpopulations and infers the total number of subpopulations present. Genotypic data from simulated and real population datasets with different degrees of structure were analyzed. For datasets with simple structures, the subpopulation assignments of individuals made by ipPCA were largely consistent with the STRUCTURE, BAPS and AWclust algorithms. On the other hand, highly structured populations containing many closely related subpopulations could be accurately resolved only by ipPCA, and not by other methods. CONCLUSION: The algorithm is computationally efficient and not constrained by the dataset complexity. This systematic subpopulation assignment approach removes the need for prior population labels, which could be advantageous when cryptic stratification is encountered in datasets containing individuals otherwise assumed to belong to a homogenous population.
Authors: Renee P Stokowski; P V Krishna Pant; Tony Dadd; Amelia Fereday; David A Hinds; Carl Jarman; Wendy Filsell; Rebecca S Ginger; Martin R Green; Frans J van der Ouderaa; David R Cox Journal: Am J Hum Genet Date: 2007-10-15 Impact factor: 11.025
Authors: Jun Z Li; Devin M Absher; Hua Tang; Audrey M Southwick; Amanda M Casto; Sohini Ramachandran; Howard M Cann; Gregory S Barsh; Marcus Feldman; Luigi L Cavalli-Sforza; Richard M Myers Journal: Science Date: 2008-02-22 Impact factor: 47.728
Authors: Ying Liu; Cynthia Helms; Wilson Liao; Lisa C Zaba; Shenghui Duan; Jennifer Gardner; Carol Wise; Andrew Miner; M J Malloy; Clive R Pullinger; John P Kane; Scott Saccone; Jane Worthington; Ian Bruce; Pui-Yan Kwok; Alan Menter; James Krueger; Anne Barton; Nancy L Saccone; Anne M Bowcock Journal: PLoS Genet Date: 2008-03-28 Impact factor: 5.917
Authors: Chao Tian; Robert M Plenge; Michael Ransom; Annette Lee; Pablo Villoslada; Carlo Selmi; Lars Klareskog; Ann E Pulver; Lihong Qi; Peter K Gregersen; Michael F Seldin Journal: PLoS Genet Date: 2008-01 Impact factor: 5.917
Authors: Jiali Han; Peter Kraft; Hongmei Nan; Qun Guo; Constance Chen; Abrar Qureshi; Susan E Hankinson; Frank B Hu; David L Duffy; Zhen Zhen Zhao; Nicholas G Martin; Grant W Montgomery; Nicholas K Hayward; Gilles Thomas; Robert N Hoover; Stephen Chanock; David J Hunter Journal: PLoS Genet Date: 2008-05-16 Impact factor: 5.917
Authors: Jami N Jackson; Kevin M Long; Yijing He; Alison A Motsinger-Reif; Howard L McLeod; John Jack Journal: Pharmacogenet Genomics Date: 2016-04 Impact factor: 2.089
Authors: Nicolas Duforet-Frebourg; Lucie M Gattepaille; Michael G B Blum; Mattias Jakobsson Journal: BMC Bioinformatics Date: 2015-07-31 Impact factor: 3.169