Peizhou Liao1, Glen A Satten2, Yi-Juan Hu1. 1. Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA. 2. Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.
Abstract
Motivation: Inferring population structure is important for both population genetics and genetic epidemiology. Principal components analysis (PCA) has been effective in ascertaining population structure with array genotype data but can be difficult to use with sequencing data, especially when low depth leads to uncertainty in called genotypes. Because PCA is sensitive to differences in variability, PCA using sequencing data can result in components that correspond to differences in sequencing quality (read depth and error rate), rather than differences in population structure. We demonstrate that even existing methods for PCA specifically designed for sequencing data can still yield biased conclusions when used with data having sequencing properties that are systematically different across different groups of samples (i.e. sequencing groups). This situation can arise in population genetics when combining sequencing data from different studies, or in genetic epidemiology when using historical controls such as samples from the 1000 Genomes Project. Results: To allow inference on population structure using PCA in these situations, we provide an approach that is based on using sequencing reads directly without calling genotypes. Our approach is to adjust the data from different sequencing groups to have the same read depth and error rate so that PCA does not generate spurious components representing sequencing quality. To accomplish this, we have developed a subsampling procedure to match the depth distributions in different sequencing groups, and a read-flipping procedure to match the error rates. We average over subsamples and read flips to minimize loss of information. We demonstrate the utility of our approach using two datasets from 1000 Genomes, and further evaluate it using simulation studies. Availability and implementation: TASER-PC software is publicly available at http://web1.sph.emory.edu/users/yhu30/software.html. Contact: yijuan.hu@emory.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Inferring population structure is important for both population genetics and genetic epidemiology. Principal components analysis (PCA) has been effective in ascertaining population structure with array genotype data but can be difficult to use with sequencing data, especially when low depth leads to uncertainty in called genotypes. Because PCA is sensitive to differences in variability, PCA using sequencing data can result in components that correspond to differences in sequencing quality (read depth and error rate), rather than differences in population structure. We demonstrate that even existing methods for PCA specifically designed for sequencing data can still yield biased conclusions when used with data having sequencing properties that are systematically different across different groups of samples (i.e. sequencing groups). This situation can arise in population genetics when combining sequencing data from different studies, or in genetic epidemiology when using historical controls such as samples from the 1000 Genomes Project. Results: To allow inference on population structure using PCA in these situations, we provide an approach that is based on using sequencing reads directly without calling genotypes. Our approach is to adjust the data from different sequencing groups to have the same read depth and error rate so that PCA does not generate spurious components representing sequencing quality. To accomplish this, we have developed a subsampling procedure to match the depth distributions in different sequencing groups, and a read-flipping procedure to match the error rates. We average over subsamples and read flips to minimize loss of information. We demonstrate the utility of our approach using two datasets from 1000 Genomes, and further evaluate it using simulation studies. Availability and implementation: TASER-PC software is publicly available at http://web1.sph.emory.edu/users/yhu30/software.html. Contact: yijuan.hu@emory.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330
Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330
Authors: Chao Tian; Roman Kosoy; Annette Lee; Michael Ransom; John W Belmont; Peter K Gregersen; Michael F Seldin Journal: PLoS One Date: 2008-12-05 Impact factor: 3.240
Authors: Klaudia Walter; Josine L Min; Jie Huang; Lucy Crooks; Yasin Memari; Shane McCarthy; John R B Perry; ChangJiang Xu; Marta Futema; Daniel Lawson; Valentina Iotchkova; Stephan Schiffels; Audrey E Hendricks; Petr Danecek; Rui Li; James Floyd; Louise V Wain; Inês Barroso; Steve E Humphries; Matthew E Hurles; Eleftheria Zeggini; Jeffrey C Barrett; Vincent Plagnol; J Brent Richards; Celia M T Greenwood; Nicholas J Timpson; Richard Durbin; Nicole Soranzo Journal: Nature Date: 2015-09-14 Impact factor: 49.962
Authors: Yang Luo; Katrina M de Lange; Luke Jostins; Loukas Moutsianas; Joshua Randall; Nicholas A Kennedy; Christopher A Lamb; Shane McCarthy; Tariq Ahmad; Cathryn Edwards; Eva Goncalves Serra; Ailsa Hart; Chris Hawkey; John C Mansfield; Craig Mowat; William G Newman; Sam Nichols; Martin Pollard; Jack Satsangi; Alison Simmons; Mark Tremelling; Holm Uhlig; David C Wilson; James C Lee; Natalie J Prescott; Charlie W Lees; Christopher G Mathew; Miles Parkes; Jeffrey C Barrett; Carl A Anderson Journal: Nat Genet Date: 2017-01-09 Impact factor: 41.307